DE102013018376A1

DE102013018376A1 - System for executing sequential code by using single-instruction-multi-strand-processor, has pipeline-control unit formed to generate group of opponent-strands of sequential code, where one of opponent-strands is subordinate strand

Info

Publication number: DE102013018376A1
Application number: DE201310018376
Authority: DE
Inventors: Gautam CHAKRABARTI; Yuan Lin; Jaydeep MARATHE; Okwan Kwon; Amit Sabne
Original assignee: Nvidia Corp
Current assignee: Nvidia Corp
Priority date: 2012-11-05
Filing date: 2013-11-04
Publication date: 2014-05-08

Abstract

The system has a pipeline-control unit which is formed to generate a group of opponent-strands of sequential code, where one of the opponent-strands is a subordinate strand. The remaining opponent-strands are subordinate strands. The webs are formed to execute the certain instructions of the sequential code only in the subordinate strands. The corresponding instructions are determined in the subordinate strands by the certain instructions. The webs are formed to give the branch conditions in the subordinate strand of the subordinate strands. An independent claim is included for a single-instruction-multi-strand-processor with a local memory.

Description

QUERVERWEIS AUF VERWANDTE ANMELDUNGCROSS-REFERENCE TO RELATED APPLICATION

Diese Anmeldung beansprucht die Priorität der vorläufigen US-Anmeldung mit der Seriennummer 61/722661, die von den Lin et al. am 5. November 2012 eingereicht wurde mit dem Titel „AUSFÜHRUNG EINES SEQUENZIELLEN CODES UNTER ANWENDUNG EINER GRUPPE AUS STRÄNGEN”, und der US-Anmeldung mit der Seriennummer 13/723981, die von Lin et al. am 21. Dezember 2012 eingereicht wurde mit dem Titel „SYSTEM UND VERFAHREN ZUR AUSFÜHRUNG EINES SEQUENZIELLEN CODES UNTER ANWENDUNG EINER GRUPPE AUS STRÄNGEN UND EINES EINZELBEFEHL-MULTI-STRANG-PROZESSORS, DER DIESE ENTHÄLT”, die beide von der gleichen Anmelderin wie die vorliegende Anmeldung sind und hierin durch Bezugnahme mit eingeschlossen sind.This application claims the benefit of US Provisional Application Ser. No. 61 / 722,661, filed by Lin et al. filed November 5, 2012, entitled "IMPLEMENTATION OF A SEQUENCIAL CODE USING A GROUP OF STRANDS," and US Application Serial No. 13 / 723,981, by Lin et al. on December 21, 2012, entitled "SYSTEM AND METHOD FOR CARRYING OUT A SEQUENCIAL CODE USING A GROUP OF STRANDS AND A SINGLE COMMAND MULTI-STRAND PROCESSOR CONTAINING THEM", both filed by the same Applicant as the present application and are incorporated herein by reference.

TECHNISCHES GEBIETTECHNICAL AREA

Diese Anmeldung richtet sich generell an parallele Prozessoren und insbesondere an ein System und Verfahren zur Ausführung eines sequenziellen Codes unter Anwendung einer Gruppe aus Strängen und eines Einzelbefehl-Multi-Strang-(SIMT-)Prozessors, in welchem das System oder das Verfahren enthalten sind.This application is directed generally to parallel processors and, more particularly, to a system and method for executing a sequential code using a set of strings and a single instruction multi-string (SIMT) processor incorporating the system or method.

HINTERGRUNDBACKGROUND

Wie der Fachmann auf dem Gebiet weiß, können Anwendungen parallel ausgeführt werden, um ihr Leistungsverhalten zu verbessern. Datenparallele Anwendungen führen den gleichen Prozess gleichzeitig an unterschiedlichen Daten aus. Aufgabenparallele Anwendungen führen unterschiedliche Prozesse gleichzeitig an den gleichen Daten aus. Statische parallele Anwendungen sind Anwendungen mit einem Grad an Parallelität, der vor ihrer Ausführung festgelegt werden kann. Im Gegensatz dazu kann die Parallelität, die von dynamischen parallelen Anwendungen erreichbar ist, nur bestimmt werden, wenn sie gerade ausgeführt werden. Unabhängig davon, ob die Anwendung datenparallel oder aufgabenparallel oder statisch oder dynamisch parallel sind, kann sie in einer Pipeline bzw. einer Parallelverarbeitungslinie ausgeführt werden, was häufig der Fall ist für graphische Anwendungen.As one skilled in the art knows, applications can be performed in parallel to improve their performance. Data-parallel applications execute the same process simultaneously on different data. Task-parallel applications execute different processes simultaneously on the same data. Static parallel applications are applications with a degree of parallelism that can be set prior to their execution. In contrast, the parallelism achievable by dynamic parallel applications can only be determined when they are being executed. Whether the applications are data-parallel or task-parallel, or static or dynamically parallel, they can be executed in a pipeline or parallel processing line, which is often the case for graphical applications.

Ein SIMT-Prozessor ist insbesondere bei der Ausführung datenparalleler Anwendungen geeignet. Eine Pipeline-Steuereinheit in dem SIMT-Prozessor erzeugt eine Gruppe aus Strängen zur Ausführung und disponiert diese für die Ausführung, während welcher alle Stränge in der Gruppe gleichzeitig den gleichen Befehl ausführen. In einem speziellen Prozessor hat jede Gruppe 32 Stränge, die 32 Ausführungs-Pipelines oder Bahnen in dem SIMT-Prozessor entsprechen.A SIMT processor is particularly suitable for the execution of data-parallel applications. A pipeline controller in the SIMT processor generates a set of threads for execution and schedules them for execution, during which all threads in the group simultaneously execute the same instruction. In a particular processor, each group has 32 strands corresponding to 32 execution pipelines or lanes in the SIMT processor.

Parallele Anwendungen enthalten typischerweise Gebiete bzw. Bereiche mit sequenziellem Code und parallelem Code. Ein sequenzieller Code kann nicht parallel ausgeführt werden und wird daher in einem einzelnen Strang ausgeführt. Wenn ein paralleler Code angetroffen wird, teilt die Pipeline-Steuereinheit die Ausführung auf, wobei Gruppen aus Arbeiter-Strängen für die parallele Ausführung des parallelen Codes gebildet werden. Wenn wieder ein sequenzieller Code angetroffen wird, vereinigt die Pipeline-Steuereinheit die Ergebnisse der parallelen Ausführung, erzeugt einen weiteren einzelnen Strang für den sequenziellen Code und die Ausführung geht weiter.Parallel applications typically contain regions of sequential code and parallel code. A sequential code can not be executed in parallel and is therefore executed in a single thread. When a parallel code is encountered, the pipeline control unit breaks the execution, forming groups of worker strings for the parallel execution of the parallel code. When a sequential code is encountered again, the pipeline controller unites the results of the parallel execution, creates another single thread for the sequential code, and execution continues.

Es ist wichtig, die Stränge in einer Gruppe zu synchronisieren. Die Synchronisierung beinhaltet teilweise die Anpassung der Zustände jeweiliger lokaler Speicher, die zu jeweiligen Bahnen gehören. Es wurde erkannt, dass die Synchronisierung schneller gemacht werden kann, wenn bei der Ausführung eines sequenziellen Codes ein Gegenspieler-Strang des sequenziellen Codes in jeder der Bahnen ausgeführt wird. Die Zustände des lokalen Speichers werden somit als bereits angeglichen angenommen, wenn die Ausführung später aufgeteilt wird.It is important to synchronize the strands in a group. The synchronization involves partially adapting the states of respective local memories belonging to respective lanes. It has been recognized that synchronization can be made faster if, in the execution of a sequential code, an antagonist string of the sequential code is executed in each of the tracks. The states of the local memory are thus assumed to be already aligned when the execution is split later.

ÜBERBLICKOVERVIEW

Ein Aspekt stellt ein System zur Ausführung eines sequenziellen Codes bereit. In einer Ausführungsform umfasst das System: (1) eine Pipeline-Steuereinheit, die ausgebildet ist, eine Gruppe aus Gegenspieler-Strängen des sequenziellen Codes zu erzeugen, wobei einer der Gegenspieler-Stränge ein Master-Strang bzw. ein übergeordneter Strang ist, wobei verbleibende der Gegenspieler-Stränge Slave-Stränge bzw. untergeordnete Stränge sind, und (2) Bahnen, die ausgebildet sind, um: (2a) gewisse Befehle des sequenziellen Codes nur in dem übergeordneten Strang auszuführen, wobei entsprechende Befehle in den untergeordneten Strängen auf den gewissen Befehlen basieren, und (2b) Verzweigungsbedingungen in dem übergeordneten Strang an die untergeordneten Stränge bekannt zu geben.One aspect provides a system for executing a sequential code. In one embodiment, the system comprises: (1) a pipeline control unit configured to generate a set of opponent strands of the sequential code, wherein one of the opposing strands is a master thread, with one remaining strand the adversary strands are slave strands, and (2) tracks configured to: (2a) execute certain commands of the sequential code only in the parent thread, with corresponding commands in the subordinate strands are based on the certain commands, and (2b) announce branch conditions in the parent strand to the subordinate strands.

Ein weiterer Aspekt stellt ein Verfahren zur Ausführung eines sequenziellen Codes bereit. In einer Ausführungsform umfasst das Verfahren: (1) Erzeugen einer Gruppe aus Gegenspieler-Strängen des sequenziellen Codes, wobei einer der Gegenspieler-Strange ein übergeordneter Strang ist, wobei die verbleibenden Gegenspieler-Stränge untergeordnete Stränge sind, (2) Ausführen gewisser Befehle des sequenziellen Codes nur in dem übergeordneten Strang, wobei entsprechende Befehle in den untergeordneten Strängen auf den gewissen Befehlen basieren, und (3) Bekanntgeben von Verzweigungsbedingungen in dem übergeordneten Strang an die untergeordneten Stränge.Another aspect provides a method of executing a sequential code. In one embodiment, the method comprises: (1) generating a set of antisplay strands of the sequential code, wherein one of the antagonist strands is a parent strand, the remaining antisplay strands being subordinate strands, (2) performing certain commands of the sequential one Codes only in the parent thread, with corresponding commands in the child threads based on the certain commands, and (3) posting branch conditions in the parent thread to the child threads.

Ein noch weiterer Aspekt stellt einen SIMT-Prozessor bereit. In einer Ausführungsform umfasst der SIMT-Prozessor: (1) Bahnen, (2) lokale Speicher, die entsprechenden Bahnen zugeordnet sind, (3) eine gemeinsam genutzte Speichereinrichtung für die Bahnen und (4) eine Pipeline-Steuereinheit, die ausgebildet ist, eine Gruppe aus Gegenspieler-Strängen des sequenziellen Codes zu erzeugen und zu veranlassen, dass die Gruppe in den Bahnen ausgeführt wird, wobei einer der Gegenspieler-Stränge ein übergeordneter Strang ist, wobei die verbleibenden Gegenspieler-Strange untergeordnete Stränge sind. Die Bahnen sind ausgebildet, um: (1) gewisse Befehle des sequenziellen Codes nur in dem übergeordneten Strang auszuführen, wobei entsprechende Befehle in den untergeordneten Strängen auf den gewissen Befehlen basieren, und (2) Verzweigungsbedingungen in dem übergeordneten Strang an die untergeordneten Stränge bekannt zu geben.Yet another aspect provides a SIMT processor. In one embodiment, the SIMT processor comprises: (1) lanes, (2) local memories associated with respective lanes, (3) a shared memory device for the lanes, and (4) a pipeline control unit configured Create group of opponent strands of the sequential code and cause the group to execute in the lanes, one of the opponent strands being a parent strand, with the remaining opponent strands being minor strands. The lanes are configured to: (1) execute certain commands of the sequential code only in the parent thread, with corresponding instructions in the child threads based on the certain instructions, and (2) know branch conditions in the parent thread to the child threads give.

KURZE BESCHREIBUNGSHORT DESCRIPTION

Es wird nun auf die folgenden Beschreibungen in Verbindung mit den begleitenden Zeichnungen verwiesen, in denen:Reference is now made to the following descriptions taken in conjunction with the accompanying drawings, in which:

1 eine Blockansicht eines SIMT-Prozessors ist, der ausgebildet ist, ein System oder ein Verfahren zur Ausführung eines sequenziellen Codes unter Anwendung einer Gruppe aus Strängen zu enthalten oder auszuführen; 1 Figure 5 is a block diagram of a SIMT processor configured to include or execute a system or method for executing a sequential code using a set of strings;

2 eine Blockansicht einer Ausführungsform eines Systems zur Ausführung eines sequenziellen Codes unter Anwendung einer Gruppe aus Strängen ist; und 2 Figure 12 is a block diagram of one embodiment of a system for executing a sequential code using a set of strings; and

3 ein Flussdiagramm einer Ausführungsform eines Verfahrens zur Ausführung eines sequenziellen Codes unter Anwendung einer Gruppe aus Strängen ist. 3 Figure 3 is a flowchart of one embodiment of a method for executing a sequential code using a set of strings.

DETAILLIERTE BESCHREIBUNGDETAILED DESCRIPTION

Wie zuvor angegeben ist, wurde erkannt, dass der Vorgang der Synchronisierung unter den Bahnen oder Kernen eines SIMT-Prozessors schneller gemacht werden kann, wenn ein Gegenspieler-Strang des sequenziellen Codes in jeder der Bahnen ausgeführt wird. Da die Gegenspieler-Stränge den gleichen Code aufweisen (d. h. die gleichen Befehle in der gleichen Reihenfolge), und da die Zustände der lokalen Speicher angeglichen sind, wenn die Gegenspieler-Stränge des Codes mit der Ausführung beginnen, erscheint die Annahme, dass die Zustände der lokalen Speicher angeglichen bleiben, als eine ausgemachte Sache. Jedoch wird hierin erkannt, dass Bedingungen auftreten können, unter denen die Speicherzustände voneinander abweichen.As stated previously, it has been recognized that the process of synchronization among the lanes or cores of a SIMT processor can be made faster if an antisquos strand of the sequential code is executed in each of the lanes. Since the opponent strands have the same code (ie, the same commands in the same order), and since the states of the local memories are aligned, when the opponent strands of the code begin execution, the assumption appears that the states of the local memory, as a foregone conclusion. However, it is recognized herein that conditions may occur under which the memory states differ.

Als ein Beispiel sei angenommen, dass die Gegenspieler-Stränge des sequenziellen Codes den gleichen Ladebefehl ausführen sollen. Die Stelle des Speichers, die zu laden ist, ist durch ein Register oder eine Adresse angegeben. Wenn dies durch ein Register der Fall ist, kann der Wert des Registers pro Strang unterschiedlich sein, da jeder Strang seine eigene Kopie des Registers besitzt. Wenn dies durch eine Adresse der Fall ist, kann der Adressenwert auf unterschiedliche Strang-lokale Speicherstellen in dem System zeigen. In jedem Falle können die jeweiligen Stränge gegebenenfalls unterschiedliche Werte aus einer Vielzahl von Speicherstellen laden, so dass die Strang-lokalen Speicherzustände voneinander abweichen. Sollen die Gegenspieler-Stränge dann auf der Grundlage der geladenen Daten verzweigen, wären einige genommene Verzweigungen korrekt, und andere wären fehlerhaft.As an example, assume that the counterpart strings of the sequential code should execute the same load instruction. The location of the memory to be loaded is indicated by a register or an address. If this is the case through a register, the value of the register per thread may be different because each thread has its own copy of the register. If this is the case through an address, the address value may point to different thread local storage locations in the system. In either case, the respective strings may optionally load different values from a plurality of memory locations such that the string local memory states are different. If the adversary strands then branch based on the loaded data, some branches taken would be correct, and others would be flawed.

Es sei in ähnlicher Weise angenommen, dass die Gegenspieler-Stränge des sequenziellen Codes den gleichen Speicherbefehl ausführen sollen. Der Speicher, in welchen gespeichert werden soll, ist aus den gleichen Gründen pro Strang unterschiedlich, wie dies zuvor für den Ladebefehl beschrieben ist.It is similarly assumed that the opponent strands of the sequential code should execute the same store instruction. The memory to be stored in will be different per thread for the same reasons as previously described for the load instruction.

Speicherstellen, die bei der sequenziellen Ausführung nicht modifiziert wurden, würden fälschlicher Weise bei der parallelen Ausführung modifiziert werden.Memory locations which were not modified in the sequential execution would be falsely modified in the parallel execution.

Als ein weiteres Beispiel sei angenommen, dass die Gegenspieler-Stränge des sequenziellen Codes Daten gleichzeitig in die gleiche Stelle des gemeinsam benutzten Speichers speichern sollen. Der gemeinsam benutzte Speicher könnte wiederum überfüllt und als Folge davon geschädigt werden. Die Probleme, die in diesen beiden Beispielen aufgezeigt sind, werden manchmal in Vektor-Operationen angetroffen.As another example, assume that the contiguous strands of the sequential code should simultaneously store data in the same location of the shared memory. The shared memory could in turn become overcrowded and damaged as a result. The problems shown in these two examples are sometimes encountered in vector operations.

Als noch ein weiteres Beispiel sei angenommen, dass eine Ausnahmebehandlung eine von den diversen Bahnen gemeinsam benutzte Ressource ist. Gebiete mit sequenziellem Code enthalten oft zahlreiche Befehle, die potenziell das Auftreten von Ausnahmen hervorrufen können. Bei der parallelen Ausführung dieser Befehle könnten bei Auftreten einer Ausnahme die parallelen Prozesse gleichzeitige Ausnahmen hervorrufen und könnten die gemeinsam genutzte Ausnahmebehandlung überfrachten, die höchstens eine einzelne Ausnahme und möglicherweise überhaupt keine Ausnahme erwarten würde.As yet another example, assume that an exception handler is a resource shared by the various lanes. Sequential code areas often contain many commands that can potentially cause exceptions to occur. When executing these commands in parallel, if an exception occurs, the parallel processes could cause concurrent exceptions and could overload the shared exception handler, which would expect at most a single exception and possibly no exception whatsoever.

Es wird daher hierin erkannt, dass die Annahme, dass die lokalen Speicherzustände notwendigerweise übereinstimmend bleiben, während die Gegenspieler-Strange eines sequenziellen Codes ausgeführt werden, nicht haltbar ist. Es wird ferner hierin erkannt, dass gewisse Operationen, wozu nicht nur Ladebefehle aus und Speicherbefehle in den gemeinsamen benutzten Speicher gehören, sondern auch die Divisionen und andere Befehle, die potentiell Ausnahmen hervorrufen, den gemeinsam benutzten Speicher schädigen können oder bewirken können, dass als „Nebenwirkung” lokale Speicherzustände voneinander abweichen können. Es wird ferner hierin erkannt, dass ein Mechanismus benötigt wird, um sicherzustellen, dass die Semantik eines sequenziellen Codes durch voneinander abweichende Strang-lokale Speicherzustände nicht verändert wird. It is therefore recognized herein that the assumption that the local memory states necessarily remain coincidental while the counterpart strings of a sequential code are executed is not tenable. It is further appreciated herein that certain operations, including not only load instructions and store instructions in shared memory, but also the divisions and other instructions that potentially cause exceptions, may or may cause damage to the shared memory. Side effect "local memory states may differ. It is further appreciated herein that a mechanism is needed to ensure that the semantics of a sequential code is not altered by divisive strand-local memory states.

Folglich werden hier diverse Ausführungsformen eines Systems und Verfahrens zur Ausführung eines sequenziellen Codes unter Anwendung einer Gruppe aus Strängen eingeführt. Auf einer sehr hohen Ebene betrachtet bewirken die diversen Ausführungsformen, dass die Ausführung auf Gegenspieler-Strang-Ebene eines sequenziellen Codes eine Ausführung auf Ebene eines übergeordneten Strangs eines sequenziellen Codes emuliert bzw. nachbildet.Thus, various embodiments of a system and method for executing a sequential code using a set of strands are introduced herein. At a very high level, the various embodiments cause the counterpart string-level execution of a sequential code to emulate execution at the level of a higher-level string of sequential code.

Gemäß den diversen Ausführungsformen wird einer der Gegenspieler-Stränge als ein übergeordneter Strang ausgewiesen, und die anderen Stränge werden als untergeordnete Stränge ausgewiesen. Gewisse Befehle (typischerweise jene, die gemeinsam benutzte Ressourcen nutzen können oder tatsächlich benutzen) in den untergeordneten Strängen basieren bzw. gründen dann auf entsprechenden Befehlen in dem übergeordneten Strang, und nur die entsprechenden Befehle in dem übergeordneten Strang werden ausgeführt. Wenn ein Verzweigungsbefehl in dem übergeordneten Strang angetroffen wird, werden dann die Verzweigungsbedingungen in dem übergeordneten Strang den untergeordneten Strängen mitgeteilt bzw. bekannt gegeben.According to the various embodiments, one of the opponent strands is designated as a parent strand and the other strands are designated as child strands. Certain commands (typically those that may or may share shared resources) in the child strands are then based on corresponding commands in the parent thread, and only the corresponding commands in the parent thread are executed. If a branch instruction is encountered in the parent thread, then the branch conditions in the parent thread are reported to the child strings.

1 ist eine Blockansicht eines SIMT-Prozessors 100, der geeignet ist, ein System oder ein Verfahren zur Ausführung eines sequenziellen Codes unter Anwendung einer Gruppe aus Strängen zu enthalten oder auszuführen. Der SIMT-Prozessor 100 umfasst mehrere Strang-Prozessoren oder Kerne 106, die in Stranggruppen 104 oder „Ketten bzw. Wölbungen” unterteilt sind. Der SIMT-Prozessor 100 enthält J Stranggruppen 104-1 bis 104-J, wovon jede K Kerne 106-1 bis 106-K aufweist. In gewissen Ausführungsformen können die Stranggruppen 104-1 bis 104-J weiter in einen oder mehrere Strangblöcke 102 unterteilt sein. Eine spezielle Ausführungsform besitzt zweiunddreißig Kerne 106 pro Stranggruppe 104. Andere Ausführungsformen können weniger oder mehr Kerne in einer Stranggruppe enthalten und können bis zu mehreren zehntausend enthalten. Gewisse Ausführungsformen unterteilen die Kerne 106 in eine einzelne Stranggruppe 104, während in anderen Ausführungsformen hunderte oder sogar tausende von Stranggruppen 104 vorhanden sind. In anderen Ausführungsformen des SIMT-Prozessors 100 können die Kerne 106 nur in die Stranggruppen 104 unterteilt werden, wobei die Organisationsebene in Form des Strangblocks weggelassen wird. 1 is a block diagram of a SIMT processor 100 suitable for containing or executing a system or method for executing a sequential code using a set of strings. The SIMT processor 100 includes multiple strand processors or cores 106 that are in strand groups 104 or "chains or bulges" are divided. The SIMT processor 100 contains J string groups 104-1 to 104-J , of which each K cores 106-1 to 106-K having. In certain embodiments, the strand groups 104-1 to 104-J further into one or more strand blocks 102 be divided. A particular embodiment has thirty-two cores 106 per strand group 104 , Other embodiments may include fewer or more cores in a thread group, and may include up to several tens of thousands. Certain embodiments divide the cores 106 into a single strand group 104 while in other embodiments, hundreds or even thousands of strand groups 104 available. In other embodiments of the SIMT processor 100 can the cores 106 only in the strand groups 104 are subdivided, omitting the organizational level in the form of the strand block.

Der SIMT-Prozessor 100 umfasst ferner eine Pipeline-Steuereinheit 108, einen gemeinsam benutzten Speicher 110 und ein Array aus lokalen Speichern 112-1 bis 112-J, die zu den Stranggruppen 104-1 bis 104-J gehören. Die Pipeline-Steuereinheit 108 verteilt Aufgaben an die diversen Stranggruppen 104-1 bis 104-J über einen Datenbus 114. Die Pipeline-Steuereinheit 108 erzeugt, verwaltet, disponiert, führt aus und stellt bereit einen Mechanismus, um die Stranggruppen 104-1 bis 104-J zu synchronisieren. Gewisse Ausführungsformen des SIMT-Prozessors 100 werden in einer grafischen Verarbeitungseinheit (GPU) angetroffen. Einige GPUs stellen einen Gruppensynchronisierbefehl bereit, etwa bar.sync in GPUs, die von der Fa. Nvidia, Santa Clara, Kalifornien hergestellt werden. Gewisse Ausführungsformen unterstützen die Ausführung divergenter bedingter Verzweigungen durch Stranggruppen. Bei Auftreten einer Verzweigung nehmen gewisse Stränge innerhalb einer Stranggruppe 104 die Verzweigung, da eine Vorbestimmung der Verzweigungsbedingungen ein „wahr” ermittelt, und andere Stränge gehen auf den nächsten Befehl über, da die Vorherbestimmung der Verzweigungsbedingung ein „falsch” ermittelt. Die Pipeline-Steuereinheit 108 verfolgt bzw. überwacht aktive Stränge, indem zuerst einer der Pfade ausgeführt wird, entweder die genommene Verzweigung oder die nicht genommene Verzweigung, und dann die Ausführung des alternativen Pfades erfolgt, wobei jeweils die geeigneten Stränge aktiviert werden.The SIMT processor 100 further includes a pipeline control unit 108 , a shared memory 110 and an array of local stores 112-1 to 112-J that belong to the strand groups 104-1 to 104-J belong. The pipeline control unit 108 distributes tasks to the various strand groups 104-1 to 104-J via a data bus 114 , The pipeline control unit 108 creates, manages, schedules, executes and provides a mechanism to the strand groups 104-1 to 104-J to synchronize. Certain embodiments of the SIMT processor 100 are encountered in a graphical processing unit (GPU). Some GPUs provide a group sync command, such as bar.sync in GPUs manufactured by Nvidia, Santa Clara, California. Certain embodiments support the execution of divergent conditional branches by strand groups. When branching occurs, certain strands take up within a strand group 104 the branch, since a predetermination of the branch conditions determines a "true", and other threads go to the next command, since the prediction of the branch condition determines a "false". The pipeline control unit 108 tracks active strands by first executing one of the paths, either the branch taken or the branch not taken, and then executing the alternative path, activating the appropriate strands, respectively.

Es wird die Ausführungsform aus 2 fortgesetzt; die Kerne 106 innerhalb einer Stranggruppe arbeiten parallel zueinander. Die Stranggruppen 104-1 bis 104-J kommunizieren mit dem gemeinsam benutzten Speicher 110 über einen Speicherbus 116. Die Stranggruppen 104-1 bis 104-J kommunizieren entsprechend mit lokalen Speichern 112-1 bis 112-J über lokale Busse 118-1 bis 118-J. Beispielsweise benutzt eine Stranggruppe 104-J den lokalen Speicher 112-J mittels einer Kommunikation über einen lokalen Bus 118-J. Gewisse Ausführungsformen des SIMT-Prozessors 100 weisen einen gemeinsam benutzten Bereich des gemeinsam benutzten Speichers 110 jedem Strangblock 102 zu und erlauben Zugriff auf gemeinsam benutzte Bereiche des gemeinsam benutzten Speichers 110 für alle Stranggruppen 104 innerhalb eines Strangblocks 102. Gewisse Ausführungsformen umfassen Stranggruppen 104, die nur den lokalen Speicher 112 benutzen. Viele andere Ausführungsformen umfassen die Stranggruppen 104 in der Form, dass diese die Verwendung des lokalen Speichers 112 und des gemeinsam benutzten Speichers 110 abwägen.It will be the embodiment 2 continue; the cores 106 within a strand group work parallel to each other. The strand groups 104-1 to 104-J communicate with the shared memory 110 over a memory bus 116 , The strand groups 104-1 to 104-J communicate accordingly with local stores 112-1 to 112-J via local buses 118-1 to 118-J , For example, a strand group uses 104-J the local store 112-J by means of communication via a local bus 118-J , Certain embodiments of the SIMT processor 100 have a shared area of shared memory 110 every strand block 102 and allow access to shared areas of shared memory 110 for all strand groups 104 within a strand block 102 , Certain embodiments include strand groups 104 that only has the local memory 112 to use. Many other embodiments include the strand groups 104 in the form that these are the use of the local store 112 and shared memory 110 weigh.

Die Ausführungsform aus 1 umfasst eine übergeordnete Stranggruppe 104-1. Jede der verbleibenden Stranggruppen 104-2 bis 104-J wird als „Arbeiter-”Stranggruppe betrachtet. Die übergeordnete Stranggruppe 104-1 umfasst zahlreiche Kerne, wovon einer ein übergeordneter Kern 106-1 ist, der letztlich einen übergeordneten Strang ausführt. Programme, die in einem SIMT-Prozessor 100 ausgeführt werden, sind als eine Sequenz aus Kernels bzw. Kernen strukturiert. Typischerweise beendet jeder Kernel seine Ausführung, bevor der nächste Kernel beginnt. In gewissen Ausführungsformen kann der SIMT 100 mehrere Kernels parallel ausführen, wobei dies von der Größe der Kernels abhängt. Jeder Kernel ist als eine Hierarchie an Strängen aufgebaut, die in den Kernen 106 auszuführen sind.The embodiment of 1 includes a parent strand group 104-1 , Each of the remaining strand groups 104-2 to 104-J is considered a "worker" strand group. The parent strand group 104-1 includes numerous cores, one of which is a higher-level core 106-1 is who ultimately executes a parent strand. Programs running in a SIMT processor 100 are structured as a sequence of kernels. Typically, each kernel ends its execution before the next kernel begins. In certain embodiments, the SIMT 100 run multiple kernels in parallel, depending on the size of the kernel. Each kernel is built as a hierarchy of strands that are in the cores 106 are to be executed.

2 ist eine Blockansicht einer Ausführungsform eines Systems 200 zur Ausführung eines sequenziellen Codes unter Verwendung einer Gruppe aus Strängen. Das System 200 umfasst ein Programm 202 mit einem sequenziellen Gebiet bzw. Bereich 204 und einem parallelen Gebiet 206, einen Speicher 208, einen Vorbestimmungsmodul 210, eine Strangkennung 212, eine Strang-Starteinheit 214 und eine Stranggruppe 104. Die Stranggruppe 104 aus 1 besteht aus K Kernen 106-1 bis 106-K oder Bahnen. 2 Figure 13 is a block diagram of one embodiment of a system 200 to execute a sequential code using a group of strings. The system 200 includes a program 202 with a sequential area 204 and a parallel area 206 , a store 208 , a predetermination module 210 , a string identifier 212 , a strand-starting unit 214 and a strand group 104 , The strand group 104 out 1 consists of K cores 106-1 to 106-K or trains.

Die Stranggruppe 104 ist mit dem Speicher 208 verbunden, der in Abschnitte aufgeteilt ist, die jeweils den Kernen 106-1 bis 106-K zugeordnet sind. Die Strang-Starteinheit 214 erzeugt Verarbeitungsstränge in den Kernen 106-1 bis 106-K. Ein einzelner Kern, häufig der erste, das heißt Kern 106-1, wird zugeordnet, um den übergeordneten Strang auszuführen. Die verbleibenden Stränge sind Arbeiter-Stränge. Üblicherweise führt der übergeordnete Strang das sequenzielle Gebiet 204 des Programms 202 aus, und das parallele Gebiet 206 wird üblicherweise in den Arbeiter-Strängen ausgeführt. Wenn das parallele Gebiet 206 erreicht wird, erzeugt die Strang-Starteinheit 214 die erforderlichen Arbeiter-Stränge, um die parallele Verarbeitung auszuführen.The strand group 104 is with the store 208 connected, which is divided into sections, respectively the cores 106-1 to 106-K assigned. The strand-starting unit 214 creates processing strands in the cores 106-1 to 106-K , A single core, often the first, that is core 106-1 , is assigned to execute the parent thread. The remaining strands are worker strands. Usually, the parent thread will carry the sequential domain 204 of the program 202 out, and the parallel area 206 is usually carried out in the worker strands. If the parallel area 206 is reached, generates the strand-start unit 214 the required worker strands to perform the parallel processing.

In der Ausführungsform aus 2 wird das sequenzielle Gebiet 204 des Programms 202 von dem Vorbestimmungsmodul 210 verarbeitet. Das Vorbestimmungsmodul kennzeichnet gewisse Operationen, die nur von dem übergeordneten Strang ausgeführt werden dürfen. Die Vorbestimmung wird durch die Strangkennung 212 realisiert, die den übergeordneten Strang zur Verarbeitung der gewissen Operationen kennzeichnet. Die Ausgewogenheit des sequenziellen Gebiets 204 wird in allen Strängen in der Stranggruppe 104 unterhalten. Wenn die Arbeiter-Stränge ein vorher bestimmtes Segment des sequenziellen Gebiet 204 erreichen, Überspringen die Arbeiter-Stränge das vorher bestimmte Segment und gehen weiter, bis eine Verzweigung erreicht wird. Wenn die Arbeiter-Stränge eine Verzweigung erreichen, warten sie auf Anweisung durch den übergeordneten Strang, da nur der übergeordnete Strand zuverlässig die Verzweigungsbedingungen bewerten kann. Sobald der übergeordnete Strang das vorher bestimmte Segment bearbeitet, die Verzweigung erreicht und die Versorgungsbedingungen bewertet hat, gibt der übergeordnete Strang die Verzweigungsbedingungen jedem der Arbeiter-Stränge bekannt. Die Arbeiter-Stränge können dann mit der Fortsetzung durch das sequenzielle Gebiet 204 des Programms 202 weitermachen.In the embodiment of 2 becomes the sequential area 204 of the program 202 from the prediction module 210 processed. The prediction module identifies certain operations that may only be performed by the parent thread. The predetermination is determined by the string identifier 212 realized that marks the parent strand for processing certain operations. The balance of the sequential area 204 becomes in all strands in the strand group 104 to chat. If the worker strands a previously determined segment of the sequential area 204 The worker strands skip the previously determined segment and continue until a branch is reached. When the worker strands reach a branch, they wait for instructions from the parent strand, as only the parent beach can reliably evaluate the branch conditions. Once the parent strand processes the predetermined segment, reaches the branch, and evaluates the supply conditions, the parent strand announces the branch conditions to each of the worker strands. The worker strands can then continue with the sequential area 204 of the program 202 keep going.

3 ist ein Flussdiagramm einer Ausführungsform eines Verfahrens zur Ausführung eines sequenziellen Codes unter Anwendung einer Gruppe aus Strängen. Der sequenzielle Code kann Teil sein einer Vektor-Operation, Teil eines Programms, das gemäß einem OpenMP- oder OpenACC-Programmmodell entwickelt wurde, oder kann mit einer weiteren Anwendung einer beliebigen Art verknüpft sein. 3 FIG. 10 is a flowchart of one embodiment of a method for executing a sequential code using a set of strings. FIG. The sequential code may be part of a vector operation, part of a program developed according to an OpenMP or OpenACC program model, or may be linked to another application of any kind.

Das Verfahren beginnt in einem Startschritt 310. In einem Schritt 320 wird eine Gruppe aus Gegenspieler-Strängen des sequenziellen Codes erzeugt, wobei einer der Gegenspieler-Stränge ein übergeordneter Strang ist, während die verbleibenden Gegenspieler-Stränge untergeordnete Stränge sind. In einem Schritt 330 werden gewisse Befehle des sequenziellen Codes nur von dem übergeordneten Strang ausgeführt, entsprechende Befehle in den untergeordneten Strängen basieren auf den gewissen Befehlen bzw. sind durch diese bestimmt. In diversen Ausführungsformen können die gewissen Befehle Ladebefehle, Speicherbefehle, Teilungsbefehle oder ein beliebiger anderer Befehl sein, der Nebenwirkungen erzeugen kann oder der als ein Befehl betrachtet werden kann, der Nebenwirkungen erzeugen kann. In einer Ausführungsform gründen sich die entsprechenden Befehle auf eine Bedingung bzw. sind von dieser bestimmt, die auf einer Strangkennung beruht.The process starts in a starting step 310 , In one step 320 For example, a set of opponent strands of the sequential code is generated, with one of the opponent strands being a parent strand while the remaining opponent strands are subordinate strands. In one step 330 For example, certain instructions of the sequential code are only executed by the parent thread, and corresponding instructions in the child strings are based on or determined by the certain instructions. In various embodiments, the certain commands may be load instructions, store instructions, split commands, or any other command that may create side effects or that may be considered a command that may create side effects. In one embodiment, the respective instructions are based on or determined by a condition based on a thread identifier.

In einem Schritt 340 werden Verzweigungsbedingungen in dem übergeordneten Strang den untergeordneten Strängen bekannt gegeben. In einer Ausführungsform werden die Verzweigungsbedingungen bekannt gegeben, bevor ein Verzweigungsbefehl in dem übergeordneten Strang ausgeführt wird, und die entsprechenden Verzweigungsbefehle werden in den untergeordneten Strängen nur nach der Bekanntgabe ausgeführt. Das Verfahren endet in einem Schlussschritt 350.In one step 340 Branching conditions in the parent strand are reported to the subordinate strands. In one embodiment, the branch conditions are announced before a branch instruction is executed in the parent thread, and the corresponding branch instructions are executed in the child strings only after the advertisement. The procedure ends in a final step 350 ,

Ausführungsformen der vorliegenden Erfindung umfassen die folgenden Konzepte:
Konzept 1. Ein System zur Ausführung eines sequenziellen Codes, mit: (i) einer Pipeline-Steuereinheit, die ausgebildet ist, eine Gruppe aus Gegenspieler-Strängen des sequenziellen Codes zu erzeugen, wobei einer der Gegenspieler-Stränge ein übergeordneter Strang ist, während die verbleibenden Gegenspieler-Stränge untergeordnete Stränge sind; und (ii) Bahnen, die ausgebildet sind, um: gewisse Befehle des sequenziellen Codes nur in dem übergeordneten Strang auszuführen, wobei entsprechende Befehle in den untergeordneten Strängen auf den gewissen Befehle basieren, und Verzweigungsbedingungen in dem übergeordneten Strang den untergeordneten Strängen bekannt zu geben. Embodiments of the present invention include the following concepts:
Concept 1. A system for executing a sequential code, comprising: (i) a pipeline controller configured to generate a set of antisplay strands of the sequential code, wherein one of the antisander strands is a parent thread, while the remaining antagonist strands are subordinate strands; and (ii) paths adapted to: execute certain sequential code instructions only in the parent thread, with corresponding instructions in the subordinate threads being based on the certain instructions, and announcing branch conditions in the parent thread to the subordinate threads.

Konzept 2. Das System wie beschrieben in Konzept 1, wobei lokale Speicher, die den Bahnen zugeordnet sind, die die untergeordneten Stränge ausführen, ferner ausgebildet sind, die Verzweigungsbedingungen zu speichern.Concept 2. The system as described in Concept 1, wherein local memories associated with the tracks that execute the subordinate threads are further configured to store the branch conditions.

Konzept 3. Das System wie beschrieben in Konzept 1 oder 2, wobei die gewissen Befehle ausgewählt sind aus der Gruppe:
Ladebefehle, Speicherbefehle und eine Ausnahme hervorrufende Befehle.Concept 3. The system as described in Concept 1 or 2, where the certain commands are selected from the group:
Load commands, store commands, and exception-causing commands.

Konzept 4. Das System wie beschrieben in einem der Konzepte 1–3, wobei eine Bahn, die den übergeordneten Strang ausführt, ferner ausgebildet ist, die Verzweigungsbedingungen vor Ausführung eines Verzweigungsbefehls in dem übergeordneten Strang bekannt zu geben, und wobei Bahnen, die die untergeordneten Stränge ausführen, ferner ausgebildet sind, entsprechende Verzweigungsbefehle in den untergeordneten Strängen nur auszuführen, nachdem die Bahn die Verzweigungsbedingungen bekannt gegeben hat.Concept 4. The system as described in any one of concepts 1-3, wherein a lane that executes the parent strand is further configured to announce the branch conditions prior to executing a branch instruction in the parent strand, and lanes that are the child Execute strands, further configured to execute corresponding branch instructions in the subordinate strands only after the web has announced the branch conditions.

Konzept 5. Das System wie beschrieben in einem der Konzepte 1–4, wobei die Pipeline-Steuereinheit ferner ausgebildet ist, die entsprechenden Befehle unter Verwendung einer Bedingung anzugeben, die auf einer Strangkennung beruht.Concept 5. The system as described in any one of concepts 1-4, wherein the pipeline control unit is further configured to specify the corresponding instructions using a condition based on a string identifier.

Konzept 6. Das System wie beschrieben in einem der Konzepte 1–5, wobei der sequenzielle Code Teil einer Vektor-Operation ist.Concept 6. The system as described in any of concepts 1-5, wherein the sequential code is part of a vector operation.

Konzept 7. Das System wie beschrieben in einem der Konzepte 1–6, wobei der sequenzielle Code Teil einer Anwendung ist, die ausgewählt ist aus der Gruppe: ein OpenMP-Programm und ein OpenACC-Programm.Concept 7. The system as described in any of concepts 1-6, wherein the sequential code is part of an application selected from the group: an OpenMP program and an OpenACC program.

Konzept 8. Ein Verfahren zur Ausführung eines sequenziellen Codes, mit: (i) Erzeugen einer Gruppe aus Gegenspieler-Strängen des sequenziellen Codes, wobei einer der Gegenspieler-Stränge ein übergeordneter Strang ist, während die verbleibenden Gegenspieler-Strange untergeordnete Stränge sind; (ii) Ausführen gewisser Befehle des sequenziellen Codes nur in dem übergeordneten Strang, wobei entsprechende Befehle in den untergeordneten Strängen auf den gewissen Befehlen beruhen bzw. durch diese bestimmt sind; und (iii) Bekanntgeben von Verzweigungsbedingungen in dem übergeordneten Strang an die untergeordneten Stränge.Concept 8. A method of executing a sequential code, comprising: (i) generating a set of antisplay strands of the sequential code, wherein one of the antisplay strands is a parent thread while the remaining antagonist strands are child strands; (ii) executing certain instructions of the sequential code only in the parent thread, with corresponding instructions in the child threads being based on or determined by the certain instructions; and (iii) posting branch conditions in the parent thread to the subordinate threads.

Konzept 9. Das Verfahren wie beschrieben in Konzept 8, das ferner umfasst: Speichern der Verzweigungsbedingungen in den lokalen Speichern, die den untergeordneten Strängen zugeordnet sind.Concept 9. The method as described in Concept 8, further comprising: storing the branch conditions in the local memories associated with the subordinate threads.

Konzept 10. Das Verfahren wie beschrieben in Konzept 8 oder 9, wobei die gewissen Befehle ausgewählt sind aus der Gruppe: Ladebefehle, Speicherbefehle und eine Ausnahme hervorrufende Befehle.Concept 10. The method as described in concept 8 or 9, wherein the certain instructions are selected from the group: load instructions, store instructions, and exception causing instructions.

Konzept 11. Das Verfahren wie beschrieben in einem der Konzepte 8–10, wobei die Bekanntgabe vor Ausführung eines Verzweigungsbefehls in dem übergeordneten Strang ausgeführt wird, wobei das Verfahren ferner umfasst: Ausführen entsprechender Verzweigungsbefehle in den untergeordneten Strängen, nachdem die Bekanntgabe ausgeführt ist.Concept 11. The method as described in any of the concepts 8-10, wherein the announcement is performed prior to executing a branch instruction in the parent thread, the method further comprising: executing corresponding branch instructions in the child strings after the announcement is executed.

Konzept 12. Das Verfahren wie beschrieben in einem der Konzepte 8–11, wobei die Ausführung umfasst: Bestimmen der entsprechenden Befehle unter Verwendung einer Bedingung, die auf einer Strangkennung beruht.Concept 12. The method as described in any of concepts 8-11, the embodiment comprising: determining the corresponding instructions using a condition based on a string identifier.

Konzept 13. Das Verfahren wie beschrieben in einem der Konzepte 8–12, wobei der sequenzielle Code Teil einer Vektor-Operation ist.Concept 13. The method as described in any of the concepts 8-12, wherein the sequential code is part of a vector operation.

Konzept 14. Das Verfahren wie beschrieben in einem der Konzepte 8–13, wobei der sequenzielle Code Teil einer Anwendung ist, die ausgewählt ist aus der Gruppe: ein OpenMP-Programm und ein OpenACC-Programm.Concept 14. The method as described in any of concepts 8-13, wherein the sequential code is part of an application selected from the group consisting of an OpenMP program and an OpenACC program.

Konzept 15. Ein Einzelbefehl-Multi-Strang-(SIMT-)Prozessor, mit: (i) Bahnen; (ii) lokalen Speichern, die entsprechenden Bahnen zugeordnet sind; (iii) einer gemeinsam benutzten Speichereinrichtung für die Bahnen; und (iv) einer Pipeline-Steuereinheit, die ausgebildet ist, eine Gruppe aus Gegenspieler-Strängen des sequenziellen Codes zu erzeugen und zu veranlassen, dass die Gruppe in den Bahnen ausgeführt wird, wobei einer der Gegenspieler-Stränge ein übergeordneter Strang ist, und wobei die verbleibenden Gegenspieler-Strange untergeordnete Stränge sind, wobei die Bahnen ausgebildet sind, um: gewisse Befehle des sequenziellen Codes nur in dem übergeordneten Strang auszuführen, wobei entsprechende Befehle in den untergeordneten Strängen auf den gewissen Befehlen beruhen, und um Verzweigungsbedingungen in dem übergeordneten Strang den untergeordneten Strängen bekannt zu geben.Concept 15. A single-instruction multi-strand (SIMT) processor, comprising: (i) tracks; (ii) local memories associated with respective lanes; (iii) a shared memory device for the lanes; and (iv) a pipeline control unit configured to generate a set of opponent strands of the sequential code and to cause the group to execute in the paths, wherein one of the opposing strands is a parent strand, and wherein the remaining antagonist strings are subordinate threads, the traces being arranged to: execute certain sequential code instructions only in the parent thread, with corresponding instructions in the subordinate threads based on the certain instructions, and branch conditions in the parent thread subordinate strands to announce.

Konzept 16. Der SIMT-Prozessor wie beschrieben in Konzept 15, wobei die lokalen Speicher, die den Bahnen zugeordnet sind, die die untergeordneten Stränge ausführen, ferner ausgebildet sind, die Verzweigungsbedingungen zu speichern.Concept 16. The SIMT processor as described in Concept 15, wherein the local memories associated with the paths that execute the subordinate threads are further configured to store the branch conditions.

Konzept 17. Der SIMT-Prozessor wie beschrieben in Konzept 15 oder 16, wobei die gewissen Befehle ausgewählt sind aus der Gruppe: Ladebefehle, Speicherbefehle und eine Ausnahme hervorrufende Befehle.Concept 17. The SIMT processor as described in Concept 15 or 16, wherein the certain instructions are selected from the group: load instructions, store instructions, and exception causing instructions.

Konzept 18. Der SIMT-Prozessor wie beschrieben in einem der Konzepte 15–17, wobei eine Bahn, die den übergeordneten Strang ausführt, ferner ausgebildet ist, die Verzweigungsbedingungen vor der Ausführung eines Verzweigungsbefehls in dem übergeordneten Strang bekannt zu geben, und wobei Bahnen, die die untergeordneten Stränge ausführen, ferner ausgebildet sind, entsprechende Verzweigungsbefehle in den untergeordneten Strängen nur auszuführen, nachdem die Bahn die Verzweigungsbedingungen bekannt gegeben hat.Concept 18. The SIMT processor as described in any one of concepts 15-17, wherein a lane executing the parent strand is further configured to announce the branch conditions prior to executing a branch instruction in the parent strand, and wherein lanes, which execute the subordinate threads, are further configured to execute corresponding branch instructions in the subordinate threads only after the web has announced branching conditions.

Konzept 19. Der SIMT-Prozessor wie beschrieben in einem der Konzepte 15–18, wobei die Pipeline-Steuereinheit ferner ausgebildet ist, die entsprechenden Befehle unter Verwendung einer Bedingung zu bestimmen, die auf einer Strangkennung beruht.Concept 19. The SIMT processor as described in any of concepts 15-18, wherein the pipeline control unit is further configured to determine the corresponding instructions using a condition based on a thread identifier.

Konzept 20. Der SIMT-Prozessor wie beschrieben in einem der Konzepte 15–19, wobei der sequenzielle Code Teil einer Anwendung ist, die ausgewählt ist aus der Gruppe: ein OpenMP-Programm und ein OpenACC-Programm.Concept 20. The SIMT processor as described in any of concepts 15-19, wherein the sequential code is part of an application selected from the group consisting of an OpenMP program and an OpenACC program.

Der Fachmann auf diesem Gebiet, an den sich diese Anmeldung richtet, erkennt, dass andere und weitere Hinzufügungen, Streichungen, Ersetzungen und Modifizierungen an den beschriebenen Ausführungsformen vorgenommen werden können.Those skilled in the art to which this application pertains recognize that other and further additions, deletions, substitutions, and alterations can be made to the described embodiments.

Claims

A system for executing a sequential code, with: a pipeline controller configured to generate a set of antisplay strands of the sequential code, wherein one of the adversary strands is a parent thread, the remaining antisplay strands being subordinate strands; and Trajectories designed to: execute certain instructions of the sequential code only in the parent thread, with corresponding instructions in the subordinate threads being determined by the certain instructions, and Branching conditions in the parent strand to announce the subordinate strands.

The system of claim 1, wherein local memories associated with lanes that execute the child strands are further configured to store the branch conditions.

The system of claim 1 or 2, wherein said certain commands are selected from the group: Load instructions Memory commands, and an exception causing commands.

The system of any one of claims 1-3, wherein a lane executing the parent strand is further configured to announce the branch conditions prior to executing a branch instruction in the parent strand, and lanes that execute the child strands are further formed are to execute corresponding branch instructions in the subordinate threads only after the web has announced the branch conditions.

The system of any of claims 1-4, wherein the pipeline control unit is further configured to determine the corresponding instructions using a condition based on a thread identifier.

The system of any one of claims 1-5, wherein the sequential code is part of a vector operation.

The system of any of claims 1-6, wherein the sequential code is part of an application selected from the group: an OpenMP program, and an OpenACC program.

A single-instruction multi-strand (SIMT) processor comprising: tracks; local memories, each associated with respective tracks; a shared memory device for the webs; and a pipeline control unit configured to generate a set of opponent strands of the sequential code and cause the group to execute in the paths, wherein one of the opposing strands is a parent strand, and wherein the remaining counterparts are Strands are subordinate strands, the webs being formed to: execute certain commands of the sequential code only in the parent thread, with corresponding instructions in the subordinate threads being determined by the certain instructions, and branch conditions in the parent thread announcing the subordinate threads.

The SIMT processor of claim 8, wherein the local memories associated with lanes that execute the child strands are further configured to store the branch conditions.

The SIMT processor of claim 8 or 9, wherein the certain instructions are selected from the group: Load instructions Memory commands, and an exception causing commands.