DE102020127707A1

DE102020127707A1 - HIGH PERFORMANCE SYNCHRONIZATION MECHANISMS FOR COORDINATING OPERATIONS ON A COMPUTER SYSTEM

Info

Publication number: DE102020127707A1
Application number: DE102020127707.5A
Authority: DE
Inventors: Olivier Giroux; Jack Choquette; Ronny KRASHINSKY; Steve Heinrich; Xiaogang Qiu; Shirish Gadre
Original assignee: Nvidia Corp
Current assignee: Nvidia Corp
Priority date: 2019-10-29
Filing date: 2020-10-21
Publication date: 2021-04-29

Abstract

Um Operationen eines Rechnersystems zu synchronisieren wird ein neuer Typ einer Synchronisierungsbarriere offenbart. Bei einer Ausführungsform kann die offenbarte Synchronisierungsbarriere für bestimmte Synchronisierungsmechanismen, wie zum Beispiel „Arrive“ und „Wait“, geteilt werden, um eine größere Flexibilität und Effizienz beim Koordinieren einer Synchronisierung zu ermöglichen. Bei einer anderen Ausführungsform ermöglicht die offenbarte Synchronisierungsbarriere, dass Hardwarekomponenten, wie zum Beispiel bestimmte Kopier- oder Direktspeicherzugriffs- (DMA-) Maschinen, mit Software-basierten Threads synchronisiert werden.In order to synchronize operations of a computer system, a new type of synchronization barrier is disclosed. In one embodiment, the disclosed synchronization barrier for certain synchronization mechanisms, such as "Arrive" and "Wait", can be shared to allow greater flexibility and efficiency in coordinating synchronization. In another embodiment, the disclosed synchronization barrier enables hardware components, such as certain copy or direct memory access (DMA) machines, to be synchronized with software-based threads.

Description

HINTERGRUND & ZUSAMMENFASSUNGBACKGROUND & SUMMARY

Massiv-parallele Hochleistungssysteme mit Multithreading-Mehrkernprozessoren - Systeme, die viele parallel arbeitende Verarbeitungskerne enthalten - verarbeiten Daten viel schneller, als es in der Vergangenheit möglich war. Diese Verarbeitungssysteme zerlegen komplexe Berechnungen in kleinere Aufgaben, die gleichzeitig von parallelen Rechenkernen ausgeführt werden. Dieser „Divide and Conquer“-Ansatz ermöglicht es, komplexe Berechnungen in einem Bruchteil der Zeit durchzuführen, die erforderlich wäre, wenn nur ein oder wenige Prozessoren nacheinander an denselben Berechnungen arbeiten würden. Aber eine solche Parallelverarbeitung schafft auch den Bedarf an Kommunikation und Koordination zwischen parallelen Ausführungs-Threads oder -Blöcken.Massively parallel high-performance systems with multithreading multi-core processors - systems that contain many processing cores working in parallel - process data much faster than was possible in the past. These processing systems break down complex calculations into smaller tasks that are carried out simultaneously by parallel computing cores. This “divide and conquer” approach enables complex calculations to be performed in a fraction of the time it would take if only one or a few processors worked on the same calculations in succession. But such parallel processing also creates the need for communication and coordination between parallel execution threads or blocks.

Eine Möglichkeit, um für verschiedene Ausführungsprozesse ihre Zustände miteinander zu koordinieren, ist die Synchronisierung mittels einer Barriere. Bei der Synchronisierung mittels einer Barriere wartet jeder Prozess in einer Sammlung parallel ausgeführter Prozesse in der Regel an einer Barriere, bis alle anderen Prozesse in der Sammlung aufgeholt haben. Kein Prozess kann die Barriere überschreiten, bis alle Prozesse die Barriere erreicht haben.One possibility to coordinate their states for different execution processes is the synchronization by means of a barrier. When synchronizing using a barrier, each process in a collection of parallel processes usually waits at a barrier until all other processes in the collection have caught up. No process can cross the barrier until all processes have reached the barrier.

Die 1A-1H bilden zusammen eine Flipchart-Animation, die ein Beispiel für eine solche Synchronisierung mittels einer Barriere veranschaulicht. 1A stellt Threads oder andere Ausführungsprozesse T1, T2, ... TK dar. Es sind drei Threads dargestellt, aber es könnte Hunderte oder Tausende von Threads geben. Die Threads beginnen an einem gemeinsamen Punkt (z.B. einem Initialisierungspunkt, einer vorherigen Barriere usw. - hier zur Veranschaulichung als „Startlinie“ bezeichnet).The 1A-1H together form a flipchart animation that illustrates an example of such synchronization by means of a barrier. 1A provides threads or other execution processes T1 , T2 , ... TK Three threads are shown, but there could be hundreds or thousands of threads. The threads begin at a common point (eg an initialization point, a previous barrier, etc. - here referred to as the “start line” for illustration purposes).

Ein Thread TK ist ein „Hase“ - er läuft schneller ab als andere Threads und bewegt sich schneller auf eine Barriere zu (hier grafisch dargestellt durch einen Bahnübergang und die Bezeichnung „Ankommen-Wartepunkt“, dessen Bedeutung im Folgenden erläutert wird).One thread TK is a "rabbit" - it runs faster than other threads and moves faster towards a barrier (here graphically represented by a level crossing and the designation "arriving waiting point", the meaning of which is explained below).

Ein weiterer Thread T1 ist eine „Schildkröte“ - er läuft langsamer als andere Threads und bewegt sich langsamer auf die Barriere zu.Another thread T1 is a "turtle" - it runs slower than other threads and moves more slowly towards the barrier.

Die 1B, 1C, 1D zeigen die verschiedenen Threads, die mit unterschiedlichen Geschwindigkeiten auf die Barriere zulaufen. 1D zeigt den „Hasen“-Thread TK, der vor dem „Schildkröten“-Thread T1 an der Barriere ankommt. Da der „Hasen“-Thread TK nicht über die Barriere hinausgehen kann, bis der „Schildkröten“-Thread T1 ebenfalls an der Barriere ankommt, würde der „Hasen“-Thread an der Barriere warten (nach dem Stand der Technik - aber siehe unten für die „geteilte“ Ankunfts-Warte-Funktionalität). Dies könnte möglicherweise viele Verzögerungszyklen umfassen - wenn beispielsweise der „Schildkröten“-Thread T1 langsam ist, da er auf den Dienst von dem Hauptspeicher wartet, muss der „Hasen“-Thread TK möglicherweise lange warten (1E), bevor der „Schildkröten“-Thread aufholt und die Barriere erreicht.The 1B , 1C , 1D show the different threads approaching the barrier at different speeds. 1D shows the "rabbit" thread TK , the one before the "turtle" thread T1 arrives at the barrier. Since the "rabbit" thread TK cannot go beyond the barrier until the "turtle" thread T1 also arrives at the barrier, the "rabbit" thread would wait at the barrier (according to the state of the art - but see below for the "shared" arrival / waiting functionality). This could potentially involve many cycles of delays - if for example the "turtle" thread T1 is slow, because it is waiting for the service from the main memory, the "rabbit" thread TK may wait a long time ( 1E) before the "turtle" thread catches up and reaches the barrier.

Sobald der letzte nachlaufende „Schildkröten“-Thread T1 die Barriere erreicht (1F), wird die Barriere aufgehoben (in 1G durch die Öffnung der Bahnübergangsschranke dargestellt) und alle Threads können die Barriere überqueren.As soon as the last trailing "turtle" thread T1 reached the barrier ( 1F) , the barrier is lifted (in 1G represented by the opening of the level crossing barrier) and all threads can cross the barrier.

Ein Beispiel für eine nützliche Anwendung, die von der Synchronisierung mittels einer Barriere profitiert, ist die „asynchrone Berechnung“. Bei der asynchronen Berechnung wird die GPU-Nutzung dadurch gesteigert, dass die Aufgaben nicht in strikter Reihenfolge, sondern ohne Beachtung der Reihenfolge eingeplant werden, so dass „spätere“ (gemäß der Reihenfolge) Berechnungen zur gleichen Zeit wie „frühere“ (gemäß der Reihenfolge) Berechnungen durchgeführt werden können. Ein Beispiel: Beim Rendern von Grafiken kann eine asynchrone Berechnung eine gleichzeitige Ausführung des Shaders mit anderen Arbeiten ermöglicht werden, anstatt dass ein Shader sequentiell zu anderen Arbeitslasten ausgeführt wird. Während die GPU-API so konzipiert sein kann, dass sie davon ausgeht, dass die meisten oder alle Aufrufe unabhängig sind, erhält der Entwickler auch die Kontrolle darüber, wie die Aufgaben oder Tasks eingeplant werden, und kann Barrieren implementieren, um die Korrektheit zu gewährleisten, z. B. wenn eine Operation vom Ergebnis einer anderen abhängt. Siehe z.B. U.S. Patent Nr. 9,117,284 und 10,217,183 .An example of a useful application that benefits from synchronization using a barrier is "asynchronous computation". With asynchronous computation, the GPU usage is increased by the fact that the tasks are not scheduled in strict order, but rather without observing the order, so that "later" (according to the order) calculations at the same time as "earlier" (according to the order ) Calculations can be performed. For example, when rendering graphics, an asynchronous computation can allow the shader to run concurrently with other work, rather than having a shader run sequentially with other workloads. While the GPU API can be designed to assume that most or all of the calls are independent, it also gives the developer control over how the tasks or tasks are scheduled and can implement barriers to ensure correctness , e.g. B. when one operation depends on the outcome of another. See, for example, U.S. Patent No. 9,117,284 and 10,217,183 .

Hardware-basierte Synchronisierungsmechanismen sind in GPUs zur Unterstützung solcher Arten von Synchronisierungsfunktionen mit einer Barriere enthalten. Siehe z.B. Xiao et al., „Inter-Block GPU Communication via Fast Barrier Synchronization“, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS) (19. - 23. April 2010). Rechenfähige GPUs mit solchen hardwarebasierten Synchronisierungsfähigkeiten wurden gewöhnlich im massensynchronen Stil programmiert - breite parallele Tasks mit einer Synchronisierung mittels einer Barriere innerhalb und einem Aufspalten / Zusammenführen zwischen ihnen. Siehe zum Beispiel die US-Patentanmeldung mit der Veröffentlichungs-Nr. 2015020558.Hardware-based synchronization mechanisms are included in GPUs to support such types of synchronization functions with a barrier. See e.g. Xiao et al., "Inter-Block GPU Communication via Fast Barrier Synchronization", 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS) (April 19-23, 2010). Computational GPUs with such hardware-based synchronization capabilities have traditionally been programmed in a mass-synchronous style - broad parallel tasks with synchronization by means of a barrier within and a split / merge between them. See, for example, U.S. Patent Application Publication No. 2015020558.

In modernen GPU-Architekturen werden viele Ausführungsthreads gleichzeitig ausgeführt, und viele Warps, die jeweils viele Threads umfassen, werden ebenfalls gleichzeitig ausgeführt. Wenn Threads in einem Warp kompliziertere Kommunikations- oder kollektive Operationen ausführen müssen, kann der Entwickler z.B. NVIDIAs CUDA „_syncwarp“-Grundelement bzw. -Primitiv verwenden, um Threads zu synchronisieren. Das _syncwarp-Grundelement initialisiert Hardware-Mechanismen, die bewirken, dass ein ausführender Thread vor der Wiederaufnahme der Ausführung wartet, bis alle in einer Maske angegebenen Threads das Grundelement mit derselben Maske aufgerufen haben. Für weitere Einzelheiten siehe z.B. U.S. Patent Nr. 8.381.203 ; 9.158.595 ; 9.442.755 ; 9.448.803 ; 10.002.031 ; und 10.013.290 ; und siehe auch https://devblogs.nvidia.com/usingcuda-warp-level-primitives/; und https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#memory-fence-functions.In modern GPU architectures, many threads of execution run concurrently, and many warps, each spanning many threads, also run concurrently. For example, when threads need to perform more complex communications or collective operations in a warp, the developer can use NVIDIA's CUDA "_syncwarp" primitive to synchronize threads. The _syncwarp primitive initializes hardware mechanisms that cause an executing thread to wait until all threads specified in a mask have called the primitive with the same mask before resuming execution. For more details see e.g. US Patent No. 8,381,203 ; 9,158,595 ; 9,442,755 ; 9,448,803 ; 10,002,031 ; and 10,013,290 ; and see also https://devblogs.nvidia.com/usingcuda-warp-level-primitives/; and https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#memory-fence-functions.

Obwohl sich hardwareimplementierte Barrieren als nützlich erwiesen haben, ist es manchmal hilfreich, wenn ein Programm mehr als eine Barriere gleichzeitig nutzt. Beispielsweise kann ein Programm potenziell eine erste Synchronisierungsbarriere verwenden, um eine erste Gruppe von Threads zu blockieren, und eine zweite, andere Synchronisierungsbarriere, um eine weitere Gruppe von Threads zu blockieren (oder manchmal wird dieselbe Synchronisierungsbarriere wiederverwendet, um dieselbe Gruppe von Threads immer wieder zu blockieren, während sie ihre Ausführungspfade durchlaufen). In der Vergangenheit musste ein Software-Entwickler, um Operationen mit mehreren Barrieren durchzuführen, dem Compiler in der Regel im Voraus angeben, wie viele Barrieren benötigt werden. In Systemen, in denen Synchronisierungsbarrieren hardwareimplementiert waren, stand eine begrenzte Anzahl von Synchronisierungsbarrieren zur Verfügung. Einige Programme benötigten oder hätten mehr Synchronisierungsbarrieren verwenden können, als in der Hardware von der Hardware-Plattform unterstützt wurden.Although hardware-implemented barriers have proven useful, it is sometimes helpful for a program to use more than one barrier at the same time. For example, a program can potentially use a first sync barrier to block a first set of threads and a second, different sync barrier to block another set of threads (or sometimes the same sync barrier is reused to block the same set of threads over and over again block as they traverse their execution paths). In the past, in order to perform operations with multiple barriers, a software developer typically had to tell the compiler how many barriers were needed in advance. In systems where synchronization barriers were hardware implemented, a limited number of synchronization barriers were available. Some programs needed or could have used more synchronization barriers than were supported in the hardware by the hardware platform.

Aufgrund zusätzlicher Anwendungen von und der Nachfrage nach Synchronisierungsbarrieren besteht die Notwendigkeit, die Zuordnung von Synchronisierungsbarrieren zu verbessern. Insbesondere bestimmte frühere Implementierungen und Ansätze von Hardware-beschleunigten Barrieren weisen erhebliche Mängel auf:

1. Programme, die mehr als eine physikalische Hardware-Barriere benötigten, hatten Schwierigkeiten bei der Zuweisung dieser Barrieren.
2. Barrieren mit der klassischen „arrive-and-wait“-Schnittstelle verbergen die Synchronisierungslatenz nicht gut (in Bezug auf die Flipchart-Animation in 1A-1H muss der „Hase“-Thread möglicherweise eine Weile warten und nichts tun, bis die „Schildkröten“-Threads aufholen).
3. Kopier-Maschinen (wie DMA-Einheiten (Direct Memory Access)) können in der Regel nicht direkt an der hardwarebasierten Synchronisierung teilnehmen, da sie keine Software-Threads sind.

Due to additional uses of and the demand for synchronization barriers, there is a need to improve the mapping of synchronization barriers. In particular, certain previous implementations and approaches of hardware accelerated barriers have significant shortcomings:

1. Programs that required more than one physical hardware barrier had difficulty assigning these barriers.
2. Barriers with the classic “arrive and wait” interface do not hide the synchronization latency well (with regard to the flipchart animation in 1A-1H the "rabbit" thread may have to wait a while and do nothing for the "turtle" threads to catch up).
3. Copy machines (such as DMA units (Direct Memory Access)) can usually not participate directly in the hardware-based synchronization because they are not software threads.

Es ist seit langem möglich, Synchronisierungsbarrieren in Software zu implementieren, aber software-implementierte Barrieren stellen nicht unbedingt das gleiche Leistungsniveau wie hardware-implementierte Barrieren bereit. Beispielsweise verwendeten einige Entwickler in der Vergangenheit Hardware, um so viele Barrieren zu implementieren, wie von der Plattform-Hardware unterstützt wurden, und wenn mehr (oder andere Arten von) Barrieren benötigt wurden, implementierten sie zusätzliche Barrieren in Software. Entwickler, die Synchronisierungsbarrieren in Software implementierten, mussten oft Leistungseinbußen hinnehmen. Insbesondere eine übermäßige Zuweisung von Barrieren konnte zu weniger Ausführungspfaden und entsprechend geringerer Leistung führen. Es war für Entwickler nicht immer ein einfacher Kompromiss, festzustellen, ob die Leistung durch den Einsatz von mehr Barrieren und weniger Tasks verbessert werden könnte.It has long been possible to implement synchronization barriers in software, but software-implemented barriers do not necessarily provide the same level of performance as hardware-implemented barriers. For example, in the past some developers used hardware to implement as many barriers as the platform hardware supported, and when more (or other types of) barriers were needed, they implemented additional barriers in software. Developers who implemented synchronization barriers in software often suffered performance degradation. In particular, over-assigning barriers could result in fewer execution paths and correspondingly lower performance. It has not always been an easy compromise for developers to determine if using more barriers and fewer tasks could improve performance.

Es besteht die Notwendigkeit, die Zuweisung von Hardware-beschleunigten und/oder -unterstützten Synchronisierungsbarrieren so zu verbessern, dass die Software-Zuweisung flexibel gestaltet werden kann, ohne die Leistung jedoch negativ zu beeinflussen.There is a need to improve the assignment of hardware-accelerated and / or -assisted synchronization barriers in such a way that the software assignment can be made flexible without, however, adversely affecting performance.

FigurenlisteFigure list

The 1A-1H together form a temporal sequence of images that comprise a flipchart animation that illustrates the use of a synchronization barrier (to view the animation, start with 1A an electronic copy of this patent and presses the page down key repeatedly);
2 Figure 13 shows an example of a non-limiting instruction stream;
The 2A & 2 B Figure 4 is block diagrams of an example of a non-limiting system using the present method of synchronization using a barrier;
3 Figure 3 is a block diagram of an example of a non-limiting memory-based synchronization barrier primitive;
4th FIG. 3 is a block diagram of an exemplary, non-limiting combination of hardware and software functions used to manage the barrier of 3 can be used;
The 5A-5F illustrate exemplary flow diagrams of operations associated with barrier functions;
The 6th & 7th show illustrative examples of two-threaded barriers;
8th Figure 12 shows an illustrative example of three stage pipeline streaming code; and
9 shows an example of a non-limiting loop code structure.

DETAILLIERTE BESCHREIBUNG DER BEISPIELHAFTEN NICHT EINSCHRÄNKENDEN AUSFÜHRUNGSFORMENDETAILED DESCRIPTION OF EXEMPLARY NON-RESTRICTING EMBODIMENTS

Es wird eine neue Art von Barriere eingeführt, die die oben beschriebenen Probleme löst:

1. Sie ist im Speicher implementiert und allokiert daher wie der Speicher.
2. Die „Arrive“- (Ankommen-) und die „Wait“- (Warten-) Operation werden aufgeteilt, damit nicht betroffene Tasks bzw. Arbeiten dazwischen ausgeführt werden können.
3. Asynchrone Kopierhardware von demselben Streaming-Multiprozessor (SM) kann sich als virtueller oder „moralischer“ Thread beteiligen.

A new type of barrier is introduced that solves the problems described above:

1. It is implemented in memory and is therefore allocated like memory.
2. The "Arrive" and "Wait" operations are split up so that unaffected tasks or work in between can be carried out.
3. Asynchronous copy hardware from the same streaming multiprocessor (SM) can participate as a virtual or "moral" thread.

Die Implementierung von Barrieren im Speicher ist durchaus machbar und ist im Allgemeinen durch Software realisiert. Geteilte Software-Barrieren sind weniger verbreitet, existieren aber auch in der Praxis. Es wird eine Hardware-Beschleunigung für diese(s) Idiom(e) bereitgestellt, und Hardware-Kopiereinheiten werden mit der Hardware-Beschleunigung integriert, als ob die Hardware-Kopiereinheiten „moralische“ Threads wären.The implementation of barriers in memory is quite feasible and is generally implemented by software. Shared software barriers are less common, but they also exist in practice. Hardware acceleration is provided for this idiom (s), and hardware copiers are integrated with the hardware acceleration as if the hardware copiers were "moral" threads.

Der Programmieraufwand, der erforderlich ist, um eine Barriere mit reichhaltigerer Funktionalität und guter Leistung zu erzeugen, wird wesentlich verbessert. Es wird auch ermöglicht, dass mehr asynchrone Kopieroperationen bezüglich der Streaming-Multiprozessoren (SMs) einer GPU eingeführt werden, indem ein innovativer Weg zur Synchronisierung mit den Kopieroperationen bereitgestellt wird, der die SM-Leistung verbessert, indem die Kern-Threads von Arbeit entlastet werden.The programming effort required to create a barrier with richer functionality and good performance is greatly improved. It also enables more asynchronous copy operations to be introduced into the streaming multiprocessors (SMs) of a GPU by providing an innovative way of synchronizing with the copy operations that improves SM performance by offloading work on the core threads .

Das vorliegende Beispiel eines nicht einschränkenden Verfahrens verlagert somit zusätzliche Funktionalität in Barrieren, was wiederum die Verbreitung von Synchronisierungsbarrieren erhöhen kann. Insbesondere wird viel mehr Code als in der Vergangenheit potenziell mehrere Synchronisierungsbarrieren verwenden müssen (und können).The present example of a non-limiting method thus shifts additional functionality into barriers, which in turn can increase the spread of synchronization barriers. In particular, much more code will have to (and can) potentially use multiple sync barriers than in the past.

Synchronisierungsfunktion mit geteilter Ankommen-Warte-BarriereSynchronization function with split arrival / waiting barrier

Wie oben erläutert ist, ist eine Synchronisierung mittels einer Barriere häufig oder typischerweise als ein Konstrukt definiert, bei dem eine bestimmte Menge von Threads oder anderen Prozessen an dem Synchronisierungspunkt blockiert werden, und wenn alle Threads oder anderen Prozesse in einer bestimmten Menge am Synchronisierungspunkt angekommen sind, werden alle Threads oder anderen Prozesse dann freigegeben. Siehe 1A-1H.As explained above, synchronization by means of a barrier is often or typically defined as a construct in which a certain amount of threads or other processes are blocked at the synchronization point and when all threads or other processes in a certain amount have arrived at the synchronization point , all threads or other processes are then released. Please refer 1A-1H .

Insbesondere warteten in vielen früheren Implementierungen von einer Synchronisierung mittels einer Barriere Threads wie der „Hase“-Thread in 1A ff., die früh am Synchronisierungspunkt ankommen, einfach ab und leisteten keine sinnvolle Arbeit, bis sie durch das Eintreffen später endender Threads freigegeben wurden. Ein typisches Szenario bestand darin, dass, wann immer mehr Threads am Synchronisierungspunkt ankamen, bald nur noch ein Nachzügler-Thread ausstand, auf den alle anderen Threads warteten, bevor sie alle freigegeben wurden und zur nächsten Verarbeitungsphase übergehen konnten.In particular, in many earlier implementations, threads such as the "rabbit" thread in 1A ff. that arrive at the synchronization point early, simple and did no useful work until they were released by the arrival of later ending threads. A typical scenario was that whenever more threads arrived at the sync point, there was soon just one lagging thread waiting for all the other threads to be waiting for before they were all released and allowed to move on to the next stage of processing.

Es sei angenommen, dass ein Programmierer eine Barriere in Software implementiert. Es sei angenommen, dass der Programmierer/Entwickler Code geschrieben hat, um die Barriere zu implementieren. Ein wirtschaftlicher Vorgang wäre „ankommen und warten“ („arrive and wait“). Das ist es, was eine „Open MP“-Barriere implementiert und wie frühere CUDA-Barrieren implementiert wurden. Bei einer beispielhaften Implementierung würde jeder Thread, der am Synchronisierungspunkt ankommt, die Aussage „Ich bin angekommen“ machen. Das Programm würde dann die Anzahl der Threads zählen, die angekommen sind. Wenn nicht alle Threads eingetroffen sind, würde das Programm blockieren. Ein solches System würde einfach warten und fortfahren, die Zählung abzufragen, bis die Zählung den richtigen Wert erreicht, der anzeigt, dass alle Threads angekommen sind. Die Abfrageschleife ist verschwenderisch, da die Threads keine sinnvolle Arbeit verrichten, während sie warten, der Prozessor viel Zeit damit verbringt, die Anzahl abzufragen, und das System dadurch Ressourcen verbraucht, die andernfalls für sinnvolle Arbeit verwendet werden könnten.Assume that a programmer implements a barrier in software. Assume that the programmer / developer wrote code to implement the barrier. An economic process would be “arrive and wait”. This is what an "Open MP" barrier implemented and how previous CUDA barriers were implemented. In an exemplary implementation, each thread arriving at the synchronization point would say "I have arrived". The program would then count the number of threads that arrived. If not all threads have arrived, the program would block. Such a system would simply wait and continue polling the count until the count reaches the correct value indicating that all threads have arrived. The polling loop is wasteful because the threads are not doing useful work while they wait, the processor spends a lot of time polling the number, and the system is consuming resources that could otherwise be used for useful work.

Einige neuere softwarebasierte Ansätze haben die beiden Ereignisse des Ankommens am Synchronisierungspunkt und des Blockierens am Synchronisierungspunkt voneinander entkoppelt. Beispielhafte, nicht einschränkende Ausführungsformen stellen in ähnlicher Weise eine Entkopplung bereit, wie ein Thread oder ein anderer Prozess an einem Synchronisierungspunkt ankommt und wartet. Bei beispielhaften nicht einschränkenden Ausführungsformen wird dieses Verfahren so verwendet, dass Threads zuerst an dem Synchronisierungspunkt ankommen, wo ihre Ankunft berücksichtigt wird. Bei nicht einschränkenden Implementierungen müssen die Threads jedoch nicht blockieren, wenn sie am Synchronisierungspunkt ankommen. Vielmehr können sie andere Arbeiten ausführen, die nicht mit dem Synchronisierungspunkt in Zusammenhang stehen (d.h. Arbeiten, die nicht durch diese spezielle Synchronisierungsbarriere synchronisiert werden müssen, sondern eher asynchron in Bezug auf diese spezielle Synchronisierungsbarriere sind). Sobald sie diese andere Arbeit abgeschlossen haben und zu der Arbeit zurückkehren müssen, die eine Synchronisierung durch die Barriere erfordert, können sie blockiert werden, wenn es erforderlich ist. Wenn die andere Arbeit jedoch bedeutend genug ist, sind zu dem Zeitpunkt, an dem der Thread die andere Arbeit abschließt, alle anderen Threads bereits am Synchronisierungspunkt angekommen, und es erfolgt überhaupt keine Blockierung. In solchen Situationen wird bei beispielhaften nicht einschränkenden Implementierungen einfach festgestellt, dass alle Threads am Synchronisierungspunkt angekommen sind, und atomar blockiert und freigeben, ohne die weitere Verarbeitung eines Threads tatsächlich zu verzögern oder zu stoppen (mit Ausnahme von Threads, denen die andere Arbeit ausgeht, die sie erledigen könnten, während sie auf die Freigabe am Synchronisierungspunkt warten).Some newer software-based approaches have decoupled the two events of arriving at the synchronization point and blocking at the synchronization point. Exemplary, non-limiting embodiments similarly provide decoupling of how a thread or other process arrives and waits at a synchronization point. In exemplary, non-limiting embodiments, this method is used so that threads arrive first at the synchronization point where their arrival is considered. However, in non-limiting implementations, the threads do not need to block when they arrive at the sync point. Rather, they can do other work that is not related to the synchronization point (i.e. work that does not need to be synchronized through that particular synchronization barrier, but is more asynchronous with respect to that particular synchronization barrier). Once they complete that other work and need to get back to the work that requires syncing through the barrier, they can be blocked if necessary. However, if the other work is significant enough, by the time the thread completes the other work, all the other threads have already arrived at the sync point and there is no blocking at all. In such situations, exemplary non-limiting implementations simply determine that all threads have arrived at the synchronization point and are atomically blocked and released without actually delaying or stopping further processing of a thread (with the exception of threads running out of other work, which they could do while waiting for the release at the sync point).

Beispielhafte, nicht einschränkende Ausführungsformen unterteilen somit die „arrive and wait“- bzw. Ankommen-und-Warten-Funktion in zwei verschiedene atomare Funktionen: (1) Ankommen bzw. arrive und (2) Warten bzw. wait. Der „arrive“-Teil der Funktion ist die gesamte Buchhaltung und sonstige Verwaltung, die typischerweise vor der Implementierung der Barriere durchgeführt werden muss, aber er bewirkt nicht, dass irgendein Thread tatsächlich blockiert wird. Da die Threads nicht blockiert werden, dürfen sie Arbeiten ausführen, die nicht mit der Barriere in Beziehung stehen.Exemplary, non-restrictive embodiments thus subdivide the “arrive and wait” or arriving and waiting function into two different atomic functions: (1) arriving or arrive and (2) waiting or wait. The "arrive" part of the function is all of the bookkeeping and other management that typically needs to be done prior to implementing the barrier, but it does not cause any thread to actually be blocked. Because the threads are not blocked, they are allowed to do work unrelated to the barrier.

Eine Barriere wird beispielsweise häufig verwendet, um Phasen einer Berechnung über eine Datenstrukturzu implementieren. Die Synchronisierungsbarriere wird also dazu verwendet, Threads an der Verwendung der Datenstruktur zu hindern, bis alle Threads ihre Aktualisierung der Datenstruktur abgeschlossen haben. Bei nicht einschränkenden Ausführungsformen kann Threads, die bereits am Synchronisierungspunkt angekommen sind, erlaubt werden, andere nützliche Arbeiten durchzuführen, die nicht diese Datenstruktur betreffen, während sie darauf warten, dass andere Threads am Synchronisierungspunkt ankommen.For example, a barrier is often used to implement phases of computation over a data structure. So the synchronization barrier is used to prevent threads from using the data structure until all threads have finished updating the data structure. In non-limiting embodiments, threads that have already arrived at the sync point may be allowed to perform other useful work that does not involve that data structure while waiting for other threads to arrive at the sync point.

Beispiel - Rote und grüne DatenstrukturenExample - red and green data structures

Als Beispiel in 2 wird angenommen, dass es zwei Datenstrukturen gibt: Auch wenn Patentzeichnungen nicht in Farbe sind, wird der Einfachheit halber auf zwei unterschiedliche Datenstrukturen verwiesen, die das System als „rote“ Datenstruktur und „grüne“ Datenstruktur aktualisiert. 2 ist eigentlich allgemeiner als die Aktualisierung von zwei verschiedenen Datenstrukturen (z.B. gilt sie auch für die Aktualisierung einer einzigen Datenstruktur und die anschließende Durchführung „anderer Arbeiten“, die nicht mit dieser Datenstruktur zusammenhängen, während auf den Abschluss der Aktualisierungen dieser Datenstruktur gewartet wird), aber es ist nützlich zu erklären, wie der gezeigte Prozess zur Aktualisierung von zwei verschiedenen (z.B. einer „roten“ und einer „grünen“) Datenstrukturen verwendet werden könnte.As an example in 2 It is assumed that there are two data structures: Even if patent drawings are not in color, for the sake of simplicity reference is made to two different data structures, which the system updates as the “red” data structure and the “green” data structure. 2 is actually more general than updating two different data structures (e.g., updating a single data structure and then doing "other work" unrelated to that data structure while waiting for updates to that data structure to complete), but it is useful to explain how the process shown could be used to update two different (eg a “red” and a “green”) data structures.

Eine „rote“ Synchronisierungsbarriere wird erstellt (2100'), um eine Barriere für die „rote“ Datenstruktur bereitzustellen. Sobald Threads mit der Aktualisierung der „roten“ Datenstruktur fertig sind (2702) und an dem „roten“ Synchronisierungspunkt ankommen (2200'), nachdem sie ihre jeweiligen Operationen an der „roten“ Datenstruktur abgeschlossen haben, können sie mit anderen Arbeiten beginnen, wie z.B. Arbeiten in Bezug auf die „grüne“ Datenstruktur (2704), und diese Datenstruktur aktualisieren, während sie auf die „rote“ Synchronisierungsbarriere warten - wodurch die „rote“ Datenstrukturarbeit geschützt wird, um abgeschlossen zu werden. Eine zusätzliche, „grüne“ Synchronisierungsbarriere könnte auf ähnliche Weise verwendet werden, falls es gewünscht ist, um die „grüne“ Datenstruktur zu schützen.A "red" synchronization barrier is created (2100 ') to provide a barrier to the "red" data structure. Once threads have finished updating the "red" data structure (2702) and arrive at the "red" sync point (2200 ') after completing their respective operations on the "red" data structure, they can begin other work, such as e.g. work in relation to the "green" data structure ( 2704 ), and update that data structure while waiting for the "red" sync barrier - thus protecting the "red" data structure work to be completed. An additional, "green" synchronization barrier could be used in a similar manner if desired to protect the "green" data structure.

Sobald die Threads ihre Arbeit mit der „grünen“ Datenstruktur beendet haben, können sie wieder an der „roten“ Datenstruktur arbeiten - aber bevor sie weitere Schritte in Bezug auf die „rote“ Datenstruktur unternehmen, müssen sie sicherstellen, dass die vorherige Verarbeitungsphase abgeschlossen ist. Wenn zu diesem Zeitpunkt die vorherige Verarbeitung, wie sie von dem „roten“ Synchronisierungs-Grundelement verwaltet wird, noch nicht abgeschlossen ist, müssen die Threads möglicherweise warten (2300'), bis die Verarbeitungsphase abgeschlossen ist. Da die atomaren Operationen „arrive“ (2200') und „wait“ (2300') jedoch zeitlich durch eine beliebig lange Zeit getrennt sind, die beispielsweise Tausende von Zyklen umfassen kann, kann viel nützliche Arbeit (2704) von allen Threads, die an dem „roten“ Synchronisierungspunkt angekommen sind (2200'), aber nicht blockiert sind, sondern stattdessen frei sind, um jede nützliche Arbeit außer der Arbeit an der „roten“ Datenstruktur zu verrichten, kollektiv ausgeführt werden (2704).Once the threads have finished working with the “green” data structure, they can work on the “red” data structure again - but before taking any further steps with the “red” data structure, they must ensure that the previous processing phase has been completed . At this point, if previous processing, as managed by the "red" synchronization primitive, has not yet completed, the threads may have to wait ( 2300 ' ) until the processing phase is complete. Since the atomic operations "arrive" ( 2200 ' ) and "wait" ( 2300 ' ) but are separated in time by any length of time, for example thousands of cycles, can do a lot of useful work ( 2704 ) of all threads that have arrived at the "red" synchronization point ( 2200 ' ), but are not blocked, but instead are free to do any useful work other than work on the “red” data structure, to be done collectively (2704).

Es kann sich herausstellen, dass das Synchronisierungs-Grundelement eigentlich nie einen Thread blockiert. Wenn alle Threads so ausgelegt sind, dass sie beim Ankommen (2200') an dem „roten“ Synchronisierungspunkt mit der Arbeit an der „grünen“ Datenstruktur beginnen (2704), und wenn die Zeit, die die Threads mit der Arbeit an der „grünen“ Datenstruktur verbringen, die Zeit übersteigt, die der letzte nachlaufende Thread benötigt, um nach der Arbeit an der „roten“ Datenstruktur an dem Synchronisierungspunkt anzukommen, dann wird keiner der Threads blockiert. Vielmehr wird das Synchronisierungs-Grundelement beim Ankommen des letzten nachlaufenden Threads an dem Synchronisierungspunkt den Zustand auf die nächste Verarbeitungsphase ändern, und wenn die Threads den Zustand des Synchronisierungspunktes überprüfen, werden sie feststellen, dass sich die Verarbeitungsphase geändert hat und sie nicht mehr blockiert sind. Dementsprechend werden keine Thread-Blöcke und keine Zyklen verschwendet.It may turn out that the synchronization primitive never actually blocks a thread. If all the threads are designed to run when they arrive ( 2200 ' ) start working on the "green" data structure at the "red" synchronization point ( 2704 ), and if the time the threads spend working on the "green" data structure exceeds the time it takes for the last lagging thread to arrive at the synchronization point after working on the "red" data structure, then none will the threads are blocked. Rather, when the last lagging thread arrives at the synchronization point, the synchronization primitive will change state to the next processing phase, and when the threads check the state of the synchronization point, they will find that the processing phase has changed and they are no longer blocked. Accordingly, no thread blocks and no cycles are wasted.

Eine andere Art, dieses Szenario zu beschreiben, ist: Die Synchronisierungsbarriere erfordert, dass alle Threads oder anderen Prozesse gleichzeitig an dem Synchronisierungspunkt ankommen, und lässt keinen Thread oder Prozess weiterlaufen, bis alle angekommen sind. Anstelle eines „Ankommen und Warten“-Szenarios verwandeln beispielsweise nicht einschränkende Ausführungsformen das „arrive“-Ereignis in ein Ausführungsfenster zwischen dem „arrive“ (2200') und dem „wait“ (2300'). Kein Thread darf den „wait“-Punkt (2300') passieren, bis alle Threads mindestens den „arrive“-Punkt (2200') erreicht haben. Dies hindert die „angekommenen“ Threads jedoch nicht daran, andere Aufgaben (2704) auszuführen, die nicht durch die Synchronisierungsbarriere geschützt sind.Another way of describing this scenario is: The synchronization barrier requires all threads or other processes to arrive at the synchronization point at the same time and does not allow any thread or process to continue until they have all arrived. For example, rather than an "arrive and wait" scenario, non-limiting embodiments transform the "arrive" event into an execution window between the "arrive" ( 2200 ' ) and the "wait" ( 2300 ' ). No thread is allowed to use the "wait" point ( 2300 ' ) happen until all threads at least the "arrive" point ( 2200 ' ) achieved. However, this does not prevent the "arrived" threads from performing other tasks ( 2704 ) that are not protected by the synchronization barrier.

Wenn alle Threads und anderen Prozessoren an dem Synchronisierungspunkt angekommen sind, werden die Synchronisierungsbarrieren gemäß einem nicht einschränkenden Beispiel zurückgesetzt, um die nächste Verarbeitungsphase zu starten. Eine Synchronisierungsbarriere ist somit gemäß beispielhafter nicht einschränkender Implementierungen ein mehrfach verwendbares Objekt, das verwendet werden kann, um mehrere Synchronisierungspunkte für dieselbe Menge von Threads zu verwalten. Sobald eine erste Phase abgeschlossen ist, beginnt die nächste Phase, die von derselben Synchronisierungsbarriere verwaltet werden kann, und dann beginnt die übernächste Phase, die ebenfalls von derselben Synchronisierungsbarriere verwaltet werden kann, und so weiter.As one non-limiting example, when all threads and other processors have arrived at the synchronization point, the synchronization barriers are reset to start the next phase of processing. Thus, in accordance with exemplary, non-limiting implementations, a synchronization barrier is a reusable object that can be used to manage multiple synchronization points for the same set of threads. Once a first phase is complete, the next phase begins, which can be managed by the same synchronization barrier, and then the next but one phase, which can also be managed by the same synchronization barrier, begins, and so on.

Genauer gesagt ruft bei einem Beispiel einer nicht einschränkenden Ausführungsform jeder Thread, der an der Ankommen-Warten-Barriere mitwirkt, der Reihe nach zwei Funktionen auf, zuerst die ARRIVE-Funktion und dann die WAIT-Funktion. Das Modell der Ankommen-Warte-Barriere teilt ein Mitwirken an einer Barriere in einem Programm in drei Abschnitte ein: einen VOR_ANKOMMEN_ABSCHNITT, einen MITTEL_ABSCHNITT und einen NACH-WARTEN_ABSCHNITT mit den folgenden Axiomen: VOR_ANKOMMEN_ABSCHNITT ANKOMMEN [Ankommen-Warten-Barriere-Adresse] MITTEL_ABSCHNITT WARTEN [Ankommen-Warten-Barriere-Adresse] NACH_WARTEN_ABSCHNITT wobei für beispielhafte Ausführungsformen gilt:

1. load-/store- bzw. Lade-/Speicher-Vorgänge in einem VOR_ANKOMMEN_ABSCHNITT eines Therads sind garantiert sichtbar für load-/store-Vorgänge in einem NACH_WARTEN_ABSCHNITT anderer mitwirkender Threads;
2. load-/store-Vorgänge in einem NACH_WARTEN_ABSCHNITT eines Threads sind garantiert nicht sichtbar für load-/store-Vorgänge in einem VOR_ANKOMMEN_ABSCHNITT anderer mitwirkender Threads; und
3. load-/store-Vorgänge in einem MITTEL_ABSCHNITT eines Threads weisen keine Sichtbarkeitsgarantiereihenfolge bezüglich anderer Threads auf.

More specifically, in one example of a non-limiting embodiment, each thread involved in the arriving wait barrier calls two functions in turn, first the ARRIVE function and then the WAIT function. The model of the arriving-wait-barrier divides a contribution to a barrier in a program into three sections: an ARRIVAL_SECTION, a MIDDLE_SECTION and a WAIT_SECTION with the following axioms:

BEFORE_ ARRIVAL_SECTION ARRIVE [arrive-wait-barrier-address] MIDDLE_SECTION WAIT [Arriving-Waiting-Barrier-Address] AFTER_WATCH_SECTION

where the following applies to exemplary embodiments:

1. load / store or load / store processes in an ARRIVAL_SECTION of a Therad are guaranteed to be visible for load / store processes in a WAIT_SECTION of other participating threads;
2. load / store operations in a WAIT_PART of a thread are guaranteed not to be visible to load / store operations in an ARRIVAL_SECTION of other contributing threads; and
3. load / store operations on a MID_SECTION of a thread do not have a visibility guarantee order with respect to other threads.

Bei nicht einschränkenden Ausführungsformen ermöglichen beispielsweise Ankommen-Warten-Barrieren, dass überlappende Barrieren für eine Auflösung ausstehen.In non-limiting embodiments, for example, arrive-wait barriers enable overlapping barriers to be pending resolution.

Eine solche Implementierung kann für eine softwareimplementierte Barriere, eine hardwareimplementierte Barriere oder eine hybride hardware-/software-implementierte Barriere durchgeführt werden. Zum Beispiel könnte das frühere CUDA-Hardware-basierte_synch-Grundelement mit Schaltungsänderungen modifiziert werden, um die Ankommen-und-Warten-Strategie als zwei separate atomare Funktionen (arrive und wait) zu implementieren, wie es oben beschrieben ist. Bei dem Beispiel einer nicht einschränkenden Technologie werden jedoch zusätzliche Vorteile erzielt, indem die Synchronisierungsbarriere als speichergestützte, hardwarebeschleunigte Barriere implementiert wird.Such an implementation can be carried out for a software-implemented barrier, a hardware-implemented barrier or a hybrid hardware / software-implemented barrier. For example, the previous CUDA hardware-based_synch primitive could be modified with circuit changes to implement the arrive-and-wait strategy as two separate atomic functions (arrive and wait) as described above. However, in the example of a non-limiting technology, additional benefits are obtained by implementing the synchronization barrier as a memory-based, hardware-accelerated barrier.

Speichergestützte SynchronisierungsbarrierenStorage-based synchronization barriers

Zum Zwecke einer terminologischen Klärung kann der Begriff „Barriere“ je nach Abstraktionsebene unterschiedliche Bedeutungen haben. Auf einer niedrigeren Abstraktionsebene verfügen Systeme in der Regel über einen physikalischen Speicher, der den Speicher implementiert. Eine Einrichtung, die das Laden bzw. load und Speichern bzw. store implementiert, wird zum Lesen und Schreiben bei diesem Speicher verwendet.For the purpose of terminological clarification, the term “barrier” can have different meanings depending on the level of abstraction. At a lower level of abstraction, systems usually have physical memory that implements the memory. A device that implements load and store is used for reading and writing to this memory.

Auf einer nächsten Abstraktionsebene, wenn der Speicher für die Datenkommunikation zwischen Prozessen verwendet wird, kann ein Mechanismus vorgesehen sein, der sicherstellt, dass alle relevanten Daten in den physikalischen Speicher geschrieben wurden, bevor ein Flag gesetzt wird, das einem anderen Prozess anzeigt, dass die Daten für die Kommunikation mit dem anderen Prozess verfügbar sind. Ohne irgendeine Art von Barriere könnte ein anderer Prozess versuchen, die Daten zu lesen, bevor sie geschrieben wurden oder während sie geschrieben werden, und die Nachricht könnte unvollständig oder falsch sein. Solche Barrieren, die davor schützen, werden üblicherweise als „Speicherbarrieren“ bezeichnet.At a next level of abstraction, if the memory is used for data communication between processes, a mechanism can be provided that ensures that all relevant data has been written to the physical memory before a flag is set that indicates to another process that the Data is available for communication with the other process. Without some sort of barrier, another process could be trying to read the data before it was written or while it was being written and the message could be incomplete or incorrect. Such barriers that protect against this are usually referred to as “storage barriers”.

Eine Speicherbarriere in dem obigen Beispielkontext ist kein Synchronisierungs-Grundelement, sondern vielmehr eine nebenwirkungsfreie Anweisung, die dazu verwendet wird, Operationen in einer Programmreihenfolge auf Maschinen wie vielen modernen GPUs, die Speichertransaktionen neu anordnen, sichtbar zu machen. Viele moderne GPUs haben solche Speicherbarrieren, z.B. eine Anweisung, die als „memory fence“ bezeichnet wird, wie der CUDA-Befehl „_thread-fence_block()“. Diese Speicherbarrieren befinden sich auf einer Abstraktionsebene, die unterhalb der typischen Synchronisierungsbarrieren liegt.A memory barrier in the example context above is not a synchronization primitive, but rather a non-side effect instruction that is used to make operations in a program order visible on machines like many modern GPUs that rearrange memory transactions. Many modern GPUs have such memory barriers, e.g. an instruction called "memory fence", such as the CUDA command "_thread-fence_block ()". These storage barriers are on an abstraction level that is below the typical synchronization barriers.

Es ist möglich, Speicherbarrieren zur Implementierung von Synchronisierungs-Grundelementen zu verwenden. Sperren bzw. Locks und Mutexe (Objekte zum gegenseitigen Ausschluss) sind weitere Beispiele für Synchronisierungs-Grundelemente. Ein Mutex gewährt einem Thread nach dem anderen einen exklusiven Zugriff auf eine kritische Ressource. Eine Barriere tut etwas Ähnliches, aber mit gewissen Unterschieden.It is possible to use memory barriers to implement synchronization primitives. Locks and mutexes (objects for mutual exclusion) are further examples of basic synchronization elements. A mutex grants thread after thread exclusive access to a critical resource. A barrier does something similar, but with certain differences.

Im Allgemeinen trennt eine Synchronisierungsbarriere, wie es in 1A und folgenden dargestellt ist, verschiedene Berechnungsphasen, in denen viele Threads, die in einer Phase arbeiten, an einer Barriere ankommen, wobei diese Threads blockiert werden, bis alle anderen relevanten Threads ihre Arbeit in dieser Phase abgeschlossen haben. Sobald alle Threads an der Barriere angekommen sind, wird die Barriere aufgehoben und eine neue Verarbeitungsphase beginnt. Solche Synchronisierungsbarrieren werden typischerweise als ein Synchronisierungs-„Grundelement“ definiert und können in manchen Kontexten als ein Objekt betrachtet werden. Während also Synchronisierungs-Grundelemente in einigen Kontexten teilweise unter Verwendung von „Speicherbarrieren“ implementiert werden können, soll die Diskussion in der vorliegenden Beschreibung das Konzept einer „Speicherbarriere“ von einer „Synchronisierungsbarriere“ trennen.In general, a sync barrier, like the one in 1A and the following is shown, various calculation phases in which many threads working in one phase arrive at a barrier, these threads being blocked until all other relevant threads do their work in this one Phase completed. As soon as all threads have arrived at the barrier, the barrier is lifted and a new processing phase begins. Such synchronization barriers are typically defined as a synchronization "primitive" and in some contexts can be viewed as an object. Thus, while synchronization primitives can be implemented in some contexts using “memory barriers”, the discussion in the present description is intended to separate the concept of a “memory barrier” from a “synchronization barrier”.

Beispielhafte nicht einschränkende Ausführungsformen verwenden eine speichergestützte Synchronisierungsbarriere, d.h. sie implementieren Synchronisierungsbarrieren unter Verwendung von Speicher (und in einigen Fällen von zugehörigen Speicherbarrieren). Durch die Implementierung einer Synchronisierungsbarriere als einen Speicher wird die Synchronisierungsbarriere auf die gleiche Weise virtualisiert wie der Speicher virtualisiert wird. Darüber hinaus gibt es keine praktische Begrenzung der Anzahl der Barriere-Objekte, die instanziiert werden können, zumindest für den Fall, dass jeder virtuelle Speicherplatz zur Unterstützung und Speicherung einer Synchronisierungsbarriere verwendet werden kann.Exemplary non-limiting embodiments use a memory-based synchronization barrier, that is, they implement synchronization barriers using memory (and, in some cases, associated memory barriers). Implementing a synchronization barrier as storage virtualizes the synchronization barrier in the same way that storage is virtualized. Furthermore, there is no practical limit to the number of barrier objects that can be instantiated, at least in the event that each virtual storage space can be used to support and store a synchronization barrier.

Es sei zum Beispiel angenommen, dass ein Synchronisierungsbarrierenobjekt 64 Byte Speicher benötigt. Daraus folgt, dass ein speichergestütztes Synchronisierungsbarriereschema dem Entwickler erlaubt, so viele Synchronisierungsbarrieren zu haben, wie verfügbarer Speicher zusätzliche 64 Byte lange Speicherelemente aufnehmen kann. Bei modernen GPU-Architekturen mit vereinheitlichtem Speicher kann der globale Speicher übermäßig groß sein, was bedeutet, dass eine sehr große Anzahl von Synchronisierungsbarrieren untergebracht werden kann. Dies ist eine Verbesserung gegenüber hardwaregestützten Synchronisierungsbarrieren, bei denen typischerweise nur eine begrenzte Anzahl von Barrieren verwendet werden kann, abhängig von der jeweiligen Hardware-Implementierung, dem Chip-Design, der verfügbaren Chipgröße usw.For example, assume that a synchronization barrier object 64 Bytes of memory required. It follows that a memory-based synchronization barrier scheme allows the designer to have as many synchronization barriers as available memory can accommodate additional 64-byte memory elements. In modern GPU architectures with unified memory, the global memory can be excessively large, which means that a very large number of synchronization barriers can be accommodated. This is an improvement over hardware-based synchronization barriers, where typically only a limited number of barriers can be used, depending on the particular hardware implementation, chip design, available chip size, etc.

Durch eine Instantiierung von Synchronisierungsbarrierenobjekten im Speicher werden die oben diskutierten Leistungskompromisse wesentlich erleichtert, da die Implementierung von Objekten im Speicher ein einfaches Problem ist, das die meisten Entwickler beherrschen. Da ein Entwickler so viele Barriereobjekte instanziieren kann (wenn auch nicht wirklich unbegrenzt, so ist die Anzahl doch praktisch unbegrenzt, wenn die Größe des Haupt- oder globalen Speichers zunimmt), besteht keine Notwendigkeit, einen Kompromiss zwischen der Anzahl der Synchronisierungsbarrieren und der Task-Kapazität einzugehen.Instantiating synchronization barrier objects in memory greatly eases the performance tradeoffs discussed above, since implementing objects in memory is a simple problem that most developers have mastered. Since a developer can instantiate so many barrier objects (although not really unlimited, the number is practically unlimited as the size of the main or global memory increases), there is no need to compromise between the number of synchronization barriers and the task Capacity to enter.

Da Synchronisierungsbarrieren gespeichert werden, d.h. bei bespielhaften Ausführungsformen im Speicher implementiert sind, profitieren sie auch von der gemeinsamen Nutzung des Speichers und der Speicherhierarchie. Frühere Hardware-Synchronisierungsbarriereschaltungen wurden oft direkt im Prozessor implementiert. Daher galten solche Hardware-implementierten Barrieren im Allgemeinen nicht zwischen verschiedenen Prozessoren. Mit anderen Worten, jeder Prozessor konnte seine eigenen hardwarebasierten Barrieren haben, die er zur Verwaltung von Tasks oder Aufgaben mit mehreren Threads verwenden konnte, die auf diesem Prozessor ausgeführt wurden, aber diese Hardware-Barrieren waren bei der Koordinierung der Aktivitäten außerhalb des jeweiligen Prozessors nicht hilfreich, z.B. in einem System mit mehreren Prozessoren, die möglicherweise an der parallelen Implementierung derselben Verarbeitungsphase beteiligt waren. Eine solche Koordination erforderte in der Regel die Verwendung eines gemeinsam genutzten globalen (CPU-) Hauptspeichers, der langsam sein und andere Leistungsprobleme aufweisen konnte.Because synchronization barriers are stored, i.e. implemented in memory in exemplary embodiments, they also benefit from sharing the memory and the memory hierarchy. Previous hardware synchronization barrier circuits were often implemented directly in the processor. Therefore, such hardware implemented barriers generally did not apply between different processors. In other words, each processor could have its own hardware-based barriers that it could use to manage tasks or multi-threaded tasks running on that processor, but those hardware barriers were not in coordinating activities outside of that processor useful, for example in a system with multiple processors that may have been involved in implementing the same processing phase in parallel. Such coordination typically required the use of shared global (CPU) memory, which could be slow and have other performance issues.

Im Gegensatz dazu ermöglichen beispielsweise nicht einschränkende Ausführungsformen, die eine Synchronisierungsbarriere unter Verwendung von Speicherbefehlen implementieren, die Unterstützung von Funktionalität außerhalb der Grenzen eines Prozessors, einer GPU oder eines SOC (System-on-a-Chip). Insbesondere können Synchronisierungsbarrieren jetzt auf jeder Ebene der Speicherhierarchie implementiert werden, einschließlich z.B. Ebenen, die über mehrere Kerne, mehrere Streaming-Multiprozessoren, mehrere GPUs, mehrere Chips, mehrere SOCs oder mehrere Systeme gemeinsam genutzt werden und in einigen Fällen in auf solchen Hierarchien basierenden Speicher-Caches zwischengespeichert werden.In contrast, for example, non-limiting embodiments that implement a synchronization barrier using memory instructions allow functionality to be supported outside the confines of a processor, GPU, or SOC (System-on-a-Chip). In particular, synchronization barriers can now be implemented at any level of the storage hierarchy, including, for example, levels shared across multiple cores, multiple streaming multiprocessors, multiple GPUs, multiple chips, multiple SOCs, or multiple systems, and in some cases, storage based on such hierarchies -Caches are cached.

Beispiel für eine Implementierung eines nicht einschränkenden speichergestützten SystemsExample of a non-limiting memory-based system implementation

Beispielsweise kann bei einem beispielhaften System mit Bezug zu 2 mit einer oder mehreren CPU(s) 101, einem oder mehreren GPU(s) 110 mit einem oder mehreren lokalen Grafikspeichern 114, einem oder mehreren Hauptspeichern 115 und einer oder mehreren Anzeige(n) 112 die Synchronisierungsbarriere im lokalen Grafikspeicher 114 gespeichert sein und von allen Streaming-Multiprozessoren (SMs) 204 (siehe 2B) innerhalb der GPU 110 gemeinsam genutzt werden, oder sie könnte im Hauptspeicher 115 gespeichert und von der CPU 101 und der GPU 110 gemeinsam genutzt werden. Genauer gesagt kann der lokale Grafikspeicher 114 in verschiedenen Hierarchien organisiert sein, wie z.B. Level-3-Cache, Level-2-Cache, Level-1-Cache, gemeinsamem Speicher usw., die alle von einer Memory Management Unit (MMU) 212 verwaltet werden, wie es in 2B dargestellt ist. Da die Speicherhierarchie verwendet wird, um zu bestimmen, welche Speicherplätze von welchen Ressourcen gemeinsam genutzt werden, steuert die Speicherung einer Synchronisierungsbarriere auf verschiedenen Ebenen der Speicherhierarchie, welche Rechenressourcen die Synchronisierungsbarriere gemeinsam nutzen können (und ebenso, welche Rechenressourcen nicht auf die Barriere zugreifen können).For example, in an exemplary system, referring to 2 with one or more CPU (s) 101 , one or more GPU (s) 110 with one or more local graphics memories 114 , one or more main memories 115 and one or more advertisements 112 the synchronization barrier in the local graphics memory 114 stored and used by all streaming multiprocessors (SMs) 204 (please refer 2 B) inside the GPU 110 be shared, or they could be in main memory 115 saved and used by the CPU 101 and the GPU 110 shared. More precisely, the local graphics memory 114 be organized in different hierarchies, such as level 3 cache, level 2 cache, Level 1 cache, shared memory, etc., all held by a memory management unit (MMU) 212 managed as it is in 2 B is shown. Because the storage hierarchy is used to determine which storage locations are shared by which resources, the storage of a synchronization barrier at different levels of the storage hierarchy controls which compute resources can share the synchronization barrier (and also which compute resources cannot access the barrier) .

Solche speicherimplementierten Synchronisierungsbarrieren können daher verwendet werden, um zwischen Threads, die auf einem gemeinsamen Kern 206 laufen, zwischen verschiedenen Warps, die auf verschiedenen Kernen laufen, verschiedenen Prozessen, die auf denselben oder verschiedenen GPUs 200 laufen, denselben oder verschiedenen SOCs usw. zu synchronisieren. Somit ist eine bestimmte Barriere nicht mehr auf die Synchronisierung von Threads beschränkt, die parallel auf einem bestimmten Prozessor verarbeitet werden, sondern kann auch dazu verwendet werden, viel mehr Threads oder andere Ausführungen über beliebig viele verschiedene Kerne, GPUs, Prozessoren, Prozessorarchitekturen, Chips und Systeme hinweg zu synchronisieren.Such memory-implemented synchronization barriers can therefore be used to move between threads running on a common core 206 run, between different warps running on different cores, different processes running on the same or different GPUs 200 running, synchronizing the same or different SOCs, etc. Thus, a certain barrier is no longer limited to the synchronization of threads that are processed in parallel on a certain processor, but can also be used to transfer many more threads or other designs over any number of different cores, GPUs, processors, processor architectures, chips and Synchronize across systems.

Diese Fähigkeit wird durch die Implementierung der speichergestützten Synchronisierungsbarriere auf einer geeigneten Ebene der Speicherhierarchie ermöglicht, so dass auf sie zugegriffen werden kann und sie von mehreren Prozessen und/oder Prozessoren gemeinsam genutzt werden kann, während sie z.B. durch allgemein verwendete Speicherbarrieren geschützt ist. Was die Skalierbarkeit anbelangt, so können mit zunehmender Erweiterung der Speicherhierarchie immer mehr Threads diese Synchronisierungsbarrieren nutzen, während kleinere Hierarchien möglicherweise begrenztere Anwendungsbereiche der Barriere unterstützen.This capability is made possible by implementing the memory-based synchronization barrier at an appropriate level of the memory hierarchy so that it can be accessed and shared by multiple processes and / or processors while being protected, for example, by commonly used memory barriers. For scalability, as the storage hierarchy expands, more threads can leverage these synchronization barriers, while smaller hierarchies may support more limited uses of the barrier.

Verwendung von Synchronisierungsbarrieren zur Synchronisierung von HardwareUsing synchronization barriers to synchronize hardware

Eine zusätzliche Einschränkung der meisten früheren Implementierungen von Hardware-Synchronisierungsbarrieren bestand darin, dass sie zwar Software-Ausführungen blockieren konnten, aber nicht unbedingt Hardware-Prozesse. Ein Prozess, der typischerweise von einer früheren GPU-Hardware ausgeführt wird, sind Kopiervorgänge. Während ein Prozessor Speicherbefehle wie das Laden bzw. load und Speichern bzw. store, die von der Lade-/Speichereinheit 208 ausgeführt werden, nutzen kann, um von einem Speicherplatz zu einem anderen zu kopieren, werden zur Beschleunigung des Kopierens von Daten in den und aus dem Speicher seit langem sogenannte „Kopiermaschinen“ 210, wie sie in 2B dargestellt sind, oder „Direct Memory Access“ (DMA)-Controller (z.B. hardwarebeschleunigte oder hardwarebasierte Kopieroperatoren) verwendet. In vielen Systemen kann Software Befehle an eine dedizierte Kopier-/DMA-Maschine 210 senden, die den Datentransfer durchführt. Eine solche Kopiermaschine kann Daten für einen Thread aus einer Vielzahl von zusammenhängenden Stellen im Speicher 214, die jeweils einem Versatz von einer Basisadresse des Threads entsprechen, kopieren und die Daten zusammenhängend im Systemspeicher speichern. Die Kopiermaschine 210 kann auch Daten für einen Thread aus Speicherplätzen im Systemspeicher kopieren und die Daten im GPU-Speicher 214 speichern. Siehe z.B. die US-Patentveröffentlichung Nr. 20140310484.An additional limitation of most previous hardware synchronization barrier implementations was that they could block software execution, but not necessarily hardware processes. One process typically performed by previous GPU hardware is copy operations. During a processor, memory commands such as load and store, which are issued by the load / store unit 208 can be used to copy from one storage location to another, so-called "copy machines" have long been used to speed up the copying of data to and from the storage. 210 as in 2 B are shown, or "Direct Memory Access" (DMA) controllers (eg hardware-accelerated or hardware-based copy operators) are used. In many systems, software can send commands to a dedicated copy / DMA machine 210 send, which carries out the data transfer. Such a copy machine can store data for a thread from a large number of contiguous locations in memory 214 , which each correspond to an offset from a base address of the thread, copy and store the data contiguously in the system memory. The copier 210 can also copy data for a thread from memory locations in system memory and the data in GPU memory 214 to save. See, for example, U.S. Patent Publication No. 20140310484.

Häufig müssen solche Kopiervorgänge abgeschlossen werden, bevor eine nächste Verarbeitungsphase beginnen kann, da z. B. der Übergang zu der nächsten Verarbeitungsphase vom Abschluss einer Aktualisierung einer Datenstruktur im Speicher abhängen kann. Da jedoch frühere Implementierungen von Synchronisierungsbarrieren abhängig von blockierenden Threads und nicht abhängig von Hardware implementiert wurden, war über die hardwarebasierten Synchronisierungs-Grundelemente hinaus ein zusätzlicher Mechanismus erforderlich, um sicherzustellen, dass die richtigen Daten vorhanden waren, nachdem alle von dem Synchronisierungs-Grundelement überwachten Threads abgeschlossen waren, bevor die nächste Verarbeitungsphase beginnen konnte. Mit anderen Worten, in früheren traditionellen Ansätzen konnten solche zusätzlichen, hardwarebasierten Maschinen (z.B. Kopiermaschine 210 oder jede andere Hardware) im Allgemeinen nicht am gleichen Synchronisierungsbarriereprozess wie auszuführende Threads teilnehmen. Während eine frühere Lösung darin bestand, einen Software-Operator um den hardwarebasierten DMA/Kopier-Operator zu hüllen, so dass der Software-Operator erst dann fertig wurde, wenn der Hardware-Operator fertig war, erlegte dieser Ansatz dem Software-Design zusätzliche Beschränkungen auf, die nicht immer wünschenswert oder effizient waren.Often such copying processes have to be completed before a next processing phase can begin, since z. B. the transition to the next processing phase may depend on the completion of an update of a data structure in memory. However, because previous implementations of synchronization barriers were based on blocking threads rather than hardware, an additional mechanism beyond the hardware-based synchronization primitives was required to ensure that the correct data was present after all of the threads monitored by the synchronization primitive were completed before the next phase of processing could begin. In other words, in earlier traditional approaches, such additional hardware-based machines (e.g. copier machine 210 or any other hardware) generally does not participate in the same synchronization barrier process as executing threads. While one previous solution was to wrap a software operator around the hardware-based DMA / copy operator so that the software operator did not finish until the hardware operator had finished, this approach placed additional constraints on software design that were not always desirable or efficient.

Im Gegensatz zu solchen früheren Ansätzen werden gemäß einem beispielhaften nicht einschränkenden Merkmal Kopieroperationen mit direktem Speicherzugriff (DMA), die von einer Kopiermaschine 210 durchgeführt werden (oder ähnliche oder andere Operationen, die von einer Hardware wie einer Rechenmaschine ausgeführt werden), in das softwareimplementierte, aber hardwarebeschleunigte Synchronisierungs-Grundelement integriert, so dass dasselbe Synchronisierungs-Grundelement verwendet werden kann, um eine Barriere für Software-Prozesse, hardwarebasierte Prozesse und Hybride aus hardware- und softwarebasierten Prozessen bereitzustellen.In contrast to such prior approaches, an exemplary non-limiting feature is that direct memory access (DMA) copy operations performed by a copy machine 210 (or similar or other operations performed by hardware such as a calculating machine) are integrated into the software-implemented but hardware-accelerated synchronization primitive so that the same synchronization primitive can be used to create a Providing barriers for software processes, hardware-based processes and hybrids of hardware- and software-based processes.

Ein Beispiel für ein nicht einschränkendes Merkmal der hier vorgestellten Implementierungen ist daher die Synchronisierung von Transaktionen von Kopiermaschinen 210 mit der Technologie der Synchronisierungsbarriere. Eine solche Integration kann z.B. mit rein hardwarebasierten Barriere-Implementierungen durchgeführt werden, aber das hier beschriebene Beispiel eines nicht einschränkenden speichergestützten Synchronisierungsbarrierenverfahrens bietet zusätzliche Vorteile in Bezug auf Leistung und Flexibilität gegenüber einer reinen Hardware-Implementierung.An example of a non-limiting feature of the implementations presented here is therefore the synchronization of transactions of copy machines 210 with the technology of synchronization barrier. Such an integration can for example be carried out with purely hardware-based barrier implementations, but the example of a non-limiting memory-based synchronization barrier method described here offers additional advantages in terms of performance and flexibility compared to a purely hardware implementation.

Bei einer hierin beschriebenen nicht einschränkenden Ausführungsform verhalten sich Hardware-Operationen, wie z.B. die einmal initiierten Operationen von Kopiermaschinen 210, vom Standpunkt der Synchronisierungsbarriere aus betrachtet, als wären sie vollwertige Threads, d.h. als wären sie ein Ausführungsstrom von Software-Anweisungen, die ein Programmierer geschrieben oder ein Compiler kompiliert hat. Die Implementierung ist insofern elegant, als sie einfach zu beschreiben ist: Die Hardware-Operation verhält sich, als wäre sie „moralisch“ ein Thread. Bei einigen Beispielen nicht einschränkender Ausführungsformen kann es viele feinkörnige Hardware-Operationen geben, wie z.B. Kopieroperationen, die gleichzeitig und nebenläufig ausgeführt werden, und sie können alle mit (einer) gemeinsamen Synchronisierungsbarriere(n) synchronisiert werden.In one non-limiting embodiment described herein, hardware operations behave such as copy machine operations once initiated 210 , from the standpoint of the synchronization barrier, as if they were full-fledged threads, that is, as if they were a stream of execution of software instructions written by a programmer or compiled by a compiler. The implementation is elegant in that it is easy to describe: the hardware operation behaves as if it were "morally" a thread. In some examples of non-limiting embodiments, there can be many fine-grained hardware operations, such as copy operations, that are performed concurrently and concurrently, and they can all be synchronized with a common synchronization barrier (s).

Mit massiv-parallelen modernen GPUs lassen sich komplexe Berechnungen am häufigsten kollektiv durchführen. So können die Berechnungen kollektiv unter Verwendung einer großen Anzahl von Threads durchgeführt werden, die wiederum gemeinsam eine noch größere Anzahl von hardwarebasierten Operationen wie DMA-Operationen durch eine oder beliebig viele Kopiermaschinen 210 starten können. Es sei beispielsweise angenommen, dass 100 Threads gleichzeitig ausgeführt werden und dass jeder dieser 100 Threads eine DMA-Operation durch eine oder mehrere zugehörige Kopiermaschinen 210 initiiert. Unter Verwendung beispielhafter, nicht einschränkender Merkmale des hier vorgestellten Verfahrens kann dieselbe Synchronisierungsbarriere die 100 DMA-Operationen und die 100 Threads synchronisieren (d. h. vom Standpunkt der Synchronisierung durch das Synchronisierungs-Grundelement „sehen“ die DMA-Operationen wie Threads aus), wodurch eine Synchronisierung für insgesamt 200 Prozesse (100 Software-Threads und 100 hardwarebasierte DMA-Operationen) erreicht wird. Eine solche Funktionalität wird z.B. durch Hardware-Beschleunigungsschaltungen bereitgestellt, die Schnittstellen zwischen der MMU 212 und der/den Kopiermaschine(n) 210 bereitstellen, damit die Kopiermaschine(n) 210 Werte des speichergestützten Synchronisierungs-Grundelements ändern (z.B. Zählerwerte inkrementieren und zurücksetzen) kann/können. Das vorliegende Verfahren ist erweiterbar, so dass beliebig viele feinkörnige DMA-Operationen auf derselben Barriere synchronisiert werden können.With massively parallel modern GPUs, complex calculations can most often be performed collectively. Thus, the calculations can be carried out collectively using a large number of threads, which in turn jointly an even larger number of hardware-based operations such as DMA operations by one or any number of copying machines 210 can start. For example, assume that 100 Threads are running concurrently and that each of these 100 Threads a DMA operation through one or more associated copy machines 210 initiated. Using exemplary, non-limiting features of the method presented here, the same synchronization barrier can synchronize the 100 DMA operations and the 100 threads (ie, from the standpoint of synchronization by the synchronization primitive, the DMA operations "look" like threads), thereby creating a synchronization for a total of 200 processes ( 100 Software threads and 100 hardware-based DMA operations). Such a functionality is provided, for example, by hardware acceleration circuits, the interfaces between the MMU 212 and the copier (s) 210 ready so that the copier (s) 210 Can / can change values of the memory-based synchronization basic element (e.g. increment and reset counter values). The present method can be expanded so that any number of fine-grained DMA operations can be synchronized on the same barrier.

Bei einer massiv parallelen Architektur, die viele Threads unterstützen kann, könnte es ineffizient sein, jeden einzelnen Thread so zu programmieren, dass er auf den Abschluss jeder hardwarebasierten Operation wartet. Das aktuelle Beispiel eines nicht einschränkenden Verfahrens stellt stattdessen ein Synchronisierungs-Grundelement bereit, das es der großen Anzahl von Threads (und in einigen Ausführungsformen auch Kopieroperationen) ermöglicht, gemeinsam auf den Abschluss einer oder mehrerer hardwarebasierter Operationen (z.B. Kopieren) zu warten, von denen die nächste Verarbeitungsphase abhängt.With a massively parallel architecture that can support many threads, it might be inefficient to program each individual thread to wait for each hardware-based operation to complete. The current example of a non-limiting method instead provides a synchronization primitive that allows the large number of threads (and, in some embodiments, copy operations) to wait together for one or more hardware-based operations (e.g., copy) to complete the next processing phase depends.

In diesem Fall ist das Barriere-Grundelement eine andere Art von Mechanismus als ein Semaphor oder ein Flag (die manchmal in früheren Ansätzen zur Synchronisierung mit hardwarebasierten Prozessen verwendet wurden), da das neue Synchronisierungs-Grundelement eine kollektive Synchronisierung bereitstellt. Es unterscheidet sich von einem Thread, der ein Flag oder eine Semaphore für einen anderen Thread setzt. Stattdessen erlaubt es, dass N Threads bis zum Abschluss von potenziell M hardwarebasierten Kopiervorgängen blockiert werden, wobei N und M eine beliebige nicht-negative ganze Zahl sind. Eine solche kollektive Funktionalität muss nicht notwendigerweise auf softwarebasierte oder speichergestützte Barriereverfahren beschränkt sein, sondern kann in Software, Hardware oder in beiden implementiert werden.In this case, the barrier primitive is a different type of mechanism than a semaphore or flag (which were sometimes used in previous approaches to synchronizing with hardware-based processes) because the new synchronization primitive provides collective synchronization. It is different from a thread setting a flag or semaphore for another thread. Instead, it allows N threads to be blocked pending completion of potentially M hardware-based copies, where N and M are any non-negative integers. Such collective functionality need not necessarily be limited to software-based or memory-based barrier methods, but can be implemented in software, hardware, or both.

Hardware-beschleunigte SynchronisierungsbarriereHardware accelerated synchronization barrier

Um die obige Funktionalität zu ermöglichen und eine höhere Leistung zu erzielen, stellen beispielsweise nicht einschränkende Ausführungsformen eine Hardware-beschleunigte Implementierung von speichergestützten Barrieren bereit. Die Implementierung wird durch Software-Befehle aufgerufen, integriert aber zusätzliche Hardware-Funktionalität, wie z.B. ein hardwarebasiertes Kopieren, in denselben Synchronisierungsbarrieremechanismus, der zur Blockierung von Software-Prozessen, wie z.B. Threads, verwendet wird.For example, to enable the above functionality and achieve higher performance, non-limiting embodiments provide hardware accelerated implementation of memory-based barriers. The implementation is called by software commands, but integrates additional hardware functionality, such as hardware-based copying, into the same synchronization barrier mechanism that is used to block software processes such as threads.

Während es, z.B. gemäß nicht einschränkender Implementierungen, möglich wäre, eine solche Funktion vollständig in Software zu implementieren, wird eine Hardware-Beschleunigung genutzt, um zumindest die Funktion zum Zurücksetzen einer Barriere effizienter zu implementieren, und sie wird auch genutzt, um eine Schnittstelle zu Hardware-basierten Prozessen wie DMA-Kopiermaschinen zu schaffen und so Hardware-Funktionen zu ermöglichen, eine Barriere zurückzusetzen. Bei einigen Ausführungsformen könnte ein dedizierter hardwarebasierter Beschleuniger verwendet werden, um die Synchronisierungsbarriere im Cache zu halten.While it would be possible, for example according to non-limiting implementations, to implement such a function entirely in software, hardware acceleration is used to at least implement the function of resetting a barrier more efficiently, and it is also used to interface with To create hardware-based processes such as DMA copy machines and thus to enable hardware functions to reset a barrier. In some embodiments, a dedicated hardware-based accelerator could be used to cache the synchronization barrier.

Bei früheren softwareimplementierten Versionen erkannte der zuletzt ankommende Thread, dass er der zuletzt ankommende Thread war und modifizierte einen Zähler entsprechend, indem er das Kompliment zum aktuellen Wert des Zählers hinzufügte, um den Zähler auf einen Startwert zurückzusetzen. Als Beispiel sei auf den Java-„Phaser“, der üblicherweise in virtuellen Java-Maschinen implementiert wird, verwiesen. DMA-Maschinen sind bei einigen beispielhaften Implementierungen nicht in Software geschrieben. Da die DMA-Maschinen in einigen Fällen für das Zurücksetzen der Barriere verantwortlich sein können und da sie keine Software sind, ist ein solches Zurücksetzen bei diesen Implementierungen in Hardware wünschenswert. Aus diesem Grund stellen beispielhafte nicht einschränkende Implementierungen eine hardwarebeschleunigte Rücksetzoperation bereit. Andere beispielhafte nicht einschränkende Implementierungen des hier vorgestellten Verfahrens könnten jedoch auf Phaser, Latches oder andere Synchronisierungs-Grundelemente angewandt werden, die keine Barrieren sind. Ein solches Verfahren könnte auch für die Verwendung mit Semaphoren angewandt werden.In previous software-implemented versions, the last thread to arrive recognized that it was the last thread to arrive and modified a counter accordingly, adding the compliment to the current value of the counter to reset the counter to a starting value. As an example, refer to the Java “Phaser”, which is usually implemented in Java virtual machines. DMA machines are not written in software in some example implementations. Because the DMA machines may be responsible for resetting the barrier in some cases, and because they are not software, such resetting is desirable in these hardware implementations. For this reason, exemplary non-limiting implementations provide a hardware accelerated reset operation. However, other exemplary, non-limiting implementations of the method presented here could be applied to phasers, latches, or other synchronization primitives that are not barriers. Such a technique could also be applied for use with semaphores.

Software-ImplementierungSoftware implementation

Bei nicht einschränkenden Ausführungsformen ist jeder Ankommen-Warten-Barriere-Zustand eine im Speicher gespeicherte, implementierte (z.B. 64-Bit) Datenstruktur 1900. Wie es in 3 dargestellt ist, enthält diese Datenstruktur 1900 Folgendes:

1. Die erwartete Anzahl von Ankünften für jedes Mal, wenn die Barriere benutzt wird (Feld 1908).
2. Die verbleibende Anzahl von Ankünften, die erforderlich sind, um die Barriere aufzuheben (Zählerfeld 1904).
3. Die Barrierephase (zur Wiederverwendung der Barriere) (Feld 1902).

In non-limiting embodiments, each arrive wait barrier state is an implemented (eg, 64-bit) data structure stored in memory 1900 . Like it in 3 contains this data structure 1900 The following:

1. The expected number of arrivals for each time the barrier is used (field 1908 ).
2. The remaining number of arrivals required to lift the barrier (counter field 1904 ).
3. The barrier phase (to reuse the barrier) (field 1902 ).

Gemäß beispielhaften nicht einschränkenden Implementierungen ermöglichen Ankommen-Warten-Barrieren eine Zusammenarbeit individueller Threads, so dass der Zählerstand der Felder 1904, 1908 als Thread-Zähler angesehen werden kann.In accordance with exemplary, non-limiting implementations, arrive-wait barriers allow individual threads to work together so that the count of the fields 1904 , 1908 can be viewed as a thread counter.

Wie es in 3 dargestellt ist, ist das erste Feld 1902 ein Phasenzähler, der die aktuelle Verarbeitungsphase der Ankommen-Warten-Barriere anzeigt. Beim Zurücksetzen der Barriere kann der Phasenzähler 1902 auf die nächste Phasennummer erhöht werden. Bei einer beispielhaften Implementierung kann der Phasenzähler 1902 ein Ein-Bit-Zustandsflag (d.h. ein Ein-Bit-Zähler) sein, der bei jedem Zurücksetzen der Barriere umgeschaltet (inkrementiert) wird. Ein solches Zustandsflag kann verwendet werden, um Threads anzuzeigen, dass sich der Zustand der Barriere zur nächsten Verarbeitungsphase geändert hat - was bedeutet, dass die Threads nicht (oder nicht mehr) durch die Barriere blockiert werden. Bei anderen, nicht einschränkenden Ausführungsformen könnte ein größeres Auflösungsvermögen des Phasenindikators erwünscht sein, z.B. um zwischen mehr als nur der aktuellen Phase und einer nächstfolgenden Phase zu unterscheiden. Solche anderen Implementierungen könnten z.B. bei jedem Zurücksetzen der Barriere einen Mehrfach-Bit-Zähler inkrementieren und so nachverfolgen, welche von N Verarbeitungsphasen gerade in Kraft ist, wobei N eine ganze Zahl ist.Like it in 3 is the first field 1902 a phase counter which shows the current processing phase of the arriving-waiting barrier. When the barrier is reset, the phase counter 1902 can be increased to the next phase number. In an exemplary implementation, the phase counter 1902 be a one-bit status flag (ie a one-bit counter) that is toggled (incremented) each time the barrier is reset. Such a state flag can be used to indicate to threads that the state of the barrier has changed for the next processing phase - which means that the threads are not (or no longer) blocked by the barrier. In other, non-limiting embodiments, a greater resolving power of the phase indicator could be desired, for example in order to differentiate between more than just the current phase and a subsequent phase. Such other implementations could, for example, increment a multi-bit counter each time the barrier is reset to keep track of which of N processing phases is currently in effect, where N is an integer.

Wie in 3 weiter dargestellt ist, umfasst ein zusätzliches Feld einen Ankunftszähler 1904, der anzeigt, wie viele Threads/Prozesse angekommen sind (oder in einer Implementierung die verbleibenden Ankünfte, die für eine Freigabe der Barriere erforderlich sind). Bei einer beispielhaften Implementierung wird jedes Mal, wenn ein Thread oder ein anderer Prozess an der Barriere ankommt, der Ankunftszähler 1904 inkrementiert. Wenn der Ankunftszähler 1904 auf einen vorbestimmten bekannten Wert inkrementiert ist, zeigt dies an, dass alle Threads und Prozesse angekommen sind und die Barriere ihren Zustand ändern und zur nächsten Verarbeitungsphase übergehen kann.As in 3 is further shown, an additional field comprises an arrival counter 1904 , which indicates how many threads / processes have arrived (or in one implementation, the remaining arrivals required for the barrier to clear). In an exemplary implementation, each time a thread or other process arrives at the barrier, the arrival counter is checked 1904 incremented. When the arrival counter 1904 is incremented to a predetermined known value, it indicates that all threads and processes have arrived and the barrier can change state and move on to the next stage of processing.

Wenn der letzte nacheilende Thread (oder Hardware-Prozess) eintrifft und die Barriere gesättigt ist / zurückgesetzt wird, kann der Ankunftszähler 1904 auf einen Anfangswert zurückgesetzt werden, der z.B. im dritten Feld 1908 in der in 3 dargestellten nicht einschränkenden Ausführungsform enthalten ist - nämlich auf eine erwartete Anzahl von Ankünften, die bei jeder Verwendung der Ankommen-Warten-Barriere eingesetzt wird.When the last lagging thread (or hardware process) arrives and the barrier is saturated / reset, the arrival counter can 1904 can be reset to an initial value, for example in the third field 1908 in the in 3 shown non-limiting embodiment is included - namely for an expected number of arrivals to be used each time the arrive-wait barrier is used.

Wie im Folgenden diskutiert ist, erlauben beispielhafte nicht einschränkende Ausführungsformen, dass die Software die Werte des Ankunftszählers 1904 und des Feldes „Erwartete Anzahl von Threads“ 1908 dynamisch ändert, nachdem die Barriere erstellt wurde.As discussed below, exemplary non-limiting embodiments allow the software to read the values of the arrival counter 1904 and the field "Expected number of threads" 1908 changes dynamically after the barrier is created.

Bei den beispielhaften nicht einschränkenden Ausführungsformen kann die in 3 dargestellte Datenstruktur 1900 irgendwo im Speicher gespeichert sein. Wie vorab diskutiert wurde, wird der Speichertyp / die hierarchische Ebene des Speichers, der für die Speicherung der Datenstruktur 1900 verwendet wird, so gewählt, dass der gewünschte Umfang der gemeinsamen Nutzung zwischen Threads bereitgestellt wird. Wenn ein Thread eine Synchronisierungsbarriere der beispielhaften nicht einschränkenden Ausführungsform hierin implementiert, weist der Aufruf des Synchronisierungs-Grundelements im Thread direkt oder indirekt eine Speicheradresse für die gespeicherte Instanz der Datenstruktur 1900 auf, die das Synchronisierungs-Grundelement repräsentiert. Dies unterscheidet sich von früheren hardwarebasierten Synchronisierungs-Grundelementen, bei denen der Aufruf des Synchronisierungs-Grundelements voraussichtlich eine Referenznummer oder ID darstellen würde. Eine Speicheradresse (bei der es sich je nach System um eine physikalische oder virtuelle Speicheradresse handeln kann) gibt an, wo im Speicher die vom Thread referenzierte Instanziierung des Synchronisierungs-Grundelements 1900 zu finden ist.In the exemplary, non-limiting embodiments, the in 3 data structure shown 1900 stored somewhere in memory. As previously discussed, the storage type / hierarchical level of storage is used to store the data structure 1900 is used, chosen to provide the desired amount of inter-thread sharing. When a thread implements a synchronization barrier of the exemplary non-limiting embodiment herein, the call to the synchronization primitive in the thread directly or indirectly assigns a memory address for the stored instance of the data structure 1900 representing the synchronization primitive. This differs from previous hardware-based synchronization primitives, where calling the synchronization primitive would likely represent a reference number or ID. A memory address (which, depending on the system, can be a physical or virtual memory address) specifies where in memory the instantiation of the synchronization primitive referenced by the thread 1900 is to be found.

Beispiel; Hardware-beschleunigte ImplementierungExample; Hardware accelerated implementation

4 ist ein Blockdiagramm einer beispielhaften, nicht einschränkenden, Hardware-beschleunigten, speichergestützten Implementierung des Synchronisierungs-Grundelements 1900 aus 3. Wie vorab beschrieben ist, sind der Zähler 1902 für die aktuelle Phase, der Ankunftszähler 1904 und das Speicherfeld 1908 für die erwartete Anzahl in dem Speicher gespeichert und durch die Lade-/Speichereinheit 208 zugänglich, die in beispielhaften Ausführungsformen in der Lage ist, atomare Operationen (die in einem Prozessorbefehlsregister 1952 gespeichert sind und durch einen herkömmlichen Befehlsdekodierer 1954 dekodiert werden) wie (arithmetische Funktionen wie atomicAdd()), atomicSub(), atomicExch(), atomicMin(), atomicMax(), atomiclnc(), atomicDec(), atomicCAS()); bitweise Funktionen wie atomicAnd(), atomicOr(), atomicXor()); und andere Funktionen auszuführen. Solche atomaren Operationen ermöglichen es einem Streaming-Multiprozessor, den Wert des Synchronisierungs-Grundelements 1900 „an Ort und Stelle“ zu ändern. Eine MMU 212 ist so modifiziert, dass sie eine Hardwareschaltung aufweist, die es der DMA-Steuerungs-/Kopier-Maschine 210 ermöglicht, den Ankunftszähler 1904 „an Ort und Stelle“ in ähnlicher Weise zu modifizieren und das Synchronisierungs-Grundelement 1900 zurückzusetzen, wenn eine Operation der Kopiermaschine bewirkt, dass der Ankunftszähler-Decoder 1956 (der als ein Komparator fungiert, der den Zählerstand des Zählers mit einem vorgegebenen Wert vergleicht und den Zähler und den Phasenindikator abhängig von den Ergebnissen des Vergleichs zurücksetzt) feststellt, dass keine weiteren Threads oder Kopiermaschinen-Operationen erwartet werden, bevor die Synchronisierungsbarriere zurückgesetzt werden kann. In einem solchen Fall initiiert der Decoder 1956 ein hardwaregesteuertes Zurücksetzen der Instanz des Synchronisierungs-Grundelements 1900, um den Phasenindikator 1902 „umzudrehen“ (to flip) und die erwartete Anzahl 1908 erneut in den Ankunftszähler 1904 zu laden. Bei einer beispielhaften Ausführungsform führt die LSU 208- und/oder MMU 212-Schaltung eine Art atomare Operation mit direktem Speicherzugriff aus, um geeignete atomare Operationen an der im Speicher gespeicherten Instanz des Synchronisierungs-Grundelements 1900 zu initiieren/durchzuführen, um diese Änderungen vorzunehmen. 4th Figure 4 is a block diagram of an exemplary, non-limiting, hardware accelerated, memory-based implementation of the synchronization primitive 1900 out 3 . As previously described, the counters are 1902 for the current phase, the arrival counter 1904 and the memory field 1908 stored in memory for the expected number and by the load / store unit 208 accessible, which in exemplary embodiments is capable of atomic operations (those in a processor instruction register 1952 are stored and by a conventional instruction decoder 1954 decoded) like (arithmetic functions like atomicAdd ()), atomicSub (), atomicExch (), atomicMin (), atomicMax (), atomiclnc (), atomicDec (), atomicCAS ()); bitwise functions like atomicAnd (), atomicOr (), atomicXor ()); and perform other functions. Such atomic operations allow a streaming multiprocessor to determine the value of the synchronization primitive 1900 To change "on the spot". An MMU 212 is modified to have hardware circuitry that makes it the DMA control / copy machine 210 allows the arrival counter 1904 Modify "in place" in a similar fashion and the synchronization primitive 1900 reset when an operation on the copier causes the arrival counter decoder 1956 (which acts as a comparator that compares the count of the counter with a predetermined value and resets the counter and the phase indicator depending on the results of the comparison) determines that no further threads or copy machine operations are expected before the synchronization barrier can be reset . In such a case the decoder initiates 1956 a hardware reset of the instance of the synchronization primitive 1900 to see the phase indicator 1902 "To flip" and the expected number 1908 again in the arrival counter 1904 to load. In an exemplary embodiment, the LSU performs 208 - and / or MMU 212 circuit a kind of atomic operation with direct memory access to perform suitable atomic operations on the instance of the synchronization primitive stored in the memory 1900 initiate / perform to make these changes.

Daher wird z.B. in nicht einschränkenden Ausführungsformen die Prozessorarchitektur modifiziert, um eine zusätzliche Schaltung bereitzustellen, die hardwarebasierte Prozesse wie DMA in die Implementierung der Synchronisierungsbarriere einbindet, so dass die Hardware in der Lage ist, den Speicher, der die Synchronisierungsbarriere blockiert, zurückzusetzen, wenn sie ihre DMA-Aufgabe abgeschlossen hat und erkennt, dass es sich um den letzten nacheilenden Prozess handelt, der erforderlich ist, um zur nächsten Verarbeitungsphase überzugehen. In beispielhaften nicht einschränkenden Implementierungen müssen sich die Hardware-Änderungen nicht darum kümmern, ein Ankommen von einem Warten zu trennen, wie es oben beschrieben ist, da die Hardware-DMA-Steuerung im Allgemeinen zurückkehrt und für eine nächste Task oder Aufgabe bereit ist, sobald sie die vorherige abgeschlossen hat. Die Zählung, die vom Beendigungszähler 1904 der Barriere beibehalten wird, weist jedoch bei beispielhaften nicht einschränkenden Implementierungen beispielsweise sowohl die Anzahl der Threads, die zum Abschluss benötigt werden, als auch die Anzahl von DMA-Hardware-Operationen, die zum Abschluss benötigt werden, auf - d.h. die Zählerstände 1904, 1908 unterscheiden nicht zwischen einer Thread-Zählung und einer DMA/Kopier-Maschinen-Zählung, sondern weisen jeweils einen einzigen Wert auf, der die Anzahl der Threads und die Anzahl der Kopieroperationen zusammenfasst, die abgeschlossen sein müssen, damit die Barriere zurückgesetzt wird.Therefore, for example, in non-limiting embodiments, the processor architecture is modified to provide additional circuitry that integrates hardware-based processes such as DMA into the implementation of the synchronization barrier so that the hardware is able to reset the memory that is blocking the synchronization barrier when it has completed its DMA task and realizes that it is the last lagging process required to proceed to the next stage of processing. In exemplary non-limiting implementations, the hardware changes need not worry about separating arriving from waiting, as described above, since hardware DMA control generally returns and is ready for a next task or task as soon as it is ready she has completed the previous one. The count given by the completion counter 1904 However, in exemplary non-limiting implementations, for example, it includes both the number of threads needed to complete and the number of DMA hardware operations needed to complete - ie, the counters 1904 , 1908 do not distinguish between a thread count and a DMA / copy machine count, but instead have one each only value that summarizes the number of threads and the number of copy operations that must be completed for the barrier to reset.

Beispiel einer nicht einschränkenden ImplementierungExample of a non-limiting implementation

Beispielhafte, nicht einschränkende Ausführungsformen hierin implementieren Änderungen an der Instruction Set Architecture (ISA), damit diese Anweisungen für einen Zugriff auf ein neues Synchronisierungs-Grundelement aufweist, das auf eine speichergestützte Speicherung einer Synchronisierungs-Barriere im Speicher verweist. Darüber hinaus weist ein Thread bei beispielhaften nicht einschränkenden Implementierungen anstelle eines Grundelement-Aufrufs zwei verschiedene Grundelement-Aufrufe auf: einen Grundelement-Aufruf „arrive“ an die Barriere und einen Grundelement-Aufruf „wait“ an die Barriere. Zwischen diesen beiden Aufrufen kann der Thread, wie es vorab erklärt und in 2 dargestellt ist, Anweisungen aufweisen, die nichts mit der Barriere zu tun haben und die ausgeführt werden können, ohne die Barriere zu verletzen.Exemplary, non-limiting embodiments herein implement changes to the Instruction Set Architecture (ISA) to include instructions for accessing a new synchronization primitive that references memory-based storage of a synchronization barrier in memory. In addition, in exemplary non-limiting implementations, a thread has two different primitive calls instead of a primitive call: a primitive “arrive” call to the barrier and a primitive “wait” call to the barrier. Between these two calls, the thread can, as explained in advance and in 2 have instructions that are unrelated to the barrier and that can be carried out without violating the barrier.

Während der Initialisierung richtet das System zunächst die Instanz der Synchronisierungsbarriere im Speicher ein und speichert dort die entsprechenden Daten, die das System zur Implementierung der Barriere halten muss (z.B. Ankunftszähler, Phasenzähler). Typischerweise kann das vom Systementwickler bereitgestellte SDK (Software Development Kit) eine Bibliothek aufweisen, die diese verschiedenen Funktionsaufrufe zur Initiierung einer Synchronisierungsbarriere enthält. In ähnlicher Weise wird die ISA des Verarbeitungssystems modifiziert, um neue Anweisungen für das Synchronisierungsbarriere-arrive und das Synchronisierungsbarriere-wait aufzuweisen.During the initialization, the system first sets up the instance of the synchronization barrier in the memory and stores the corresponding data there that the system must hold to implement the barrier (e.g. arrival counter, phase counter). Typically, the software development kit (SDK) provided by the system developer may have a library that contains these various function calls to initiate a synchronization barrier. Similarly, the ISA of the processing system is modified to include new instructions for the sync barrier arrive and the sync barrier wait.

Bei einer beispielhaften nicht einschränkenden Ausführungsform können die folgenden Softwarefunktionen verwendet werden, um eine im Speicher gespeicherte Instanz eines Ankommen-Warten-Barriere-Grundelements 1900 zu verwalten:

• eine _create-Funktion 2100 (siehe 5A) wird verwendet, um eine Ankommen-Warten-Barriere im Speicher einzurichten.
• eine _arrive-Funktion 2200 (siehe 5B) wird von einem Thread verwendet, um seine Ankunft an einer Ankommen-Warten-Barriere anzuzeigen. Die Barrierephase wird von dieser Funktion zur Verwendung mit der _wait-Funktion zurückgegeben.
• eine _wait-Funktion 2300 (siehe 5C) wird verwendet, um auf die Freigabe einer Ankommen-Warten-Barriere für die vorgesehene Phase zu warten.
• a _dropthread-Funktion 2400 (siehe 5D) entfernt permanent einen Thread von einer Ankommen-Warten-Barriere. Dies ist nützlich, wenn ein Thread endet.
• eine _addonce-Funktion 2500 (siehe 5E) fügt dem Zählerstand einer Ankommen-Warten-Barriere einen bestimmten Wert hinzu.
• eine _droponce-Funktion 2600 (siehe 5F) reduziert den Zählerstand einer Ankommen-Warten-Barriere um einen bestimmten Wert.

In an exemplary, non-limiting embodiment, the following software functions may be used to create a memory-stored instance of an arrive-wait barrier primitive 1900 manage:

• a _create function 2100 (please refer 5A) is used to set up an arriving-wait barrier in memory.
• an _arrive function 2200 (please refer 5B) is used by a thread to indicate its arrival at an arrive-wait barrier. The barrier phase is returned by this function for use with the _wait function.
• a _wait function 2300 (please refer 5C ) is used to wait for an arrival-wait barrier to be released for the intended phase.
• a _dropthread function 2400 (please refer 5D ) permanently removes a thread from an arrival-wait barrier. This is useful when a thread ends.
• an _addonce function 2500 (please refer 5E) adds a certain value to the count of an arrival-wait-barrier.
• a _droponce function 2600 (please refer 5F) reduces the count of an arriving-waiting-barrier by a certain value.

Zusätzlich weisen einige nicht einschränkende Ausführungsformen eine zusätzliche Anweisung ARRIVES.LDGSTSBAR.64 auf, die signalisiert, dass alle DMA-Übertragungen von diesem Thread abgeschlossen sind, und den Ankommen-Zählerstand in der Ankommen-Warten-Barriere entsprechend aktualisiert.In addition, some non-limiting embodiments have an additional ARRIVES.LDGSTSBAR.64 instruction that signals that all DMA transfers from this thread have been completed and updates the arriving count in the arriving wait barrier accordingly.

_create_create

5A zeigt ein Beispiel der _create-Funktion 2100, die eine neue Barriere erstellt. In diesem Beispiel ist der aufrufende Code dafür verantwortlich, dass _create nur in einem Thread aufgerufen wird. Die _create -Funktion 2100 erhält als Parameter „Ptr“ (die direkte oder indirekte Speicherstelle, an der die Datenstruktur der Barriere im Speicher abgelegt werden soll) und „ExpectedCount“ (die erwartete Gesamtzahl der Ansammlung von Threads und DMA-Operationen, die von dieser bestimmten Barriere synchronisiert werden). In dem gezeigten Beispiel werden die Zähler in der Datenstruktur initialisiert (2102). Der Barrierezähler wird auf „ExpectedCount“ (Inkrement bei Ankommen) gesetzt (2108), und die jetzt initialisierte Barriere wird an der angegebenen Stelle im Speicher gespeichert (2110). Die Funktion führt dann ein return aus (2112). Wie es vorab beschrieben ist, kann die Barriere bei einer beispielhaften Ausführungsform in einem Gemeinschaftsspeicher gespeichert sein, bei anderen Ausführungsformen kann sie jedoch in jedem erwünschten Speicher in der Speicherhierarchie gespeichert sein, der mit dem Geltungsbereich der Synchronisierung vereinbart werden kann. 5A shows an example of the _create function 2100 who created a new barrier. In this example, the calling code is responsible for ensuring that _create is only called in one thread. The _create function 2100 receives as parameters "Ptr" (the direct or indirect memory location at which the data structure of the barrier is to be stored in the memory) and "ExpectedCount" (the expected total number of accumulation of threads and DMA operations that are synchronized by this particular barrier) . In the example shown, the counters are initialized in the data structure ( 2102 ). The barrier counter is set to "ExpectedCount" (increment upon arrival) ( 2108 ), and the now initialized barrier is stored in the specified location in memory ( 2110 ). The function then executes a return ( 2112 ). As previously described, in one exemplary embodiment the barrier may be stored in shared memory, but in other embodiments it may be stored in any desired memory in the memory hierarchy that can be negotiated with the scope of the synchronization.

Ein spezifischeres Beispiel ist:

_create(BarPtr, NumThreads);
Eingabe: BarPtr, NumThreads;
Initialisieren der Barriere für die angegebene Anzahl von Threads.
Kern-Grundelement: Cooperative Thread Array- (CTA-) breite geteilte Barriere (zugewiesen durch CTA-breite Synchronisierung).
BarPtr = Zeiger auf zugewiesene Barriere, die in einem gemeinsam genutzten oder jedem anderen Speicherort gespeichert ist, an dem der Zustand der opaken Barriere gespeichert ist.
NumThreads: die Anzahl der an dieser Barriere beteiligten Threads, die ankommen müssen, bevor das Warten aufgehoben wird.
_arrive

A more specific example is:

_create (BarPtr, NumThreads);
Input: BarPtr, NumThreads;
Initialize the barrier for the specified number of threads.
Core Primitive: Cooperative Thread Array (CTA) wide shared barrier (assigned by CTA wide sync).
BarPtr = pointer to assigned barrier, stored in a shared or any other location that stores the state of the opaque barrier.
NumThreads: the number of threads involved in this barrier that must arrive before the wait is released.
_arrive

5B zeigt ein Beispiel der _arrive-Funktion 2200, die anzeigt, dass ein Thread an einer Barriere angekommen ist, und die Barrierephase zurückgibt, die für die _wait()-Funktion 2300 verwendet werden soll. Bei dem gezeigten Beispiel liest die _arrive-Funktion 2200 die Phase der Barriere an der Speicherstelle [Ptr] und speichert die Phase lokal (z.B. in einem Register, auf das der Thread Zugriff hat). Es sei angemerkt, dass bei dem Beispiel einer nicht einschränkenden Ausführungsform aus der Software-Perspektive die _arrive-Funktion 2200 nicht den Zustand der Barriere ändert, sondern lediglich den aktuellen Phasenwert von der Barriere liest. Bei der beispielhaften Implementierung hat der Aufruf der _arrive-Funktion 2200 jedoch den Effekt (basierend auf der mit der Implementierung von Barrieren verbundenen Hardware-Beschleunigung), die Anzahl der Threads zu reduzieren, die der Thread-Zähler der Barriere anzeigt, um dadurch bei der Barriere zu „registrieren“, dass der Thread am Synchronisierungspunkt angekommen ist und damit die Barriere nicht mehr auf diesen bestimmten Thread wartet, um ihren definierten Synchronisierungspunkt zu erreichen. 5B shows an example of the _arrive function 2200 , which indicates that a thread has arrived at a barrier and returns the barrier phase that is required for the _wait () function 2300 should be used. In the example shown, the _arrive function reads 2200 the phase of the barrier at the memory location [Ptr] and stores the phase locally (eg in a register to which the thread has access). It should be noted that in the example of a non-limiting embodiment, from a software perspective, the _arrive function 2200 does not change the state of the barrier, but merely reads the current phase value from the barrier. In the exemplary implementation, the call to the _arrive function 2200 however, the effect (based on the hardware acceleration associated with the implementation of barriers) of reducing the number of threads indicated by the barrier's thread counter, thereby “registering” with the barrier that the thread has arrived at the synchronization point and so the barrier no longer waits for that particular thread to reach its defined synchronization point.

Bei der beispielhaften nicht einschränkenden Ausführungsform kann der Aufruf der _arrive-Funktion 2200 an einer beliebigen Stelle in einem Thread platziert werden, und es ist die Position des Aufrufs der _arrive-Funktion 2200, die den Synchronisierungspunkt innerhalb des Threads definiert. Der Entwickler und/oder ein optimierender Compiler achten darauf, dass die Anzahl der Threads, die einen _arrive-Funktionsaufruf 2200 (+ DMA oder andere geeignete Hardware-Aufrufe) enthalten, mit der erwarteten Anzahl von Ankünften, die in der Barriere programmiert ist, übereinstimmt.In the exemplary, non-limiting embodiment, the call to the _arrive function 2200 placed anywhere in a thread, and it is the position of the call to the _arrive function 2200 that defines the synchronization point within the thread. The developer and / or an optimizing compiler ensure that the number of threads that call a _arrive function 2200 (+ DMA or other suitable hardware calls), matches the expected number of arrivals programmed in the barrier.

Ein spezifischeres, nicht einschränkendes Beispiel:

_arrive(BarPhase, BarPtr);
Eingabe: BarPtr;
Ausgabe: BarPhase

A more specific, non-limiting example:

_arrive (BarPhase, BarPtr);
Input: BarPtr;
Output: BarPhase

Zeigt an, dass ein Thread angekommen ist; gibt die Barrierenphase zurück, die für die Wait-Anweisung verwendet werden soll.Indicates that a thread has arrived; returns the barrier phase to be used for the wait statement.

Bei einer beispielhaften Ausführungsform kann eine Wait-Anweisung durch die Kopiermaschine 210 unabhängig von irgendeinem Software-Thread der Ausführung initiiert werden. Dies kann dadurch geschehen, dass die Hardware in der MMU 212 oder LSU 209 einen fusionierten atomaren Lade-/Speicherbefehl (LDGSTS) an den gemeinsamen Speicher generiert, der im Wesentlichen einen direkten Speicherzugriff („DMA“) durch die Hardware-Maschine auf die Instanz des im gemeinsamen Speicher gespeicherten Grundelements durchführt.In an exemplary embodiment, a wait instruction may be issued by the copy machine 210 can be initiated independently of any software thread of execution. This can be done by having the hardware in the MMU 212 or LSU 209 generates a fused atomic load / store instruction (LDGSTS) to the shared memory, which essentially performs a direct memory access (“DMA”) by the hardware machine to the instance of the basic element stored in the shared memory.

_wait(BarPtr, BarPhase)_wait (BarPtr, BarPhase)

Wie in 5C dargestellt ist, wartet die _wait-Funktion 2300 darauf, dass die Ankommen-Warten-Barriere für die vorgesehene Phase aufgehoben wird. Die _wait-Funktion 2300 liest erneut die Phase der Barriere an der Speicherstelle [Ptr] und vergleicht sie (2304) mit der zuvor von der _arrive-Funktion 2200 gelesenen Phase. 5C definiert somit eine Endlosschleife, die wartet, bis sich die Phase der Barriere ändert, bevor sie den Thread weiterlaufen lässt. Ein Thread, der die _wait-Funktion 2300 ausführt, wird an der Barriere blockiert, bis sich die Barrierephase ändert.As in 5C is shown, the _wait function is waiting 2300 that the arriving-waiting barrier for the intended phase is lifted. The _wait function 2300 again reads the phase of the barrier at the memory location [Ptr] and compares it ( 2304 ) with the one previously used by the _arrive function 2200 read phase. 5C thus defines an infinite loop that waits until the phase of the barrier changes before allowing the thread to continue. A thread that does the _wait function 2300 is blocked at the barrier until the barrier phase changes.

_wait(BarPtr, BarPhase) muss die BarPhase konsumieren, die durch einen vorherigen Aufruf von _arrive zurückgegeben wurde. Bei einer nicht einschränkenden beispielhaften Ausführungsform, bei der _arrive Seiteneffekte hervorruft, muss _wait auch diese Seiteneffekte hervorrufen. Aber viel typischer ist, dass das Gegenteil passieren wird - nämlich _wait ruft Seiteneffekte hervor und _arrive muss auch diese Seiteneffekte hervorrufen._wait (BarPtr, BarPhase) must consume the BarPhase returned by a previous call to _arrive. In one non-limiting exemplary embodiment where _arrive causes side effects, _wait must also cause those side effects. But it is much more typical that the opposite will happen - namely _wait causes side effects and _arrive must also cause these side effects.

Da im gezeigten Beispiel die _wait-Funktion 2300 einen Wert verwendet, der von einer _arrive-Funktion 2200 zurückgegeben wird, sollte _wait 2300 erst aufgerufen werden, nachdem _arrive 2200 aufgerufen wurde. Die beiden Funktionen könnten unmittelbar nacheinander aufgerufen werden, oder eine beliebige Anzahl von Befehlen oder anderen Funktionen, die nicht mit der Barriere zusammenhängen, könnten zwischen dem Aufruf der _arrive-Funktion 2200 und dem Aufruf der _wait-Funktion 2300 platziert werden. Der Entwickler (und/oder ein optimierender Compiler) könnte nützliche Arbeit zwischen einem _arrive-Funktionsaufruf 2200 und einem wait-Funktionsaufruf 2300 platzieren, so dass Prozessorzyklen nicht unnötig verschwendet werden. Wenn der Thread die _wait-Funktion 2300 aufruft, nachdem sich der Zustand der Barrierephase bereits geändert hat, wird der Thread nicht durch die Barriere blockiert, sondern führt stattdessen die nächste Anweisung nach dem Funktionsaufruf mit nur einer kurzen (z.B. ein oder zwei Zyklen) Verzögerung aus, die bei der Durchführung der Operationen 2302, 2304 der 5C entsteht.Since in the example shown the _wait function 2300 uses a value obtained from an _arrive function 2200 returned should be _wait 2300 are only called after _arrive 2200 was called. The two functions could be called immediately one after the other, or any number of commands or other functions unrelated to the barrier could be called between calls to the _arrive function 2200 and calling the _wait function 2300 to be placed. The developer (and / or an optimizing compiler) could do useful work between a _arrive function call 2200 and a wait function call 2300 so that processor cycles are not unnecessarily wasted. When the thread executes the _wait function 2300 calls after the state of the barrier phase has already changed, the thread is not blocked by the barrier, but instead executes the next instruction after the function call with only a short (e.g. one or two cycle) delay that occurs when performing the operations 2302 , 2304 the 5C arises.

Ein spezifischeres Beispiel:

_wait(BarPtr, BarPhase)
Eingabe: BarPhase;

A more specific example:

_wait (BarPtr, BarPhase)
Input: BarPhase;

Warten, bis alle erwarteten Ankünfte für die Barriere für die angegebene Phase der Barriere eingetroffen sind.
für einen Thread für einen Barriere-BarPtr:

Jedes Wait(BarPtr) weist ein entsprechendes Arrive(BarPtr) auf, und die BarPhase von Arrive(BarPtr) wird als Eingabe für _wait(BarPtr) bereitgestellt; ein Aufruf von _wait(BarPtr) kann nicht auf ein _wait(BarPtr) ohne ein dazwischenliegendes _arrive(BarPtr) folgen.

Wait for all expected arrivals for the barrier for the specified phase of the barrier to arrive.
for a thread for a barrier BarPtr:

Each Wait (BarPtr) has a corresponding Arrive (BarPtr), and the BarPhase of Arrive (BarPtr) is provided as input to _wait (BarPtr); a call to _wait (BarPtr) cannot follow a _wait (BarPtr) without an intervening _arrive (BarPtr).

Ein Aufruf von _arrive(BarPtr) sollte nicht auf ein _arrive(BarPtr) ohne ein dazwischenliegendes _wait(BarPtr) folgen.
_dropthreadA call to _arrive (BarPtr) should not follow a _arrive (BarPtr) without an intervening _wait (BarPtr).
_dropthread

Die _dropthread-Funktion 2400 in 5D wird bei beispielhaften Ausführungsformen verwendet, um einen Thread dauerhaft von einer Ankommen-Warten-Barriere zu entfernen. Diese Funktion ist nützlich, wenn ein Thread beendet werden muss. Die Wirkung dieser Funktion besteht darin, den Wert für die erwartete Anzahl von Ankünften bei der Barriere zu verringern (dekrementieren), so dass die Barriere nicht länger auf diesen bestimmten Thread wartet (Block 2402). Bei dem dargestellten Beispiel inkrementiert die Operation den in der Barriere gespeicherten Wert, um dadurch die Anzahl der Zählungen zu verringern, die erforderlich sind, um einen Zählerstand von „Warten auf keine weiteren Threads“ zu erreichen.The _dropthread function 2400 in 5D is used in exemplary embodiments to permanently remove a thread from an arrive-wait barrier. This function is useful when a thread needs to be terminated. The effect of this function is to decrease (decrement) the value for the expected number of arrivals at the barrier so that the barrier is no longer waiting for that particular thread (block 2402 ). In the example shown, the operation increments the value stored in the barrier, thereby reducing the number of counts required to reach a “waiting for no more threads” count.

Ein spezifischeres Beispiel:

_dropThread (BarPtr)
Eingabe: BarPtr;
Entfernt einen Thread von der Barriere. Nützlich, wenn sich ein Thread beenden möchte.
_addonce (BarPtr, Count)

A more specific example:

_dropThread (BarPtr)
Input: BarPtr;
Removes a thread from the barrier. Useful when a thread wants to end itself.
_addonce (BarPtr, Count)

Die _addonce-Funktion 2500 in 5E addiert zu dem Zählerstand der Ankommen-Warten-Barriere einen bestimmten Wert, und die _droponce-Funktion 2600 in 5F verringert den Zählerstand der Ankommen-Warten-Barriere um einen bestimmten Wert. Alle Überläufe können explizit von der Software behandelt werden.The _addonce function 2500 in 5E adds a certain value to the counter reading of the arriving-waiting-barrier, and the _droponce function 2600 in 5F reduces the count of the arriving-waiting-barrier by a certain value. All overflows can be handled explicitly by the software.

Ein spezifischeres Beispiel:

_add(BarPtr, AddCnt)
Eingabe: BarPtr, AddCnt;

A more specific example:

_add (BarPtr, AddCnt)
Input: BarPtr, AddCnt;

AddCnt addiert zusätzliche erwartete Ankünfte für diese Barriere. Es wird nur für eine einmalige Benutzung der Barriere addiert.AddCnt adds additional expected arrivals for this barrier. It is only added for a single use of the barrier.

Für alle an der Barriere beteiligten Threads:For all threads involved in the barrier:

Die Summe aller AddCnt entspricht der Anzahl der _arrive()-Aufrufe.The sum of all AddCnt corresponds to the number of _arrive () calls.

Der Thread, der den _addCnt bereitstellt, kann sich von dem Thread unterscheiden, der den _arrive() ausführt.The thread that provides the _addCnt can be different from the thread that executes the _arrive ().

Ein Thread sollte _add(BarPtr) nicht zwischen _arrive(BarPtr) und _wait(BarPtr) ausführen.A thread should not execute _add (BarPtr) between _arrive (BarPtr) and _wait (BarPtr).

Andere ISA-Ansätze sind möglich. Die Hauptansätze unterscheiden sich in der Art und Weise, wie die erwartete Barriereankunftsanzahl angegeben wird. Einige Optionen weisen auf:

• Spezifizierung der erwarteten Ankunftsanzahl bei der Erstellung der Barriere.
• Spezifizierung der erwarteten Ankunftsanzahl jedes Mal, wenn die Barriere von dem Arrive verwendet wird (wie bei der bestehenden SM CTA-Barriere müssen alle Threads die gleiche erwartete Ankunftsanzahl angeben)
• Spezifizierung der erwarteten Ankunftsanzahl jedes Mal, wenn die Barriere von Wait verwendet wird (wie bei der bestehenden SM CTA-Barriere müssen alle Threads die gleiche erwartete Ankunftsanzahl angeben)
• hybrid: Spezifizierung der erwarteten Ankunftsanzahl bei der Erstellung der Barriere, aber zusätzliche erwartete Ankünfte können durch das Arrive hinzugefügt werden.

Other ISA approaches are possible. The main approaches differ in the way in which the expected barrier arrival count is reported. Some options include:

• Specification of the expected number of arrivals when creating the barrier.
• Specification of the expected number of arrivals each time the barrier is used by the arrive (as with the existing SM CTA barrier, all threads must specify the same expected number of arrivals)
• Specification of the expected number of arrivals each time the barrier of Wait is used (as with the existing SM CTA barrier, all threads must specify the same expected number of arrivals)
• hybrid: Specification of the expected number of arrivals when creating the barrier, but additional expected arrivals can be added by the Arrive.

Beispielhafte nicht einschränkende Mikro-Architektur (die zur Implementierung des Blockdiagramms von 4 verwendet werden kann):

Barriere-Zustand
- * Phase: Phase der Barriere.
- * Zählerstand: Zählerstand der Barriere.
- * ThreadCnt: Anzahl der an dieser Barriere beteiligten Threads


 _create(BarPtr,InitCnt)
      barrier[BarPtr].initCnt = initCnt;
      barrier[BarPtr].cnt = initCnt;
      barrier[BarPtr].phase = 0;
      _add(BarPtr, addCnt)
      // Erhöhen der Ankunftsanzahl

       barrier[BarPtr].cnt += addCnt;
       Arrive_function(BarPtr) // dies ist nicht Teil der API in einer Ausführungsform

      // Dekrementieren der Ankunftsanzahl

       barrier[BarPtr].cnt--;

      // Prüfen, ob die Ankunft die Barriere aufhebt

       if (barrier[BarPtr].cnt == 0) { 

             // Aktualisieren der Phase und Rücksetzen des Zählerstands
             if (barrier[BarPtr].phase == 0) {barrier[BarPtr].phase = 1;} else {bar-
 rier[BarPtr].phase = 0;}
             barrier[BarPtr].cnt = barrier[BarPtr].init;
             // alle auf die Barriere wartenden Warps freigeben
             }unstall(BarPtr);

 }
             Arrive(BarPtr, addCnt, BarPhase)
      // Zurückgeben der Phase (optional)
      BarPhase = barrier[BarPtr].phase;

       // Arrive-Funktion ausführen

      Arrive_function(BarPtr);
      LDGSTS_Arrive(BarPtr)
      // Arrive-Funktion ausführen
      Arrive_function(BarPtr);
      BarWait(BarPtr,BarPhase)

       // wenn sich die Barriere in der gleichen Phase befindet, ist sie noch nicht
         freigegeben worden, dann anhalten, andernfalls weitermachen.
      if (barrier[BarPtr].phase == BarPhase) { stall(BarPtr); }
      DropThread(BarPtr)
      // Thread entfernt sich selbst von der Barriere

       barrier[BarPtr].initCnt--;

      // Arrive-Funktion ausführen
      Arrive(BarPtr);

Exemplary non-limiting micro-architecture (used to implement the block diagram of the 4th can be used):

Barrier state
- * Phase: phase of the barrier.
- * Meter reading: Meter reading of the barrier.
- * ThreadCnt: Number of threads involved in this barrier


 _create (BarPtr, InitCnt)
      barrier [BarPtr] .initCnt = initCnt;
      barrier [BarPtr] .cnt = initCnt;
      barrier [BarPtr] .phase = 0;
      _add (BarPtr, addCnt)
      // Increase the number of arrivals

       barrier [BarPtr] .cnt + = addCnt;
       Arrive_function (BarPtr) // this is not part of the API in any embodiment

      // Decrement the number of arrivals

       barrier [BarPtr] .cnt--;

      // Check that the arrival lifts the barrier

       if (barrier [BarPtr] .cnt == 0) { 

             // Update the phase and reset the counter reading
             if (barrier [BarPtr] .phase == 0) {barrier [BarPtr] .phase = 1;} else {bar-
 rier [BarPtr] .phase = 0;}
             barrier [BarPtr] .cnt = barrier [BarPtr] .init;
             // release all warps waiting for the barrier
             } unstall (BarPtr);

 }
             Arrive (BarPtr, addCnt, BarPhase)
      // return the phase (optional)
      BarPhase = barrier [BarPtr] .phase;

       // Execute the Arrive function

      Arrive_function (BarPtr);
      LDGSTS_Arrive (BarPtr)
      // Execute the Arrive function
      Arrive_function (BarPtr);
      BarWait (BarPtr, BarPhase)

       // if the barrier is in the same phase, it is not yet
         released, then pause, otherwise continue.
      if (barrier [BarPtr] .phase == BarPhase) {stall (BarPtr); }
      DropThread (BarPtr)
      // thread removes itself from the barrier

       barrier [BarPtr] .initCnt--;

      // Execute the Arrive function
      Arrive (BarPtr);

Kooperative Datenbewegung ohne threadsync. DMA Taskbasierte geteilte BarriereCooperative data movement without thread sync. DMA task based shared barrier

Das Programmiermodell kann ein Modell sein, das dem einer Multi-Thread-Barriere entspricht, außer dass die Barriere geteilt ist.The programming model can be a model similar to that of a multi-threaded barrier, except that the barrier is split.

Eine bestehende Multi-Thread-Barriere kann wie folgt beschrieben werden:

• <PRE>
• BARRIER
• <POST>

An existing multi-thread barrier can be described as follows:

• <PRE>
• BARRIER
• <POST>

Die Sichtbarkeitsregeln:

• Die <PRE>.load/stores eines Threads sind garantiert sichtbar für die <POST>-load/stores anderer teilnehmender Threads.
• Die <POST>-load/stores eines Threads sind garantiert nicht sichtbar für die <PRE>-load/stores anderer teilnehmender Threads.

The visibility rules:

• The <PRE> .load / stores of a thread are guaranteed to be visible to the <POST> -load / stores of other participating threads.
• The <POST> -load / stores of a thread are guaranteed not to be visible to the <PRE> -load / stores of other participating threads.

Die geteilte Multi-Thread-Barriere kann wie folgt beschrieben werden:

• <PRE>
• ARRIVE
• <MIDDLE>
• WAIT
• <POST>

The shared multi-thread barrier can be described as follows:

• <PRE>
• ARRIVE
• <MIDDLE>
• WAIT
• <POST>

Die Sichtbarkeitsregeln (die ersten beiden sind die gleichen wie oben):

• Die <PRE>-load/stores eines Threads sind garantiert sichtbar für die <POST>-load/stores anderer teilnehmender Threads.
• Die <POST>-load/stores eines Threads sind garantiert nicht sichtbar für die <PRE>-load/stores anderer teilnehmender Threads.
• Die <MIDDLE>-load/stores eines Threads haben keine Sichtbarkeitsgarantie-Reihenfolge für andere Threads (zumindest keine Sichtbarkeitsgarantie, die von dieser speziellen Barriere bereitgestellt wird).

The visibility rules (the first two are the same as above):

• The <PRE> -load / stores of a thread are guaranteed to be visible to the <POST> -load / stores of other participating threads.
• The <POST> -load / stores of a thread are guaranteed not to be visible to the <PRE> -load / stores of other participating threads.
• The <MIDDLE> -load / stores of a thread have no visibility guarantee order for other threads (at least no visibility guarantee provided by this particular barrier).

Die LDGSTS-„DMA“-Anweisung wird logisch wie ein unabhängiger Thread behandelt, der vom aufrufenden Thread abgespalten bzw. „forked“ wurde und einen LDG/STS/ARRIVE ausführt, wonach er endet bzw. „dies“.The LDGSTS “DMA” instruction is logically treated like an independent thread that has been split off or “forked” from the calling thread and executes an LDG / STS / ARRIVE, after which it ends or “this”.

Sichtbarkeitsprobleme mit geteilter BarriereSplit barrier visibility issues

Eine geteilte Barriere bedeutet, dass sich mehrere geteilte Barrieren überlappen können. Alle folgenden Überlappungen sind zulässig und funktional korrekt ohne Deadlock (siehe auch 6). Pipeline-artig Thread 0 Thread 1 Arrive (BarA) Arrive (BarA) Arrive (BarB) Arrive (BarB) Wait (BarA) Wait (BarA) Wait (BarB) Wait (BarB) verschachtelt Thread 0 Thread 1 Arrive (BarA) Arrive (BarA) Arrive (BarB) Arrive (BarB) Wait (BarB) Wait (BarB) Wait (BarA) Wait (BarA) unterschiedliche Reihenfolge pro Thread Thread 0 Thread 1 Arrive (BarA) Arrive (BarB) Arrive (BarB) Arrive (BarA) Wait (BarA) Wait (BarB) Wait (BarB) Wait (BarA) unterschiedliche Barriere und Reihenfolge pro Thread Thread 0 Thread 1 Thread 2 Arrive (BarA) Arrive (BarC) Arrive (BarB) Arrive (BarB) Arrive (BarA) Arrive (BarC) Wait (BarA) Wait (BarA) Wait (BarB) Wait (BarB) Wait (BarC) Wait (BarC) A split barrier means that multiple split barriers can overlap. All of the following overlaps are permitted and functionally correct without deadlock (see also 6th ). Pipeline-like Thread 0 Thread 1 Arrive (BarA) Arrive (BarA) Arrive (BarB) Arrive (BarB) Wait (BarA) Wait (BarA) Wait (BarB) Wait (BarB) nested Thread 0 Thread 1 Arrive (BarA) Arrive (BarA) Arrive (BarB) Arrive (BarB) Wait (BarB) Wait (BarB) Wait (BarA) Wait (BarA) different order per thread Thread 0 Thread 1 Arrive (BarA) Arrive (BarB) Arrive (BarB) Arrive (BarA) Wait (BarA) Wait (BarB) Wait (BarB) Wait (BarA) different barrier and order per thread Thread 0 Thread 1 Thread 2 Arrive (BarA) Arrive (BarC) Arrive (BarB) Arrive (BarB) Arrive (BarA) Arrive (BarC) Wait (BarA) Wait (BarA) Wait (BarB) Wait (BarB) Wait (BarC) Wait (BarC)

Bei einigen beispielhaften nicht einschränkenden Ausführungsformen sollten die folgenden Überlappungen nicht erlaubt sein, da sie ein Deadlock erzeugen würden. → Deadlock Thread 0 Thread 1 Arrive (BarA) Arrive (BarB) Wait (BarA) Wait (BarB) Arrive (BarB) Arrive (BarA) Wait (BarB) Wait (BarA) In some exemplary, non-limiting embodiments, the following overlaps should not be allowed as they would create a deadlock. → deadlock Thread 0 Thread 1 Arrive (BarA) Arrive (BarB) Wait (BarA) Wait (BarB) Arrive (BarB) Arrive (BarA) Wait (BarB) Wait (BarA)

Beispiel eines ZustandsExample of a state

6 stellt ein Zustandsbeispiel dar. 6 zeigt zunächst den Wechsel von Phase 0 zu Phase 1 (wenn alle Ankünfte erfolgen, ändert sich die Barrierephase und der Zählerstand wird neu gesetzt). Die Operation des Threads 1 „Wait(BarPhase_T1, BarPtr)“ stellt sicher, dass alle loads/stores für die Phase 0 für alle beteiligten Threads für Thread 1 vor einem Fortfahren sichtbar sind. Ähnliche Wait-Operationen in Phase 1 und Phase 2 stellen ähnliche Funktionen für diese Phasen bereit. Jedes Mal, wenn die Barriere eingesetzt wird, handelt es sich um eine neue Phase, so dass der interne Phasenindikator bei beispielhaften Ausführungsformen nur ein einziges Bit sein muss, um anzuzeigen, auf welche Phase ein bestimmtes Wait() warten muss. 6th represents an example of the state. 6th first shows the change from phase 0 to phase 1 (if all arrivals take place, the barrier phase changes and the count is reset). The operation of thread 1 "Wait (BarPhase_T1, BarPtr)" ensures that all loads / stores for phase 0 are visible to all threads involved for thread 1 before continuing. Similar wait operations in phase 1 and phase 2 provide similar functionality for these phases. Every time the barrier is deployed, it is a new phase, so in exemplary embodiments the internal phase indicator only needs to be a single bit to indicate which phase a particular Wait () must wait for.

7 zeigt ein weiteres Zustandsbeispiel, das load-/store-Operationen aufweist, die von Hardware-Operatoren wie der Kopiermaschine 210 ausgeführt werden. Die „Add(BarPtr, 2)“-Befehle, die von Thread0 und Thread1 ausgeführt werden, werden verwendet, um diese beiden Kopiermaschinen-Operationen zu der Barriere hinzuzufügen. 7th shows another state example, which has load / store operations carried out by hardware operators such as the copy machine 210 are executed. The "Add (BarPtr, 2)" commands executed by Thread0 and Thread1 are used to add these two copy machine operations to the barrier.

Bei diesem Beispiel stellt die Operation „Wait(BarPhase_T1, BarPtr)“ sicher, dass alle load-/store-Operationen in Phase0 für alle beteiligten Threads für Thread1 sichtbar sind (markiert durch Arrives) und alle LDGSTS-Ergebnisse im gemeinsamen Speicher sichtbar sind (markiert durch LDGSTS-Arrive). Sobald alle Arrives aufgetreten sind (siehe Linien mit Pfeilen, die phase=0 und dann phase=1 anzeigen), ändert sich die Barrierephase, und der Zählerstand wird neu gesetzt.In this example, the "Wait (BarPhase_T1, BarPtr)" operation ensures that all load / store operations in phase0 are visible for all threads involved for thread1 (marked by arrives) and that all LDGSTS results are visible in the shared memory ( marked by LDGSTS-Arrive). As soon as all arrives have occurred (see lines with arrows indicating phase = 0 and then phase = 1), the barrier phase changes and the count is reset.

8 zeigt ein Beispiel für einen dreistufigen Pipeline-Streaming-Code. Die 8 zeigt das Priming bzw. Priming der Schleife, dann Iterationen der Schleife. 9 zeigt ein Beispiel für eine Schleifencode-Struktur mit den folgenden Details: --- URF/RF Werte --- URF1/URF2: Zeigen auf die einzusetzende Barriere RF1/RFT: Zeigen auf den einzusetzenden SMEM-Speicher Iteration UR1/R1 UR2/R2 0 BarA/SBUfA BarC/SBufC 1 BarB/SBufB Ba rA/SB ufB 2 BarC/SBufC BarB/SBufB ... wiederholen ... 8th shows an example of a three-stage pipeline streaming code. The 8th shows the priming of the loop, then iterations of the loop. 9 shows an example of a loop code structure with the following details: --- URF / RF values --- URF1 / URF2: Point to the barrier to be used RF1 / RFT: Point to the SMEM memory to be used iteration UR1 / R1 UR2 / R2 0 BarA / SBUfA BarC / SBufC 1 BarB / SBufB Ba rA / SB ufB 2 BarC / SBufC BarB / SBufB ... to repeat ...

Alle hier zitierten Dokumente werden durch Verweis einbezogen, als ob sie ausdrücklich aufgeführt wären.All documents cited here are incorporated by reference as if they were specifically listed.

Obwohl die Erfindung im Zusammenhang mit den gegenwärtig als am praktischsten und bevorzugtesten angesehenen Ausführungsformen beschrieben wurde, ist zu verstehen, dass die Erfindung nicht auf die offenbarten Ausführungsformen beschränkt werden soll, sondern im Gegenteil verschiedene Modifikationen und gleichwertige Anordnungen umfassen soll, die im Geist und Umfang der beigefügten Ansprüche enthalten sind.Although the invention has been described in connection with the embodiments presently considered to be most practical and preferred, it is to be understood that the invention is not intended to be limited to the disclosed embodiments, but on the contrary is intended to include various modifications and equivalent arrangements included in the spirit and scope of the appended claims are included.

ZITATE ENTHALTEN IN DER BESCHREIBUNGQUOTES INCLUDED IN THE DESCRIPTION

Diese Liste der vom Anmelder aufgeführten Dokumente wurde automatisiert erzeugt und ist ausschließlich zur besseren Information des Lesers aufgenommen. Die Liste ist nicht Bestandteil der deutschen Patent- bzw. Gebrauchsmusteranmeldung. Das DPMA übernimmt keinerlei Haftung für etwaige Fehler oder Auslassungen.This list of the documents listed by the applicant was generated automatically and is included solely for the better information of the reader. The list is not part of the German patent or utility model application. The DPMA assumes no liability for any errors or omissions.

Zitierte PatentliteraturPatent literature cited

US 9117284 [0008]
US 10217183 [0008]
US 8381203 [0010]
US 9158595 [0010]
US 9442755 [0010]
US 9448803 [0010]
US 10002031 [0010]
US 10013290 [0010]

Zitierte Nicht-PatentliteraturNon-patent literature cited

Xiao et al., "Inter-Block GPU Communication via Fast Barrier Synchronization", 2010 [0009]

Claims

Comprehensive synchronization barrier: a data structure stored in memory, the data structure including a counter; wherein the counter is modified by an operation performed by either an execution thread or a hardware operator.

Synchronization barrier after Claim 1 wherein the data structure, which is stored in a memory, comprises a phase flag, an arrival counter and a further value which is used to reinitialize the arrival counter when the barrier is reset.

Synchronization barrier after Claim 1 or 2 wherein the hardware operator comprises hardware that performs copying.

Synchronization barrier according to one of the preceding claims, wherein the operation comprises an ARRIVE which differs from a WAIT and / or a WAIT which differs from an ARRIVE.

A synchronization barrier according to any one of the preceding claims, wherein the data structure is structured to be reset in response to a fused atomic load / store operation which can be initiated either by a hardware machine or a software thread.

Comprehensive computing system: a synchronization barrier primitive stored in memory, the primitive including a counter and a phase indicator; and a memory access circuit which is dependent on either (a) execution of an instruction by a software thread or (b) when the counter indicates that all threads in a collection of threads and at least one copy operation have reached a synchronization point and all operations have completed in the cluster, reset the counter and change the phase indicator.

System according to Claim 6 wherein the counter counts a sum of the number of copy operation completions and the number of execution thread arrival calls.

System according to Claim 6 or 7th , the instruction consisting of an ARRIVE operation that does not include a WAIT operation or a WAIT operation that does not include an ARRIVE operation.

System according to one of the Claims 6 to 8th wherein the primitive stored in memory further comprises a predetermined value, and wherein hardware resets the counter by loading the predetermined value when the counter indicates that all threads of a collection of threads and copy operations have reached a sync point and all copy operations in the cluster are complete.

System according to Claim 9 , the system being structured to allow a thread to dynamically change the predetermined value.

System according to one of the Claims 6 to 10 wherein the synchronization barrier primitive is stored in shared memory of a GPU.

System according to one of the Claims 6 to 11 wherein the synchronization barrier primitive is stored in a memory hierarchy that determines threaded access to the primitive.

System according to one of the Claims 6 to 12th which also has a comparator which compares the count of the counter with a predetermined value and resets the basic element depending on the results of the comparison.

System according to one of the Claims 6 to 13th wherein the primitive phase indicator is configured to be read first by an ARRIVE instruction and then by a WAIT instruction so that a thread can determine whether the primitive phase indicator has changed a phase state.

GPU instruction set architecture comprising: an ARRIVE operation which reads at least a phase indicator portion of a synchronization barrier primitive stored in a memory and causes a counter to increment; and a WAIT operation which reads at least the phase indicator portion of the primitive stored in a memory and the phase indicator portion read by the ARRIVE operation with the phase indicator portion of the primitive which is read by the WAIT Operation is read, compares to decide whether the phase state of the barrier has changed.

GPU instruction set architecture according to Claim 15 which furthermore has an ADD operation which adds to a field which is stored with the synchronization barrier primitive the field which is used to reinitialize the primitive on a reset to a next phase state.

GPU instruction set architecture according to Claim 15 or 16 which also includes a CREATE statement that initializes the synchronization barrier primitive and stores it in memory.

GPU instruction set architecture according to one of the Claims 15 to 17th which further includes a merged load / store instruction that enables a hardware-based machine to reset the synchronization barrier primitive when the hardware-based machine has completed an assigned task.

Comprehensive synchronization procedure: Storing in memory a synchronization barrier indication comprising a phase indicator and a counter; Executing an arrive statement from at least one thread, causing the counter to count and allowing the thread to read the phase indicator; Completing a task with a hardware controller causing the counter to count; Resetting the counter when the count of the counter indicates that a set of threads has executed arrive instructions and the hardware controller has completed tasks; and Execution of a wait instruction from the thread, whereby the thread is again allowed to read the phase indicator, whereby the thread is blocked depending on whether the phase indicator has changed values.

Synchronization procedure according to Claim 19 which comprises opening an execution window from when a thread executes the arrive statement to when the thread executes the wait statement, the thread performing work within the execution window that is asynchronous with respect to the synchronization barrier.

Comprehensive synchronization barrier: a counter which provides a synchronization barrier count; and circuitry operably connected to the counter that modifies the synchronization barrier count in response to completion of operations performed by execution threads and hardware operators.

Synchronization barrier after Claim 21 , the counter being in a memory.

Synchronization barrier after Claim 21 , the counter being stored in a shared memory of a GPU.

Sync barrier after one of the Claims 21 to 23 wherein the circuit is configured to reset the synchronization barrier counter reading as a function of a merged atomic load / store process which can be initiated by a copy hardware and / or a software thread execution.