GR20180200014U

GR20180200014U - Discarding of threads processed by a warp processing unit

Info

Publication number: GR20180200014U
Application number: GR20180200014U
Authority: GR
Inventors: Isidoros Sideris; Stephane Forey; Reimar Gisbert Doffinger
Original assignee: Arm Limited
Priority date: 2017-09-20
Filing date: 2017-09-20
Publication date: 2019-05-09

Abstract

A warp processing unit controls, in dependence on a warp program counter shared between a plurality of threads processing respective graphics fragments, fetching of a next instruction to be executed for at least some of the plurality of threads. In response to a determination that a given subset of threads is to be discarded when at least one other subset of threads is to continue, the warp processing unit processes the given subset of threads in a discarded state. For a thread processed in the discarded state, execution of instructions continues for the discarded thread, and at least one of: generation of data access messages triggered by the discarded thread is suppressed; and at least one processing operation, which would be deferred until completion of the discarded thread had the thread not been discarded, is enabled to be commenced independently of an outcome of the discarded thread.

Description

ΑΠΟΡΡΙΨΗ ΝΗΜΑΤΩΝ ΠΟΥ ΕΠΕΞΕΡΓΑΖΟΝΤΑΙ ΑΠΟ ΜΙΑ ΜΟΝΑΔΑ ΕΠΕΞΕΡΓΑΣΙΑΣ DROP THREADS PROCESSED BY A PROCESSING UNIT

WARP WARP

Η παρούσα τεχνική σχετίζεται με το πεδίο της επεξεργασίας γραφικών. The present technique relates to the field of graphics processing.

Μια διάταξη επεξεργασίας γραφικών μπορεί να απαιτεί την πραγματοποίηση ενός αριθμού νημάτων επεξεργασίας σε αντίστοιχα τμήματα γραφικών. Γ ια παράδειγμα κάθε νήμα μπορεί να είναι ένα νήμα σκίασης (shader) το οποίο πραγματοποιεί επεξεργασία σκίασης σε ένα τμήμα γραφικών προς σχεδίαση σε μια δεδομένη θέση pixel εντός του προς απόδοση πλαισίου εικόνας. Ορισμένες διατάξεις επεξεργασίας γραφικών μπορεί να έχουν μια μονάδα επεξεργασίας warp, η οποία επεξεργάζεται έναν αριθμό νημάτων σε εξάρτηση από έναν απαριθμητή προγράμματος warp, ο οποίος χρησιμοποιείται από κοινού από τα νήματα, με την προσκόμιση μιας επόμενης εντολής προς εκτέλεση για τουλάχιστον ορισμένα από αυτά τα νήματα που ελέγχονται με βάση τον απαριθμητή προγράμματος warp. Εφόσον τα νήματα επεξεργασίας που πραγματοποιούνται για κοντινές θέσεις pixel μπορεί να είναι παρόμοια και μπορεί να απαιτούν την εφαρμογή των ίδιων λειτουργιών σε διαφορετικές εισόδους δεδομένων, ο έλεγχος εκτέλεσης των εντολών με βάση έναν κοινόχρηστο απαριθμητή προγράμματος μπορεί να είναι αποτελεσματικός εφόσον επιτρέπει την απόσβεση της επιβάρυνσης της προσκόμισης και αποκωδικοποίησης των εντολών ανάμεσα στην παρτίδα των νημάτων ως σύνολο. A graphics processing arrangement may require a number of processing threads to run on respective graphics sections. For example each thread can be a shader thread that performs shader processing on a portion of graphics to be drawn at a given pixel position within the image frame to be rendered. Some graphics processors may have a warp processing unit, which processes a number of threads depending on a warp program counter, which is shared by the threads, by providing a next command to execute for at least some of those threads which are controlled based on the warp program counter. Since processing threads for nearby pixel locations may be similar and may require the same operations to be applied to different data inputs, controlling the execution of instructions based on a shared program counter can be efficient as it allows the overhead of fetching and decoding commands among the batch of threads as a whole.

Τουλάχιστον ορισμένα παραδείγματα παρέχουν μια διάταξη επεξεργασίας γραφικών, η οποία περιλαμβάνει: At least some examples provide a graphics processing arrangement that includes:

μια μονάδα επεξεργασίας warp για την επεξεργασία ενός πλήθους νημάτων επεξεργασίας σε αντίστοιχα τμήματα γραφικών, a warp processing unit for processing a plurality of processing threads into corresponding graphics sections;

όπου η μονάδα επεξεργασίας warp είναι διαμορφωμένη για τον έλεγχο, σε εξάρτηση από έναν απαριθμητή προγράμματος warp που χρησιμοποιείται από κοινού ανάμεσα στο πλήθος των νημάτων, με την προσκόμιση μιας επόμενης εντολής προς εκτέλεση για τουλάχιστον ορισμένα από το πλήθος των νημάτων, wherein the warp processing unit is configured to control, in dependence on a warp program counter shared among the plurality of threads, the presentation of a next command to be executed for at least some of the plurality of threads;

η μονάδα επεξεργασίας warp περιλαμβάνει καταχωρητές για την αποθήκευση δεδομένων αρχιτεκτονικής κατάστασης για το πλήθος των νημάτων, the warp processing unit includes registers to store architectural state data for the number of threads;

σε απόκριση ενός προσδιορισμού ότι ένα δεδομένο υποσύνολο νημάτων πρόκειται να απορριφθεί όταν τουλάχιστον ένα άλλο υποσύνολο νημάτων από το πλήθος των νημάτων πρόκειται να συνεχίσει, η μονάδα επεξεργασίας warp είναι διαμορφωμένη για την επεξεργασία του δεδομένου υποσυνόλου νημάτων σε μια κατάσταση απόρριψης, και για ένα νήμα επεξεργασμένο στην κατάσταση απόρριψης, η μονάδα επεξεργασίας warp είναι διαμορφωμένη ώστε να συνεχίζει την εκτέλεση των εντολών για το απορριφθέν νήμα, και τουλάχιστον ένα από τα εξής: in response to a determination that a given subset of threads is to be discarded when at least one other subset of threads from the plurality of threads is to continue, the warp processing unit is configured to process the given subset of threads into a discarding state, and for a processed thread in the abort state, the warp processor is configured to continue executing commands for the aborted thread, and at least one of the following:

η μονάδα επεξεργασίας warp είναι διαμορφωμένη ώστε να καταστέλλει τη δημιουργία μηνυμάτων πρόσβασης δεδομένων που διεγείρεται από το απορριφθέν νήμα, με τα εν λόγω μηνύματα πρόσβασης δεδομένων να περιλαμβάνουν μηνύματα που αιτούνται πρόσβαση σε δεδομένα διαφορετικά από τα εν λόγω δεδομένα αρχιτεκτονικής κατάστασης που είναι αποθηκευμένα στους καταχωρητές της μονάδας επεξεργασίας warp, και the warp processor is configured to suppress the generation of data access messages triggered by the discarded thread, said data access messages including messages requesting access to data other than said architecture state data stored in its registers warp processing unit, and

η διάταξη επεξεργασίας γραφικών είναι διαμορφωμένη ώστε να επιτρέπει τουλάχιστον μια λειτουργία επεξεργασίας, η οποία θα καθυστερούσε μέχρι την ολοκλήρωση του απορριφθέντος νήματος σε περίπτωση μη απόρριψης του νήματος, προς έναρξη ανεξάρτητα ενός αποτελέσματος του απορριφθέντος νήματος. the graphics processor is configured to allow at least one processing operation, which would delay until completion of the discarded thread if the thread had not been discarded, to be initiated independently of a result of the discarded thread.

μέσα επεξεργασίας ενός πλήθους νημάτων επεξεργασίας σε αντίστοιχα τμήματα γραφικών, processing means of a plurality of processing threads in respective graphics sections;

όπου τα μέσα επεξεργασίας είναι διαμορφωμένα ώστε να ελέγχουν, σε εξάρτηση από έναν απαριθμητή προγράμματος warp ο οποίος χρησιμοποιείται από κοινού ανάμεσα στο πλήθος νημάτων, την προσκόμιση μιας επόμενης εντολής προς εκτέλεση για τουλάχιστον ορισμένα από το πλήθος των νημάτων, wherein the processing means is configured to control, in dependence on a warp program counter shared among the plurality of threads, the presentation of a next instruction to be executed for at least some of the plurality of threads;

τα μέσα επεξεργασίας περιλαμβάνουν μέσα για την αποθήκευση δεδομένων αρχιτεκτονικής κατάστασης για το πλήθος των νημάτων, the processing means includes means for storing architectural state data for the plurality of threads;

σε απόκριση ενός προσδιορισμού ότι ένα δεδομένο υποσύνολο νημάτων πρόκειται να απορριφθεί όταν τουλάχιστον ένα άλλο υποσύνολο νημάτων από το πλήθος νημάτων πρόκειται να συνεχίσει, τα μέσα επεξεργασίας είναι διαμορφωμένα για την επεξεργασία του δεδομένου υποσυνόλου νημάτων σε μια κατάσταση απόρριψης, και in response to a determination that a given subset of threads is to be discarded when at least one other subset of threads from the plurality of threads is to continue, the processing means is configured to process the given subset of threads in a discarded state, and

για ένα νήμα που έχει απορριφθεί στην κατάσταση απόρριψης, τα μέσα επεξεργασίας είναι διαμορφωμένα ώστε να συνεχίζουν την εκτέλεση εντολών για το νήμα που έχει απορριφθεί και τουλάχιστον ένα από τα εξής: for an aborted thread in the aborted state, the processing means are configured to continue executing commands for the aborted thread and at least one of the following:

τα μέσα επεξεργασίας είναι διαμορφωμένα ώστε να καταστέλλουν τη δημιουργία μηνυμάτων πρόσβασης δεδομένων η οποία διεγείρεται από το νήμα που έχει απορριφθεί, με τα εν λόγω μηνύματα πρόσβασης δεδομένων να περιλαμβάνουν μηνύματα που ζητούν πρόσβαση σε δεδομένα άλλα από τα εν λόγω δεδομένα αρχιτεκτονικής κατάστασης που είναι αποθηκευμένα στα μέσα για την αποθήκευση των μέσων επεξεργασίας, και the processing means is configured to suppress the generation of data access messages triggered by the aborted thread, said data access messages including messages requesting access to data other than said architecture state data stored in means for storing processing media, and

η διάταξη επεξεργασίας γραφικών είναι διαμορφωμένη ώστε να επιτρέπει τουλάχιστον μια λειτουργία επεξεργασίας, η οποία θα καθυστερούσε μέχρι την ολοκλήρωση του απορριφθέντος νήματος εάν το νήμα δεν είχε απορριφθεί, για έναρξη ανεξάρτητα από ένα αποτέλεσμα του απορριφθέντος νήματος. the graphics processor is configured to allow at least one processing operation, which would be delayed until completion of the aborted thread if the thread had not been aborted, to start regardless of an outcome of the aborted thread.

Τουλάχιστον ορισμένα παραδείγματα παρέχουν μια μέθοδο επεξεργασίας γραφικών, η οποία περιλαμβάνει: At least some examples provide a method of processing graphics, which includes:

την επεξεργασία ενός πλήθους νημάτων επεξεργασίας σε αντίστοιχα τμήματα γραφικών χρησιμοποιώντας μια μονάδα επεξεργασίας warp διαμορφωμένη ώστε να ελέγχει, σε εξάρτηση από έναν απαριθμητή προγράμματος warp ο οποίος χρησιμοποιείται από κοινού ανάμεσα στο πλήθος των νημάτων, με την προσκόμιση μιας επόμενης εντολής να εκτελείται για τουλάχιστον ορισμένα νήματα από το πλήθος των νημάτων, με τη μονάδα επεξεργασίας warp να περιλαμβάνει καταχωρητές για την αποθήκευση δεδομένων αρχιτεκτονικής κατάστασης για το πλήθος των νημάτων, και processing a plurality of processing threads in respective graphics sections using a warp processing unit configured to control, in dependence on a warp program counter shared among the plurality of threads, providing a subsequent command to be executed for at least some threads by the plurality of threads, the warp processing unit including registers for storing architectural state data for the plurality of threads, and

σε απόκριση ενός προσδιορισμού ότι ένα δεδομένο υποσύνολο νημάτων πρόκειται να απορριφθεί όταν τουλάχιστον ένα άλλο υποσύνολο νημάτων από το πλήθος των νημάτων πρόκειται να συνεχίσει, η μονάδα επεξεργασίας warp να επεξεργάζεται το δεδομένο υποσύνολο νημάτων σε μια κατάσταση απόρριψης, in response to a determination that a given subset of threads is to be discarded when at least one other subset of threads from the plurality of threads is to continue, the warp processing unit to process the given subset of threads in a discarded state;

όπου για ένα απορριφθέν νήμα στην κατάσταση απόρριψης, η μονάδα επεξεργασίας warp συνεχίζει την εκτέλεση εντολών για το απορριφθέν νήμα, και τουλάχιστον ένα από τα εξής: where for a dropped thread in the dropped state, the warp processor continues executing commands for the dropped thread, and at least one of the following:

η μονάδα επεξεργασίας warp καταστέλλει τη δημιουργία μηνυμάτων πρόσβασης δεδομένων η οποία διεγείρεται από το απορριφθέν νήμα, με τα εν λόγω μηνύματα πρόσβασης δεδομένων να περιλαμβάνουν μηνύματα που ζητούν πρόσβαση σε δεδομένα διαφορετικά από τα εν λόγω δεδομένα αρχιτεκτονικής κατάστασης που αποθηκεύονται στους καταχωρητές της μονάδας επεξεργασίας warp, και the warp processor suppresses the generation of data access messages triggered by the rejected thread, said data access messages including messages requesting access to data other than said architecture state data stored in the warp processor registers; and

τουλάχιστον μια λειτουργία επεξεργασίας, η οποία θα καθυστερούσε μέχρι την ολοκλήρωση του απορριφθέντος νήματος εάν το νήμα δεν είχε απορριφθεί, επιτρέπεται να ξεκινήσει ανεξάρτητα ενός αποτελέσματος του απορριφθέντος νήματος. at least one processing operation, which would delay until the aborted thread's completion if the thread had not been aborted, is allowed to start regardless of an outcome of the aborted thread.

Περαιτέρω ζητήματα, χαρακτηριστικά και πλεονεκτήματα της παρούσας τεχνικής θα διαφανούν από την ακόλουθη περιγραφή των παραδειγμάτων, η οποία πρέπει να διαβάζεται σε συνδυασμό με τα συνοδευτικά σχεδιαγράμματα, στα οποία: Further issues, features and advantages of the present technique will become apparent from the following description of the examples, which should be read in conjunction with the accompanying drawings, in which:

Το Σχήμα 1 απεικονίζει σχηματικά ένα παράδειγμα ενός σωληναγωγού επεξεργασίας γραφικών για την επεξεργασία θεμελιακών στοιχείων γραφικών για απεικόνιση, Figure 1 schematically illustrates an example of a graphics processing pipeline for processing graphics fundamentals for display,

Το Σχήμα 2 δείχνει ένα παράδειγμα απόδοσης σε πλακίδια θεμελιακών στοιχείων γραφικών, Figure 2 shows an example rendering on graphics fundamentals tiles,

Το Σχήμα 3 δείχνει ένα παράδειγμα δημιουργίας raster για τη δημιουργία τμημάτων γραφικών που αντιστοιχούν στα θεμελιακά στοιχεία γραφικών, Figure 3 shows an example of creating rasters to create graphics parts that correspond to graphics fundamentals,

Το Σχήμα 4 απεικονίζει ένα παράδειγμα στο οποίο ένα νήμα επεξεργασίας για ένα συγκεκριμένο τμήμα γραφικών μπορεί να απορρίπτεται επειδή το τμήμα γραφικών δε συμβάλλει στο τελικό πλαίσιο που απεικονίζεται επειδή αποκρύπτεται από ένα άλλο τμήμα γραφικών, Figure 4 illustrates an example in which a processing thread for a particular graphics part may be discarded because the graphics part does not contribute to the final frame rendered because it is hidden by another graphics part,

Το Σχήμα 5 απεικονίζει ένα παράδειγμα ενός τμήματος ενός πυρήνα σκίασης για την πραγματοποίηση νημάτων επεξεργασίας σκιάσεων, με τον πυρήνα σκίασης να περιλαμβάνει τουλάχιστον μια μονάδα επεξεργασίας warp, Figure 5 illustrates an example of a portion of a shader core for implementing shader processing threads, the shader core including at least one warp processing unit,

Το Σχήμα 6 απεικονίζει σχηματικά ένα παράδειγμα μιας μονάδας επεξεργασίας warp του πυρήνα σκίασης, Figure 6 schematically illustrates an example of a shading kernel warp processing unit,

Το Σχήμα 7 απεικονίζει ένα παράδειγμα εκτέλεσης πολλαπλών νημάτων σε ένα warp ελεγχόμενο από έναν απαριθμητή προγράμματος warp στη μονάδα επεξεργασίας warp, Τα Σχήματα 8 και 9 δείχνουν δύο παραδείγματα απόκλισης ανάμεσα σε νήματα του ίδιου warp, Figure 7 illustrates an example of multiple thread execution in a warp controlled by a warp program counter in the warp processing unit, Figures 8 and 9 show two examples of divergence between threads of the same warp,

Το Σχήμα 10 δείχνει ένα παράδειγμα εκτέλεσης ρήτρας από τη μονάδα επεξεργασίας warp, Figure 10 shows an example clause execution from the warp processor,

Το Σχήμα 11 δείχνει ένα παράδειγμα διαφορετικών καταστάσεων νημάτων για νήματα που επεξεργάζονται από τη μονάδα επεξεργασίας νημάτων, και Figure 11 shows an example of different yarn states for yarns processed by the yarn processing unit, and

Το Σχήμα 12 είναι ένα διάγραμμα ροής που δείχνει μια μέθοδο ελέγχου της κατάστασης νημάτων ανάλογα εάν απαιτείται απόρριψη των νημάτων. Figure 12 is a flow chart showing a method of checking the state of threads depending on whether the threads need to be discarded.

Ορισμένες φορές, αφού έχει ξεκινήσει η επεξεργασία ενός αριθμού νημάτων σε ένα warp στη μονάδα επεξεργασίας warp, μπορεί να προσδιοριστεί ότι ορισμένα νήματα μπορούν να απορριφθούν, για παράδειγμα επειδή τα αντίστοιχα τμήματα γραφικών μπορεί να μη συμβάλλουν στην τελική αποδιδόμενη εικόνα. Όταν μόνο ορισμένα νήματα του warp πρόκειται να απορριφθούν και άλλα νήματα πρόκειται να συνεχίσουν με την επεξεργασία τους, τότε λόγω της εξάρτησης από τον κοινό απαριθμητή προγράμματος warp για τον έλεγχο της προσκόμισης εντολών, είναι δυνατή η εκχώρηση διαφορετικών νημάτων στα τμήματα της μονάδας επεξεργασίας warp τα οποία προηγουμένως επεξεργάζονταν τα απορριφθέντα νήματα. Συνεπώς, σε γενικές γραμμές η εκτέλεση warp μπορεί να συνεχίζει. Μολονότι ορισμένες υλοποιήσεις μπορεί να έχουν τη δυνατότητα καταστολής της εκτέλεσης των εντολών για τμήματα της μονάδας επεξεργασίας warp τα οποία αντιστοιχούν στα απορριφθέντα νήματα, μπορεί να μην είναι πάντα πρακτικό άμεσα κάτι τέτοιο, εάν παρέχεται και αυτή η λειτουργία. Sometimes, after a number of threads have started processing a warp in the warp processing unit, it may be determined that some threads may be discarded, for example because the corresponding graphics sections may not contribute to the final rendered image. When only some warp threads are to be discarded and other threads are to continue processing, then due to the reliance on the common warp program counter to control command rendering, it is possible to assign different threads to the warp processing unit sections which previously processed the discarded yarns. So, in general, warp execution can continue. Although some implementations may have the ability to suppress command execution for parts of the warp processor that correspond to discarded threads, it may not always be immediately practical to do so if this functionality is provided.

Η παρούσα τεχνική παρέχει μια κατάσταση απόρριψης για νήματα που επεξεργάζονται από τη μονάδα επεξεργασίας warp, η οποία μπορεί να χρησιμοποιείται για ένα υποσύνολο νημάτων που πρόκειται να απορριφθούν όταν τουλάχιστον ένα άλλο υποσύνολο νημάτων στο warp πρόκειται να συνεχίσει. Όταν επεξεργάζεται ένα νήμα στην κατάσταση απόρριψης, η εκτέλεση των εντολών για το απορριφθέν νήμα μπορεί να συνεχίζει, όμως μπορεί να πραγματοποιείται τουλάχιστον μια από τις ακόλουθες ενέργειες. Σε μια περίπτωση, η μονάδα επεξεργασίας warp μπορεί να καταστείλει τη δημιουργία των μηνυμάτων πρόσβασης δεδομένων η οποία διεγείρεται από το απορριφθέν νήμα. Αυτά τα μηνύματα πρόσβασης δεδομένων μπορεί να περιλαμβάνουν μηνύματα τα οποία ζητούν πρόσβαση σε δεδομένα διαφορετικά των δεδομένων αρχιτεκτονικής κατάστασης που αποθηκεύονται σε καταχωρητές της μονάδας επεξεργασίας warp. Συνεπώς, ακόμη και εάν η εκτέλεση των εντολών συνεχίσει για τα απορριφθέντα νήματα, με την καταστολή της δημιουργίας των μηνυμάτων πρόσβασης δεδομένων, το εύρος ζώνης για την πρόσβαση στις μονάδες αποθήκευσης που αποθηκεύουν αυτά τα μη αρχιτεκτονικά δεδομένα μπορεί να εξοικονομείται για άλλες λειτουργίες ή νήματα και η ενέργεια που δαπανάται στην περιττή πρόσβαση σε αυτά τα άλλα δεδομένα μπορεί να διαφυλάσσεται. Σε μια άλλη επιλογή, για ένα νήμα που επεξεργάζεται την κατάσταση απόρριψης η διάταξη επεξεργασίας γραφικών μπορεί να επιτρέπει τουλάχιστον μια λειτουργία επεξεργασίας, η οποία θα καθυστερούσε μέχρι την ολοκλήρωση του απορριφθέντος νήματος εάν τα νήματα δεν είχαν απορριφθεί, να συνεχίζει ανεξάρτητα από ένα αποτέλεσμα του απορριφθέντος νήματος. Συνεπώς, μολονότι η εκτέλεση των εντολών για το απορριφθέν νήμα συνεχίζει, μια επακόλουθη λειτουργία, η οποία συνήθως θα έπρεπε να περιμένει μέχρι να ολοκληρωνόταν το απορριφθέν νήμα, μπορεί να ξεκινάει χωρίς να περιμένει την ολοκλήρωση του απορριφθέντος νήματος, για τη βελτίωση της απόδοσης αυτών των άλλων λειτουργιών. Σε ορισμένες υλοποιήσεις, μόνο μια από τις επιλογές καταστολής της δημιουργίας μηνυμάτων ή έναρξης άλλων λειτουργιών επεξεργασίας θα μπορούσε να υλοποιηθεί. Άλλες υλοποιήσεις μπορεί να παρέχουν και τις δύο αυτές βελτιώσεις για το απορριφθέν νήμα. Συνεπώς, σε γενικές γραμμές ακόμη και εάν δεν είναι δυνατή η καταστολή της εκτέλεσης των ίδιων των εντολών για το απορριφθέν νήμα, υπάρχουν και έτσι βελτιώσεις στην απόδοση και /ή εξοικονόμηση ενέργειας οι οποίες μπορούν να επιτυγχάνονται καταστέλλοντας επιπτώσεις τις οποίες αλλιώς το νήμα θα μπορούσε να έχει σε λειτουργίες που πραγματοποιούνται εκτός της μονάδας επεξεργασίας warp. Αυτό μπορεί να έχει ως αποτέλεσμα μια βελτίωση στην απόδοση για τη διάταξη επεξεργασίας γραφικών ως σύνολο. The present technique provides a discard state for yarns processed by the warp processing unit, which may be used for a subset of yarns to be discarded when at least one other subset of yarns in the warp is to continue. When a thread is processing in the aborted state, execution of commands for the aborted thread may continue, but at least one of the following actions may occur. In one case, the warp processing unit may suppress the generation of data access messages triggered by the discarded thread. These data access messages may include messages that request access to data other than architecture state data stored in warp processor registers. Therefore, even if instruction execution continues for the discarded threads, by suppressing the generation of data access messages, bandwidth to access the storage units that store this non-architectural data can be saved for other operations or threads, and the energy spent on redundant access to this other data can be saved. In another option, for a thread processing the aborted state the graphics processor may allow at least one processing operation, which would delay until the aborted thread's completion if the threads had not been aborted, to continue regardless of an outcome of the aborted thread. Therefore, although the execution of instructions on the discarded thread continues, a subsequent operation, which would normally have to wait until the discarded thread has completed, can start without waiting for the discarded thread to complete, to improve the performance of these other functions. In some implementations, only one of the options of suppressing message generation or initiating other processing operations could be implemented. Other implementations may provide both of these improvements for the discarded thread. Therefore, in general even if it is not possible to suppress the execution of the instructions themselves for the aborted thread, there are still performance improvements and/or energy savings that can be achieved by suppressing effects that the thread might otherwise have has in operations performed outside the warp processing unit. This can result in a performance improvement for the graphics editor as a whole.

Ορισμένες υλοποιήσεις μπορεί να επιτρέπουν τη θέση των νημάτων στην κατάσταση απόρριψης σε οποιαδήποτε ανάλυση, για παράδειγμα επιτρέποντας σε κάθε μεμονωμένο νήμα να τοποθετείται ανεξάρτητα στην κατάσταση απόρριψης ή να μετάγεται εκτός της κατάστασης απόρριψης. Some implementations may allow threads to be placed in the discard state at any resolution, for example by allowing each individual thread to be independently placed in the discard state or transitioned out of the discard state.

Ωστόσο, στην πράξη η επιβάρυνση που σχετίζεται με τη μεμονωμένη απόρριψη των νημάτων μπορεί να μη δικαιολογείται. Σε ορισμένες περιπτώσεις τα νήματα που επεξεργάζονται από τη μονάδα επεξεργασίας warp χρησιμοποιώντας τον κοινόχρηστο απαριθμητή προγράμματος warp μπορεί να περιλαμβάνει τουλάχιστον δύο ομάδες νημάτων. Η μονάδα επεξεργασίας warp μπορεί να αποτρέπει τη μεταγωγή ενός νήματος σε μια δεδομένη ομάδα από μια κατάσταση μη απόρριψης σε μια κατάσταση απόρριψης μερικώς μέσω της εκτέλεσης αυτού του νήματος, όταν τουλάχιστον ένα άλλο νήμα της δεδομένης ομάδας συνεχίζει σε μια μη κατάσταση απόρριψης. Συνεπώς, η μεταγωγή των νημάτων στην κατάσταση απόρριψης θα μπορούσε να ελέγχεται ανά ομάδα νημάτων, παρά ανά μεμονωμένο νήμα. However, in practice the burden associated with individual thread disposal may not be justified. In some cases the threads processed by the warp processing unit using the shared warp program counter may include at least two thread groups. The warp processing unit can prevent a thread in a given group from switching from a non-discarding state to a partially discarding state by executing that thread when at least one other thread in the given group continues in a non-discarding state. Therefore, the transition of threads to the discard state could be controlled per thread group, rather than per individual thread.

Οι ομάδες νημάτων θα μπορούσαν να ποικίλουν σε μέγεθος. Ωστόσο, μια ιδιαίτερα χρήσιμη υλοποίηση μπορεί να είναι αυτή στην οποία κάθε ομάδα νημάτων περιλαμβάνει τέσσερα νήματα τα οποία αντιστοιχούν σε ένα τετραγωνίδιο 2 επί 2 τμημάτων γραφικών (δηλαδή, τέσσερα τμήματα που αντιστοιχούν σε μια βαθμίδα pixel με ύψος 2 pixel και πλάτος 2 pixel). Είναι συνηθισμένη για τμήματα γραφικών η ομαδοποίηση σε τετραγωνίδια ώστε να είναι δυνατός ο υπολογισμός των διαφορών μεταξύ των τιμών που σχετίζονται με τα τμήματα γραφικών στο ίδιο τετραγωνίδιο, ώστε να υπολογίζονται οι παράγωγοι οι οποίες μπορεί να είναι χρήσιμες για τον έλεγχο της απόδοσης των διαβαθμίσεων για παράδειγμα. Συνεπώς, στην πράξη ακόμη και εάν απαιτείται η απόρριψη ενός νήματος που αντιστοιχεί σε ένα δεδομένο τετραγωνίδιο, μπορεί να μην αξίζει η απόρριψη αυτού του νήματος εκτός εάν όλα τα άλλα ενεργά νήματα στο ίδιο τετραγωνίδιο επίσης πρέπει να απορριφθούν, επειδή εάν τα άλλα νήματα στο ίδιο τετραγωνίδιο είναι ακόμη ενεργά τότε μπορεί να βασίζονται σε τιμές που παρέχονται από τα συσχετισμένα νήματα στο ίδιο τετραγωνίδιο. Συνεπώς, σε ορισμένες περιπτώσεις κάθε ομάδα νημάτων περιλαμβάνει ένα τετραγωνίδιο νημάτων. Thread groups could vary in size. However, a particularly useful implementation may be one in which each thread group includes four threads that correspond to a 2-by-2 grid of graphics segments (ie, four segments that correspond to a pixel array 2 pixels high and 2 pixels wide). It is common for plots to be grouped into boxes so that the differences between values associated with plots in the same box can be calculated to calculate derivatives which can be useful for testing the performance of gradients for example . Therefore, in practice even if it is required to drop a thread corresponding to a given square, it may not be worth dropping that thread unless all other active threads in the same square also need to be dropped, because if the other threads in the same box are still active then they may rely on values provided by associated threads in the same box. Therefore, in some cases each thread group includes a thread block.

Σε ένα παράδειγμα, σε απόκριση ενός προσδιορισμού ότι όλα τα νήματα μιας δεδομένης ομάδας πρόκειται να απορριφθούν ενώ τα νήματα μιας τουλάχιστον άλλης ομάδας που επεξεργάζονται από την ίδια μονάδα επεξεργασίας warp πρόκειται να συνεχίσουν, η μονάδα επεξεργασίας warp μπορεί να μετάγει τα νήματα της δεδομένης ομάδας από την κατάσταση μη απόρριψης στην κατάσταση απόρριψης. In one example, in response to a determination that all yarns of a given group are to be discarded while yarns of at least one other group processed by the same warp processing unit are to continue, the warp processing unit may switch the yarns of the given group from the no rejection state to the rejection state.

Ωστόσο, ορισμένες φορές η μονάδα επεξεργασίας warp μπορεί επίσης να μετάγει νήματα στην κατάσταση απόρριψης ακόμη και εάν όλα τα νήματα της ομάδας πρόκειται να απορριφθούν. Γ ια παράδειγμα, ορισμένες φορές μπορεί να έχει προσδιοριστεί από την αρχή ότι δεν απαιτείται ένα ορισμένο νήμα σε ένα τετραγωνίδιο κι επομένως η μονάδα επεξεργασίας warp θα μπορούσε να ξεκινάει με ορισμένα νήματα ανενεργό από την αρχή. Σε αυτή την περίπτωση, δεν είναι απαραίτητη η απόρριψη των νημάτων τα οποία ήταν πάντα ανενεργά ώστε να δικαιολογείται η απόρριψη και των άλλων ενεργών νημάτων. However, sometimes the warp processor may also transition threads to the discard state even if all the threads in the group are to be discarded. For example, it may sometimes have been determined from the start that a certain thread is not required in a block and therefore the warp processor could start with some threads inactive from the start. In this case, it is not necessary to drop the threads that were always inactive to justify dropping the other active threads as well.

Η μονάδα επεξεργασίας warp μπορεί να διατηρεί μια ενεργή μάσκα ενδεικτική των νημάτων τα οποία είναι ενεργά και πρόκειται να εκτελέσουν την προσκόμιση της επόμενης εντολής σε εξάρτηση από τον απαριθμητή προγράμματος warp. Η μονάδα επεξεργασίας warp μπορεί επίσης να διατηρεί μια εκκρεμή μάσκα η οποία υποδεικνύει τα νήματα τα οποία ήταν προηγουμένως ενεργά αλλά τώρα είναι ανενεργά λόγω της απόκλισης μεταξύ της ροής ελέγχου που ακολουθείται από τα αντίστοιχα νήματα του warp. Η εκκρεμής μάσκα μπορεί να χρησιμοποιείται για τη διάκριση ανάμεσα στα νήματα τα οποία ήταν ανενεργά από την αρχή επεξεργασίας του warp και τα νήματα τα οποία έγιναν ανενεργά λόγω της απόκλισης στη ροή ελέγχου που ακολουθείται από τα αντίστοιχα νήματα. The warp processor may maintain an active mask indicative of the threads that are active and about to execute the next instruction rendering depending on the warp counter. The warp processing unit may also maintain a pending mask which indicates threads which were previously active but are now inactive due to the deviation between the control flow followed by the corresponding warp threads. The pending mask can be used to distinguish between threads that have been inactive since the warp processing authority and threads that have become inactive due to a deviation in the control flow followed by the corresponding threads.

Συνεπώς, σε ορισμένες περιπτώσεις εάν προσδιοριστεί ότι όλα τα ενεργά νήματα για μια δεδομένη ομάδα (π.χ. τετραγωνίδιο) πρόκειται να απορριφθούν, η εκκρεμής μάσκα υποδηλώνει ότι δεν υπάρχουν εκκρεμή νήματα για τη δεδομένη ομάδα και ότι τα νήματα τουλάχιστον μιας ομάδας πρόκειται να συνεχίσουν, τότε η μονάδα επεξεργασίας warp μπορεί να μετάγει όλα τα ενεργά νήματα της δεδομένης ομάδας από την κατάσταση μη απόρριψης στην κατάσταση απόρριψης. Αυτό επιτρέπει τη χρήση της κατάστασης απόρριψης όταν ορισμένα νήματα ήταν ανενεργά από την έναρξη επεξεργασίας του warp. Therefore, in some cases if it is determined that all active threads for a given group (eg square) are to be dropped, the pending mask indicates that there are no pending threads for the given group and that at least one group's threads are to continue , then the warp processing unit can transition all active threads of the given group from the non-discarding state to the discarding state. This allows the discard state to be used when some threads have been inactive since warp processing started.

Η τεχνική που εξετάζεται παραπάνω μπορεί να είναι ιδιαίτερα χρήσιμη σε περιπτώσεις στις οποίες η μονάδα επεξεργασίας warp αποκρίνεται σε ρήτρες εντολών εντός ενός κοινού προγράμματος που εκτελείται για το νήμα στο warp, ώστε κάθε ρήτρα εντολών να εκτελείται ως μια βαθμίδα εντολών με διαδοχική ροή ελέγχου. Η μονάδα επεξεργασίας warp μπορεί να περιορίζει μη διαδοχικές αλλαγές της ροής ελέγχου στα όρια μεταξύ των ρητρών. Αυτή η προσέγγιση μπορεί να είναι χρήσιμη για τη μείωση της επιβάρυνσης για τον προσδιορισμό της ροής ελέγχου για το warp, επειδή σημαίνει ότι οι συγκρίσεις των μεμονωμένων απαριθμητών προγράμματος νημάτων ώστε να προσδιορίζεται ο επόμενης απαριθμητής προγράμματος warp που θα χρησιμοποιηθεί μπορούν να περιορίζονται στα όρια μεταξύ των ρητρών και όχι να πραγματοποιούνται μετά από κάθε εντολή. Στην πράξη, για πολλές ρουτίνες επεξεργασίας γραφικών, το μέγεθος των βασικών βαθμιδών μεταξύ διαδοχικών υπό συνθήκη κλάδων μπορεί να είναι σχετικά μεγάλο ώστε να είναι δυνατή η επίτευξη σημαντικής εξοικονόμησης ενέργειας με τη χρήση ρητρών, όπου μόλις εισάγεται μια ρήτρα τότε οι εντολές εκτελούνται διαδοχικά μέχρι το τέλος της ρήτρας και δεν είναι δυνατή η διακλάδωση προς μια μη διαδοχική εντολή στο μέσο μιας ρήτρας. The technique discussed above can be particularly useful in cases where the warp processor responds to command clauses within a common program executed for the thread in the warp, so that each command clause is executed as a cascade of commands with sequential control flow. The warp processor can restrict non-consecutive changes of control flow to the boundaries between clauses. This approach can be useful in reducing the overhead of determining the control flow for the warp because it means that comparisons of individual thread counters to determine the next warp counter to use can be limited to the boundaries between clauses and not be performed after each command. In practice, for many graphics processing routines, the size of the key steps between successive conditional branches can be relatively large to achieve significant energy savings by using clauses, where once a clause is entered then the statements are executed sequentially until end of clause and cannot branch to a non-consecutive command in the middle of a clause.

Όταν η μονάδα επεξεργασίας warp υποστηρίζει τέτοια βασισμένη σε ρήτρες εκτέλεση, η μονάδα επεξεργασίας warp μπορεί να μην έχει δυνατότητα ενημέρωσης μιας ενεργής μάσκας μερικώς μέσω της επεξεργασίας μιας δεδομένης ρήτρας, όπου η ενεργή μάσκα όπως εξετάζεται παραπάνω υποδεικνύει τα ενεργά νήματα τα οποία πρόκειται να εκτελέσουν την επόμενη εντολή που προσκομίζεται σε εξάρτηση από τον απαριθμητή προγράμματος warp. Δηλαδή, καθώς η ροή του προγράμματος συνεχίζει διαδοχικά εντός κάθε ρήτρας, οι υπό συνθήκη εντολές κλάδου θα τοποθετούνταν στο όριο της ρήτρας κι επομένως εντός μιας ρήτρας δε θα υπήρχε κάποια αλλαγή στην ομαδοποίηση των νημάτων εφόσον δεν είναι δυνατή η απόκλιση των νημάτων μεταξύ τους εντός μιας ρήτρας. Συνεπώς, είναι δυνατή η μείωση της επιβάρυνσης αποτρέποντας τις ενημερώσεις της ενεργής μάσκας η οποία υποδεικνύει την τρέχουσα ομαδοποίηση των νημάτων που συγκλίνουν στο μέσο της ρήτρας. Εφόσον δεν είναι δυνατή η ενημέρωση της ενεργής μάσκας μερικώς σε μια ρήτρα, τότε αυτό επίσης σημαίνει ότι μπορεί να μην είναι δυνατή η καταστολή της ίδιας της εκτέλεσης εντολών μερικώς σε μια ρήτρα, ακόμη και εάν προσδιοριστεί μερικώς στη ρήτρα ότι το νήμα πρέπει να απορριφθεί. Σε μια τέτοια υλοποίηση, η τεχνική παροχής της κατάστασης απόρριψης όπως εξετάζεται παραπάνω, η οποία επιτρέπει τη συνέχιση της εκτέλεσης εντολών, αλλά με την καταστολή ορισμένων επιπτώσεων του νήματος σε άλλα νήματα, μπορεί να είναι ιδιαίτερα χρήσιμη. Καθώς ορισμένες ρήτρες μπορεί να είναι σχετικά επιμήκεις, με την απουσία μιας τέτοιας κατάστασης απόρριψης το απορριφθέν νήμα μπορεί να συνεχίσει να δημιουργεί μηνύματα τα οποία έχουν ως αποτέλεσμα την πρόσβαση στη μνήμη ή σε προσωρινές μνήμες εκτός της μονάδας επεξεργασίας warp για κάποιο χρόνο μετά τον προσδιορισμό ότι το νήμα δεν απαιτείται, το οποίο θα σπαταλούσε ενέργεια και μπορεί να είχε επιπτώσεις στην πραγματοποίηση και άλλων νημάτων. Παρόμοια, σε περιπτώσεις στις οποίες ένα απορριφθέν νήμα μπορεί να επιτρέπει τη συνέχιση μιας επακόλουθης εξαρτώμενης λειτουργίας, αυτό κάνει δυνατή τη βελτίωση της απόδοσης για τέτοιες εξαρτώμενες λειτουργίες. Συνεπώς, με τη μεταγωγή ενός νήματος από μια κατάσταση μη απόρριψης (ενεργή) στην κατάσταση απόρριψης μερικώς κατά την επεξεργασία της δεδομένης ρήτρας, είναι δυνατή η βελτίωση της απόδοσης. When the warp processor supports such clause-based execution, the warp processor may not be able to update an active mask partially through the processing of a given clause, where the active mask as discussed above indicates the active threads which are to execute the next command served depending on the warp program counter. That is, as the program flow continues sequentially within each clause, the conditional branch statements would be placed at the clause boundary and therefore within a clause there would be no change in thread grouping since it is not possible for threads to diverge from each other within a clause. clause. Therefore, it is possible to reduce the overhead by preventing updates to the active mask that indicates the current grouping of threads converging in the middle of the clause. Since it is not possible to update the active mask partially in a clause, then this also means that it may not be possible to suppress command execution itself partially in a clause, even if it is partially specified in the clause that the thread should be dropped. In such an implementation, the technique of providing the abort state as discussed above, which allows instructions to continue executing but suppressing some of the thread's effects on other threads, can be particularly useful. As some clauses can be relatively lengthy, in the absence of such an abort state the aborted thread may continue to generate messages that result in access to memory or temporary memories outside the warp processor for some time after determining that thread is not required, which would waste energy and may impact the execution of other threads as well. Similarly, in cases where a dropped thread might allow a subsequent dependent operation to continue, this makes it possible to improve performance for such dependent operations. Therefore, by switching a thread from a non-dismissing (active) state to a partially-dismissing state while processing the given clause, performance can be improved.

Όταν ένα δεδομένο νήμα μετάγεται από την κατάσταση μη απόρριψης στην κατάσταση απόρριψης μερικώς κατά την επεξεργασία της δεδομένης ρήτρας, τότε μόλις ολοκληρωθεί η επεξεργασία της δεδομένης ρήτρας, η μονάδα επεξεργασίας warp μπορεί τότε να μετάγει το δεδομένο νήμα σε μια τερματική κατάσταση. Για τα νήματα στην τερματική κατάσταση, η μονάδα επεξεργασίας warp μπορεί να καταστέλλει την εκτέλεση της εντολής. When a given thread transitions from the non-discarding state to the discarding state partway through processing the given clause, then once the given clause is finished processing, the warp processing unit can then transition the given thread to a terminal state. For threads in the terminal state, the warp processor may suppress execution of the command.

Για παράδειγμα, η μεταγωγή του νήματος στην τερματική κατάσταση μπορεί να αντιστοιχεί στο μηδενισμό των bit στην ενεργή μάσκα ώστε οι εντολές δε θα εκτελούνται για αυτό το νήμα από τη μονάδα επεξεργασίας warp στις μετέπειτα ρήτρες. Συνεπώς, η κατάσταση απόρριψης μπορεί να χρησιμοποιείται μεταξύ του σημείου στο οποίο το νήμα προσδιορίστηκε προς απόρριψη και του τέλους της τρέχουσας ρήτρας, το οποίο σε ορισμένα προγράμματα σκίασης θα μπορούσε να είναι ένας σχετικά μεγάλος χρόνος. For example, transitioning the thread to the terminal state may correspond to zeroing the bits in the active mask so that commands will not be executed for that thread by the warp processing unit in subsequent clauses. Therefore, the dump state can be used between the point at which the thread was specified to dump and the end of the current clause, which in some shaders could be a relatively long time.

Η κατάσταση απόρριψης μπορεί να χρησιμοποιείται σε περιπτώσεις στις οποίες τα νήματα μιας δεδομένης ομάδας ή τετραγωνιδίου προσδιορίζονται προς απόρριψη, όμως τουλάχιστον μια άλλη ομάδα ή τετραγωνίδιο πρέπει ακόμη να συνεχίσει. Αφετέρου, σε περιπτώσεις στις οποίες όλα τα νήματα που επεξεργάστηκαν από τη μονάδα επεξεργασίας warp πρόκειται να απορριφθούν (συμπεριλαμβανόμενων όλων των ομάδων ή των τετραγωνιδίων), τότε η μονάδα επεξεργασίας warp μπορεί απλά να τερματίζει την επεξεργασία των νημάτων του warp μερικώς κατά την επεξεργασία μιας τρέχουσας ρήτρας. Σε αυτή την περίπτωση δε θα ήταν απαραίτητη η αναμονή μέχρι το τέλος της ρήτρας, επειδή δεν υπάρχουν νήματα για τα οποία απαιτείται ακόμη η εκτέλεση εντολών. The drop state can be used in cases where the threads of a given group or block are determined to be dropped, but at least one other group or block must still continue. On the other hand, in cases where all threads processed by the warp processing unit are to be discarded (including all groups or squares), then the warp processing unit can simply stop processing the warp threads part way through processing a current clause. In this case it would not be necessary to wait until the end of the clause, because there are no threads that still need to execute commands.

Η κατάσταση απόρριψης μπορεί επίσης να χρησιμοποιείται για ορισμένα νήματα εκτός αυτών που προσδιορίζονται ότι απαιτούν απόρριψη αφού το warp έχει ήδη ξεκινήσει να τα επεξεργάζεται. Γ ια παράδειγμα, ένα νήμα βοήθειας μπορεί να επεξεργάζεται στην κατάσταση απόρριψης από την αρχή της επεξεργασίας του νήματος βοήθειας. Ορισμένες φορές, η μονάδα επεξεργασίας warp μπορεί να εκχωρείται ώστε να επεξεργάζεται ορισμένα νήματα βοήθειας τα οποία στην πράξη δεν αντιστοιχούν σε ένα πραγματικό τμήμα γραφικών το οποίο θα καταλήξει να συμβάλλει στην απεικόνιση του αποδιδόμενου πλαισίου εικόνας, αλλά αντίθετα επεξεργάζονται ώστε να παρέχουν τιμές δεδομένων οι οποίες μπορούν να χρησιμοποιούνται από άλλα νήματα τα οποία πράγματι αντιστοιχούν στα πραγματικά τμήματα γραφικών. Γ ια παράδειγμα, τέτοια νήματα βοήθειας μπορούν να χρησιμοποιούνται σε περιπτώσεις στις οποίες ένα λογισμικό δημιουργίας raster χαρτογραφεί ένα θεμελιακό στοιχείο γραφικών σε έναν αριθμό τμημάτων γραφικών αλλά το όριο του θεμελιακού στοιχείου διέρχεται από ένα συγκεκριμένο τετραγωνίδιο ώστε τουλάχιστον ορισμένα από τα τμήματα για το τετραγωνίδιο στο όριο του θεμελιακού στοιχείου βρίσκονται εκτός του ορίου του θεμελιακού στοιχείου κι επομένως δε χρειάζεται να σχεδιαστούν. Παρ’ όλα αυτά, για να είναι δυνατός ο υπολογισμός των παράγωγων για αυτά τα νήματα τα οποία αντιστοιχούν σε τμήματα εντός του ορίου θεμελιακών στοιχείων, νήματα βοήθειας μπορούν ακόμη να εκδίδονται για αυτά τα τμήματα του τετραγωνιδίου τα οποία βρίσκονται εκτός του ορίου θεμελιακών στοιχείων. Αυτά τα νήματα βοήθειας μπορούν να εκτελούνται στην κατάσταση απόρριψης από την έναρξη της επεξεργασίας του νήματος βοήθειας, ώστε το νήμα βοήθειας δεν παράγει μηνύματα ή άλλα νήματα στην ίδια θέση pixel μπορούν να εκδίδονται χωρίς αναμονή για την ολοκλήρωση του νήματος βοήθειας. The discard state may also be used for some threads other than those identified as requiring discard after warp has already started processing them. For example, a helper thread may be processing in the discard state from the beginning of the helper thread's processing. Sometimes the warp processing unit may be assigned to process certain helper threads that in practice do not correspond to an actual piece of graphics that will end up contributing to the rendering of the rendered image frame, but are instead processed to provide data values that they can be used by other threads that actually correspond to the actual graphics sections. For example, such helper threads can be used in situations where a raster generation software maps a graphic element to a number of graphics segments but the boundary of the underlying element passes through a particular box so that at least some of the segments for the box on the boundary of the fundamental element are outside the boundary of the fundamental element and therefore do not need to be drawn. However, in order to be able to calculate the derivatives for those threads which correspond to segments inside the fundamental element boundary, auxiliary threads can still be issued for those parts of the square which are outside the fundamental element boundary. These helper threads can run in the discard state from the start of helper thread processing, so that the helper thread does not produce messages, or other threads at the same pixel location can issue without waiting for the helper thread to complete.

Μπορεί να υπάρχει ένας αριθμός λόγων για τους οποίους η μονάδα επεξεργασίας warp μπορεί να προσδιορίζει ότι ένα δεδομένο νήμα πρόκειται να απορριφθεί. Σε ένα παράδειγμα, η μονάδα επεξεργασίας warp μπορεί να προσδιορίζει ότι ένα νήμα πρόκειται να απορριφθεί σε απόκριση ενός σήματος τερματισμού το οποίο υποδηλώνει ότι το τμήμα γραφικών θα αποκρυβόταν στην αποδιδόμενη εικόνα από ένα άλλο τμήμα γραφικών που επεξεργάστηκε από τη διάταξη. Για παράδειγμα, το σήμα τερματισμού μπορεί να λαμβάνεται από ένα εμπρόσθιο στάδιο τερματισμού pixel το οποίο είναι προγενέστερο στο σωληναγωγό επεξεργασίας γραφικών από τη μονάδα επεξεργασίας warp, ενώ το σήμα τερματισμού μπορεί να παράγεται σε περιπτώσεις στις οποίες το εμπρόσθιο στάδιο τερματισμού έχει προσδιορίσει ότι ένα τμήμα που λήφθηκε αργότερα θα αποκρύψει το τμήμα που λήφθηκε νωρίτερα το οποίο επεξεργάζεται εκείνη τη στιγμή από το δεδομένο νήμα της μονάδας επεξεργασίας warp. Καταστέλλοντας τις περιττές λειτουργίες επεξεργασίας για τα κρυφά τμήματα τα οποία δε θα συνέβαλαν στην αποδιδόμενη εικόνα, ο υπολογιστικός φόρτος εργασίας του σωληναγωγού επεξεργασίας γραφικών ως σύνολο μπορεί να μειώνεται και κατά συνέπεια η απόδοση να βελτιώνεται. There can be a number of reasons why the warp processor may determine that a given thread is to be discarded. In one example, the warp processor may determine that a thread is to be discarded in response to a termination signal indicating that the graphics segment would be hidden in the rendered image by another graphics segment processed by the layout. For example, the termination signal may be received from a pixel termination front-end stage that is earlier in the graphics processing pipeline than the warp processing unit, while the termination signal may be produced in cases where the termination front-end stage has determined that a segment that fetched later will hide the earlier fetched part which is currently being processed by the given warp processor thread. By suppressing unnecessary processing operations for the hidden parts that would not contribute to the rendered image, the computational workload of the graphics processing pipeline as a whole can be reduced and thus performance improved.

Επίσης, σε ορισμένες περιπτώσεις ένα νήμα επεξεργασίας το οποίο πραγματοποιείται από τη μονάδα επεξεργασίας warp μπορεί το ίδιο να προσδιορίζει ότι απαιτείται η απόρριψή του. Για παράδειγμα, ορισμένες φορές με βάση την τιμή βάθους για ένα δεδομένο νήμα, μπορεί να προσδιορίζεται ότι ένα ορισμένο τμήμα του νήματος αντιστοιχεί σε ένα εντελώς διαφανές pixel και επομένως σε αυτή την περίπτωση αυτό το τμήμα θα μπορούσε να απορριφθεί. Συνεπώς, ορισμένες φορές το πρόγραμμα σκίασης που εκτελείται για κάθε νήμα του warp μπορεί να έχει ορισμένες υπό συνθήκη λειτουργίες οι οποίες εάν εκτελεστούν μπορεί να περιλαμβάνουν μια εντολή απόρριψης η οποία εντέλει ότι το αντίστοιχο νήμα θα πρέπει να απορριφθεί. Συνεπώς, όταν η μονάδα επεξεργασίας warp συναντά μια εντολή απόρριψης εντός ενός ενεργού νήματος τότε μπορεί να προσδιορίζει ότι το αντίστοιχο νήμα θα πρέπει να απορριφθεί. Ανάλογα εάν τα άλλα νήματα του ίδιου τετραγωνιδίου ή της ίδιας ομάδας πρόκειται επίσης να απορριφθούν η μονάδα επεξεργασίας warp μπορεί να προσδιορίζει κατά πόσο πρέπει να θέσει το απορριφθέν νήμα στην κατάσταση απόρριψης. Also, in some cases a processing thread performed by the warp processing unit may itself determine that it needs to be discarded. For example, sometimes based on the depth value for a given thread, it may be determined that a certain portion of the thread corresponds to a completely transparent pixel, and therefore in that case that portion could be discarded. Therefore, sometimes the shader running for each thread of the warp may have some conditional functions which if executed may include a drop command which means that the corresponding thread should be dropped. Therefore, when the warp processor encounters a discard command within an active thread then it can determine that the corresponding thread should be discarded. Depending on whether the other threads in the same block or group are also to be discarded the warp processor can determine whether to put the discarded yarn into the discard state.

Τα μηνύματα πρόσβασης δεδομένων τα οποία καταστέλλονται σε ορισμένα παραδείγματα για νήματα στην κατάσταση απόρριψης μπορεί να περιλαμβάνουν αιτήματα για τη φόρτωση δεδομένων από ή την αποθήκευση δεδομένων σε μια θέση αποθήκευσης εκτός της μονάδας επεξεργασίας warp. Γ ια παράδειγμα, η θέση αποθήκευσης θα μπορούσε να περιλαμβάνει μια προσωρινή μνήμη πλακιδίων ή μια προσωρινή μνήμη πλαισίων η οποία αποθηκεύει τις τιμές pixel που υπολογίστηκαν προηγουμένως για τουλάχιστον ένα τμήμα ενός πλαισίου αποδιδόμενης εικόνας, μια προσωρινή μνήμη βάθους για την αποθήκευση τιμών βάθους για τα pixel τουλάχιστον ενός τμήματος του πλαισίου αποδιδόμενης εικόνας, μια προσωρινή μνήμη υφής η οποία αποθηκεύει δεδομένα υφής τα οποία παραπέμπουν σε νήματα επεξεργασίας που πραγματοποιούνται από τη μονάδα επεξεργασίας warp και/ή έναν αποθηκευτικό χώρο δεδομένων χαρακτηριστικών για την αποθήκευση χαρακτηριστικών που υπολογίζονται για ένα δεδομένο τμήμα γραφικών πριν την έκδοση ενός αντίστοιχου νήματος επεξεργασίας προς τη μονάδα επεξεργασίας warp. Γ ια παράδειγμα αυτά τα χαρακτηριστικά θα μπορούσαν να προσδιορίζουν παραμέτρους όπως η θέση του pixel, το χρώμα, το βάθος, η διαφάνεια ή η αδιαφάνεια, κλπ. Data access messages that are suppressed in some examples for threads in the discard state may include requests to load data from or save data to a storage location outside of the warp processor. For example, the storage location could include a tile cache or a frame cache that stores previously calculated pixel values for at least a portion of a rendered image frame, a depth buffer for storing depth values for the pixels at least one portion of the rendered image frame, a texture cache that stores texture data that references processing threads performed by the warp processing unit, and/or a feature data store for storing features computed for a given graphics portion before issuing a corresponding processing thread to the warp processing unit. For example these attributes could specify parameters such as pixel position, color, depth, transparency or opacity, etc.

Σε παραδείγματα τα οποία επιτρέπουν σε μια τουλάχιστον άλλη λειτουργία επεξεργασίας να συνεχίζει χωρίς αναμονή για το αποτέλεσμα ενός απορριφθέντος νήματος, αυτή η άλλη λειτουργία μπορεί να είναι οποιαδήποτε λειτουργία που πραγματοποιείται εντός της διάταξης επεξεργασίας γραφικών, η οποία εξαρτάται από ένα αποτέλεσμα του απορριφθέντος νήματος ή εμποδίζεται να συνεχίσει μέχρι την ολοκλήρωση του απορριφθέντος νήματος. Σε ορισμένες περιπτώσεις, η τουλάχιστον μια λειτουργία επεξεργασίας μπορεί να περιλαμβάνει ένα άλλο νήμα επεξεργασίας το οποίο πραγματοποιείται για ένα άλλο τμήμα γραφικών το οποίο αντιστοιχεί στην ίδια θέση στην αποδιδόμενη εικόνα με το τμήμα γραφικών για το νήμα στην κατάσταση απόρριψης. Για παράδειγμα, για να εξασφαλιστεί ότι οι επακόλουθες λειτουργίες, όπως η alpha blending (μίξη άλφα) ή η depth testing (δοκιμή βάθους), εξετάζουν κάθε αντίστοιχο τμήμα για την ίδια θέση pixel με τη σειρά, ο χρονοπρογραμματιστής η οποία εκχωρεί νήματα επεξεργασίας στη μονάδα επεξεργασίας warp ή σε άλλες μονάδες επεξεργασίας warp εντός ενός πυρήνα σκίασης μπορεί να καθυστερεί την έκδοση νημάτων για μια δεδομένη θέση pixel μέχρι να έχει ολοκληρωθεί ήδη κάθε προηγούμενο νήμα για την ίδια θέση. Όταν ένα νήμα είναι στην κατάσταση απόρριψης δεν είναι απαραίτητη η αναμονή για την ολοκλήρωση αυτού του νήματος και στη θέση της μπορεί να εκδίδεται για εκτέλεση το άλλο νήμα επεξεργασίας για ένα διαφορετικό τμήμα γραφικών στην ίδια θέση. Συγκεκριμένα, σε περιπτώσεις στις οποίες η διαχείριση της απόρριψης των νημάτων πραγματοποιείται ανά ομάδα ή τετραγωνίδιο όπως εξετάζεται παραπάνω, τότε μπορεί να επιτρέπεται η έκδοση ενός διαφορετικού τετραγωνιδίου για την ίδια θέση pixel για επεξεργασία νημάτων νωρίτερα. In examples that allow at least one other processing operation to continue without waiting for the result of a discarded thread, that other operation may be any operation performed within the graphics processor that depends on a result of the discarded thread or is blocked continue until the discarded thread completes. In some cases, the at least one processing operation may include another processing thread that is executed for another graphics segment that corresponds to the same location in the rendered image as the graphics segment for the thread in the discard state. For example, to ensure that subsequent operations, such as alpha blending or depth testing, examine each corresponding segment for the same pixel position in turn, the scheduler that allocates processing threads to the module warp processing or in other warp processing units within a shader core may delay issuing threads for a given pixel location until every previous thread for the same location has already completed. When a thread is in the discard state it is not necessary to wait for that thread to complete, and the other processing thread for a different graphics segment at the same location can be issued to execute in its place. Specifically, in cases where thread rejection is handled per group or tile as discussed above, then a different tile may be allowed to be issued for the same pixel location for earlier thread processing.

Μολονότι ορισμένες υλοποιήσεις θα μπορούσαν να παρέχουν μια μοναδική μονάδα επεξεργασίας warp, σε ορισμένες περιπτώσεις η διάταξη μπορεί να περιλαμβάνει δύο ή περισσότερες ξεχωριστές μονάδες επεξεργασίας warp, κάθε μια από τις οποίες έχει έναν ξεχωριστό απαριθμητή προγράμματος warp. Συνεπώς, τα νήματα που επεξεργάζονται εντός του ίδιου warp χρησιμοποιούν έναν κοινό απαριθμητή προγράμματος warp κι επομένως η προσκόμιση και αποκωδικοποίηση των εντολών μοιράζεται ανάμεσα στα αντίστοιχα νήματα του warp, όμως τα νήματα σε μια μονάδα επεξεργασίας warp μπορούν να εκτελούνται ανεξάρτητα από τα νήματα σε μια άλλη μονάδα επεξεργασίας warp, με διαφορετικούς απαριθμητές προγράμματος warp να ελέγχουν την παράλληλη προσκόμιση και αποκωδικοποίηση διαφορετικών εντολών για τα αντίστοιχα warp. Although some implementations could provide a single warp processing unit, in some cases the arrangement may include two or more separate warp processing units, each of which has a separate warp program counter. Therefore, threads processing within the same warp use a common warp program counter and thus the rendering and decoding of instructions is shared between the corresponding warp threads, but threads in one warp processing unit can execute independently of threads in another warp processing unit, with different warp program counters controlling parallel rendering and decoding of different commands for respective warps.

Το Σχήμα 1 απεικονίζει ένα παράδειγμα ενός σωληναγωγού επεξεργασίας γραφικών 2 για την επεξεργασία θεμελιακών στοιχείων γραφικών για την απεικόνιση ενός πλαισίου δεδομένων εικόνας. Δεδομένα γεωμετρίας 4 που ορίζουν έναν αριθμό θεμελιακών στοιχείων γραφικών προς σχεδίαση στην αποδιδόμενη εικόνα εισάγονται στο σωληναγωγό. Τα θεμελιακά στοιχεία μπορεί να αντιστοιχούν σε τρίγωνα ή άλλα πολύγωνα προς σχεδίαση, για παράδειγμα. Η είσοδος γεωμετρίας μπορεί να προσδιορίζει συντεταγμένες των κορυφών κάθε θεμελιακού στοιχείου, ενώ επίσης θα μπορούσε να προσδιορίζει και άλλες ιδιότητες του θεμελιακού στοιχείου, όπως ένα χρώμα, μια διαφάνεια ή ένα βάθος που σχετίζεται με μια δεδομένη κορυφή. Figure 1 illustrates an example of a graphics processing pipeline 2 for processing graphics fundamentals for displaying a frame of image data. Geometry data 4 that defines a number of basic graphics elements to be drawn on the rendered image is inserted into the pipeline. Foundation elements can correspond to triangles or other polygons to be drawn, for example. The geometry input can specify coordinates of the vertices of each foundation element, and could also specify other properties of the foundation element, such as a color, transparency, or depth associated with a given vertex.

Ένα στάδιο δημιουργίας πλακιδίων 6 λαμβάνει την είσοδο γεωμετρίας και εκχωρεί κάθε θεμελιακό στοιχείο σε ένα ή περισσότερα πλακίδια εντός του προς απόδοση πλαισίου. Όπως δείχνεται στο Σχήμα 2, το πλαίσιο 8 μπορεί να διαιρείται σε ένα πλέγμα μικρότερων πλακιδίων 10 ενός ορισμένου μεγέθους (π.χ. 16 χ 16 pixel ή 32 χ 32 pixel). Ορισμένα θεμελιακά στοιχεία 11 μπορεί να εκτείνονται σε περισσότερα από ένα πλακίδιο κι επομένως μπορεί να εκχωρούνται σε περισσότερα από ένα πλακίδια από το στάδιο δημιουργίας πλακιδίων 6. Δεν είναι απαραίτητο τα πλακίδια να είναι τετράγωνα, ενώ σε ορισμένα παραδείγματα θα μπορούσαν να χρησιμοποιηθούν και ορθογώνια πλακίδια. Τα υπόλοιπα στάδια του σωληναγωγού 2 χρησιμοποιούν βασισμένη σε πλακίδια απόδοση, στην οποία οι λειτουργίες σχεδίασης των pixel σε ένα δεδομένο πλακίδιο ολοκληρώνονται πριν τη μετακίνηση στο επόμενο πλακίδιο της εικόνας. Συνεπώς, το στάδιο δημιουργίας πλακιδίων 6 επαναλαμβάνεται σε κάθε διαδοχικό πλακίδιο του πλαισίου 8, για κάθε πλακίδιο που στέλνει πληροφορίες στο σύνολο θεμελιακών στοιχείων προς σχεδίαση σε αυτό το πλακίδιο προς το στάδιο διευθέτησης θεμελιακών στοιχείων 12 και που μετακινείται στο επόμενο πλακίδιο μόλις σταλούν όλα τα θεμελιακά στοιχεία του προηγούμενου πλακιδίου, ενώ συνεχίζει να επαναλαμβάνεται μέχρι να ολοκληρωθούν όλα τα πλακίδια στο πλαίσιο, στο οποίο σημείο μπορεί να ληφθεί και επεξεργαστεί η είσοδος γεωμετρίας 4 για ένα επακόλουθο πλαίσιο. A tiling stage 6 takes the geometry input and assigns each ground element to one or more tiles within the frame to be rendered. As shown in Figure 2, the frame 8 can be divided into a grid of smaller tiles 10 of a certain size (eg 16 x 16 pixels or 32 x 32 pixels). Some foundation elements 11 may span more than one tile and thus may be assigned to more than one tile by the tiling step 6. It is not necessary for the tiles to be square, and in some examples rectangular tiles could be used. The remaining stages of pipeline 2 use tile-based rendering, in which pixel drawing operations on a given tile are completed before moving to the next image tile. Therefore, the tiling stage 6 is repeated on each successive tile of the frame 8, for each tile that sends information on the set of fundamentals to be drawn on that tile to the arranging stage of fundamentals 12 and which moves to the next tile once all the fundamentals have been sent. elements of the previous tile, while continuing to iterate until all tiles in the frame are complete, at which point geometry input 4 can be obtained and processed for a subsequent frame.

Το στάδιο διευθέτησης θεμελιακών στοιχείων 12 πραγματοποιεί διάφορες λειτουργίες διευθέτησης θεμελιακών στοιχείων στην ομάδα θεμελιακών στοιχείων που έχουν ανατεθεί σε ένα δεδομένο πλακίδιο. Για παράδειγμα, οι λειτουργίες διευθέτησης θεμελιακών στοιχείων μπορεί να προσδιορίζουν επιπλέον ιδιότητες των θεμελιακών στοιχείων οι οποίες δεν υποδεικνύονται σαφώς από τα δεδομένα γεωμετρίας. Για παράδειγμα, το στάδιο διευθέτησης θεμελιακών στοιχείων 12 μπορεί να αποκομίζει μια ή περισσότερες λειτουργίες ακμών οι οποίες αντιπροσωπεύουν τις θέσεις των ακμών που συνδέουν τις αντίστοιχες κορυφές των θεμελιακών στοιχείων, μια λειτουργία βάθους η οποία αντιπροσωπεύει τη διακύμανση του βάθους κατά μήκος του θεμελιακού στοιχείου ή μια λειτουργία παρεμβολής η οποία αντιπροσωπεύει τη διακύμανση ιδιοτήτων όπως το χρώμα, η σκίαση ή των τιμών διαφάνειας/ αδιαφάνειας κατά μήκος του θεμελιακού στοιχείου. Οι ιδιότητες που προσδιορίζονται από το στάδιο αποθήκευσης διευθέτησης θεμελιακών στοιχείων 12 μπορεί να αποθηκεύονται σε ένα χώρο αποθήκευσης ιδιοτήτων 13. The core arrangement stage 12 performs various core arrangement operations on the group of cores assigned to a given tile. For example, foundation arrangement functions may specify additional foundation properties that are not clearly indicated by the geometry data. For example, the foundation element arrangement step 12 may derive one or more edge functions that represent the locations of edges connecting the respective vertices of the foundation elements, a depth function that represents the variation of depth along the foundation element, or a interpolation function which represents the variation of properties such as color, shading, or transparency/opacity values along the underlying element. The properties determined by the foundation arrangement storage step 12 may be stored in a property store 13.

Όπως δείχνεται στο Σχήμα 3, τα θεμελιακά στοιχεία 11 διαβιβάζονται σε ένα στάδιο λογισμικού δημιουργίας raster 14 το οποίο μετατρέπει τις κορυφές και κάθε πρόσθετη γνωστοποίηση διευθέτησης θεμελιακών στοιχείων για ένα δεδομένο θεμελιακό στοιχείο σε τμήματα γραφικών 15 που αντιπροσωπεύουν ιδιότητες (θέση x-y, βάθος, χρώμα, διαφάνεια/αδιαφάνεια, κλπ.) αντίστοιχων pixel της περιοχής που καταλαμβάνεται από το θεμελιακό στοιχείο. Στην υλοποίηση που εξετάζεται παρακάτω, τα τμήματα που δημιουργούνται από το λογισμικό δημιουργίας raster 14 επεξεργάζονται από ορισμένα κατάντη στάδια (π.χ. το στάδιο σκίασης 26) σε μονάδες των 2x2 βαθμιδών pixel που ονομάζονται τετραγωνίδια 16 (με την επεξεργασία να πραγματοποιείται παράλληλα για κάθε τμήμα στο ίδιο τετραγωνίδιο). Ωστόσο, αυτό δεν είναι απαραίτητο ενώ και άλλες υλοποιήσεις θα μπορούσαν να επεξεργαστούν κάθε τμήμα μεμονωμένα ή σε ομάδες διαφορετικού αριθμού τμημάτων. Εάν ένα τετραγωνίδιο τέμνει ένα όριο θεμελιακού στοιχείου, τότε επιπλέον των συμβατικών νημάτων επεξεργασίας που πραγματοποιούνται για τα τμήματα εντός του ορίου, ο σωληναγωγός μπορεί επίσης να πραγματοποιεί νήματα βοήθειας για τις θέσεις τμημάτων 15’ εκτός του ορίου τα οποία δε θα καταλήξουν να σχεδιαστούν, αλλά χρησιμοποιούνται ως βοήθεια στους υπολογισμούς άλλων τμημάτων του ίδιου τετραγωνιδίου τα οποία βρίσκονται εντός του ορίου θεμελιακών στοιχείων. As shown in Figure 3, the primitives 11 are passed to a raster generation software stage 14 which converts the vertices and any additional primitive arrangement information for a given primitive into graphics segments 15 representing properties (x-y position, depth, color, transparency/opacity, etc.) of corresponding pixels of the area occupied by the underlying element. In the embodiment discussed below, the segments generated by the raster generation software 14 are processed by some downstream stages (eg, the shading stage 26) into units of 2x2 pixel gradients called squares 16 (with the processing occurring in parallel for each section in the same square). However, this is not necessary and other implementations could process each segment individually or in groups of different number of segments. If a square intersects a foundation element boundary, then in addition to the conventional processing threads performed for segments within the boundary, the pipeliner may also perform auxiliary threads for segment locations 15' outside the boundary that will not end up being drawn, but they are used as an aid in the calculations of other parts of the same square which are within the limit of fundamental elements.

Όπως δείχνεται στο Σχήμα 4, διάφορα θεμελιακά στοιχεία που σχετίζονται με διαφορετικές τιμές βάθους μπορεί να περιλαμβάνουν αδιαφανή τμήματα 15 ή τετραγωνίδια 16 στην ίδια θέση x-y, ώστε μόνο το πλέον εμπρόσθιο τμήμα ή τετραγωνίδιο να είναι ορατό στην τελική εικόνα. Τα τμήματα που δημιουργούνται από το λογισμικό δημιουργίας raster 14 υπόκεινται σε ένα πρώιμο στάδιο δοκιμής βάθους 18 το οποίο δοκιμάζει κατά πόσο το βάθος που σχετίζεται με ένα τμήμα που λήφθηκε αργότερα είναι τέτοιο ώστε το μεταγενέστερο τμήμα θα αποκρύπτεται από ένα προγενέστερο τμήμα το οποίο έχει ήδη σχεδιαστεί στις προσωρινές μνήμες πλακιδίων 20 για την αποθήκευση των πιο πρόσφατων τιμών των pixel της σκηνής της οποίας η απόδοση είναι σε εξέλιξη. Οι προσωρινές μνήμες πλακιδίων 20 μπορεί να περιλαμβάνουν μια προσωρινή μνήμη ανά πλακίδιο, ενώ κάθε πλακίδιο μπορεί να περιλαμβάνει έναν αριθμό εγγραφών κάθε μια από τις οποίες αντιστοιχεί σε ένα pixel αυτού του πλακιδίου. Ο σωληναγωγός μπορεί να διατηρεί μια προσωρινή μνήμη Ζ stencil 22 η οποία αποθηκεύει, σε βάση ανά pixel, μια τιμή βάθους Ζ η οποία υποδηλώνει το βάθος του πλέον εμπρόσθιου τμήματος που αποδίδεται σε μια δεδομένη θέση pixel μέχρι τώρα στο τρέχον πλακίδιο. Συνεπώς, το πρώιμο στάδιο δοκιμής βάθους 18 μπορεί να συγκρίνει το βάθος που σχετίζεται με ένα τελευταία ληφθέν τμήμα με το βάθος να υποδηλώνεται για μια αντίστοιχη συντεταγμένη pixel στην προσωρινή μνήμη Ζ stencil 22 και να καταστέλλει την επεξεργασία του ληφθέντος τμήματος όταν προσδιορίζεται ότι η θέση βάθους του μεταγενέστερα ληφθέντος τμήματος είναι πίσω από τη θέση βάθους του ήδη σχεδιασμένου τμήματος του οποίου το βάθος αναπαρίσταται στην προσωρινή μνήμη Ζ stencil 22. As shown in Figure 4, different fundamentals associated with different depth values may include opaque segments 15 or squares 16 at the same x-y position so that only the most forward segment or square is visible in the final image. Segments generated by the raster generation software 14 are subject to an early depth test stage 18 which tests whether the depth associated with a later captured segment is such that the later segment will be obscured by an earlier segment that has already been drawn in tile caches 20 to store the latest pixel values of the scene being rendered. Tile caches 20 may include one cache per tile, and each tile may include a number of records each corresponding to a pixel of that tile. The pipeline may maintain a Z stencil buffer 22 which stores, on a per-pixel basis, a Z-depth value indicating the depth of the frontmost portion assigned to a given pixel location so far in the current tile. Accordingly, the early depth test stage 18 may compare the depth associated with a last received segment with the depth indicated for a corresponding pixel coordinate in the Z stencil buffer 22 and suppress processing of the received segment when it is determined that the depth location of the subsequently obtained part is behind the depth position of the already drawn part whose depth is represented in the Z stencil buffer 22.

Το πρώιμο στάδιο δοκιμής βάθους 18 βοηθά στην εξάλειψη της επεξεργασίας των τμημάτων σε περιπτώσεις στις οποίες το πλέον εμπρόσθιο τμήμα σε μια δεδομένη θέση pixel λαμβάνεται πριν από το πλέον οπίσθιο τμήμα κι επομένως η επεξεργασία του μεταγενέστερα ληφθέντος τμήματος μπορεί να ανασταλεί επειδή θα αποκρύπτεται από ένα ήδη σχεδιασμένο τμήμα. Ωστόσο, είναι επίσης δυνατόν το πλέον οπίσθιο τμήμα να ληφθεί πριν από το πλέον εμπρόσθιο τμήμα. Συνεπώς, τα τμήματα τα οποία επιτυγχάνουν στο πρώιμο στάδιο δοκιμής βάθους 18 παρέχονται σε ένα στάδιο εμπρόσθιου τερματισμού pixel (forward pixel kill ή FPK) 24 το οποίο προσδιορίζει τις περιπτώσεις στις οποίες ένα νωρίτερα ληφθέν τμήμα θα αποκρυβόταν από ένα μεταγενέστερα ληφθέν τμήμα. Με τη λήψη ενός δεδομένου τμήματος (ένα μεταγενέστερα ληφθέν τμήμα), το στάδιο FPK 24 δοκιμάζει κατά πόσο ένα νωρίτερα ληφθέν τμήμα που εκκρεμεί ακόμη στο στάδιο FPK 24 ή ένα μεταγενέστερο τμήμα του σωληναγωγού 2 θα αποκρυβόταν από το δεδομένο μεταγενέστερα ληφθέν τμήμα. Σε αυτή την περίπτωση, το στάδιο FPK 24 δημιουργεί ένα αίτημα τερματισμού το οποίο ζητά την καταστολή της περαιτέρω επεξεργασίας του νωρίτερα ληφθέντος τμήματος, για την αποφυγή κατανάλωσης περαιτέρω πόρων επεξεργασίας για την επεξεργασία ενός τμήματος το οποίο δε θα συνέβαλε στην τελική εικόνα. The early depth test stage 18 helps to eliminate segment processing in cases where the most forward segment at a given pixel location is obtained before the most posterior segment and therefore processing of the later received segment may be inhibited because it will be obscured by an already designed section. However, it is also possible for the rearmost part to be taken before the frontmost part. Therefore, segments that pass the early depth test stage 18 are provided to a forward pixel kill (FPK) stage 24 which identifies cases in which an earlier received segment would be obscured by a later received segment. Upon receiving a given segment (a later received segment), the FPK step 24 tests whether an earlier received segment still pending at the FPK stage 24 or a later segment of the pipeline 2 would be hidden by the given later received segment. In this case, the FPK stage 24 generates a termination request which requests the suppression of further processing of the earlier received segment, to avoid consuming further processing resources to process a segment that would not contribute to the final image.

Τα τμήματα τα οποία δεν τερματίζονται από το στάδιο FPK 24 διαβιβάζονται σε ένα στάδιο σκίασης τμημάτων 26 το οποίο περιλαμβάνει ένα κύκλωμα επεξεργασίας νημάτων για την πραγματοποίηση νημάτων επεξεργασίας σκίασης τμημάτων για κάθε τμήμα. Για παράδειγμα, η σκίαση τμημάτων μπορεί να έχει πρόσβαση σε δεδομένα υφής αποθηκευμένα σε μια προσωρινή μνήμη υφής 27, τα οποία ορίζουν λειτουργίες που αντιπροσωπεύουν ένα σχέδιο ή μια υφή προς απόδοση εντός ενός δεδομένου θεμελιακού στοιχείου και μπορούν να το χρησιμοποιούν για να προσδιορίζουν το ακριβές χρώμα προς εκχώρηση σε ένα δεδομένο pixel (τα χρώματα που εκχωρούνται αρχικά από τα στάδια ρύθμισης θεμελιακών στοιχείων και δημιουργίας raster 12, 14 μπορεί να είναι αρχικές τιμές για τα νήματα σκίασης). Το στάδιο σκίασης τμημάτων μπορεί να εκτελεί παράλληλα έναν αριθμό νημάτων επεξεργασίας που αντιστοιχούν στα αντίστοιχα τμήματα του ίδιου τετραγωνιδίου 16. Ο πυρήνας σκίασης τμημάτων μπορεί επίσης να έχει πόρους για την παράλληλη επεξεργασία πολλαπλών τετραγωνιδίων 16. Η εκτέλεση σκίασης τμημάτων είναι σχετικά εντατική ως προς τον επεξεργαστή και γι' αυτό μπορεί να είναι χρήσιμο το στάδιο FPK 24 να έχει τη δυνατότητα καταστολής ενός νήματος στόχου της εκτέλεσης σκίασης τμημάτων εάν βρεθεί ότι ένα μεταγενέστερα ληφθέν τμήμα θα αποκρύψει το πρότερο τμήμα που αντιστοιχεί στο νήμα στόχο. Ο πυρήνας σκίασης τμημάτων 26 περιγράφεται λεπτομερέστερα παρακάτω. Segments that are not terminated by the FPK stage 24 are passed to a segment shading stage 26 which includes a thread processing circuit for performing segment shading processing threads for each segment. For example, the part shader can access texture data stored in a texture buffer 27 that defines functions representing a pattern or texture to render within a given underlying element and can use it to determine the exact color to assign to a given pixel (the colors initially assigned by the fundamental and rasterization stages 12, 14 may be initial values for the shader threads). The segment shader stage may execute in parallel a number of processing threads corresponding to corresponding segments of the same tile 16. The segment shader core may also have resources for parallel processing of multiple tiles 16. Segment shading execution is relatively processor intensive and so it may be useful for the FPK stage 24 to have the ability to suppress a target thread from performing segment shading if it is found that a later fetched segment will obscure the earlier segment corresponding to the target thread. The segment shading kernel 26 is described in more detail below.

Τα σκιασμένα τμήματα παρέχονται σε ένα όψιμο στάδιο δοκιμής βάθους 28 το οποίο δοκιμάζει κατά πόσο το βάθος που σχετίζεται με το σκιασμένο τμήμα είναι τέτοιο ώστε το τμήμα να αποκρύβεται από ένα ήδη αποδοθέν τμήμα όπως υποδηλώνεται από το βάθος στην προσωρινή μνήμη Ζ stencil 22. Το όψιμο στάδιο δοκιμής βάθους 28 παρέχεται επειδή υπάρχουν ορισμένα τμήματα για τα οποία η τιμή βάθους μπορεί να μην είναι διαθέσιμη εγκαίρως για το πρώιμο στάδιο δοκιμής βάθους 18 ή για τα οποία το βάθος μπορεί να αλλάζει κατά την εκτέλεση της σκίασης τμημάτων. Η όψιμη δοκιμή βάθους επιτρέπει επίσης την ανίχνευση των υπερβολικά σχεδιασμένων τμημάτων σε περιπτώσεις στις οποίες, κατά το χρόνο που το μεταγενέστερο τμήμα βρίσκεται στο πρώιμο στάδιο δοκιμής βάθους 18, το πρότερο τμήμα το οποίο θα απέκρυβε αυτό το μεταγενέστερο τμήμα εκκρεμεί ακόμη στο σωληναγωγό και δεν έχει ακόμη ενημερώσει την προσωρινή μνήμη Ζ stencil 22 (αλλά θα το κάνει μέχρι το μεταγενέστερο τμήμα να φτάσει στο όψιμο στάδιο δοκιμής βάθους 28). Επίσης, το όψιμο στάδιο δοκιμής βάθους 28 επιτρέπει τη διαχείριση διαφανών αντικειμένων των οποίων η διαφάνεια μπορεί να γίνει εμφανής μόνο κατά την εκτέλεση σκίασης τμημάτων. Εάν κάποια τμήματα βρεθούν από το όψιμο στάδιο δοκιμής βάθους 28 να αποκρύβονται από ήδη σχεδιασμένα τμήματα, καταστέλλονται και αποτρέπεται η σχεδίασή τους στο αντίστοιχο πλακίδιο. Τα υπόλοιπα τμήματα διαβιβάζονται σε ένα στάδιο ανάμιξης 30 το οποίο πραγματοποιεί ανάμιξη για την ανάμιξη των ιδιοτήτων των διαφανών τμημάτων με τα επόμενα πλέον εμπρόσθια τμήματα σε αντίστοιχες θέσεις pixel και εγγράφει τις τιμές pixel που προκύπτουν σε αντίστοιχες εγγραφές μιας επί του παρόντος ενεργής προσωρινής μνήμης πλακιδίων. Για αδιαφανή αντικείμενα, το στάδιο ανάμιξης 30 μπορεί απλά να αντικαθιστά τις προηγούμενες τιμές pixel στην προσωρινή μνήμη πλακιδίων. Όταν ολοκληρωθεί η επεξεργασία όλων των τμημάτων για ένα πλακίδιο, η επεξεργασία μεταφέρεται στο επόμενο πλακίδιο που αναπαρίσταται από μια διαφορετική προσωρινή μνήμη πλακιδίων. The shaded portions are provided to a late depth test stage 28 which tests whether the depth associated with the shaded portion is such that the portion is obscured by an already rendered portion as indicated by the depth in the Z stencil buffer 22. The late depth test stage 28 is provided because there are some segments for which the depth value may not be available in time for the early depth test stage 18 or for which the depth may change when segment shading is performed. Late depth testing also allows for the detection of overdesigned sections in cases where, by the time the downstream section is at the early depth test stage 18, the earlier section that would conceal that downstream section is still pending in the pipeline and has not still update the Z stencil cache 22 (but will do so until the later part reaches the late depth test stage 28). Also, the late depth test stage 28 allows handling of transparent objects whose transparency can only become apparent when performing part shading. If any segments are found by the late stage of depth testing 28 to be hidden by already drawn segments, they are suppressed and prevented from being drawn on the corresponding tile. The remaining segments are passed to a blending stage 30 which performs blending to blend the properties of the transparent segments with the next most forward segments at corresponding pixel positions and writes the resulting pixel values to corresponding entries of a currently active tile buffer. For opaque objects, the blending stage 30 may simply replace the previous pixel values in the tile buffer. When all segments for a tile have been processed, processing moves to the next tile represented by a different tile buffer.

Μολονότι το Σχήμα 1 δείχνει ένα παράδειγμα που χρησιμοποιεί βασισμένη σε πλακίδια απόδοση, η οποία μπορεί να είναι επωφελής για τη μείωση της απαιτούμενης χωρητικότητας κρυφής μνήμης και του απαιτούμενου εύρους ζώνης μνήμης, και άλλα παραδείγματα μπορεί να χρησιμοποιούν απόδοση άμεσης λειτουργίας, στην οποία τα θεμελιακά στοιχεία για ολόκληρο το πλαίσιο διαβιβάζονται προς τα κάτω στο σωληναγωγό με οποιαδήποτε σειρά, χωρίς να ομαδοποιούνται πρώτα σε πλακίδια. Μολονότι το Σχήμα 1 δείχνει διάφορες προσωρινές μνήμες 13, 22, 27, 20 που υλοποιούνται ως μονάδες αποθήκευσης στο σωληναγωγό 2, στην πράξη αυτές μπορεί να υλοποιούνται ως δομές αποθήκευσης με πρόσβαση από τη μνήμη από εντολές φόρτωσης/αποθήκευσης γενικού σκοπού. Οι εγγραφές των δομών αποθήκευσης προσωρινής μνήμης μπορούν να αποθηκεύονται σε κρυφή μνήμη τοπικά στο σωληναγωγό 2. Η χρήση βασισμένης σε πλακίδια απόδοσης μπορεί να βοηθά στη βελτίωση της απόδοσης αποθήκευσης σε κρυφή μνήμη εφόσον η ομαδοποίηση της επεξεργασίας των pixel ενός δεδομένου πλακιδίου σε συνδυασμό αυξάνει την πιθανότητα τα απαιτούμενα δεδομένα προσωρινής μνήμης να υπάρχουν στην κρυφή μνήμη, εφόσον είναι πιο πιθανό ότι τα γειτονικά pixel θα χρειάζονται παρόμοια δεδομένα. Although Figure 1 shows an example that uses tile-based rendering, which can be beneficial in reducing the required cache capacity and memory bandwidth, other examples may use on-the-fly rendering, in which the fundamentals for the entire frame are passed down the pipeline in any order without first being grouped into tiles. Although Figure 1 shows various temporary memories 13, 22, 27, 20 implemented as storage units in pipeline 2, in practice these may be implemented as storage structures accessed from memory by general purpose load/store commands. Writes of cache structures can be cached locally in pipeline 2. Using tile-based rendering can help improve caching performance since grouping processing pixels of a given tile together increases the chance the required cache data should be in the cache since it is more likely that neighboring pixels will need similar data.

Το στάδιο σκίασης τμημάτων 26 περιλαμβάνει έναν ή περισσότερους πυρήνες σκίασης οι οποίοι πραγματοποιούν νήματα εκτέλεσης σκίασης τμημάτων. Το Σχήμα 5 δείχνει ένα τμήμα ενός πυρήνα σκίασης. Θα εκτιμηθεί ότι το Σχήμα 5 είναι ιδιαίτερα απλοποιημένο και στην πράξη ο πυρήνας σκίασης μπορεί να έχει πολλά άλλα στοιχεία που δε δείχνονται στο Σχήμα 5. Κάθε πυρήνας σκίασης μπορεί να περιλαμβάνει έναν αριθμό μονάδων επεξεργασίας warp 40 και ένα διαχειριστή warp 42. Κάθε μονάδα επεξεργασίας warp 40 μπορεί να επεξεργάζεται έναν αριθμό νημάτων εκτέλεσης σκίασης τμημάτων ως ένα warp υπό τον έλεγχο ενός κοινόχρηστου απαριθμητή προγράμματος warp 44. Στο παράδειγμα του Σχήματος 5, κάθε μονάδα επεξεργασίας warp 40 επεξεργάζεται οκτώ νήματα, που αντιστοιχούν σε δύο τετραγωνίδια 16, όμως και άλλα παραδείγματα θα μπορούσαν να επεξεργάζονται ένα μεγαλύτερο αριθμό τετραγωνιδίων εντός του ίδιου warp. Εντός ενός warp 40, τα νήματα επεξεργάζονται σύμφωνα με ένα μοντέλο μοναδικής εντολής, πολλαπλών νημάτων (single instruction multiple thread ή SIMT) στο οποίο η εκτέλεση ενός κοινού προγράμματος από έναν αριθμό νημάτων που δρουν σε διαφορετικές εισόδους δεδομένων ελέγχεται με βάση ένα μοναδικό κοινόχρηστο απαριθμητή προγράμματος warp 44. Ο διαχειριστής warp 42 ελέγχει το χρονοπρογραμματισμό συγκεκριμένων νημάτων τετραγωνιδίων στις αντίστοιχες μονάδες επεξεργασίας warp 40. The segment shader stage 26 includes one or more shader cores that implement threads of segment shader execution. Figure 5 shows a section of a shading core. It will be appreciated that Figure 5 is highly simplified and in practice the shader core may have many other elements not shown in Figure 5. Each shader core may include a number of warp processing units 40 and a warp manager 42. Each warp processing unit 40 may process a number of segment shader execution threads as a warp under the control of a shared warp program counter 44. In the example of Figure 5, each warp processor 40 processes eight threads, corresponding to two squares 16, but other examples will could process a larger number of squares within the same warp. Within a warp 40, threads process according to a single instruction multiple thread (SIMT) model in which the execution of a common program by a number of threads acting on different data inputs is controlled based on a single shared program counter warp 44. The warp manager 42 controls the scheduling of specific block threads to the respective warp processing units 40.

Το Σχήμα 6 δείχνει ένα παράδειγμα μιας μονάδας επεξεργασίας warp 40 σε μεγαλύτερη λεπτομέρεια. Η μονάδα επεξεργασίας warp 40 περιλαμβάνει έναν αριθμό σωληναγωγών εκτέλεσης 50, έναν για κάθε νήμα του warp. Συνεπώς, σε αυτό το παράδειγμα εφόσον επεξεργάζονται δύο τετραγωνίδια από το warp υπάρχουν οκτώ σωληναγωγοί εκτέλεσης 50. Κάθε σωληναγωγός εκτέλεσης 50 έχει πρόσβαση σε ένα αντίστοιχο σύνολο καταστάσεων νημάτων 52 αποθηκευμένων σε καταχωρητές αρχιτεκτονικής κατάστασης 54. Κάθε σωληναγωγός εκτέλεσης 50 εκτελεί εντολές από ένα κοινό πρόγραμμα σκίασης, χρησιμοποιώντας όμως διαφορετικές εισόδους δεδομένων όπως ορίζεται από την κατάσταση νήματος 52. Κοινή λογική προσκόμισης/αποκωδικοποίησης εντολών 56 παρέχεται και χρησιμοποιείται από κοινού από τις αντίστοιχες μονάδες εκτέλεσης 50 του warp, ώστε σε κάθε κύκλο η ίδια εντολή να εκδίδεται για παράλληλη εκτέλεση σε τουλάχιστον ένα υποσύνολο των σωληναγωγών εκτέλεσης νημάτων 50 (συνεπώς δεν είναι δυνατή η έκδοση διαφορετικών εντολών προς το σωληναγωγό εκτέλεσης 50 του ίδιου warp 76 στον ίδιο κύκλο). Ο απαριθμητής προγράμματος warp 44 αντιπροσωπεύει τη διεύθυνση της εντολής του τρέχοντος σημείου εκτέλεσης στο κοινό πρόγραμμα που προσεγγίζεται από το warp ως σύνολο και ελέγχει τη λογική προσκόμισης 56 για την προσκόμιση της εντολής που υποδεικνύεται από τον απαριθμητή προγράμματος warp 44. Κάθε κατάσταση νήματος περιλαμβάνει έναν αντίστοιχο απαριθμητή προγράμματος νήματος 58 ο οποίος αντιπροσωπεύει την επόμενη εντολή προς εκτέλεση από την αντίστοιχη μονάδα εκτέλεσης 50 για το αντίστοιχο νήμα. Κάθε απαριθμητής προγράμματος νήματος 58 αυξάνεται (ή ενημερώνεται μη διαδοχικά στην περίπτωση ενός κλάδου) με βάση το αποτέλεσμα της επεξεργασίας του αντίστοιχου νήματος. Παρέχεται μια λογική ψηφοφορίας απαριθμητή προγράμματος 72 για την επιλογή, με βάση τους μεμονωμένους απαριθμητές προγράμματος νήματος 58 για κάθε νήμα του warp, της τιμής που θα οριστεί στο γενικό απαριθμητή προγράμματος warp 44 ο οποίος ελέγχει την εντολή η οποία προσκομίζεται στον επόμενο κύκλο. Figure 6 shows an example of a warp processing unit 40 in greater detail. The warp processing unit 40 includes a number of execution pipelines 50, one for each thread of the warp. Therefore, in this example since two blocks are processed by the warp there are eight execution pipelines 50. Each execution pipeline 50 has access to a corresponding set of thread states 52 stored in architecture state registers 54. Each execution pipeline 50 executes commands from a common shader program , but using different data inputs as defined by thread state 52. Common command rendering/decoding logic 56 is provided and shared by respective warp execution units 50 so that in each cycle the same command is issued for parallel execution to at least one subset of the yarn execution pipelines 50 (therefore different commands cannot be issued to the execution pipeline 50 of the same warp 76 in the same cycle). Warp program counter 44 represents the instruction address of the current execution point in the shared program accessed by the warp as a whole and controls fetch logic 56 to fetch the instruction indicated by warp program counter 44. Each thread state includes a corresponding thread program counter 58 which represents the next instruction to be executed by the respective execution unit 50 for the respective thread. Each thread program counter 58 is incremented (or updated non-sequentially in the case of a branch) based on the result of the corresponding thread's processing. A program counter polling logic 72 is provided for selecting, based on the individual thread program counters 58 for each warp thread, the value to be set in the global warp program counter 44 which controls which command is served in the next cycle.

Αυτός ο τύπος επεξεργασίας μπορεί να αναφέρεται ως επεξεργασίας μοναδικής εντολής πολλαπλών νημάτων (single instruction multiple thread ή SIMT). Εφόσον τα αντίστοιχα pixel σε ένα τετραγωνίδιο είναι πιθανό να έχουν παρόμοιες τιμές εισόδου, είναι πιθανό να ακολουθήσουν παρόμοιες διαδρομές διαμέσου του προγράμματος σκίασης νημάτων κι επομένως μπορεί να είναι αποτελεσματικός ο έλεγχος της εκτέλεσης των αντίστοιχων νημάτων με τη χρήση SIMT για τη μείωση της επιβάρυνσης από την προσκόμιση/αποκωδικοποίηση των εντολών. Παρόμοια, τα γειτονικά τετραγωνίδια είναι πιθανότερο να ακολουθήσουν παρόμοιες διαδρομές από τα τετραγωνίδια με μεγαλύτερη απόσταση μεταξύ τους, ώστε με χρονοπρογραμματισμό των νημάτων για γειτονικά τετραγωνίδια στο ίδιο warp, είναι δυνατή η βελτίωση της απόδοσης επεξεργασίας. This type of processing can be referred to as single instruction multiple thread (SIMT) processing. Since corresponding pixels in a box are likely to have similar input values, they are likely to follow similar paths through the thread shader, and therefore it can be effective to control the execution of corresponding threads using SIMT to reduce the overhead of presenting/decoding the commands. Similarly, neighboring squares are more likely to follow similar paths than squares further apart, so by scheduling threads for neighboring squares on the same warp, processing efficiency can be improved.

Το Σχήμα 7 δείχνει ένα παράδειγμα εκτέλεσης μιας απλής ακολουθίας εντολών με χρήση επεξεργασίας SIMT για ένα warp που περιλαμβάνει 4 νήματα. Για λόγους συνοπτικότητας δείχνονται 4 νήματα, όμως θα εκτιμηθεί ότι και άλλοι αριθμοί νημάτων (π.χ. Figure 7 shows an example of executing a simple sequence of commands using SIMT processing for a warp comprising 4 threads. For the sake of brevity, 4 threads are shown, but it will be appreciated that other numbers of threads (e.g.

8 στο παράδειγμα του Σχήματος 6) θα μπορούσαν επίσης να επεξεργάζονται με έναν τρόπο SIMT. Μια κοινή ακολουθία εντολών εκτελείται σε συγχρονισμό από κάθε νήμα του warp (ομάδα νημάτων). Ο απαριθμητής προγράμματος warp 44 υποδεικνύει τη διεύθυνση της τρέχουσας εντολής που επεξεργάζεται από το warp ως σύνολο. Ο απαριθμητής προγράμματος warp 44 προκύπτει από τους απαριθμητές προγράμματος νήματος 58 (tPCO-tPC3) των μεμονωμένων νημάτων στο warp (για παράδειγμα, μέσω ψηφοφορίας απαριθμητών προγράμματος). Για παράδειγμα, ο απαριθμητής προγράμματος warp 44 μπορεί να αντιστοιχεί στη μικρότερη διεύθυνση που υποδεικνύεται από οποιονδήποτε απαριθμητή προγράμματος νήματος 58 για την ομάδα νημάτων που επεξεργάζεται στο warp. Γ ια κάθε νήμα εντός της ομάδας, η εντολή που υποδεικνύεται από τον απαριθμητή προγράμματος warp 44 εκτελείται στον τρέχοντα κύκλο επεξεργασίας εάν ο απαριθμητής προγράμματος νήματος 58 για αυτό το νήμα ταιριάζει στον απαριθμητή προγράμματος warp 44. Εάν ο απαριθμητής προγράμματος νήματος 58 για ένα δεδομένο νήμα δεν ταιριάζει με τον απαριθμητή προγράμματος warp 44, η αντίστοιχη μονάδα εκτέλεσης warp 50 είναι αδρανής για έναν κύκλο (μολονότι θα εκτιμηθεί ότι η εκτέλεση εντολών από κάθε μονάδα εκτέλεσης warp 50 μπορεί να τίθεται σε σωληναγωγό ώστε όταν ένα τμήμα μιας μονάδας εκτέλεσης 50 είναι αδρανές, τα άλλα στάδια μπορούν παρά ταύτα να πραγματοποιούν λειτουργίες που σχετίζονται με μια εντολή που εκδίδεται σε έναν πρότερο ή ένα μεταγενέστερο κύκλο). 8 in the example of Figure 6) could also be processed in a SIMT fashion. A common sequence of instructions is executed synchronously by each thread of the warp (group of threads). Warp program counter 44 indicates the address of the current instruction being processed by warp as a whole. The warp program counter 44 is derived from the thread program counters 58 (tPCO-tPC3) of the individual threads in the warp (for example, by polling program counters). For example, the warp program counter 44 may correspond to the smallest address indicated by any thread program counter 58 for the group of threads processing in the warp. For each thread within the group, the command indicated by the warp program counter 44 is executed in the current processing cycle if the thread program counter 58 for that thread matches the warp program counter 44. If the thread program counter 58 for a given thread does not match warp program counter 44, the corresponding warp execution unit 50 is idle for one cycle (although it will be appreciated that instruction execution from each warp execution unit 50 may be piped so that when a portion of an execution unit 50 is idle, the other stages may nevertheless perform functions related to an instruction issued in an earlier or a later cycle).

Συνεπώς, στο παράδειγμα του Σχήματος 7: Therefore, in the example of Figure 7:

• Στον κύκλο 0, ο απαριθμητής προγράμματος warp 44 υποδεικνύει τη διεύθυνση #add και όλοι οι απαριθμητές προγράμματος νήματος 58 για τα νήματα 0 έως 3 επίσης υποδεικνύουν τη διεύθυνση #add. Συνεπώς, όλα τα νήματα εκτελούν την εντολή ADD στη διεύθυνση #add. Διαφορετικά νήματα μπορεί να εκτελούν την εντολή ADD χρησιμοποιώντας διαφορετικούς τελεστές ώστε ένας αριθμός διαφορετικών προσθέσεων πραγματοποιούνται παράλληλα για τα αντίστοιχα νήματα. • In cycle 0, warp program counter 44 points to address #add and all thread program counters 58 for threads 0 through 3 also point to address #add. Therefore, all threads execute the ADD command at address #add. Different threads may execute the ADD instruction using different operators so that a number of different additions are performed in parallel for the respective threads.

· Στον κύκλο 1 , εκτελείται μια εντολή σύγκρισης CMP στη διεύθυνση #add+4 για τη σύγκριση του αποτελέσματος r2 της εντολής ADD με μια άμεση τιμή 19. Για τα νήματα 0, 1 και 3, το αποτέλεσμα r2 δεν ήταν ίσο με 19, ενώ για το νήμα 2 το αποτέλεσμα τ2 ήταν ίσο με 19. · In cycle 1 , a CMP comparison instruction is executed at address #add+4 to compare the r2 result of the ADD instruction with an immediate value of 19. For threads 0, 1, and 3, the r2 result was not equal to 19, while for thread 2 the result τ2 was equal to 19.

• Στον κύκλο 2, μια εντολή κλάδου ΒΝΕ μεταφέρεται στη διεύθυνση #add+16 εάν το αποτέλεσμα της εντολής CMP στον κύκλο 1 δεν ήταν ίσο (ΝΕ). Για τα νήματα 0, 1 και 3, ο κλάδος ακολουθείται κι επομένως οι απαριθμητές προγράμματος νήματος 58 για αυτά τα νήματα ορίζονται σε #add+16. Ωστόσο, για το νήμα 2 το αποτέλεσμα της εντολής CMP ήταν ίσο (EQ) κι επομένως ο κλάδος δεν ακολουθείται και ο απαριθμητής προγράμματος νήματος 58 για το νήμα 2 αυξάνεται σε #add+12. Συνεπώς, τώρα υπάρχουν νήματα με διαφορετικές τιμές του απαριθμητή προγράμματος νήματος 58 και υπάρχει διαφορετική συμπεριφορά ανάμεσα στα νήματα. • In cycle 2, a BNE branch instruction is moved to address #add+16 if the result of the CMP instruction in cycle 1 was not equal (NE). For threads 0, 1, and 3, the branch is followed, so the thread program counters 58 for these threads are set to #add+16. However, for thread 2 the result of the CMP instruction was equal (EQ) and therefore the branch is not followed and the thread program counter 58 for thread 2 is incremented to #add+12. Therefore, there are now threads with different values of thread program counter 58 and there is different behavior between threads.

• Στον κύκλο 3, ο απαριθμητής προγράμματος warp 44 ορίζεται σε #add+12 ώστε να ταιριάζει με το χαμηλότερο από τους απαριθμητές προγράμματος νήματος 58 (σε αυτή την περίπτωση, ο απαριθμητής προγράμματος νήματος για το νήμα 2). Για το νήμα 2, εκτελείται η εντολή πολλαπλασιασμού MUL στη διεύθυνση #add+12. Ωστόσο, δεν εκτελείται κάποια εντολή στον κύκλο 3 για τα νήματα 0, 1 και 3 επειδή οι απαριθμητές προγράμματος νήματος 58 για αυτά τα νήματα δεν ταιριάζουν με τον απαριθμητή προγράμματος warp 44. Αυτά τα νήματα περιμένουν μέχρι ο απαριθμητής προγράμματος warp 58 να φτάσει το #add+16 προτού συνεχίσουν την εκτέλεση της εντολής. • In cycle 3, the warp program counter 44 is set to #add+12 to match the lowest of the thread program counters 58 (in this case, the thread program counter for thread 2). For thread 2, the MUL multiply instruction at address #add+12 is executed. However, no instruction is executed in cycle 3 for threads 0, 1, and 3 because the thread program counters 58 for these threads do not match the warp program counter 44. These threads wait until the warp program counter 58 reaches # add+16 before they continue executing the command.

• Στον κύκλο 4, ο απαριθμητής προγράμματος warp 44 αυξάνεται σε #add+16 κι επομένως τώρα τα νήματα 0 έως 3 επανασυγκλίνουν και εκτελούν την εντολή STR στη διεύθυνση #add+16 για την αποθήκευση μιας τιμής σε μια διεύθυνση μνήμης που προσδιορίζεται με βάση μια βασική διεύθυνση και δείκτη, με διαφορετικά νήματα να χρησιμοποιούν διαφορετικούς δείκτες για τον προσδιορισμό της διεύθυνσης στόχου. Η πρόσβαση στη μνήμη για τουλάχιστον ορισμένα από τα νήματα μπορεί να συνδυάζεται σε μια μόνο πρόσβαση στη μνήμη εάν οι δείκτες είναι τέτοιοι ώστε οι προσβάσεις να στοχεύουν στην ίδια περιοχή της μνήμης (π.χ. την ίδια γραμμή κρυφής μνήμης ή την ίδια σελίδα του χώρου διευθύνσεων). • In cycle 4, the warp program counter 44 is incremented to #add+16 and so now threads 0 to 3 reconverge and execute the STR instruction at address #add+16 to store a value at a memory address determined based on a base address and pointer, with different threads using different pointers to determine the target address. Memory accesses for at least some of the threads can be combined into a single memory access if the pointers are such that the accesses target the same area of memory (e.g. the same cache line or the same page of space addresses).

Συνεπώς, στο παράδειγμα του Σχήματος 7, εφόσον τα νήματα απαιτούν την εκτέλεση των ίδιων εντολών σε διαφορετικές τιμές δεδομένων, μπορούν να επεξεργάζονται αποτελεσματικά ως μια ομάδα, επειδή αυτό επιτρέπει την απόσβεση μιας μοναδικής προσκόμισης εντολής κατά μήκος της ομάδας νημάτων και των προς συνδυασμό προσβάσεων στη μνήμη. Ωστόσο, οι εντολές κλάδου, για παράδειγμα, μπορεί να προκαλούν διαφορετική συμπεριφορά, όπως όταν το νήμα 2 απαιτεί μια διαφορετική εντολή σε σχέση με τα άλλα νήματα. Μολονότι στο Σχήμα 7 αυτό είχε ως αποτέλεσμα μόνο μια φυσαλίδα ενός κύκλου στους σωληναγωγούς για τα νήματα 0, 1 και 3, σε άλλες περιπτώσεις η απόκλιση θα μπορούσε να διαρκεί περισσότερο. Μπορεί να υπάρχουν συμβάντα και εκτός των κλάδων τα οποία προκαλούν παρόμοιες διαφορές στη συμπεριφορά. Όταν τα νήματα σε ένα warp αποκλίνουν, μπορεί να υπάρχει ένας σημαντικός αριθμός κύκλων στους οποίους τα νήματα δεν μπορούν να εκτελούν την επόμενή τους λειτουργία επειδή πρέπει να περιμένουν όσο τα άλλα νήματα εκτελούν διαφορετικές λειτουργίες και αυτό ελαττώνει την απόδοση. Συνεπώς, η απόδοση της επεξεργασίας μπορεί να εξαρτάται από τον τρόπο ομαδοποίησης των νημάτων. Στην πράξη, με την εκχώρηση των νημάτων για τμήματα στο ίδιο τετραγωνίδιο (βαθμίδα 2x2 pixel) στο ίδιο warp, μπορεί να μειώνεται η πιθανότητα απόκλισης των νημάτων στο warp. Therefore, in the example of Figure 7, since threads require execution of the same instructions on different data values, they can be efficiently processed as a group because this allows a single instruction fetch to be amortized across the thread group and the accesses to be combined to memory. However, branch instructions, for example, may cause different behavior, such as when thread 2 requires a different instruction than the other threads. Although in Figure 7 this only resulted in a single cycle bubble in the pipelines for threads 0, 1 and 3, in other cases the deviation could last longer. There may be events outside the disciplines that cause similar differences in behavior. When the yarns in a warp diverge, there can be a significant number of cycles in which the yarns cannot perform their next operation because they have to wait while other yarns perform different operations, and this reduces performance. Therefore, processing performance can depend on how the threads are grouped. In practice, by assigning the yarns for segments in the same square (2x2 pixel step) in the same warp, the possibility of yarn deviation in the warp can be reduced.

Όπως δείχνεται στο Σχήμα 6, η μονάδα επεξεργασίας warp 40 μπορεί να διατηρεί μια ενεργή μάσκα 60 και μια εκκρεμή μάσκα 62 για την ιχνηλάτηση της σύγκλισης ή της απόκλισης ανάμεσα στις διαδρομές που ακολουθούνται από τα αντίστοιχα νήματα του warp. Η χρήση της ενεργής μάσκας 60 και της εκκρεμούς μάσκας 62 δείχνεται σε μεγαλύτερη λεπτομέρεια στο Σχήμα 8. Η ενεργή μάσκα 60 περιλαμβάνει έναν αριθμό σημαιών bit 64 κάθε μια από τις οποίες αντιστοιχεί σε ένα νήμα, οι οποίες υποδηλώνουν κατά πόσο το αντίστοιχο νήμα είναι ενεργό. Παρόμοια, η εκκρεμής μάσκα 62 περιλαμβάνει σημαίες bit ανά νήμα 66 οι οποίες υποδηλώνουν κατά πόσο εκκρεμεί το αντίστοιχο νήμα (ένα εκκρεμές νήμα είναι ένα εκκρεμές νήμα το οποίο έχει γίνει ανενεργό λόγω απόκλισης με τα άλλα ενεργά νήματα). Αρχικά, κατά την έναρξη εκτέλεσης του προγράμματος σκίασης, και τα οκτώ νήματα Q0T0 έως Q1T3 είναι ενεργά κι επομένως όλα τα bit 64 της ενεργής μάσκας 60 είναι ίσα με 1 και όλα τα bit 66 της εκκρεμούς μάσκας 62 είναι ίσα με 0. Σε ένα δεδομένο σημείο, το πρόγραμμα σκίασης φτάνει σε μια υπό συνθήκη εντολή κλάδου και τα αποτελέσματα της εντολής διαφέρουν για διαφορετικά νήματα, ώστε παρουσιάζεται απόκλιση στη ροή ελέγχου. Δύο από τα νήματα Q0T0 και Q1T1 ακολουθούν μια πρώτη διαδρομή 68 διαμέσου του προγράμματος κι επομένως όταν οι σωληναγωγοί εκτελούν εντολές από αυτό τον κλάδο τότε η ενεργή μάσκα έχει 1 bit που αντιστοιχούν σε αυτά τα δύο νήματα και O’ bit που αντιστοιχούν στα άλλα νήματα, ενώ η εκκρεμής μάσκα 62 έχει την αντίθετη τιμή της ενεργής μάσκας, εφόσον όλα τα υπόλοιπα νήματα εκκρεμούν και περιμένουν τον απαριθμητή προγράμματος warp 44 να επιστρέφει σε μια διεύθυνση μιας εντολής η οποία πρέπει να εκτελεστεί από αυτά τα νήματα. Αφετέρου, σε ένα δεύτερο κλάδο του προγράμματος τα υπόλοιπα νήματα είναι ενεργά και τα νήματα Q0T0 και Q1T2 είναι ανενεργό, όπως υποδηλώνεται από την ενεργή και την εκκρεμή μάσκα αντίστοιχα. Όπως δείχνεται στο παράδειγμα του Σχήματος 8, εάν τα νήματα αργότερα συγκλίνουν και πάλι τότε η ενεργή μάσκα μπορεί να επιστρέφει όλη σε 1 και η εκκρεμής μάσκα επιστρέφει όλη σε 0. As shown in Figure 6, the warp processing unit 40 may maintain an active mask 60 and a pending mask 62 to track the convergence or divergence between the paths taken by the respective warp threads. The use of active mask 60 and pending mask 62 is shown in greater detail in Figure 8. Active mask 60 includes a number of bit flags 64 each corresponding to a thread, which indicate whether the respective thread is active. Similarly, pending mask 62 includes per-thread bit flags 66 that indicate whether the corresponding thread is pending (a pending thread is a pending thread that has become inactive due to divergence with other active threads). Initially, when the shader program starts execution, all eight threads Q0T0 through Q1T3 are active and thus all bits 64 of the active mask 60 are equal to 1 and all bits 66 of the pending mask 62 are equal to 0. In a given point, the shader reaches a conditional branch instruction, and the results of the instruction differ for different threads, so there is divergence in the control flow. Two of the threads Q0T0 and Q1T1 follow a first path 68 through the program and therefore when the pipelines execute instructions from this branch then the active mask has 1 bits corresponding to these two threads and O' bits corresponding to the other threads, while the pending mask 62 has the opposite value of the active mask, since all other threads are pending and waiting for the warp program counter 44 to return to an address of an instruction to be executed by those threads. On the other hand, in a second branch of the program the remaining threads are active and threads Q0T0 and Q1T2 are inactive, as indicated by the active and pending masks respectively. As shown in the example of Figure 8, if the threads later converge again then the active mask may return to all 1's and the pending mask may return to all 0's.

Σε περιπτώσεις στις οποίες όλα τα νήματα ήταν αρχικά ενεργά και κανένα από τα νήματα δεν έχει τερματιστεί ακόμη, τότε όλα τα νήματα μπορεί να έχουν ένα bit ενεργό είτε στην ενεργή μάσκα 60 είτε στην εκκρεμή μάσκα 62, ώστε ένα OR σε επίπεδο bit της ενεργής και της εκκρεμούς μάσκας θα παράγει πάντα ένα αποτέλεσμα με όλα τα bit ίσα με 1. Ωστόσο, όπως δείχνεται στο Σχήμα 9 αυτό δεν είναι απαραίτητο και ορισμένες φορές ο διαχειριστής warp 42 μπορεί να εκχωρεί νήματα σε ένα δεδομένο warp ώστε τουλάχιστον ένα νήμα να είναι ανενεργό από την έναρξη της επεξεργασίας του warp και σε αυτή την περίπτωση τουλάχιστον ένα bit της ενεργής μάσκας 60 μπορεί να ξεκινάει από το 0 ακόμη και όταν δεν υπάρχουν εκκρεμή νήματα. Συνεπώς, στο παράδειγμα του Σχήματος 9 εάν παρουσιαστεί απόκλιση και πάλι σε αυτά τα νήματα τα οποία είναι ενεργά σε ένα δεδομένο κλάδο 68, 70 της ροής ελέγχου, τα bit της ενεργής τους μάσκας ισούνται με 1 ενώ στα άλλα νήματα τα οποία ήταν προηγουμένως ενεργά, αλλά έχουν κατασταλεί λόγω της απόκλισης τα bit της εκκρεμούς τους μάσκας ισούνται με 1. Στα νήματα τα οποία δεν ήταν ποτέ ενεργά επειδή ήταν ανενεργό από την αρχή, τα αντίστοιχα bit 64, 66 θα ισούνται με 0 τόσο στην ενεργή όσο και στην εκκρεμή μάσκα 60, 62. Παρόμοια, εάν η εκτέλεση ενός νήματος τερματιστεί τότε τόσο η ενεργή όσο και η εκκρεμής μάσκα μπορεί να ισούται με 0 για ένα δεδομένο νήμα. In cases where all threads were initially active and none of the threads have yet terminated, then all threads may have a bit active in either the active mask 60 or the pending mask 62, so that a bit-level OR of the active and of the pending mask will always produce a result with all bits equal to 1. However, as shown in Figure 9 this is not necessary and sometimes the warp manager 42 can assign threads to a given warp so that at least one thread is inactive by the start of warp processing and in this case at least one bit of the active mask 60 may start from 0 even when there are no pending threads. Therefore, in the example of Figure 9 if a deviation occurs again in those threads which are active in a given branch 68, 70 of the control flow, their active mask bits are equal to 1 while in the other threads which were previously active, but are suppressed due to the deviation their pending mask bits are equal to 1. In threads which were never active because they were inactive from the beginning, the corresponding bits 64, 66 will be equal to 0 in both active and pending mask 60 , 62. Similarly, if the execution of a thread terminates then both the active and pending masks may equal 0 for a given thread.

Όπως δείχνεται στο Σχήμα 6, η ενεργή μάσκα 60 και η εκκρεμής μάσκα 62 μπορεί να παίρνουν την τιμή 1 ως αποτέλεσμα της ψηφοφορίας του απαριθμητή προγράμματος που πραγματοποιείται από τη λογική ψηφοφορίας του απαριθμητή προγράμματος 72 με βάση τους μεμονωμένους απαριθμητές προγράμματος νήματος 58 κάθε νήματος του warp. Μόλις προσδιοριστεί η νέα τιμή για τον απαριθμητή προγράμματος warp 44, η λογική ψηφοφορίας του απαριθμητή προγράμματος 72 μπορεί να ορίζει τα bit της ενεργής μάσκας 60 στην τιμή 1 για οποιαδήποτε προηγουμένως ενεργά ή εκκρεμή νήματα τα οποία έχουν έναν απαριθμητή προγράμματος νήματος 58 ο οποίος ταιριάζει με τον απαριθμητή προγράμματος warp 44, ενώ μπορεί να ορίζει τα bit της εκκρεμούς μάσκας 62 στην τιμή 1 για οποιαδήποτε προηγουμένως ενεργά ή εκκρεμή νήματα των οποίων οι αντίστοιχοι απαριθμητές προγράμματος νήματος 58 δεν ταιριάζουν με την τιμή του απαριθμητή προγράμματος warp που μόλις προσδιορίστηκε 44. As shown in Figure 6, active mask 60 and pending mask 62 may take the value 1 as a result of program counter voting performed by program counter voting logic 72 based on the individual thread program counters 58 of each warp thread. . Once the new value for the warp program counter 44 is determined, the program counter polling logic 72 can set the bits of the active mask 60 to the value 1 for any previously active or pending threads that have a thread program counter 58 that matches the warp program counter 44 , while it may set the pending mask bits 62 to the value 1 for any previously active or pending threads whose respective thread program counters 58 do not match the value of the warp program counter 44 just determined.

Το Σχήμα 10 δείχνει ένα παράδειγμα βασισμένης σε ρήτρα εκτέλεσης, η οποία μπορεί να πραγματοποιείται από τη μονάδα επεξεργασίας warp 40. Η λειτουργία ψηφοφορίας απαριθμητή προγράμματος μπορεί να είναι σχετικά δαπανηρή όσον αφορά στην κατανάλωση ενέργειας λόγω του γεγονότος ότι πολλαπλοί απαριθμητές προγράμματος νημάτων 58 μπορεί να πρέπει να συγκριθούν ώστε να προσδιοριστεί η τιμή στην οποία θα οριστεί ο απαριθμητής προγράμματος warp 44. Στην πράξη, μεγάλα μέρη εκτέλεσης προγράμματος εντός ενός προγράμματος σκίασης μπορούν να πραγματοποιούνται διαδοχικά χωρίς κάποια υπό συνθήκη εντολή κλάδου και συνεπώς σε αυτές τις περιπτώσεις ο απαριθμητής προγράμματος warp 44 μπορεί απλά να αυξάνεται από κύκλο σε κύκλο επειδή δεν υπάρχει κίνδυνος σύγκλισης ή απόκλισης των νημάτων κατά τη διάρκεια αυτών των διαδοχικών ακολουθιών εντολών προγράμματος. Από αυτή την άποψη, οι διαδοχικές εντολές είναι εντολές οι οποίες αποθηκεύονται σε διαδοχικές (συνεχόμενες και αλληλοδιάδοχες) διευθύνσεις μνήμης οι οποίες ακολουθούν κάποιο προκαθορισμένο διάστημα βήματος. Figure 10 shows an example of clause-based execution that may be performed by the warp processing unit 40. The schedule counter polling operation can be relatively expensive in terms of power consumption due to the fact that multiple thread schedule counters 58 may have to be compared to determine the value to which the warp program counter 44 will be set. In practice, large parts of program execution within a shadow program can be carried out sequentially without any conditional branch instruction, and thus in these cases the warp program counter 44 can simply increase from cycle to cycle because there is no danger of threads converging or diverging during these successive sequences of program instructions. In this sense, sequential instructions are instructions that are stored in consecutive (contiguous and interspersed) memory addresses that follow some predetermined step interval.

Για να αποφευχθεί η ανάγκη επίκλησης της λογικής ψηφοφορίας απαριθμητή προγράμματος 72 μετά από κάθε εντολή, οι εντολές μπορούν να ομαδοποιούνται από έναν προγραμματιστή ή μεταγλωττιστή σε ρήτρες 80 όπως δείχνεται στο Σχήμα 10. Κάθε ρήτρα μπορεί να σημειώνεται με κάποιο τρόπο εντός του κώδικα προγράμματος, π.χ. προτάσσοντας στις εντολές μιας δεδομένης ρήτρας μια κεφαλίδα ρήτρας 82. Θα εκτιμηθεί ότι σε άλλα παραδείγματα οι ρήτρες θα μπορούσαν αντ’ αυτού να κατατέμνονται από ένα υποσέλιδο ρήτρας το οποίο ακολουθεί την αντίστοιχη ρήτρα ή τόσο με μια κεφαλίδα ρήτρας όσο και με ένα υποσέλιδο ρήτρας ή με ορισμένες άλλες τεχνικές όπως η επισημείωση της πρώτης εντολής της ρήτρας με κάποιο τρόπο ή χρησιμοποιώντας έναν προκαθορισμένο τύπο εντολής (π.χ. έναν κλάδο) για τη σήμανση του ορίου ανάμεσα στις ρήτρες. Όπως δείχνεται στο Σχήμα 10, το μήκος κάθε ρήτρας μπορεί να διαφοροποιείται ανάλογα τις τοποθεσίες των κλάδων εντός του προγράμματος που εκτελείται ενώ το πρόγραμμα μπορεί να περιλαμβάνει οποιονδήποτε αριθμό ρητρών. Ανεξάρτητα από τον τρόπο με τον οποίο προσδιορίζονται τα όρια των ρητρών στον κώδικα προγράμματος που προσκομίζεται από τη λογική προσκόμισης εντολών 56, διασπώντας την εκτέλεση των εντολών σε ρήτρες με αυτό τον τρόπο, ο προγραμματιστής ή μεταγλωττιστής μπορεί να επιλέγει τις τοποθεσίες των ορίων ρητρών ώστε να αντιστοιχούν στις εντολές κλάδου οι οποίες μπορεί να οδηγούν σε μη διαδοχικές αλλαγές της ροής του προγράμματος, ώστε εντός μιας ρήτρας, η εκτέλεση των εντολών μπορεί να συνεχίζει εντελώς διαδοχικά. Αυτό σημαίνει ότι η επίκληση της λογικής ψηφοφορίας απαριθμητή προγράμματος 72 απαιτείται μόνο στο τέλος μιας ρήτρας προτού προσδιοριστεί ο απαριθμητής προγράμματος warp που θα χρησιμοποιηθεί για την επόμενη ρήτρα. Εντός μιας ρήτρας, ο απαριθμητής προγράμματος warp 44 μπορεί αντ’ αυτού απλά να αυξάνεται από έναν αθροιστή 86 με βάση μια ορισμένη τιμή βήματος που αντιστοιχεί στο μήκος των εντολών, το οποίο καταναλώνει πολύ λιγότερη ενέργεια από τη λειτουργία ψηφοφορίας απαριθμητή προγράμματος. Εφόσον η ψηφοφορία απαριθμητή προγράμματος πραγματοποιείται στα όρια των ρητρών, αυτό σημαίνει ότι η ενεργή μάσκα 60 και η εκκρεμής μάσκα 62 δεν μπορούν να αλλάζουν στο μέσο μιας ρήτρας, αλλά μπορούν να ενημερώνονται μόνο στο τέλος μιας ρήτρας πριν την έναρξη της επόμενης ρήτρας. To avoid the need to invoke the program counter polling logic 72 after each instruction, the instructions can be grouped by a programmer or compiler into clauses 80 as shown in Figure 10. Each clause can be marked in some way within the program code, e.g. .x. prefixing the commands of a given clause with a clause header 82. It will be appreciated that in other examples clauses could instead be broken up by a clause footer following the corresponding clause, or by both a clause header and a clause footer, or by some other techniques such as marking the first statement of the clause in some way or using a predefined statement type (eg a branch) to mark the boundary between clauses. As shown in Figure 10, the length of each clause can vary depending on the branch locations within the program being executed and the program can contain any number of clauses. Regardless of how the clause boundaries are identified in the program code served by the statement presentation logic 56, by breaking the execution of the statements into clauses in this way, the programmer or compiler can choose the locations of the clause boundaries to correspond to branch statements that may lead to non-sequential changes of program flow, so that within a clause, statement execution may continue completely sequentially. This means that the invocation of program counter polling logic 72 is only required at the end of a clause before the warp program counter to be used for the next clause is determined. Within a clause, the warp program counter 44 can instead simply be incremented by an adder 86 based on a certain step value corresponding to the length of the instructions, which consumes much less power than the program counter polling operation. Since program counter polling occurs at clause boundaries, this means that active mask 60 and pending mask 62 cannot be changed in the middle of a clause, but can only be updated at the end of a clause before the start of the next clause.

Σε ορισμένες υλοποιήσεις, η βασισμένη σε ρήτρες εκτέλεση με αυτό τον τρόπο μπορεί επίσης να επιτρέπει την πραγματοποίηση και άλλων βελτιστοποιήσεων απόδοσης ή εξοικονόμησης ενέργειας εντός των σωληναγωγών 50 ή την πρόσβαση στους καταχωρητές αρχιτεκτονικής κατάστασης 54. Για παράδειγμα, μια ρήτρα μπορεί να εκλαμβάνεται ως ένα ατομικό σύνολο λειτουργιών οι οποίες σε γενικές γραμμές πραγματοποιούνται ως σύνολο. Συνεπώς, σε ένα ενδιάμεσο σημείο της ρήτρας μπορεί να μην εγγυάται ότι η κατάσταση του νήματος 52 εντός των καταχωρητών αρχιτεκτονικής κατάστασης 54 για ένα δεδομένο νήμα είναι σε μια σταθερή κατάσταση. Για παράδειγμα, για τη μείωση της επιβάρυνσης από την εγγραφή στους καταχωρητές αρχιτεκτονικής κατάστασης 54 και την ανάγνωση από τους καταχωρητές αρχιτεκτονικής κατάστασης, όπου δύο εντολές που εκδόθηκαν σε σειρά εντός της ίδιας ρήτρας είναι τέτοιες ώστε η πρώτη εντολή να εγγράφει σε ένα δεδομένο καταχωρητή και η επόμενη εντολή να διαβάζει από τον ίδιο καταχωρητή, ενώ αυτός ο καταχωρητής θα αντικαθίσταται στη συνέχεια από μια επακόλουθη εντολή της ρήτρας, τότε μπορεί να μην απαιτείται καθόλου η πραγματοποίηση της εγγραφής καταχωρητή εφόσον η τιμή που θα διαβαστεί από τη δεύτερη εντολή θα μπορούσε απλά να διατηρείται εντός του σωληναγωγού μετά την εκτέλεση της πρώτης εντολής εφόσον θα προωθηθεί απευθείας στις εισόδους για το στοιχείο επεξεργασίας που πραγματοποιεί τη δεύτερη εντολή. Μειώνοντας τον αριθμό των απαιτούμενων αναγνώσεων και εγγραφών καταχωρητή, αυτό μπορεί να βελτιώνει την απόδοση και να εξοικονομεί ενέργεια. Συνεπώς, η ορθότητα της τρέχουσας αρχιτεκτονικής κατάστασης που αποθηκεύεται στους καταχωρητές μπορεί να μην εγγυάται μερικώς διαμέσου μιας ρήτρας. In some implementations, clause-based execution in this manner may also allow other performance or energy-saving optimizations to be made within pipelines 50 or access to architectural state registers 54. For example, a clause may be thought of as an individual set of operations which are generally carried out as a whole. Therefore, at an intermediate point the clause may not guarantee that the state of the thread 52 within the architecture state registers 54 for a given thread is in a stable state. For example, to reduce the overhead of writing to the architecture state registers 54 and reading from the architecture state registers, where two instructions issued in sequence within the same clause are such that the first instruction writes to a given register and the next instruction to read from the same register, while that register will then be replaced by a subsequent instruction in the clause, then the register write may not need to be done at all since the value read from the second instruction could simply be held within the pipeline after the first instruction is executed since it will be forwarded directly to the inputs for the processing element that executes the second instruction. By reducing the number of required register reads and writes, this can improve performance and save energy. Therefore, the correctness of the current architectural state stored in the registers may not be partially guaranteed through a clause.

Όπως δείχνεται στο Σχήμα 6, η μονάδα επεξεργασίας warp μπορεί να έχει λογική 90 για να προσδιορίζει κατά πόσο μπορούν να απορρίπτονται κάποια νήματα επεξεργασίας που πραγματοποιούνται από το warp. Μπορεί να υπάρχει ένας αριθμός λόγων για τους οποίους τα νήματα μπορούν να απορρίπτονται. Σε ορισμένες περιπτώσεις το στάδιο τερματισμού pixel 24 του σωληναγωγού 2 μπορεί να προσδιορίζει ότι ένα νήμα ήδη σε εξέλιξη εντός του σταδίου σκίασης τμήματος 26 θα σχεδιαζόταν υπερβολικά από ένα αργότερα ληφθέν τμήμα το οποίο ακολουθεί αυτό το πρότερο τμήμα διαμέσου του σωληναγωγού, και σε αυτή την περίπτωση μπορεί να εκδίδει το σήμα τερματισμού 92 το οποίο μπορεί να διαβιβάζεται από το διαχειριστή warp προς τη σχετική μονάδα επεξεργασίας warp 40 η οποία διαχειρίζεται την επεξεργασία αυτού του νήματος τμήματος. Η λογική απόρριψης 90 μπορεί να συγκρίνει τις συντεταγμένες θέσης ενός pixel που προσδιορίζονται από το σήμα τερματισμού 92 με τις συντεταγμένες που προσδιορίζονται σε πληροφορίες κατάστασης warp 94 οι οποίες υποδηλώνουν τις θέσεις των pixel που αντιστοιχούν στα τμήματα που επεξεργάζονται από κάθε νήμα του warp με σκοπό να προσδιορίζει κατά πόσο αυτό το σήμα τερματισμού εφαρμόζεται στο τρέχον warp και εάν ναι, ποιο νήμα. As shown in Figure 6, the warp processing unit may have logic 90 to determine whether some processing threads performed by the warp can be discarded. There can be a number of reasons why threads can be discarded. In some cases the pixel termination stage 24 of pipeline 2 may determine that a thread already in progress within the segment shading stage 26 would be overdrawn by a later fetched segment which follows that earlier segment through the pipeline, and in this case may issue the termination signal 92 which may be passed by the warp manager to the associated warp processing unit 40 which manages the processing of that segment thread. The rejection logic 90 may compare the position coordinates of a pixel specified by the termination signal 92 with the coordinates specified in warp state information 94 indicating the pixel positions corresponding to the segments processed by each warp thread in order to determines whether this termination signal is applied to the current warp and if so, which thread.

Επίσης, σε ορισμένες περιπτώσεις το ίδιο το πρόγραμμα σκίασης θα μπορούσε να περιλαμβάνει μια εντολή απόρριψης σε ένα συγκεκριμένο κλάδο της ροής ελέγχου, η οποία εάν εκτελεστεί για ένα δεδομένο νήμα οδηγεί σε αυτό το νήμα που απορρίπτεται εφόσον προσδιορίζεται ότι αυτό το νήμα δεν απαιτείται πια. Συνεπώς, εάν η βαθμίδα προσκόμισης και αποκωδικοποίησης εντολών 56 συναντήσει μια εντολή απόρριψης τότε αυτό μπορεί να σηματοδοτείται στη λογική απόρριψης 90 και, σε συνδυασμό με την ενεργή μάσκα 60, αυτό μπορεί να προσδιορίζει ποια νήματα θα πρέπει να απορρίπτονται (τα εκκρεμή νήματα που υποδηλώνονται από την εκκρεμή μάσκα 62 δε θα απορρίπτονταν εφόσον αυτά θα παρέκαμπταν την εντολή απόρριψης εφόσον δεν είναι ενεργά επί του παρόντος). Also, in some cases the shader itself could include a drop instruction on a particular branch of the control flow, which if executed for a given thread results in that thread being dropped once it is determined that that thread is no longer needed. Accordingly, if the instruction fetch and decode stage 56 encounters a discard instruction then this can be signaled to the discard logic 90 and, in conjunction with the active mask 60, this can determine which threads should be discarded (the pending threads denoted by the pending mask 62 would not be discarded since these would override the discard command since they are not currently active).

Ανεξάρτητα από το λόγο απόρριψης ενός δεδομένου νήματος, η μονάδα επεξεργασίας warp 40 μπορεί να διαχειρίζεται την απόρριψη των νημάτων με την αναλυτικότητα των τετραγωνιδίων παρά σε μεμονωμένα νήματα. Συνεπώς, είτε ένα ολόκληρο τετραγωνίδιο νημάτων μπορεί να απορρίπτεται ή αυτά τα νήματα μπορούν να συνεχίζουν. Αυτό επειδή τα νήματα ενός δεδομένου τετραγωνιδίου μπορεί να έχουν αλληλεξαρτήσεις στις οποίες ένα νήμα του τετραγωνιδίου παραπέμπει σε τιμές που υπολογίστηκαν από ένα άλλο νήμα του τετραγωνιδίου, εφόσον αυτό μπορεί να είναι χρήσιμο για τον υπολογισμό παράγωγων λειτουργιών που χρησιμοποιούνται για τον προσδιορισμό βαθμιδών προς απόδοση στην εικόνα. Συνεπώς, ακόμη και εάν ένα νήμα στο τετραγωνίδιο πρέπει να απορριφθεί, εάν άλλα νήματα στα τετραγωνίδια δεν έχουν ακόμη απορριφθεί τότε το τετραγωνίδιο ως σύνολο μπορεί να συνεχίζει. Regardless of the reason for discarding a given yarn, the warp processing unit 40 can handle yarn discard at the granularity of squares rather than individual threads. Therefore, either an entire block of threads can be discarded, or those threads can continue. This is because the threads of a given tile may have interdependencies in which one thread of the tile references values computed by another thread of the tile, as this may be useful for computing derivative functions used to determine gradients to render on the image . Therefore, even if a thread in the block must be dropped, if other threads in the blocks have not yet been dropped then the block as a whole can continue.

Εάν όλα τα νήματα σε ολόκληρο το warp (δηλαδή, όλα τα τετραγωνίδια) πρόκειται να απορριφθούν, τότε η επεξεργασία των εντολών από κάθε ένα από τα νήματα μπορεί απλά να τερματίζεται αμέσως ανεξάρτητα εάν έχει επιτευχθεί το τέλος μιας ρήτρας. Αυτό μπορεί να απελευθερώσει τη μονάδα επεξεργασίας warp 40 για τη διαχείριση και άλλων τετραγωνιδίων νωρίτερα. If all threads in the entire warp (ie, all squares) are to be discarded, then the processing of commands by each of the threads can simply terminate immediately regardless of whether the end of a clause has been reached. This can free up the warp 40 processor to handle other tiles sooner.

Ωστόσο, εάν μόνο ένα τετραγωνίδιο (ή δύο ή περισσότερα τετραγωνίδια, αλλά λιγότερα από όλα τα τετραγωνίδια σε περιπτώσεις στις οποίες η μονάδα επεξεργασίας warp επεξεργάζεται περισσότερα από δύο τετραγωνίδια), πρόκειται να απορριφθεί τότε η εκτέλεση των εντολών για τα μη απορριφθέντα τετραγωνίδια θα πρέπει να συνεχίζει. However, if only one tile (or two or more tiles, but less than all tiles in cases where the warp processor processes more than two tiles) is to be discarded then the execution of the commands for the non-discarded tiles should continues.

Εφόσον η εκτέλεση των εντολών για το warp ως σύνολο ελέγχεται με βάση έναν κοινό απαριθμητή προγράμματος warp 44, αυτό σημαίνει ότι δε θα ήταν δυνατή η πλήρης ανακατανομή των διαφόρων τετραγωνιδίων στο warp μέχρι να ολοκληρωθούν τα τετραγωνίδια που συνεχίζουν. Since the execution of commands for the warp as a whole is controlled based on a common warp program counter 44, this means that it would not be possible to completely redistribute the various squares in the warp until the squares that continue are completed.

Συνεπώς, αντίθετα, όταν όλα τα νήματα ενός δεδομένου τετραγωνιδίου πρόκειται να απορριφθούν αλλά ένα άλλο τετραγωνίδιο συνεχίζει, αυτά τα νήματα μπορούν να σημειώνονται ως ανενεργό μηδενίζοντας τα αντίστοιχα bit τόσο στην ενεργή όσο και στην εκκρεμή μάσκα 60 και 62. Για παράδειγμα, κάθε σωληναγωγός εκτέλεσης 50 μπορεί να περιλαμβάνει πύλες ισχύος οι οποίες απομονώνουν τμήματα του σωληναγωγού από μια τροφοδοσία ή μια παροχή ρολογιού, ώστε να είναι δυνατή η θέση τους σε μια κατάσταση εξοικονόμησης ενέργειας για την εξοικονόμηση ενέργειας όταν τα αντίστοιχα νήματα γίνουν ανενεργό ή τερματιστούν. Ωστόσο, λόγω της βασισμένης σε ρήτρες εκτέλεσης που περιγράφηκε αναφορικά με το Σχήμα 10, η ενεργή και η εκκρεμής μάσκα 60, 62 μπορεί να μην επιτρέπεται να αλλάζουν μερικώς μέσα σε μια ρήτρα κι επομένως όταν η λογική απόρριψης 90 προσδιορίσει ότι όλα τα νήματα ενός δεδομένου τετραγωνιδίου πρόκειται να απορριφθούν μερικώς διαμέσου μιας ρήτρας, οι αντίστοιχες εντολές μπορεί να χρειαστεί κι έτσι να εκτελεστούν μέχρι το τέλος της ρήτρας. Εφόσον κάποιες ρήτρες μπορεί να είναι σχετικά επιμήκεις, μπορεί να απαιτηθεί κάποιος χρόνος και στο μεταξύ τα νήματα μπορεί να έχουν ορισμένες επιπτώσεις σε μέρη του σωληναγωγού επεξεργασίας γραφικών προς το εξωτερικό της μονάδας επεξεργασίας warp 40. Γ ια παράδειγμα, όπως δείχνεται στο Σχήμα 5, κάθε μονάδα επεξεργασίας warp 40 μπορεί να έχει μια διασύνδεση διαβίβασης μηνυμάτων 100 για τη διαβίβαση μηνυμάτων σε άλλα στοιχεία του σωληναγωγού ή στη μνήμη, ώστε να είναι δυνατή η πρόσβαση σε δεδομένα που απαιτούνται από ένα δεδομένο νήμα επεξεργασίας που πραγματοποιείται σε δεδομένα του warp 40 (δεδομένα διαφορετικά από την αρχιτεκτονική κατάσταση 52 του νήματος η οποία αποθηκεύεται στους καταχωρητές αρχιτεκτονικής κατάστασης 54 της μονάδας επεξεργασίας warp 40). Για παράδειγμα, τα μηνύματα 100 μπορεί να είναι αιτήματα φόρτωσης/αποθήκευσης για την πραγματοποίηση λειτουργιών φόρτωσης ή αποθήκευσης σε δεδομένα εντός του αποθηκευτικού χώρου χαρακτηριστικών 13, της προσωρινής μνήμης υφής 27, της προσωρινής μνήμης βάθους 22 ή της προσωρινής μνήμης πλακιδίων 20 που δείχνονται στο Σχήμα 1. Τα μηνύματα θα μπορούσαν επίσης να είναι γενικές λειτουργίες φόρτωσης/αποθήκευσης που πραγματοποιούνται στην κύρια μνήμη. Εάν, κατά τη χρονική περίοδο μεταξύ του προσδιορισμού ότι τα νήματα ενός δεδομένου τετραγωνιδίου πρέπει να απορριφθούν και του χρόνου όταν τα νήματα καταστέλλονται στην πράξη στο τέλος μιας ρήτρας, τέτοια μηνύματα 100 διεγερθούν από εντολές των απορριφθέντων νημάτων, τότε αυτό θα έχει ως αποτέλεσμα τη διέγερση περιττών λειτουργιών φόρτωσης ή αποθήκευσης προς στοιχεία αποθήκευσης δεδομένων εκτός του επεξεργαστή warp. Εκτός της περιττής κατανάλωσης ενέργειας για την ανάγνωση ή την εγγραφή σε αυτές τις θέσεις αποθήκευσης, αυτό μπορεί να έχει επίπτωση στην πραγματοποίηση και άλλων νημάτων τα οποία μπορεί να περιμένουν για εύρος ζώνης στη μνήμη ή στην προσωρινή μνήμη που διαβάζεται. Therefore, conversely, when all threads of a given tile are about to be discarded but another tile continues, those threads can be marked as inactive by setting the corresponding bits to zero in both the active and pending masks 60 and 62. For example, each execution pipeline 50 may include power gates that isolate portions of the pipeline from a power or clock supply so that they can be placed in a power-saving state to save power when the corresponding threads become inactive or are terminated. However, due to the clause-based execution described with respect to Figure 10, the active and pending masks 60, 62 may not be allowed to partially change within a clause and thus when the rejection logic 90 determines that all threads of a given box are to be partially rejected through a clause, the corresponding statements may need to be executed until the end of the clause. Since some clauses may be relatively lengthy, some time may be required, and in the meantime the threads may have some effects on parts of the graphics processing pipeline to the outside of the warp processing unit 40. For example, as shown in Figure 5, each warp processing unit 40 may have a message passing interface 100 for passing messages to other pipeline components or to memory to enable access to data required by a given processing thread performed on data of warp 40 (data otherwise from the architectural state 52 of the thread which is stored in the architectural state registers 54 of the warp processing unit 40). For example, the messages 100 may be load/store requests to perform load or store operations on data within the feature store 13 , the texture cache 27 , the depth cache 22 , or the tile cache 20 shown in FIG. 1. Messages could also be generic load/store operations performed in main memory. If, during the time period between the determination that the threads of a given box are to be discarded and the time when the threads are actually suppressed at the end of a clause, such messages 100 are triggered by commands of the discarded threads, then this will result in the triggering unnecessary load or save operations to data storage elements outside of the warp processor. In addition to consuming unnecessary power to read or write to these storage locations, this can have an impact on the execution of other threads that may be waiting for bandwidth in the memory or buffer being read from.

Συνεπώς, σε ένα παράδειγμα η μονάδα επεξεργασίας warp 40 μπορεί να υποστηρίζει τη θέση των νημάτων σε μια κατάσταση απόρριψης στην οποία η εκτέλεση των εντολών ακόμη συνεχίζει, όμως η δημιουργία των μηνυμάτων 100 τα οποίου αιτούνται πρόσβαση σε δεδομένα άλλα από την αρχιτεκτονική κατάσταση του επεξεργαστή warp καταστέλλεται για την εξοικονόμηση ενέργειας και τη βελτίωση της πραγματοποίησης άλλων νημάτων. Μια μάσκα απόρριψης 102 μπορεί να παρακολουθεί ποια νήματα είναι στην κατάσταση απόρριψης, ενώ μπορεί να παρέχονται αντίστοιχα bit της μάσκας απόρριψης προς τους αντίστοιχους σωληναγωγούς 50 για να ελέγχουν κατά πόσο διεγείρουν τη δημιουργία των μηνυμάτων 100. Μόλις επιτευχθεί το τέλος της τρέχουσας ρήτρας τότε οποιαδήποτε νήματα έχουν τεθεί στην κατάσταση απόρριψης εντός της ρήτρας, μπορούν τότε να μετάγονται στην ανενεργή κατάσταση μηδενίζοντας τα αντίστοιχα bit της ενεργής και της εκκρεμούς μάσκας 60, 62, ώστε για τις μετέπειτα ρήτρες είναι δυνατή η χρήση πυλών ισχύος για την καταστολή της εκτέλεσης των εντολών και την εξοικονόμηση περισσότερης ενέργειας. Παρ’ όλα αυτά, στη χρονική περίοδο μέχρι το τέλος της ρήτρας κατά την οποία προσδιορίστηκαν τα απορριφθέντα νήματα, η χρήση της μάσκας απόρριψης και της καταστολής των μηνυμάτων κάνει δυνατή μεγαλύτερη εξοικονόμηση ενέργειας και βελτίωση της απόδοσης. Thus, in one example the warp processor 40 may support the placement of threads in a discard state in which the execution of instructions still continues, but the generation of messages 100 which request access to data other than the architectural state of the warp processor is suppressed to save power and improve the execution of other threads. A discard mask 102 can keep track of which threads are in the discard state, and corresponding bits of the discard mask can be provided to the respective pipelines 50 to check whether they trigger the generation of messages 100. Once the end of the current clause is reached then any threads have been set to the discard state within the clause, they can then be transitioned to the inactive state by setting the respective bits of the active and pending masks 60, 62 to zero so that subsequent clauses can use power gates to suppress instruction execution and saving more energy. However, in the time period until the end of the clause in which the discarded threads were identified, the use of the discard mask and message suppression allows for greater energy savings and improved performance.

Σε ένα άλλο παράδειγμα, ο διαχειριστής warp 42 μπορεί να έχει έναν ιχνηλάτη εξάρτησης pixel 106 ο οποίος παρακολουθεί τις θέσεις των pixel των τετραγωνιδίων/τμημάτων που επεξεργάζονται εκείνη τη στιγμή εντός κάθε μιας από τις μονάδες warp 40. Σε ορισμένα παραδείγματα, ο ιχνηλάτης εξάρτησης pixel 106 θα μπορούσε αντίθετα να υλοποιείται σε ένα διαφορετικό τμήμα του σωληναγωγού εκτός του διαχειριστή warp 42. Όταν ένα νήμα για μια δεδομένη θέση pixel είναι σε κίνηση εντός μιας από τις μονάδες επεξεργασίας warp 40 τότε ο διαχειριστής warp 42 μπορεί να αποτρέπει την εκχώρηση οποιωνδήποτε περαιτέρω νημάτων για ένα άλλο τμήμα γραφικών το οποίο αντιστοιχεί στην ίδια θέση pixel. Αυτό σημαίνει ότι όσο τα νήματα είναι σε κίνηση, άλλα νήματα μπορεί να καθυστερούνται μέχρι να ολοκληρωθούν αυτά τα νήματα για την ίδια θέση pixel. Για την επιτάχυνση της επεξεργασίας τέτοιων άλλων νημάτων, όταν ένα τετραγωνίδιο απορρίπτεται και τα αντίστοιχα νήματα τίθενται στην κατάσταση απόρριψης, αυτά τα νήματα μπορούν να αφαιρούνται από τον ιχνηλάτη εξάρτησης pixel 106 ώστε αυτά τα νήματα να μην υποδεικνύονται πια ότι είναι σε κίνηση κι επομένως κάθε μετέπειτα νήμα που επεξεργάζεται τμήματα στην ίδια θέση pixel εντός του τελικού πλαισίου εικόνας μπορεί στη συνέχεια να συνεχίζει και να εκχωρείται σε μια δεδομένη μονάδα επεξεργασίας warp 40. Και πάλι αυτό κάνει δυνατή μια βελτίωση της απόδοσης επιτρέποντας την επεξεργασία και άλλων τετραγωνιδίων νωρίτερα από ότι θα ήταν δυνατό εάν έπρεπε να περιμένουν για την ολοκλήρωση του warp. In another example, the warp manager 42 may have a pixel dependency tracker 106 that keeps track of the pixel positions of the squares/segments currently being processed within each of the warp units 40. In some examples, the pixel dependency tracker 106 could instead be implemented in a different part of the pipeline outside of the warp manager 42. When a thread for a given pixel location is running within one of the warp processing units 40 then the warp manager 42 can prevent any further threads from being allocated for another graphics segment corresponding to the same pixel location. This means that while threads are in motion, other threads may be delayed until those threads are done for the same pixel location. To speed up the processing of such other threads, when a tile is discarded and the corresponding threads are placed in the discarded state, those threads can be removed from the pixel dependency tracer 106 so that those threads are no longer indicated as being in motion, and therefore each subsequent a thread processing segments at the same pixel location within the final image frame can then continue and be assigned to a given warp processing unit 40. Again this enables a performance improvement by allowing other squares to be processed earlier than would otherwise be possible if they had to wait for the warp to complete.

Το Σχήμα 11 παρέχει έναν πίνακα που συνοψίζει διάφορες καταστάσεις στις οποίες τα νήματα ενός δεδομένου warp μπορούν να τίθενται. Σε μια ανενεργή ή τερματισμένη κατάσταση τα bit τόσο της ενεργής όσο και της εκκρεμούς μάσκας 64, 66 μπορεί να είναι 0 και ένα bit της μάσκας απόρριψης 102 μπορεί επίσης να είναι 0. Η ανενεργή κατάσταση μπορεί να χρησιμοποιείται σε περιπτώσεις στις οποίες ένα ορισμένο νήμα εκχωρήθηκε στη μονάδα επεξεργασίας warp 40 στην ανενεργή κατάσταση ή σε περίπτωση στην οποία ένα νήμα ήταν προηγουμένως ενεργό αλλά τερματίστηκε, για παράδειγμα μετά την απόρριψη του αντίστοιχου νήματος μόλις επιτεύχθηκε το τέλος της ρήτρας στην οποία ανιχνεύτηκε η απόρριψη. Όταν ένα νήμα είναι στην ανενεργή ή τερματισμένη κατάσταση, ο αντίστοιχος σωληναγωγός επεξεργασίας 50 μπορεί να χρησιμοποιεί πύλες ισχύος για την καταστολή της εκτέλεσης των εντολών και την εξοικονόμηση ενέργειας. Figure 11 provides a table summarizing various states in which the yarns of a given warp can be put. In an idle or terminated state the bits of both the active and pending masks 64, 66 may be 0 and a bit of the discard mask 102 may also be 0. The idle state may be used in cases where a certain thread has been allocated in the warp processing unit 40 in the inactive state or in a case where a thread was previously active but was terminated, for example after the corresponding thread was dropped once the end of the clause in which the drop was detected was reached. When a thread is in the inactive or terminated state, the corresponding processing pipeline 50 may use power gates to suppress instruction execution and save power.

Σε μια ενεργή κατάσταση, το bit ενεργής μάσκας από το νήμα 64 είναι 1 , το αντίστοιχο bit της εκκρεμούς μάσκας 66 είναι 0 και το bit της μάσκας απόρριψης είναι 0 και σε αυτή την περίπτωση η εντολή που προσκομίζεται και αποκωδικοποιείται από τη βαθμίδα προσκόμισης/αποκωδικοποίησης 56 εκτελείται για το νήμα. Σε μια εκκρεμή κατάσταση, το bit ενεργής μάσκας 64 είναι 0, το bit εκκρεμούς μάσκας 66 είναι 1 και το αντίστοιχο bit της μάσκας απόρριψης 102 είναι 0 και σε αυτή την περίπτωση όσο το νήμα παραμένει σε εκκρεμότητα κι επομένως μπορεί να γίνει ενεργό και πάλι αργότερα ανάλογα την ψηφοφορία του απαριθμητή προγράμματος 72, η εκτέλεση των εντολών που προσκομίζονται από τη βαθμίδα προσκόμισης/αποκωδικοποίησης 56 καταστέλλεται για αυτό το νήμα, μολονότι η αντίστοιχη κατάσταση νήματος διατηρείται στους καταχωρητές 54 έτοιμη για όταν το νήμα γίνει και πάλι ενεργό. In an active state, the active mask bit from thread 64 is 1 , the corresponding bit of pending mask 66 is 0 , and the discard mask bit is 0 , in which case the instruction fetched and decoded by the fetch/decode stage 56 is executed for the thread. In a pending state, the active mask bit 64 is 0, the pending mask bit 66 is 1 and the corresponding discard mask bit 102 is 0 and in this case as long as the thread remains pending and therefore can become active again later depending on the polling of program counter 72, execution of instructions presented by fetch/decode stage 56 is suppressed for that thread, although the corresponding thread state is held in registers 54 ready for when the thread becomes active again.

Στην κατάσταση απόρριψης, το bit της μάσκας απόρριψης στη μάσκα απόρριψης 102 για αυτό το νήμα είναι 1 και τα bit της ενεργής και της εκκρεμούς μάσκας 64, 66 μπορεί να παίρνουν οποιαδήποτε τιμή. Σε αυτή την περίπτωση, η εκτέλεση των εντολών μπορεί να συνεχίζει με τον ίδιο τρόπο όπως στην περίπτωση που το νήμα ήταν ενεργό, εφόσον η ενεργή μάσκα 60 δεν μπορεί να αλλάζει μερικώς διαμέσου μιας ρήτρας. Ωστόσο το bit της μάσκας απόρριψης ελέγχει το σωληναγωγό του σχετικού νήματος 50 είτε για την καταστολή της δημιουργίας των μηνυμάτων 100 για την πρόσβαση σε δεδομένα εκτός της μονάδας επεξεργασίας warp 40 ή την αφαίρεση του νήματος από τον ιχνηλάτη εξάρτησης pixel 106 ή και τα δύο. Αυτό κάνει δυνατή την εξοικονόμηση ενέργειας και τη βελτίωση της απόδοσης ακόμη και σε περιπτώσεις στις οποίες δεν είναι ακόμη δυνατή η καταστολή της πραγματικής εκτέλεσης των εντολών του νήματος. In the discard state, the discard mask bit in discard mask 102 for this thread is 1 and the active and pending mask bits 64, 66 may take any value. In this case, the execution of instructions can continue in the same way as if the thread were active, since the active mask 60 cannot be partially changed through a clause. However the discard mask bit controls the pipeline of the associated thread 50 to either suppress the generation of data access messages 100 outside the warp processing unit 40 or remove the thread from the pixel dependency tracer 106 or both. This makes it possible to save energy and improve performance even in cases where it is not yet possible to suppress the actual execution of the thread's instructions.

Το Σχήμα 12 είναι ένα διάγραμμα ροής το οποίο απεικονίζει μια μέθοδο επεξεργασίας νημάτων στη μονάδα επεξεργασίας warp 40. Θα εκτιμηθεί ότι αυτό δείχνει τα βήματα που λαμβάνει μια μόνο μονάδα επεξεργασίας warp 40, όμως και άλλες μονάδες επεξεργασίας warp 40 θα μπορούσαν να πραγματοποιούν παράλληλα παρόμοιες λειτουργίες. Figure 12 is a flow diagram illustrating a method of yarn processing in warp processing unit 40. It will be appreciated that this shows the steps taken by a single warp processing unit 40, however other warp processing units 40 could be performing similar operations in parallel .

Στο βήμα 200 η μονάδα επεξεργασίας warp ξεκινάει την εκτέλεση των εντολών από την επόμενη ρήτρα του προγράμματος που εκτελείται. Εάν δεν έχουν εκτελεστεί προηγούμενες ρήτρες τότε αυτή είναι η πρώτη ρήτρα του προγράμματος. Στο βήμα 202 η λογική απόρριψης 90 προσδιορίζει κατά πόσο όλα τα ενεργά νήματα σε ολόκληρο το warp πρόκειται να απορριφθούν και ότι δεν υπάρχουν εκκρεμή νήματα. Εάν ισχύει αυτό, τότε στο βήμα 204 ολόκληρο το warp τερματίζεται και αυτό απελευθερώνει τη μονάδα επεξεργασίας warp 40 για ανακατανομή σε άλλα τετραγωνίδια από το διαχειριστή warp 42. Η μέθοδος στη συνέχεια τελειώνει. At step 200 the warp processing unit begins execution of the instructions from the next clause of the program being executed. If no previous clauses have been executed then this is the first clause of the program. In step 202 the discard logic 90 determines whether all active threads in the entire warp are to be discarded and that there are no outstanding threads. If this is the case, then in step 204 the entire warp is terminated and this frees the warp processing unit 40 for redistribution to other squares by the warp manager 42. The method then ends.

Εάν υπάρχει τουλάχιστον ένα ενεργό νήμα στο warp το οποίο δεν έχει απορριφθεί ή υπήρχε τουλάχιστον ένα εκκρεμές νήμα που δεν απορρίφθηκε, τότε στο βήμα 206 η λογική απόρριψης 90 προσδιορίζει κατά πόσο όλα τα νήματα σε ένα δεδομένο τετραγωνίδιο εντός του warp πρόκειται να απορριφθούν. Αυτά μπορεί να περιλαμβάνουν είτε ενεργά είτε εκκρεμή νήματα. Εάν και τα τέσσερα νήματα του ίδιου τετραγωνιδίου πρόκειται να απορριφθούν (και λόγω του προσδιορισμού NO στο βήμα 202 υπάρχει τουλάχιστον ένα άλλο τετραγωνίδιο το οποίο δεν έχει απορριφθεί), τότε στο βήμα 208 όλα τα νήματα αυτού του τετραγωνιδίου μετάγονται στην κατάσταση απόρριψης ορίζοντας στην τιμή 1 τα αντίστοιχα bit στη μάσκα απόρριψης 102. Αυτό σημαίνει ότι για το υπόλοιπο της ρήτρας αυτά τα νήματα δεν μπορούν να διεγείρουν τη δημιουργία μηνυμάτων 100 και/ή μπορούν να αφαιρούνται από τον ιχνηλάτη εξάρτησης pixel 106. If there is at least one active thread in the warp that has not been discarded, or there was at least one pending thread that has not been discarded, then at step 206 the discard logic 90 determines whether all threads in a given box within the warp are to be discarded. These can include either active or pending threads. If all four threads of the same square are to be discarded (and because of the NO determination in step 202 there is at least one other square that has not been discarded), then in step 208 all threads of that square are transitioned to the discard state by setting the value 1 the corresponding bits in discard mask 102. This means that for the remainder of the clause these threads cannot trigger message generation 100 and/or can be removed from pixel dependency tracer 106.

Αφετέρου, εάν στο βήμα 206 προσδιοριστεί ότι δεν πρόκειται να απορριφθούν όλα τα νήματα του τετραγωνιδίου, τότε στο βήμα 210 προσδιορίζεται κατά πόσο όλα τα ενεργά νήματα του τετραγωνιδίου πρόκειται να απορριφθούν και ότι δεν υπάρχουν εκκρεμή νήματα σε αυτό το τετραγωνίδιο. Για παράδειγμα, αυτό μπορεί να προσδιορίζεται με βάση την ενεργή μάσκα 60 και την εκκρεμή μάσκα 62 ώστε όλα τα νήματα με ένα 1 στην ενεργή μάσκα να πρόκειται να απορριφθούν και να μην υπάρχουν 1 bit για αυτό το τετραγωνίδιο στην εκκρεμή μάσκα 62. Εάν αυτά τα κριτήρια ικανοποιούνται, τότε και πάλι στο βήμα 208 όλα τα νήματα του τετραγωνιδίου μετάγονται στην κατάσταση απόρριψης ή τουλάχιστον όλα τα ενεργά νήματα του τετραγωνιδίου μετάγονται στην κατάσταση απόρριψης (εφόσον τα εκκρεμή νήματα δε θα εκτελούν εντολές σε αυτή τη ρήτρα, δεν μπορούν να διεγείρουν τη δημιουργία μηνυμάτων ούτως ή άλλως). On the other hand, if it is determined in step 206 that not all threads of the square are to be discarded, then in step 210 it is determined whether all active threads of the square are to be discarded and that there are no pending threads in that square. For example, this may be determined based on active mask 60 and pending mask 62 so that all threads with a 1 in the active mask are to be discarded and there is no 1 bit for that square in pending mask 62. If these criteria are met, then again at step 208 all threads in the block are transitioned to the discard state, or at least all active threads in the block are transitioned to the discard state (since the pending threads will not execute commands in this clause, they cannot trigger the creation messages anyway).

Στο βήμα 212 προσδιορίζεται κατά πόσο έχει επιτευχθεί το τέλος της ρήτρας και σε αντίθετη περίπτωση η μέθοδος επιστρέφει στο βήμα 202 για να συνεχίσει τον έλεγχο για την απόρριψη των νημάτων. Όταν επιτευχθεί το τέλος της ρήτρας, τότε στο βήμα 214 προσδιορίζεται κατά πόσο υπάρχουν άλλες ρήτρες προς εκτέλεση. Εάν δεν υπάρχουν άλλες ρήτρες, τότε στο βήμα 216 το warp τερματίζεται και τα αποτελέσματα του warp προωθούνται στα επακόλουθα στάδια του σωληναγωγού για την πραγματοποίηση μεταγενέστερης δοκιμής βάθους, ανάμιξης alpha και ούτω καθεξής. In step 212 it is determined whether the end of the clause has been reached and if not, the method returns to step 202 to continue checking to discard threads. When the end of the clause is reached, then in step 214 it is determined whether there are any other clauses to execute. If there are no other clauses, then in step 216 the warp is terminated and the results of the warp are forwarded to subsequent stages of the pipeline to perform subsequent depth testing, alpha blending, and so on.

Εάν υπάρχει τουλάχιστον μια ακόμη ρήτρα προς εκτέλεση, τότε στο βήμα 216 για οποιαδήποτε νήματα δεν έγινε μεταγωγή στην κατάσταση απόρριψης κατά την εκτέλεση της ρήτρας που τελείωσε πρόσφατα μετάγονται στη συνέχεια στην τερματισμένη κατάσταση, ώστε κατά τη διάρκεια της επακόλουθης ρήτρας η εκτέλεση των εντολών καταστέλλεται για την εξοικονόμηση περισσότερης ενέργειας. Στο βήμα 218 ο απαριθμητής προγράμματος warp 44 προσδιορίζεται από τη λογική ψηφοφορίας του απαριθμητή προγράμματος 72 με βάση τους επιμέρους απαριθμητές προγράμματος των νημάτων 58, ενώ στο βήμα 220 η ενεργή και η εκκρεμής μάσκα 60, 62 ενημερώνονται με βάση το αποτέλεσμα της ψηφοφορίας του απαριθμητή προγράμματος, ενώ στη συνέχεια η μέθοδος επιστρέφει στο βήμα 200 για να ξεκινήσει την επόμενη ρήτρα. If there is at least one more clause to be executed, then at step 216 any threads that were not transitioned to the discard state during the execution of the most recently terminated clause are then transitioned to the terminated state so that during the subsequent clause execution of statements is suppressed for saving more energy. In step 218 the warp program counter 44 is determined by the program counter polling logic 72 based on the individual program counters of the threads 58, while in step 220 the active and pending masks 60, 62 are updated based on the result of the program counter poll , while the method then returns to step 200 to begin the next clause.

Συνοπτικά, όταν ένα τετραγωνίδιο απορρίπτεται πλήρως το warp μπορεί να συνεχίζει την εκτέλεση και το άλλο τετραγωνίδιο δεν απορρίπτεται, ενώ είναι δυνατή η εξοικονόμηση ενέργειας και η βελτίωση της απόδοσης με μερικό τερματισμό του warp και με απελευθέρωση των εξαρτήσεων pixel. Ένα τετραγωνίδιο καταστέλλεται εάν απορρίπτονται και τα τέσσερα νήματα του τετραγωνιδίου ή εάν απορρίπτονται όλα τα ενεργά νήματα του τετραγωνιδίου και τα νήματα είναι μη αποκλίνοντα (δηλ. δεν υπάρχουν εκκρεμή νήματα). Εάν αυτές οι συνθήκες ισχύουν τότε για αυτό το τετραγωνίδιο μπορούμε να καταστείλουμε τα μηνύματα ώστε να εξοικονομήσουμε ενέργεια, στο τέλος της ρήτρας να ενεργοποιήσουμε μια μάσκα και να τερματίσουμε το τετραγωνίδιο για να εξοικονομήσουμε σημαντική ενέργεια μέσω πυλών ισχύος, και επίσης να αφαιρέσουμε το τετραγωνίδιο από το σύστημα εξάρτησης pixel παρά να περιμένουμε για πλήρη ολοκλήρωση του warp, για τη βελτίωση της απόδοσης επιτρέποντας την έκδοση άλλων τετραγωνιδίων σε αυτή τη θέση pixel. In summary, when one tile is completely discarded the warp can continue to run and the other tile is not discarded, and it is possible to save energy and improve performance by partially terminating the warp and releasing pixel dependencies. A square is suppressed if all four threads of the square are discarded, or if all active threads of the square are discarded and the threads are non-divergent (ie, there are no outstanding threads). If these conditions hold then for that cell we can suppress messages to save power, at the end of the clause enable a mask and terminate the cell to save significant power via power gates, and also remove the cell from the pixel dependency system rather than waiting for the warp to fully complete, to improve performance by allowing other squares to be rendered at that pixel location.

Στην παρούσα εφαρμογή, οι λέξεις “διαμορφωμένο για...” χρησιμοποιούνται με τη σημασία ότι ένα στοιχείο μιας διάταξης έχει μια διαμόρφωση ικανή να εκτελεί την καθορισμένη λειτουργία. Σε αυτό το πλαίσιο, μια “διαμόρφωση” σημαίνει μια διάταξη ή έναν τρόπο διασύνδεσης υλικού ή λογισμικού. Για παράδειγμα, η διάταξη μπορεί να έχει αποκλειστικό υλικό το οποίο παρέχει την καθορισμένη λειτουργία ή έναν επεξεργαστή ή άλλη διάταξη επεξεργασίας η οποία μπορεί να προγραμματίζεται ώστε να πραγματοποιεί τη λειτουργία. Οι λέξεις “διαμορφωμένο για” δεν υπονοεί ότι το στοιχείο διάταξης πρέπει να αλλάζεται με οποιονδήποτε τρόπο ώστε να παρέχει την καθορισμένη λειτουργία. In the present application, the words "configured for..." are used to mean that an element of a device has a configuration capable of performing the specified function. In this context, a “configuration” means an arrangement or way of interconnecting hardware or software. For example, the device may have dedicated hardware that provides the specified function or a processor or other processing device that can be programmed to perform the function. The words “configured for” do not imply that the layout element must be changed in any way to provide the specified function.

Μολονότι επεξηγηματικές υλοποιήσεις της εφεύρεσης έχουν περιγράφει σε λεπτομέρεια στο παρόν αναφορικά με τα συνοδευτικά σχεδιαγράμματα, πρέπει να γίνει κατανοητό ότι η εφεύρεση δεν περιορίζεται σε αυτές τις επακριβείς υλοποιήσεις και ότι διάφορες αλλαγές και τροποποιήσεις μπορούν να πραγματοποιούνται σε αυτή από έμπειρα άτομα στην τέχνη χωρίς απόκλιση από το αντικείμενο και το πνεύμα της εφεύρεσης όπως ορίζεται από τις συνημμένες αξιώσεις. Although illustrative embodiments of the invention have been described in detail herein with reference to the accompanying drawings, it is to be understood that the invention is not limited to these precise embodiments and that various changes and modifications may be made therein by those skilled in the art without departing from the object and spirit of the invention as defined by the appended claims.

Claims

Assemble Information Dossier 2018-00364 User D. Stafylas Date 11-04-2018 Document 1: CLMS 28-12-2017 Page Source 1 - 4 CLMS 28-12-2017 1 - 4 5 - 6 CLMS 02-04-2018 1 - 2 CLAIMS

1. An arrangement for processing graphics, which includes:

a warp processing unit for processing a plurality of processing threads into respective graphics sections;

wherein the warp processing unit is configured to control, in dependence on a warp program counter shared among the plurality of threads, the presentation of a next command to be executed for at least some of the plurality of threads;

the warp processing unit includes registers to store architectural state data for the number of threads;

in response to a determination that a subset of threads is to be discarded when at least one other subset of threads of the plurality of threads is to continue, the warp processing unit is configured to process the subset of threads in a discard state, and

for a thread processed in the discard state, the warp processor is configured to continue executing instructions for the discarded thread, and at least one of the following:

the warp processor is configured to suppress generation of data access messages triggered by the discarded thread, said data access messages including messages requesting access to data other than said architecture state data stored in registers of the warp processing unit, and

the graphics processing arrangement is configured to allow at least one processing operation, which would be delayed until the completion of the aborted thread if the thread had not been aborted, to be initiated regardless of an outcome of the aborted thread.

2. The arrangement according to any preceding claim, wherein the plurality of threads comprises at least two groups of threads, and the warp processing unit is configured to prevent a thread of a given group from switching from a non-discarded state to a partially-discarded state during execution of that thread when at least one other thread in the given group is about to continue in the no-discard state.

3. The device according to claim 2, wherein each group of threads comprises four threads corresponding to a 2x2 grid of graphics segments.

4. The arrangement according to any one of claims 2 and 3, wherein the warp processing unit is configured to transition the yarns of a given group from the no discard state to the discard state in response to a determination that all yarns for the given group are to be to be discarded while at least one other group's threads are to continue.

5. The arrangement according to any preceding claim, wherein the warp processing unit is configured to maintain an active mask indicative of which threads of the plurality of threads are active threads which are to execute the next command presented in dependence on the counter warp program, and a pending mask indicative of which threads of the plurality of threads were previously active but are inactive due to divergence between the control flow followed by the corresponding threads of the plurality of threads.

6. The arrangement according to any one of claims 2 to 4, wherein the warp processing unit is configured to maintain an active mask indicative of which threads of the plurality of threads are active threads which are to execute the next command presented to dependence on the warp program counter, and a pending mask indicative of which threads of the thread count were previously active but are inactive due to the divergence between the control flow followed by the corresponding threads of the thread count, and

the warp processing unit is configured to transition all active threads indicated by the active mask for a given group from the no discard state to the discard state in response to a determination that all active threads for the given group are to be discarded, when the pending mask indicates that there are no pending threads for the given group and the threads of at least one other group are about to continue.

7. The arrangement according to any one of the preceding claims, wherein the warp processing unit is responsive to command clauses within a common program executed for the plurality of threads, to execute each command clause as a cascade of commands with sequential control flow.

8. The arrangement according to claim 7, wherein the warp processing unit is not capable of updating an active mask partially during the processing of a given clause, the active mask indicating which threads of the plurality of threads are active threads to execute the next command served depending on the warp program counter.

9. The arrangement according to any one of claims 7 and 8, wherein the warp processing unit is capable of switching a thread from a non-discarding state to a partially discarding state during processing of a given clause.

10. The apparatus according to claim 9, wherein in response to completion of processing of the given clause when a given thread has been transitioned from the no discard state to the discard state partially during the processing of the given clause, with the warp processing unit configured to transition the given thread into a terminated state,

where the warp processor is configured to suppress the execution of commands for threads in the terminated state.

11. The arrangement according to any one of claims 7 to 10, wherein in response to a determination that all yarns of the plurality of yarns processed by the warp processing unit are to be discarded, the processing unit is configured to terminate processing of thread count partially while processing a current clause.

12. The arrangement according to any one of the preceding claims, wherein the warp processing unit is configured to process a helper yarn in the discard state from the start of helper yarn processing.

13. The arrangement according to any one of the preceding claims, wherein the warp processing unit is configured to determine that a given thread is to be discarded in response to a termination signal indicating that the graphics portion corresponding to the given thread is to be hidden in a rendered image from another graphics part processed by the layout.

14. The arrangement according to any preceding claim, wherein the warp processing unit is configured to determine that a given thread is to be discarded in response to the execution of a discard command for the given thread.

15. The arrangement according to any preceding claim, wherein said data access messages include requests to load data from, or store data in, at least one of:

a tile cache or a frame cache for storing previously calculated pixel values for at least a portion of a rendered image frame;

a depth buffer for storing depth values for the pixels of at least a portion of the rendered image frame;

a texture cache for storing texture data referenced by processing threads performed by the warp processing unit, and

a feature store to store features calculated for a given graphics segment before issuing a corresponding processing thread to the warp processing unit.

16. The arrangement according to any preceding claim, wherein said at least one processing operation comprises a processing thread performed for another graphics segment at the same position in a rendered image as the graphics segment corresponding to the thread in the discard state.

17. The arrangement according to any preceding claim, comprising a plurality of said warp processing units, each having a separate warp program counter.

18. A graphics processing device, comprising:

processing means of a plurality of processing threads in respective graphics sections;

wherein the processing means is configured to control, in dependence on a warp program counter shared by the plurality of threads, the presentation of a next instruction to be executed for at least some threads of the plurality of threads;

the processing means includes means for storing architectural state data for the plurality of threads;

in response to a determination that a given subset of threads is to be discarded when at least one other subset of threads of the plurality of threads is to continue, the processing means being configured to process the given subset of threads in a discarded state, and for a discarded thread in the aborted state, the processing means are configured to continue executing commands for the aborted thread, and at least one of the following:

the processing means is configured to suppress the generation of data access messages triggered by the aborted thread, said data access messages including messages requesting access to data other than said architecture state data stored in processing media storage media, and

the graphics processor is configured to allow at least one processing operation, which would be delayed until completion of the aborted thread if the thread had not been aborted, to be initiated regardless of an outcome of the aborted thread.

19. A graphics processing method, comprising*;

the -processing of a plurality of processing threads in respective graphics sections using a warp processing unit configured to control, -cc dependency on a warp program counter which is used by -common among the plurality of threads, presenting -a next command to be executed for at least some threads from the plurality of -threads, with the -worp processing unit including registers for storing the architecture state data- yw the plurality of threads, and<■■>

q8 response of a determination that a given subset of threads is caused to be dropped when at least one other subset of threads of the plurality of threads is to continue, with. the warp processor to process the given subset of threads in a discard state;

where-for-a-discarded-thread-in-discard-state, the warp processor continues execution of the instructions for the discarded thread, and at least one of >a sys.

the — warp — processing — unit — suppresses — the — generation of — data access messages generated by the rejected thread, with said data access messages including messages denying access to data other than cv due to architecture data state stored in the warp-processing unit-registers, and —

at least OE a Processing operation, which will be suspended until the rejected is completed. thread in -case the -don thread was aborted, — is allowed to start independently; of a result of the aborted thread.