GB2522910A - Thread issue control - Google Patents
Thread issue control Download PDFInfo
- Publication number
- GB2522910A GB2522910A GB1402259.4A GB201402259A GB2522910A GB 2522910 A GB2522910 A GB 2522910A GB 201402259 A GB201402259 A GB 201402259A GB 2522910 A GB2522910 A GB 2522910A
- Authority
- GB
- United Kingdom
- Prior art keywords
- threads
- sequence
- pilot
- issue
- thread
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000015654 memory Effects 0.000 claims abstract description 40
- 238000000034 method Methods 0.000 claims description 7
- 238000000926 separation method Methods 0.000 claims description 4
- SGPGESCZOCHFCL-UHFFFAOYSA-N Tilisolol hydrochloride Chemical compound [Cl-].C1=CC=C2C(=O)N(C)C=C(OCC(O)C[NH2+]C(C)(C)C)C2=C1 SGPGESCZOCHFCL-UHFFFAOYSA-N 0.000 claims 1
- 238000001514 detection method Methods 0.000 claims 1
- 230000002045 lasting effect Effects 0.000 claims 1
- 239000012634 fragment Substances 0.000 description 4
- 238000011144 upstream manufacturing Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 2
- 230000006399 behavior Effects 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000002250 progressing effect Effects 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
- 230000001960 triggered effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline, look ahead
- G06F9/3867—Concurrent instruction execution, e.g. pipeline, look ahead using instruction pipelines
- G06F9/3869—Implementation aspects, e.g. pipeline latches; pipeline synchronisation and clocking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline, look ahead
- G06F9/3867—Concurrent instruction execution, e.g. pipeline, look ahead using instruction pipelines
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/0802—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
- G06F12/0844—Multiple simultaneous or quasi-simultaneous cache accessing
- G06F12/0855—Overlapped cache accessing, e.g. pipeline
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline, look ahead
- G06F9/3802—Instruction prefetching
- G06F9/3816—Instruction alignment, e.g. cache line crossing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline, look ahead
- G06F9/3836—Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
- G06F9/3842—Speculative instruction execution
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline, look ahead
- G06F9/3836—Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
- G06F9/3851—Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution from multiple instruction streams, e.g. multistreaming
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline, look ahead
- G06F9/3854—Instruction completion, e.g. retiring, committing or graduating
- G06F9/3856—Reordering of instructions, e.g. using queues or age tags
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2212/00—Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
- G06F2212/45—Caching of specific data in cache memory
- G06F2212/455—Image or video data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline, look ahead
Abstract
Disclosed is a data processing system 2 with a processing pipeline 8 for the parallel execution of threads within a predetermined logical sequence and an issue controller 14 that issues threads to the processing pipeline. The issue controller 14 issues the threads in accordance with both a pilot sequence which is a proper subset of the logical sequence, and a main sequence that trails the pilot sequence. The main sequence trails the pilot through the logical sequence by a delay time and it comprises those threads that are not within the pilot sequence. The system may have a cache memory 10 coupled to the processing pipeline to store data values fetched from main memory, and a stall manager 12 that controls the stalling and un-stalling of threads when a cache miss occurs within a cache memory. The delay time may corresponds to the latency associated with a cache miss. The threads may be arranged in groups corresponding to blocks of pixels for processing within a graphics processing unit.
Description
THREAD ISSUE CONTROL
This invention relates to the field of data processing systems. More particularly, this invention relates to the control of thread issue into a processing pipeline within a data processing system.
It is known to provide data processing systems having processing pipelines which can execute a plurality of threads in paralleL As an example, the threads may correspond to different fragments of an image to be generated within a graphics processing system. The use of deep pipelines supporting multiple thrcads in parallel execution enables a high level of data throughput to be achieved.
One problem associated with such systems is the latency associated with fetching from main memory data required to be accessed during processing. It is known to provide cache memories close to the processing pipeline in order to provide rapid and low energy access to data to be processed. However, data needs to he moved between the cache memory and the main memory as the cache memory has insufficient capacity to hold all of the data which may be required. When a thread makes an access to a data value which is not held within the cache memory, then a cache miss arises and the cache line containing that data value is fetched from the main memory. The time taken to service such a cache miss maybe many hundreds of dock cycles and the thread which triggered the cache miss is stalled (parked) during such a miss until the required data is returned. It is known to provide data processing pipelines with the ability to manage stored threads in this way and still make forward progress with threads which are not stalled.
In order that the system should operate efficiently, it is desirable that the capacity to deal with stalled threads should not he exceeded. Conversely, the overhead associated with managing stalled threads is not insignificant and accordingly it is undesirable to provide an excess of this capacity. Furthermore, it is desirable that not too much of the processing capabilities of the processing pipeline should be stalled at any given time as a consequence of threads awaiting data for which a cache miss has occurred.
Viewed from one aspect the present invention provides apparatus for processing data compnsing: a processing pipeline configured to execute in parallel a plurality of threads within a predetermined logical sequence of threads to be executed; and an issue controller configured to issue threads to said processing pipeline for execution; wherein said issue controller is configured to select threads from said predetermined logical sequence for issue in accordance with both: (i) a pilot sequence being a proper subset of said predetermined logical sequence; and (ii) a main sequence trailing said pilot sequence through said predetermined thgical sequence by a delay time and comprising those threads of said predetermined logical sequence not within said pilot sequence.
The present technique recognises that the threads to he processed will have a predetermined logical sequence in which the program or upstream hardware will order them as becoming eligible for issue to the processing pipeline. Conventionally the threads are then issued in this predetermined logical sequence. This predetermined logical sequence is not speculative as it is known that the threads concerned are to he executed as the program or hardware has committed these threads for execution. The present technique recognises that the predetermined logical order may result in inefficiencies in the utilisation of the processing pipeline and the system as a whole.
In some embodiments, threads issued in the predetermined logical order may correspond to data accesses which are proximal to each other within the memory address space arid accordingly result in a bunching of the cache misses to arise. When a large number of cache misses arise together, then the forward progress made by the processing pipeline slows as a relativdy large number ol threads are simultaneously stafled awaiting return ol the data values br which a cache miss occurred. During such times, the cache memory and the processing pipeline are relatively idle and relatively little borward progress is made in the processing. The present technique recognises this behaviour and provides a system in which the predeternuned logical sequence is modified to form both a pilot sequence and a main sequence.
The pilot sequence is formed of threads issued to the processing pipeline ahead of neighbouring threads within the predetermined logical sequence which form part of the main sequence. In some example embodiments, the pilot sequence threads are issued at a time greater than the memory latency for a cache miss ahead of their neighbouring threads within the main sequence such that if the thread within the pilot sequence triggers a cache nilss, then there is high likelihood that the surrounding data values which may be required by neighbouring threads within the main sequence will have been returned to the cache memory by the time those threads within the main sequence arc issued into the processing pipeline. It is expected that the pilot threads will result in a higher proportion of cache misses than the main threads, but that the cache line fills which result from the pilot threads will enable the main threads to more likely proceed without cache misses and associated stalling. The delay time could in other embodiments he css than the latency associated with a cache miss and still give an advantage by at least reducing the waiting for data values that miss.
The pilot threads can he considered as intended to provoke inevitable cache misses which will arise due to execution of the threads within the predetermined logical sequence, but to trigger these cache misses early such that the majority of the threads which will need the data associated with those cache misses will not be stalled (or stalled for a shorter time) waiting the return of that data as it will already have been fetched (or have been started to he fetched) as a consequence of the early execution of the pilot thread. This reordering of the threads from the predeternilned logical sequence into the pilot sequence and the main sequence takes place without the need for modification of program instructions executing or upstream hardware systems which create the threads. Furthermore, the early processing and staffing of the pilot threads is not speculative as those threads arc required to be executed and would have resulted in a cache miss. Rather, the reordering of the threads has moved the pilot threads earlier in execution so as to facilitate the execution of following main threads without (or with less) stalling.
In some embodiments the predetermined logical sequence may comprise a sequence of groups of threads in which each group of threads comprises a plurality of threads adjacent within the predetermined logical sequence. Division of the predetermined logical sequence into groups matches many real life processing workloads in which groups of threads have a tendency to access data values which are located proximal to each other within the memory address space.
In the context of threads arranged into groups, in some embodiments the pilot sequence and the main sequence may be timed relative to each other such that the next pilot thread to be issued in accordance with the pilot sequence is in a group at least one group ahead of die next main thread to be issued in accordance with the main sequence. Thus, pilot threads are at least one group ahead of the main threads and accordingly will provoke cache misses which will fill the cache memory with data which can then be consumed by the later main threads without cache misses arising. This effectivdy hides the cache fill latency for the main threads.
In some systems the pilot sequence may extend through a plurality of pilot groups ahead of the next main thread with decreasing numbers of pflot threads within each group as the issue time moves further ahead of the main thread issue time. This arrangement may be useful in increasing the efficiency of execution of the pilot threads themselves. Particularly early pilot threads may trigger cache misses, translation look aside buffer updates, first level memory accesses etc. which will then he used by the subsequent pilot threads. The subsequent pilot threads then trigger their own individual cache misses to return data values which are to be used by the main threads.
In some embodiments, each group of threads may be associated with a block of pixel values within an image and each thread within a group may correspond to processing associated with a pixel value within the block of pixels. The thread may eolTespond to a fragment to be determined in respect of a pixel, such as a transparency value, depth, colour, etc which will ultimately contribute to the final pixel value to he generated within an image.
Within the main sequence it is normal to use an interleaved scan pattern for each block of pixel values as in many eases this is an efficient way of traversing the data to be processed.
The pilot sequence may he sdected to have one ol a variety of different Some lorms are better matched to given patterns of data value accesses to be made within the main memory address space than others. It is desiraffle that the pilot sequence should he one which wouki trigger required cache fill operations in a wide variety of circumstances independent of the particular block of pixel values being processed and what it represents. Examples of pilot sequences which may he used include: (i) a diagonal line through each said block of pixels; (ii) a line parallel to one of a row direction and a column direction within each said block of pixels; (iii) clusters of one or more pixels disposed at predetermined positions within an array of possible cluster positions within each said block of pixels, said array of cluster positions comprising cluster lines of adjacent cluster position disposed parallel to one of a row direction and a column direction of said block of pixels, said S array divisible into a plurality of adjacent parallel lines of cluster positions such that (a) within a given line each cluster is separated by three vacant cluster positions from auy other nearest neighbour cluster within said given line and (b) each cluster in a neighbouring line adjacent said given line is positioned equidistant from any nearest neighbour cluster in said given line; and (iv) clusters of one or more pixels disposed at predetermined positions within an array of possible cluster positions within each said block of pixels, said clusters disposed within said array of cluster positions such that no clusier shares a cluster row, a cluster column or a cluster diagonal within said array of cluster positions.
As previously mentioned, each group of threads may correspond to the partial calculation of values needed to generate a block of pixels. A group of threads may correspond to a layer within a plurality of layers for processing that generates the block of pixel values.
The use of pilot threads ahead of the main thread to trigger early cache misses may he used independently of the grouping of threads and the association of groups of threads with blocks of pixels. In such general circumstances, the pilot threads need not be evenly distributed in time ahead of the main thread and may be arranged such that as time separation from the main thread issue time increases, the density of the pilot threads decreases such that a small number of pflot threads are issued very early and then these are followed by a larger number of pilot threads which are closer to the issue point in the main sequence of threads.
The issue controller may store issue queue data identifying the plurality of threads waiting within an issue queue to he executed and select threads for execution following both the main sequence and the pilot sequence in accordance with this issue queue data. At each time, a single thread may be selected for issue to the processing pipeline selected from either the main sequence or the pilot sequence. The main sequence is followed in order and the pilot sequence is followed in order. The overall order is different from the predetermined logical sequence.
In sonic embodiments the issue queue data will identify threads within the pilot sequence as having a high priority and threads within a main sequence as having a low priority. Furthermore, threads may be added to the issue queue in the predetermined logical sequence and the issue queue may identify a time in which each thread is added to the issue queue.
Using a combination of time information and priority information within the issue queue data, the issue controller may sckct a next thread to issue in accordance with a hierarchy in which an oldest low priority thread exceeding a threshold waiting time in the issue queue is selected first.
if present. followed by an oldest high priority thread waiting in the issue queue if less than a target number of high priority threads are currently in execution by the processing pipeline, if any, followed by an oldest thw priority thread. Selecting in accordance with these rules has the effect of ensuring that not too many high priority threads are in progress simultaneously in a manner which would cause an excess to become stalled and also that the main thread execution point does not drop too far behind the pilot thread execution point.
In some embodiments the target number of high priority threads to be kept in execution at any given time may be software programmable so as to match the particular data workload being executed at that time or a particular memory latency of a particifiar implementation.
Viewed from another aspect the present invention provides apparatus for processing data comprising: processing pipeline means for executing in parallel a p'urality of threads within a predetermined logical sequence of threads to he executed; and issue control means for issuing threads to said processing pipeline means for execution; wherein said issue controller means selects threads from said predetermined logical sequence for issue in accordance with both: (i) a pilot sequence being a proper subset of said predetermined logical sequence: and (ii) a main sequence trailing said pilot sequence through said predetermined logical sequence by a delay time and comprising those threads of said predetermined logical sequence not within said pilot sequence.
Viewed from a further aspect the present invention provides a method of processing data comprising the steps of: executing in parallel within a processing pipeline a plurality of threads within a predetermined logical sequence of threads to be executed; and S selecting threads from said predetermined logical sequence for issue to said processing pipeline in accordance with both: (i) a pilot sequence being a proper subset of said predetermined logical sequence; and (ii) a main sequence trailing said pilot sequence through said predetermined logical sequence by a delay time and comprising those threads of said predetermined logical sequcnce not within said pilot sequence.
Embodiments of the invention will now he described, by way of examp'e only, with reference to the accompanying drawings in which: Figure 1 schematically illustrates a data processing system including a processing pipeline and an issue controller for issuing threads in accordance with both a pilot sequence and a main sequence which differ from a predetermined logical sequence; Figure 2 schematicafly illustrates a predetermined logical sequence of thread issue; Figure 3 schematically illustrates issue in accordance with a pilot sequence and a main sequence; Figure 4 schematicafly illustrates an example of a main sequence order; Figures 5, 6 and 7 schematically illustrate examples olpilot sequence orders; and Figure 8 is a flow diagram schematically illustrating issue control.
Figure 1 schematically illustrates a data processing system 2 including a graphics processing unit 4 and a main memory 6. The graphics processing unit 4 includes a processing pipeline 8, a cache memory 10, a stall manager 12 and an issue controller 14. It will be appreciated that in practice the graphics processing unit 4 will typically indude many further data processing elements.
such as those which create threads received by the issue controller 14 and queued therein prior to issue to the processing pipeline 8. Such additional circuit elements have been omitted from Figure 1 for the sake of clarity. When a thread (e.g. a sequence of program instructions executing to S generate a particular value, such as a partic&ar pixd fragment within an array of pixels) executing within the processing pipeline 8 accesses a data value, then a check is made as to whether or not that data value is held within the cache memory 10. If the data value is not held within the cache memory 10, then a cache miss arises and a cache line induding that data value is fetched from the main memory 6 to the cache memory 10. This fetch from the main memory 6 is relatively slow and has a memory latency time associated with it which may be several hundred times greater than the time normally taken to access a data value within the cache memory 10. A thread will circulate within the processing pipeline 8 with its successiye instructions heing executed until the thread has finished at which point it will be retired from the processing pipeline 8 freeing up a slot into which another thread may be issued.
The processing pipeline 8 executes a plurality of threads in parallel. The threads are issued into the processing pipeline 8 by the issue controller 14 in dependence upon issue queue data 16 (prionty values, time stamps etc) associated with the queued threads. The issue controller 14 receives the threads in the predetermined logical sequence in which they arc generated by thc software and/or upstream hardware. The issue controller 14 issucs the threads to the processing pipeline 8 following both a main sequence and pilot sequence. Threads arc selected from the main sequence in the main sequence order. Threads are selected from the pilot sequence in the pilot sequence order. Selection as to whether the next thread to he issued should he taken from die main sequence or the pilot sequence is made in accordance with the issue queue data 16 as will he described later. The issue controller 14 may be considered to hold two queues. namely a pilot sequence queue and a main sequence queue. Threads are issued from each ci these queues in their respective order, and a selection is made as to from which queue the next thread is to be issued in dependence upon the issue queue data 16. The issue queue(s) may he provided for other reasons in addition to the above (e.g. forward pixel kill) and so support for the present techniques may be added with little extra overhead.
When a thread is progressing along the processing pipeline 8 and a cache miss occurs, then the stall manager 12 coupled to the processing pipeline 8 serves to stall that thread until (he data value which missed has been fetched to the cache memory 10 whereupon the thread is unstalled.
The thread (or at least the relevant stalled instruction of the thread) may be recirculated within the pipeline S while it is stalled and its partial processing state retained.
S
Figure 2 schematically illustrates the predetermined logical sequence in which threads are generated and received by the issue controller 14. In this example, the threads are formed into a sequence of groups of threads which each groups of threads corresponding to a group of pixels (e.g. 16* 16) to be processed. As illustrated, the block "n" is encountered first within the logical sequence and is then followed by blocks "n+1", "n+2" and "n+3". Each of the groups of threads (one thread per position) colTesponds to a block of values to be processed so as. for example, to lorm a layer associated with a block of pixels within an image to he generated. Each thread may effectively calculate a fragment contributing towards a pixel value to be in the block of pixels concerned. The predetermined logical sequence corresponds to threads which are to be executed.
Within the predetermined logical sequence illustrated in Figure 2, the group "n" will be logically intended to he issued to the processing pipeline 8 earliest.
Figure 3 schematically illustrates groups (blocks) of threads corresponding to those illustrated in Figure 2 hut in this case with threads being issued both in accordance with a pilot sequence and a main sequence. In the example illustrated, the current next thread issue point within the pilot sequence is marked with an "x". The current next thread issue point from within the main sequence is marked with an "o", As illustrated, the pilot sequence extends more thaii one group ahead of the current next thread issue point of the main sequence. As the separation iii time ahead ol the main sequence thread issue point increases, then the temporal spacing hetween threads which form part of the pilot sequence also increases. Accordingly. there are many more main threads to be issued from block "n+2" than there are pilot threads within block "n+F' and in turn many more pilot threads within block "n+1" than within block "n". The time gap between a given thread within the pilot sequence being issued and one of its neighbours within the logical sequence being issued as part of a main sequence is at least equal to the memory latency associated with a cache miss and preferably exceeds this time.
Figure 4 schematically illustrates an interleaved main sequence in which main sequence threads are issued. It will he appreciated that some threads within the path illustrated in Figure 4 which already havc bccn issucd as part of thc pilot scqucncc will bc omitted from the main sequence. Accordingly the main sequence can be considered to the remainder of the predetermined logical sequence which have not already been issued as part of the pilot sequence.
S
Figure 5 schematically illustrates a diagonal pilot sequence within a group of threads corresponding to a block of pixels. Such a diagonal path of the pilot sequence through the threads when these are considercd in their spatial position corresponding to the block of pixels has the resull that one thread corresponding to each row and each colunm is included within the pilot sequence and accordingly will triggcr any ncccssary cachc miss for data values associatcd with thc surrounding pixels.
Other possible pilot sequences include a horizontal pilot sequence and a vertical pilot sequence as illustrated by the dashed lines in Figure 5. Such horizontal and vertical pilot sequences may be suitable for some layouts of the data values within the memory address space, but not for others. Accordingly. for example, a vertical pilot sequence suitable for accessing one data value within each row of a sequence of data values set out in a horizontal raster scan order within the memory address space, would not be suitable if that image was rotated through 90 degrees such that the vertical pilot sequence then served to access data values within a single horizontal raster line as the data values are arranged within the memory address space.
Figure 6 illustrates another example of a pilot sequence, in this case a tiled sequence. As will he seen, each horizontal row within the pilot sequence contains two pilot threads with three vacant spaces therehetween. The pilot threads within adjacent rows are equidistant from the pilot threads within their neighbour rows. Also illustrated in Figure 6 is the idea of a cluster of pixels. In practice. threads can he issued in clusters corresponding to a cluster ol lour pixel values. These clusters of threads have corresponding cluster positions which may be arranged in lines corresponding to one of the rows or columns through the army of cluster positions.
It will be appreciated that the pattern of pilot threads illustrated in Figure 6 provides good coverage spatially distributed across the group of threads. The particular order with which the pilot threads may be issued out of this pattern may vary whilst still giving the appropriate coverage. In practice. there may he a preference for issuing the pilot threads out of the pilot sequence positions illustrated in Figure 6 to correspond roughly to the order in which the main threads will be issued out of the main sequence so as to increase the spacing in time of a pilot thread from its neighbours within the main sequence.
S
Figure 7 illustrates another pilot sequence. This pilot sequence corresponds to a solution of the eight queens problem from the field of chess. The eight queens problem is how to position eight queens on a chess hoard so that it shares ncither a row, column nor diagonal with any other queen. The eight queens problem is analogous to the problem of triggenng earlier prefetches with the pilot sequence as it is desired to select the pilot threads forming part of the pilot sequence such that they provide good coverage among the different rows, columns and diagonals within the array ol threads (pixels), hut without unwanted redundancy.
Figure 8 is a flow diagram schematically illustrating operation of the issue controller 14 in controlling which thread to issue next. At step 18, processing waits until there is a slot available at the head of the processing pipeline 8 into which a thread may he issued (e.g. an existing thread is retired). Step 20 then determines whether there is any thread in the main sequence which greater than a threshold age. This threshold age colTesponds to a delay since that thread was added to the issue queue. Main sequence threads are given priority for issue if they are older than this threshold age. If there arc any main sequence threads greater than the threshold agc, then step 22 selects the oldest of these for issue from the main sequence.
If a determination at step 20 is that there are no such main sequence threads, then step 24 determines whether there are currently less than a target number of pilot threads in progress within the processing pipeline 8. If there arc less than this target number of threads, then step 26 serves to issue a thread from the pilot sequence as the next thread.
If there are not less than this target numher of threads, then processing again proceeds to step 22 where an oldest main sequence thread is issued. The processing illustrated in Figure 8 implements an issue hierarchy in which main sequence threads are given priority if they are greater than a threshold age. Following this, pilot threads are given priority if less than a target number of pilot threads are currently in execution. Following this, the oldest main sequence thread is given priority.
The issue queue data 16 held by the issue controller 14 includes priority data indicating whether a given thread is a high priority thread (pilot sequence) or a low priority thread (main S sequence). In addition, time data is associated with each thread indicating the time at which it was added to the queues of threads awaiting issue by the issue controller 14. In practice, the issue controller 14 can be considered to maintain both a high priority pilot thread queue and a low priority main thread queue. A software programmable target number of a high priority threads to be kept in execution within the processing pipeline 8 is input to the issue controller 14. For example, this target numbcr of thrcads may be 16, 32 or 48 dcpending upon circumstances and when ushg.
for example, a processing pipeline capable of the parallel execution of 128 threads.
Claims (20)
- CLAIMS1. Apparatus for processing data comprising: a processing pipeline configured to execute in parallel a plurality of threads within a predetermined logical scquencc of threads to he executed; and an issue controller configured to issue threads to said processing pipeline for execution; wherein said issue controller is configured to select threads from said predetermined logical sequence for issue in accordance with both: U) a pilot sequence being a proper subset of said predetermined logical sequence; and (ii) a main sequence trailing said pilot sequence through said predetermined logical sequence by a delay time and comprising those threads of said predetermined logical sequence not within said pilot sequence.
- 2. Apparatus as claimed in claim 1. comprising: a cache memory coupled to said processing pipeline and configured to store data values fetched from a main memory, a cache miss within said cache memory triggering a fetch operation lasting a latency time to fetch a data value from said main memory to said cache memory; and a stall manager coupled to said processing pipeline and configured to stall a given processing thread executing in said processing pipeline upon detection of a miss within said cache memory for a data value to be accessed by said given thread and to unstall said given thread when said data value has been fetched to said cache memory.
- 3. Apparatus as claimed in claim 2, wherein said delay time is greater than or equal to said latency time.
- 4. Apparatus as claimed in any one of claims 1, 2 and 3, wherein said predetermined logical sequence comprises a sequence of groups of threads, each said group of threads comprising a plurality of threads adjacent within said predetermined logical sequence.
- 5. Apparatus as claimed in claim 4, wherein said issue controller selects threads to issue from said pilot sequence and said main sequence such that a next pilot thread to he issued iii accordance with said pilot sequence is within a group of threads at least one pilot group ahead of a next main thread to be issued in accordance with said main sequence.
- 6. Apparatus as claimed in daim 5, wherein said pilot sequence extends through a p'urality of pilot groups ahead of said next main thread, a number of pilot threads within each of said plurality of pilot groups ahead of said next main thread reducing as separation from said next main thread increases.
- 7. Apparatus as claimed in any one of claims 4. 5 and 6, wherein each group of threads is associated with a block of pixel values within an image and each of said threads within a group of threads corresponds to processing associated with a pixel value within said block of pixel values.
- 8. Apparatus as claimed in claim 7. wherein said main sequence corresponds to an interleaved scan pattern through each block of pixel values.
- 9. Apparatus as claimed in any one of claims 7 and 8, wherein said pilot sequence corresponds to one of: (i) a diagonal line through each said block of pixels; (ii) a line parallel to one of a row direction and a column direction within each said block of pixels; (iii) clusters of one or more pixels disposed at predetermined positions within an array of possible cluster positions within each said block of pixels, said array of cluster positions compnsing cluster lines of adjacent cluster position disposed parallel to one of a row direction and a column direction of said block of pixels, said array divisible into a plurality of adjacent parallel lines of cluster positions such that (a) within a given line each cluster is separated by three vacant cluster positions from any other nearest neighbour cluster within said given line and (h) each cluster in a neighbouring line adjacent said given line is positioned equidistant from any nearest neighbour cluster in said given line; and (iv) clusters of one or more pixels disposed at predetermined positions within an array of possible cluster positions within each said block of pixels, said clusters disposed within said array of cluster positions such that no cluster shares a cluster row, a cluster column or a cluster diagonal within said array of cluster positions.
- 10. Apparatus as claimed in any one of claims 7 to 9, wherein each group of threads corresponds to a layer within a plurality of layers of processing that generate said block of pixel values.
- 11. Apparatus as claimed in claim 1, wherein said pilot sequence extends through said predetermined logical sequence ahead of a next main thread to he issued in accordance with said main sequence such that positions of pilot threads within said predetermined logical sequence incrcasc in separation from cach other as scparation from said next main thread inercascs.
- 12. Apparatus as claimed in any one of the preceding daims, wherein said issue controller stores issue queue data identifying a plurality of threads waiting within an issue queue to be executed and said issue controller selects threads to issue for execution by said processing pipeline following said main sequence and said pilot sequence in accordance with said issue queue data.
- 13. Apparatus as claimed in claim 12, wherein said issue queue data identifies threads within said pilot sequence as having a high pnority and threads within said main sequence as having low priority.
- 14. Apparatus as claimed in claim 13, whcrcin thrcads arc added to said issuc queue in said predetermined logical sequence and said issue queue data identifies a time at which a thread was added to said issue queue.
- 15. Apparatus as claimed in claim 14, wherein said issue controller selects a next thread to issue in accordance with a hierarchy comprising: an oldest low priority thread exceeding a threshold time waiting in said issue queue: an oldest high priority thread waiting in said issue queue if less than a target number of high priority threads are in execution by said processing pipeline; and an oldest low priority thread.
- 16. Apparatus as claimed in claim 15, wherein said target number is software programmable.
- 17. Apparatus for processing data comprising: processing pipeline means for executing in parallel a plurality of threads within a predetermined logical sequence of threads to be executed; and issue control means for issuing threads to said processing pipeline means for execution; wherein said issue controller means selects threads from said predetermined logical sequence for issue in accordance with both: (i) a pilot sequence being a proper subset of said predetermined logical sequence; and (ii) a main sequence trailing said pilot sequence through said predetermined logical sequence by a delay time and comprising those threads of said predetermined logical sequence not within said pilot sequence.
- 18. A method of proccssing data comprising thc steps of: executing in parallel within a processing pipdline a plurality of threads within a predetermined logical sequence of threads to be executed; and selecting threads from said predetermined logical sequence for issue to said processing pipeline in accordance with both: (i) a pilot sequence being a proper subset of said predetermined logical sequence; and (ii) a main sequence trailing said pilot sequence through said predetermined logical sequence by a delay time and comprising those threads of said predetermined logical sequence not within said pilot sequence.
- 19. Apparatus for processing data substantially as hereinbefore described with reference to the accompanying drawings.
- 20. A method of processing data substantially as hereinbefore described with reference to the accompanying drawings.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
GB1402259.4A GB2522910B (en) | 2014-02-10 | 2014-02-10 | Thread issue control |
US14/596,948 US9753735B2 (en) | 2014-02-10 | 2015-01-14 | Thread issue control |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
GB1402259.4A GB2522910B (en) | 2014-02-10 | 2014-02-10 | Thread issue control |
Publications (3)
Publication Number | Publication Date |
---|---|
GB201402259D0 GB201402259D0 (en) | 2014-03-26 |
GB2522910A true GB2522910A (en) | 2015-08-12 |
GB2522910B GB2522910B (en) | 2021-04-07 |
Family
ID=50390728
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
GB1402259.4A Active GB2522910B (en) | 2014-02-10 | 2014-02-10 | Thread issue control |
Country Status (2)
Country | Link |
---|---|
US (1) | US9753735B2 (en) |
GB (1) | GB2522910B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20180037839A (en) * | 2016-10-05 | 2018-04-13 | 삼성전자주식회사 | Graphics processing apparatus and method for executing instruction |
GB2583061B (en) * | 2019-02-12 | 2023-03-15 | Advanced Risc Mach Ltd | Data processing systems |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20020055964A1 (en) * | 2000-04-19 | 2002-05-09 | Chi-Keung Luk | Software controlled pre-execution in a multithreaded processor |
US20040128489A1 (en) * | 2002-12-31 | 2004-07-01 | Hong Wang | Transformation of single-threaded code to speculative precomputation enabled code |
US20040148491A1 (en) * | 2003-01-28 | 2004-07-29 | Sun Microsystems, Inc. | Sideband scout thread processor |
US20090199170A1 (en) * | 2008-02-01 | 2009-08-06 | Arimilli Ravi K | Helper Thread for Pre-Fetching Data |
US20110231612A1 (en) * | 2010-03-16 | 2011-09-22 | Oracle International Corporation | Pre-fetching for a sibling cache |
US20110296431A1 (en) * | 2010-05-25 | 2011-12-01 | International Business Machines Corporation | Method and apparatus for efficient helper thread state initialization using inter-thread register copy |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9659339B2 (en) * | 2003-10-29 | 2017-05-23 | Nvidia Corporation | Programmable graphics processor for multithreaded execution of programs |
US8621478B2 (en) * | 2010-01-15 | 2013-12-31 | International Business Machines Corporation | Multiprocessor system with multiple concurrent modes of execution |
US9021493B2 (en) * | 2012-09-14 | 2015-04-28 | International Business Machines Corporation | Management of resources within a computing environment |
US9348599B2 (en) * | 2013-01-15 | 2016-05-24 | International Business Machines Corporation | Confidence threshold-based opposing branch path execution for branch prediction |
-
2014
- 2014-02-10 GB GB1402259.4A patent/GB2522910B/en active Active
-
2015
- 2015-01-14 US US14/596,948 patent/US9753735B2/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20020055964A1 (en) * | 2000-04-19 | 2002-05-09 | Chi-Keung Luk | Software controlled pre-execution in a multithreaded processor |
US20040128489A1 (en) * | 2002-12-31 | 2004-07-01 | Hong Wang | Transformation of single-threaded code to speculative precomputation enabled code |
US20040148491A1 (en) * | 2003-01-28 | 2004-07-29 | Sun Microsystems, Inc. | Sideband scout thread processor |
US20090199170A1 (en) * | 2008-02-01 | 2009-08-06 | Arimilli Ravi K | Helper Thread for Pre-Fetching Data |
US20110231612A1 (en) * | 2010-03-16 | 2011-09-22 | Oracle International Corporation | Pre-fetching for a sibling cache |
US20110296431A1 (en) * | 2010-05-25 | 2011-12-01 | International Business Machines Corporation | Method and apparatus for efficient helper thread state initialization using inter-thread register copy |
Also Published As
Publication number | Publication date |
---|---|
US9753735B2 (en) | 2017-09-05 |
US20150227376A1 (en) | 2015-08-13 |
GB201402259D0 (en) | 2014-03-26 |
GB2522910B (en) | 2021-04-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9697125B2 (en) | Memory access monitor | |
US7366878B1 (en) | Scheduling instructions from multi-thread instruction buffer based on phase boundary qualifying rule for phases of math and data access operations with better caching | |
US9547530B2 (en) | Data processing apparatus and method for processing a plurality of threads | |
CN103809964B (en) | SIMT processors with the system and method for sets of threads execution sequence code and comprising it | |
US10679322B2 (en) | Primitive processing in a graphics processing system with tag buffer storage of primitive identifiers | |
US7206902B2 (en) | System, apparatus and method for predicting accesses to a memory | |
EP2593861B1 (en) | System and method to allocate portions of a shared stack | |
US20210065438A1 (en) | Primitive Processing in a Graphics Processing System | |
US20090113181A1 (en) | Method and Apparatus for Executing Instructions | |
KR100998929B1 (en) | Cache controller device, interfacing method and programming method using thereof | |
US20190088009A1 (en) | Forward killing of threads corresponding to graphics fragments obscured by later graphics fragments | |
CN105359089B (en) | Method and apparatus for carrying out selective renaming in the microprocessor | |
US7853751B2 (en) | Stripe caching and data read ahead | |
US10705849B2 (en) | Mode-selectable processor for execution of a single thread in a first mode and plural borrowed threads in a second mode | |
WO2007034232A2 (en) | Scalable multi-threaded media processing architecture | |
JPH0371354A (en) | Method and apparatus for processing memory access request | |
US20130124805A1 (en) | Apparatus and method for servicing latency-sensitive memory requests | |
US20100088475A1 (en) | Data processing with a plurality of memory banks | |
US9753735B2 (en) | Thread issue control | |
WO2016087831A1 (en) | Method of and apparatus for providing an output surface in a data processing system | |
US20140164743A1 (en) | Reordering buffer for memory access locality | |
WO2008007038A1 (en) | Data dependency scoreboarding | |
WO2021091649A1 (en) | Super-thread processor | |
CN102855122A (en) | Processing pipeline control | |
DE102012222391B4 (en) | Multichannel Time Slice Groups |