GB2522910A - Thread issue control - Google Patents

Thread issue control Download PDF

Info

Publication number
GB2522910A
GB2522910A GB1402259.4A GB201402259A GB2522910A GB 2522910 A GB2522910 A GB 2522910A GB 201402259 A GB201402259 A GB 201402259A GB 2522910 A GB2522910 A GB 2522910A
Authority
GB
United Kingdom
Prior art keywords
threads
sequence
pilot
issue
thread
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
GB1402259.4A
Other versions
GB201402259D0 (en
GB2522910B (en
Inventor
Andreas Engh-Halstvedt
Ian Victor Devereux
David Bermingham
Jakob Fries
Lars Oskar Flordal
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
ARM Ltd
Original Assignee
ARM Ltd
Advanced Risc Machines Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ARM Ltd, Advanced Risc Machines Ltd filed Critical ARM Ltd
Priority to GB1402259.4A priority Critical patent/GB2522910B/en
Publication of GB201402259D0 publication Critical patent/GB201402259D0/en
Priority to US14/596,948 priority patent/US9753735B2/en
Publication of GB2522910A publication Critical patent/GB2522910A/en
Application granted granted Critical
Publication of GB2522910B publication Critical patent/GB2522910B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3867Concurrent instruction execution, e.g. pipeline, look ahead using instruction pipelines
    • G06F9/3869Implementation aspects, e.g. pipeline latches; pipeline synchronisation and clocking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3867Concurrent instruction execution, e.g. pipeline, look ahead using instruction pipelines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0844Multiple simultaneous or quasi-simultaneous cache accessing
    • G06F12/0855Overlapped cache accessing, e.g. pipeline
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3802Instruction prefetching
    • G06F9/3816Instruction alignment, e.g. cache line crossing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
    • G06F9/3842Speculative instruction execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
    • G06F9/3851Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution from multiple instruction streams, e.g. multistreaming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3854Instruction completion, e.g. retiring, committing or graduating
    • G06F9/3856Reordering of instructions, e.g. using queues or age tags
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/45Caching of specific data in cache memory
    • G06F2212/455Image or video data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead

Abstract

Disclosed is a data processing system 2 with a processing pipeline 8 for the parallel execution of threads within a predetermined logical sequence and an issue controller 14 that issues threads to the processing pipeline. The issue controller 14 issues the threads in accordance with both a pilot sequence which is a proper subset of the logical sequence, and a main sequence that trails the pilot sequence. The main sequence trails the pilot through the logical sequence by a delay time and it comprises those threads that are not within the pilot sequence. The system may have a cache memory 10 coupled to the processing pipeline to store data values fetched from main memory, and a stall manager 12 that controls the stalling and un-stalling of threads when a cache miss occurs within a cache memory. The delay time may corresponds to the latency associated with a cache miss. The threads may be arranged in groups corresponding to blocks of pixels for processing within a graphics processing unit.

Description

THREAD ISSUE CONTROL
This invention relates to the field of data processing systems. More particularly, this invention relates to the control of thread issue into a processing pipeline within a data processing system.
It is known to provide data processing systems having processing pipelines which can execute a plurality of threads in paralleL As an example, the threads may correspond to different fragments of an image to be generated within a graphics processing system. The use of deep pipelines supporting multiple thrcads in parallel execution enables a high level of data throughput to be achieved.
One problem associated with such systems is the latency associated with fetching from main memory data required to be accessed during processing. It is known to provide cache memories close to the processing pipeline in order to provide rapid and low energy access to data to be processed. However, data needs to he moved between the cache memory and the main memory as the cache memory has insufficient capacity to hold all of the data which may be required. When a thread makes an access to a data value which is not held within the cache memory, then a cache miss arises and the cache line containing that data value is fetched from the main memory. The time taken to service such a cache miss maybe many hundreds of dock cycles and the thread which triggered the cache miss is stalled (parked) during such a miss until the required data is returned. It is known to provide data processing pipelines with the ability to manage stored threads in this way and still make forward progress with threads which are not stalled.
In order that the system should operate efficiently, it is desirable that the capacity to deal with stalled threads should not he exceeded. Conversely, the overhead associated with managing stalled threads is not insignificant and accordingly it is undesirable to provide an excess of this capacity. Furthermore, it is desirable that not too much of the processing capabilities of the processing pipeline should be stalled at any given time as a consequence of threads awaiting data for which a cache miss has occurred.
Viewed from one aspect the present invention provides apparatus for processing data compnsing: a processing pipeline configured to execute in parallel a plurality of threads within a predetermined logical sequence of threads to be executed; and an issue controller configured to issue threads to said processing pipeline for execution; wherein said issue controller is configured to select threads from said predetermined logical sequence for issue in accordance with both: (i) a pilot sequence being a proper subset of said predetermined logical sequence; and (ii) a main sequence trailing said pilot sequence through said predetermined thgical sequence by a delay time and comprising those threads of said predetermined logical sequence not within said pilot sequence.
The present technique recognises that the threads to he processed will have a predetermined logical sequence in which the program or upstream hardware will order them as becoming eligible for issue to the processing pipeline. Conventionally the threads are then issued in this predetermined logical sequence. This predetermined logical sequence is not speculative as it is known that the threads concerned are to he executed as the program or hardware has committed these threads for execution. The present technique recognises that the predetermined logical order may result in inefficiencies in the utilisation of the processing pipeline and the system as a whole.
In some embodiments, threads issued in the predetermined logical order may correspond to data accesses which are proximal to each other within the memory address space arid accordingly result in a bunching of the cache misses to arise. When a large number of cache misses arise together, then the forward progress made by the processing pipeline slows as a relativdy large number ol threads are simultaneously stafled awaiting return ol the data values br which a cache miss occurred. During such times, the cache memory and the processing pipeline are relatively idle and relatively little borward progress is made in the processing. The present technique recognises this behaviour and provides a system in which the predeternuned logical sequence is modified to form both a pilot sequence and a main sequence.
The pilot sequence is formed of threads issued to the processing pipeline ahead of neighbouring threads within the predetermined logical sequence which form part of the main sequence. In some example embodiments, the pilot sequence threads are issued at a time greater than the memory latency for a cache miss ahead of their neighbouring threads within the main sequence such that if the thread within the pilot sequence triggers a cache nilss, then there is high likelihood that the surrounding data values which may be required by neighbouring threads within the main sequence will have been returned to the cache memory by the time those threads within the main sequence arc issued into the processing pipeline. It is expected that the pilot threads will result in a higher proportion of cache misses than the main threads, but that the cache line fills which result from the pilot threads will enable the main threads to more likely proceed without cache misses and associated stalling. The delay time could in other embodiments he css than the latency associated with a cache miss and still give an advantage by at least reducing the waiting for data values that miss.
The pilot threads can he considered as intended to provoke inevitable cache misses which will arise due to execution of the threads within the predetermined logical sequence, but to trigger these cache misses early such that the majority of the threads which will need the data associated with those cache misses will not be stalled (or stalled for a shorter time) waiting the return of that data as it will already have been fetched (or have been started to he fetched) as a consequence of the early execution of the pilot thread. This reordering of the threads from the predeternilned logical sequence into the pilot sequence and the main sequence takes place without the need for modification of program instructions executing or upstream hardware systems which create the threads. Furthermore, the early processing and staffing of the pilot threads is not speculative as those threads arc required to be executed and would have resulted in a cache miss. Rather, the reordering of the threads has moved the pilot threads earlier in execution so as to facilitate the execution of following main threads without (or with less) stalling.
In some embodiments the predetermined logical sequence may comprise a sequence of groups of threads in which each group of threads comprises a plurality of threads adjacent within the predetermined logical sequence. Division of the predetermined logical sequence into groups matches many real life processing workloads in which groups of threads have a tendency to access data values which are located proximal to each other within the memory address space.
In the context of threads arranged into groups, in some embodiments the pilot sequence and the main sequence may be timed relative to each other such that the next pilot thread to be issued in accordance with the pilot sequence is in a group at least one group ahead of die next main thread to be issued in accordance with the main sequence. Thus, pilot threads are at least one group ahead of the main threads and accordingly will provoke cache misses which will fill the cache memory with data which can then be consumed by the later main threads without cache misses arising. This effectivdy hides the cache fill latency for the main threads.
In some systems the pilot sequence may extend through a plurality of pilot groups ahead of the next main thread with decreasing numbers of pflot threads within each group as the issue time moves further ahead of the main thread issue time. This arrangement may be useful in increasing the efficiency of execution of the pilot threads themselves. Particularly early pilot threads may trigger cache misses, translation look aside buffer updates, first level memory accesses etc. which will then he used by the subsequent pilot threads. The subsequent pilot threads then trigger their own individual cache misses to return data values which are to be used by the main threads.
In some embodiments, each group of threads may be associated with a block of pixel values within an image and each thread within a group may correspond to processing associated with a pixel value within the block of pixels. The thread may eolTespond to a fragment to be determined in respect of a pixel, such as a transparency value, depth, colour, etc which will ultimately contribute to the final pixel value to he generated within an image.
Within the main sequence it is normal to use an interleaved scan pattern for each block of pixel values as in many eases this is an efficient way of traversing the data to be processed.
The pilot sequence may he sdected to have one ol a variety of different Some lorms are better matched to given patterns of data value accesses to be made within the main memory address space than others. It is desiraffle that the pilot sequence should he one which wouki trigger required cache fill operations in a wide variety of circumstances independent of the particular block of pixel values being processed and what it represents. Examples of pilot sequences which may he used include: (i) a diagonal line through each said block of pixels; (ii) a line parallel to one of a row direction and a column direction within each said block of pixels; (iii) clusters of one or more pixels disposed at predetermined positions within an array of possible cluster positions within each said block of pixels, said array of cluster positions comprising cluster lines of adjacent cluster position disposed parallel to one of a row direction and a column direction of said block of pixels, said S array divisible into a plurality of adjacent parallel lines of cluster positions such that (a) within a given line each cluster is separated by three vacant cluster positions from auy other nearest neighbour cluster within said given line and (b) each cluster in a neighbouring line adjacent said given line is positioned equidistant from any nearest neighbour cluster in said given line; and (iv) clusters of one or more pixels disposed at predetermined positions within an array of possible cluster positions within each said block of pixels, said clusters disposed within said array of cluster positions such that no clusier shares a cluster row, a cluster column or a cluster diagonal within said array of cluster positions.
As previously mentioned, each group of threads may correspond to the partial calculation of values needed to generate a block of pixels. A group of threads may correspond to a layer within a plurality of layers for processing that generates the block of pixel values.
The use of pilot threads ahead of the main thread to trigger early cache misses may he used independently of the grouping of threads and the association of groups of threads with blocks of pixels. In such general circumstances, the pilot threads need not be evenly distributed in time ahead of the main thread and may be arranged such that as time separation from the main thread issue time increases, the density of the pilot threads decreases such that a small number of pflot threads are issued very early and then these are followed by a larger number of pilot threads which are closer to the issue point in the main sequence of threads.
The issue controller may store issue queue data identifying the plurality of threads waiting within an issue queue to he executed and select threads for execution following both the main sequence and the pilot sequence in accordance with this issue queue data. At each time, a single thread may be selected for issue to the processing pipeline selected from either the main sequence or the pilot sequence. The main sequence is followed in order and the pilot sequence is followed in order. The overall order is different from the predetermined logical sequence.
In sonic embodiments the issue queue data will identify threads within the pilot sequence as having a high priority and threads within a main sequence as having a low priority. Furthermore, threads may be added to the issue queue in the predetermined logical sequence and the issue queue may identify a time in which each thread is added to the issue queue.
Using a combination of time information and priority information within the issue queue data, the issue controller may sckct a next thread to issue in accordance with a hierarchy in which an oldest low priority thread exceeding a threshold waiting time in the issue queue is selected first.
if present. followed by an oldest high priority thread waiting in the issue queue if less than a target number of high priority threads are currently in execution by the processing pipeline, if any, followed by an oldest thw priority thread. Selecting in accordance with these rules has the effect of ensuring that not too many high priority threads are in progress simultaneously in a manner which would cause an excess to become stalled and also that the main thread execution point does not drop too far behind the pilot thread execution point.
In some embodiments the target number of high priority threads to be kept in execution at any given time may be software programmable so as to match the particular data workload being executed at that time or a particular memory latency of a particifiar implementation.
Viewed from another aspect the present invention provides apparatus for processing data comprising: processing pipeline means for executing in parallel a p'urality of threads within a predetermined logical sequence of threads to he executed; and issue control means for issuing threads to said processing pipeline means for execution; wherein said issue controller means selects threads from said predetermined logical sequence for issue in accordance with both: (i) a pilot sequence being a proper subset of said predetermined logical sequence: and (ii) a main sequence trailing said pilot sequence through said predetermined logical sequence by a delay time and comprising those threads of said predetermined logical sequence not within said pilot sequence.
Viewed from a further aspect the present invention provides a method of processing data comprising the steps of: executing in parallel within a processing pipeline a plurality of threads within a predetermined logical sequence of threads to be executed; and S selecting threads from said predetermined logical sequence for issue to said processing pipeline in accordance with both: (i) a pilot sequence being a proper subset of said predetermined logical sequence; and (ii) a main sequence trailing said pilot sequence through said predetermined logical sequence by a delay time and comprising those threads of said predetermined logical sequcnce not within said pilot sequence.
Embodiments of the invention will now he described, by way of examp'e only, with reference to the accompanying drawings in which: Figure 1 schematically illustrates a data processing system including a processing pipeline and an issue controller for issuing threads in accordance with both a pilot sequence and a main sequence which differ from a predetermined logical sequence; Figure 2 schematicafly illustrates a predetermined logical sequence of thread issue; Figure 3 schematically illustrates issue in accordance with a pilot sequence and a main sequence; Figure 4 schematicafly illustrates an example of a main sequence order; Figures 5, 6 and 7 schematically illustrate examples olpilot sequence orders; and Figure 8 is a flow diagram schematically illustrating issue control.
Figure 1 schematically illustrates a data processing system 2 including a graphics processing unit 4 and a main memory 6. The graphics processing unit 4 includes a processing pipeline 8, a cache memory 10, a stall manager 12 and an issue controller 14. It will be appreciated that in practice the graphics processing unit 4 will typically indude many further data processing elements.
such as those which create threads received by the issue controller 14 and queued therein prior to issue to the processing pipeline 8. Such additional circuit elements have been omitted from Figure 1 for the sake of clarity. When a thread (e.g. a sequence of program instructions executing to S generate a particular value, such as a partic&ar pixd fragment within an array of pixels) executing within the processing pipeline 8 accesses a data value, then a check is made as to whether or not that data value is held within the cache memory 10. If the data value is not held within the cache memory 10, then a cache miss arises and a cache line induding that data value is fetched from the main memory 6 to the cache memory 10. This fetch from the main memory 6 is relatively slow and has a memory latency time associated with it which may be several hundred times greater than the time normally taken to access a data value within the cache memory 10. A thread will circulate within the processing pipeline 8 with its successiye instructions heing executed until the thread has finished at which point it will be retired from the processing pipeline 8 freeing up a slot into which another thread may be issued.
The processing pipeline 8 executes a plurality of threads in parallel. The threads are issued into the processing pipeline 8 by the issue controller 14 in dependence upon issue queue data 16 (prionty values, time stamps etc) associated with the queued threads. The issue controller 14 receives the threads in the predetermined logical sequence in which they arc generated by thc software and/or upstream hardware. The issue controller 14 issucs the threads to the processing pipeline 8 following both a main sequence and pilot sequence. Threads arc selected from the main sequence in the main sequence order. Threads are selected from the pilot sequence in the pilot sequence order. Selection as to whether the next thread to he issued should he taken from die main sequence or the pilot sequence is made in accordance with the issue queue data 16 as will he described later. The issue controller 14 may be considered to hold two queues. namely a pilot sequence queue and a main sequence queue. Threads are issued from each ci these queues in their respective order, and a selection is made as to from which queue the next thread is to be issued in dependence upon the issue queue data 16. The issue queue(s) may he provided for other reasons in addition to the above (e.g. forward pixel kill) and so support for the present techniques may be added with little extra overhead.
When a thread is progressing along the processing pipeline 8 and a cache miss occurs, then the stall manager 12 coupled to the processing pipeline 8 serves to stall that thread until (he data value which missed has been fetched to the cache memory 10 whereupon the thread is unstalled.
The thread (or at least the relevant stalled instruction of the thread) may be recirculated within the pipeline S while it is stalled and its partial processing state retained.
S
Figure 2 schematically illustrates the predetermined logical sequence in which threads are generated and received by the issue controller 14. In this example, the threads are formed into a sequence of groups of threads which each groups of threads corresponding to a group of pixels (e.g. 16* 16) to be processed. As illustrated, the block "n" is encountered first within the logical sequence and is then followed by blocks "n+1", "n+2" and "n+3". Each of the groups of threads (one thread per position) colTesponds to a block of values to be processed so as. for example, to lorm a layer associated with a block of pixels within an image to he generated. Each thread may effectively calculate a fragment contributing towards a pixel value to be in the block of pixels concerned. The predetermined logical sequence corresponds to threads which are to be executed.
Within the predetermined logical sequence illustrated in Figure 2, the group "n" will be logically intended to he issued to the processing pipeline 8 earliest.
Figure 3 schematically illustrates groups (blocks) of threads corresponding to those illustrated in Figure 2 hut in this case with threads being issued both in accordance with a pilot sequence and a main sequence. In the example illustrated, the current next thread issue point within the pilot sequence is marked with an "x". The current next thread issue point from within the main sequence is marked with an "o", As illustrated, the pilot sequence extends more thaii one group ahead of the current next thread issue point of the main sequence. As the separation iii time ahead ol the main sequence thread issue point increases, then the temporal spacing hetween threads which form part of the pilot sequence also increases. Accordingly. there are many more main threads to be issued from block "n+2" than there are pilot threads within block "n+F' and in turn many more pilot threads within block "n+1" than within block "n". The time gap between a given thread within the pilot sequence being issued and one of its neighbours within the logical sequence being issued as part of a main sequence is at least equal to the memory latency associated with a cache miss and preferably exceeds this time.
Figure 4 schematically illustrates an interleaved main sequence in which main sequence threads are issued. It will he appreciated that some threads within the path illustrated in Figure 4 which already havc bccn issucd as part of thc pilot scqucncc will bc omitted from the main sequence. Accordingly the main sequence can be considered to the remainder of the predetermined logical sequence which have not already been issued as part of the pilot sequence.
S
Figure 5 schematically illustrates a diagonal pilot sequence within a group of threads corresponding to a block of pixels. Such a diagonal path of the pilot sequence through the threads when these are considercd in their spatial position corresponding to the block of pixels has the resull that one thread corresponding to each row and each colunm is included within the pilot sequence and accordingly will triggcr any ncccssary cachc miss for data values associatcd with thc surrounding pixels.
Other possible pilot sequences include a horizontal pilot sequence and a vertical pilot sequence as illustrated by the dashed lines in Figure 5. Such horizontal and vertical pilot sequences may be suitable for some layouts of the data values within the memory address space, but not for others. Accordingly. for example, a vertical pilot sequence suitable for accessing one data value within each row of a sequence of data values set out in a horizontal raster scan order within the memory address space, would not be suitable if that image was rotated through 90 degrees such that the vertical pilot sequence then served to access data values within a single horizontal raster line as the data values are arranged within the memory address space.
Figure 6 illustrates another example of a pilot sequence, in this case a tiled sequence. As will he seen, each horizontal row within the pilot sequence contains two pilot threads with three vacant spaces therehetween. The pilot threads within adjacent rows are equidistant from the pilot threads within their neighbour rows. Also illustrated in Figure 6 is the idea of a cluster of pixels. In practice. threads can he issued in clusters corresponding to a cluster ol lour pixel values. These clusters of threads have corresponding cluster positions which may be arranged in lines corresponding to one of the rows or columns through the army of cluster positions.
It will be appreciated that the pattern of pilot threads illustrated in Figure 6 provides good coverage spatially distributed across the group of threads. The particular order with which the pilot threads may be issued out of this pattern may vary whilst still giving the appropriate coverage. In practice. there may he a preference for issuing the pilot threads out of the pilot sequence positions illustrated in Figure 6 to correspond roughly to the order in which the main threads will be issued out of the main sequence so as to increase the spacing in time of a pilot thread from its neighbours within the main sequence.
S
Figure 7 illustrates another pilot sequence. This pilot sequence corresponds to a solution of the eight queens problem from the field of chess. The eight queens problem is how to position eight queens on a chess hoard so that it shares ncither a row, column nor diagonal with any other queen. The eight queens problem is analogous to the problem of triggenng earlier prefetches with the pilot sequence as it is desired to select the pilot threads forming part of the pilot sequence such that they provide good coverage among the different rows, columns and diagonals within the array ol threads (pixels), hut without unwanted redundancy.
Figure 8 is a flow diagram schematically illustrating operation of the issue controller 14 in controlling which thread to issue next. At step 18, processing waits until there is a slot available at the head of the processing pipeline 8 into which a thread may he issued (e.g. an existing thread is retired). Step 20 then determines whether there is any thread in the main sequence which greater than a threshold age. This threshold age colTesponds to a delay since that thread was added to the issue queue. Main sequence threads are given priority for issue if they are older than this threshold age. If there arc any main sequence threads greater than the threshold agc, then step 22 selects the oldest of these for issue from the main sequence.
If a determination at step 20 is that there are no such main sequence threads, then step 24 determines whether there are currently less than a target number of pilot threads in progress within the processing pipeline 8. If there arc less than this target number of threads, then step 26 serves to issue a thread from the pilot sequence as the next thread.
If there are not less than this target numher of threads, then processing again proceeds to step 22 where an oldest main sequence thread is issued. The processing illustrated in Figure 8 implements an issue hierarchy in which main sequence threads are given priority if they are greater than a threshold age. Following this, pilot threads are given priority if less than a target number of pilot threads are currently in execution. Following this, the oldest main sequence thread is given priority.
The issue queue data 16 held by the issue controller 14 includes priority data indicating whether a given thread is a high priority thread (pilot sequence) or a low priority thread (main S sequence). In addition, time data is associated with each thread indicating the time at which it was added to the queues of threads awaiting issue by the issue controller 14. In practice, the issue controller 14 can be considered to maintain both a high priority pilot thread queue and a low priority main thread queue. A software programmable target number of a high priority threads to be kept in execution within the processing pipeline 8 is input to the issue controller 14. For example, this target numbcr of thrcads may be 16, 32 or 48 dcpending upon circumstances and when ushg.
for example, a processing pipeline capable of the parallel execution of 128 threads.

Claims (20)

  1. CLAIMS1. Apparatus for processing data comprising: a processing pipeline configured to execute in parallel a plurality of threads within a predetermined logical scquencc of threads to he executed; and an issue controller configured to issue threads to said processing pipeline for execution; wherein said issue controller is configured to select threads from said predetermined logical sequence for issue in accordance with both: U) a pilot sequence being a proper subset of said predetermined logical sequence; and (ii) a main sequence trailing said pilot sequence through said predetermined logical sequence by a delay time and comprising those threads of said predetermined logical sequence not within said pilot sequence.
  2. 2. Apparatus as claimed in claim 1. comprising: a cache memory coupled to said processing pipeline and configured to store data values fetched from a main memory, a cache miss within said cache memory triggering a fetch operation lasting a latency time to fetch a data value from said main memory to said cache memory; and a stall manager coupled to said processing pipeline and configured to stall a given processing thread executing in said processing pipeline upon detection of a miss within said cache memory for a data value to be accessed by said given thread and to unstall said given thread when said data value has been fetched to said cache memory.
  3. 3. Apparatus as claimed in claim 2, wherein said delay time is greater than or equal to said latency time.
  4. 4. Apparatus as claimed in any one of claims 1, 2 and 3, wherein said predetermined logical sequence comprises a sequence of groups of threads, each said group of threads comprising a plurality of threads adjacent within said predetermined logical sequence.
  5. 5. Apparatus as claimed in claim 4, wherein said issue controller selects threads to issue from said pilot sequence and said main sequence such that a next pilot thread to he issued iii accordance with said pilot sequence is within a group of threads at least one pilot group ahead of a next main thread to be issued in accordance with said main sequence.
  6. 6. Apparatus as claimed in daim 5, wherein said pilot sequence extends through a p'urality of pilot groups ahead of said next main thread, a number of pilot threads within each of said plurality of pilot groups ahead of said next main thread reducing as separation from said next main thread increases.
  7. 7. Apparatus as claimed in any one of claims 4. 5 and 6, wherein each group of threads is associated with a block of pixel values within an image and each of said threads within a group of threads corresponds to processing associated with a pixel value within said block of pixel values.
  8. 8. Apparatus as claimed in claim 7. wherein said main sequence corresponds to an interleaved scan pattern through each block of pixel values.
  9. 9. Apparatus as claimed in any one of claims 7 and 8, wherein said pilot sequence corresponds to one of: (i) a diagonal line through each said block of pixels; (ii) a line parallel to one of a row direction and a column direction within each said block of pixels; (iii) clusters of one or more pixels disposed at predetermined positions within an array of possible cluster positions within each said block of pixels, said array of cluster positions compnsing cluster lines of adjacent cluster position disposed parallel to one of a row direction and a column direction of said block of pixels, said array divisible into a plurality of adjacent parallel lines of cluster positions such that (a) within a given line each cluster is separated by three vacant cluster positions from any other nearest neighbour cluster within said given line and (h) each cluster in a neighbouring line adjacent said given line is positioned equidistant from any nearest neighbour cluster in said given line; and (iv) clusters of one or more pixels disposed at predetermined positions within an array of possible cluster positions within each said block of pixels, said clusters disposed within said array of cluster positions such that no cluster shares a cluster row, a cluster column or a cluster diagonal within said array of cluster positions.
  10. 10. Apparatus as claimed in any one of claims 7 to 9, wherein each group of threads corresponds to a layer within a plurality of layers of processing that generate said block of pixel values.
  11. 11. Apparatus as claimed in claim 1, wherein said pilot sequence extends through said predetermined logical sequence ahead of a next main thread to he issued in accordance with said main sequence such that positions of pilot threads within said predetermined logical sequence incrcasc in separation from cach other as scparation from said next main thread inercascs.
  12. 12. Apparatus as claimed in any one of the preceding daims, wherein said issue controller stores issue queue data identifying a plurality of threads waiting within an issue queue to be executed and said issue controller selects threads to issue for execution by said processing pipeline following said main sequence and said pilot sequence in accordance with said issue queue data.
  13. 13. Apparatus as claimed in claim 12, wherein said issue queue data identifies threads within said pilot sequence as having a high pnority and threads within said main sequence as having low priority.
  14. 14. Apparatus as claimed in claim 13, whcrcin thrcads arc added to said issuc queue in said predetermined logical sequence and said issue queue data identifies a time at which a thread was added to said issue queue.
  15. 15. Apparatus as claimed in claim 14, wherein said issue controller selects a next thread to issue in accordance with a hierarchy comprising: an oldest low priority thread exceeding a threshold time waiting in said issue queue: an oldest high priority thread waiting in said issue queue if less than a target number of high priority threads are in execution by said processing pipeline; and an oldest low priority thread.
  16. 16. Apparatus as claimed in claim 15, wherein said target number is software programmable.
  17. 17. Apparatus for processing data comprising: processing pipeline means for executing in parallel a plurality of threads within a predetermined logical sequence of threads to be executed; and issue control means for issuing threads to said processing pipeline means for execution; wherein said issue controller means selects threads from said predetermined logical sequence for issue in accordance with both: (i) a pilot sequence being a proper subset of said predetermined logical sequence; and (ii) a main sequence trailing said pilot sequence through said predetermined logical sequence by a delay time and comprising those threads of said predetermined logical sequence not within said pilot sequence.
  18. 18. A method of proccssing data comprising thc steps of: executing in parallel within a processing pipdline a plurality of threads within a predetermined logical sequence of threads to be executed; and selecting threads from said predetermined logical sequence for issue to said processing pipeline in accordance with both: (i) a pilot sequence being a proper subset of said predetermined logical sequence; and (ii) a main sequence trailing said pilot sequence through said predetermined logical sequence by a delay time and comprising those threads of said predetermined logical sequence not within said pilot sequence.
  19. 19. Apparatus for processing data substantially as hereinbefore described with reference to the accompanying drawings.
  20. 20. A method of processing data substantially as hereinbefore described with reference to the accompanying drawings.
GB1402259.4A 2014-02-10 2014-02-10 Thread issue control Active GB2522910B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
GB1402259.4A GB2522910B (en) 2014-02-10 2014-02-10 Thread issue control
US14/596,948 US9753735B2 (en) 2014-02-10 2015-01-14 Thread issue control

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
GB1402259.4A GB2522910B (en) 2014-02-10 2014-02-10 Thread issue control

Publications (3)

Publication Number Publication Date
GB201402259D0 GB201402259D0 (en) 2014-03-26
GB2522910A true GB2522910A (en) 2015-08-12
GB2522910B GB2522910B (en) 2021-04-07

Family

ID=50390728

Family Applications (1)

Application Number Title Priority Date Filing Date
GB1402259.4A Active GB2522910B (en) 2014-02-10 2014-02-10 Thread issue control

Country Status (2)

Country Link
US (1) US9753735B2 (en)
GB (1) GB2522910B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20180037839A (en) * 2016-10-05 2018-04-13 삼성전자주식회사 Graphics processing apparatus and method for executing instruction
GB2583061B (en) * 2019-02-12 2023-03-15 Advanced Risc Mach Ltd Data processing systems

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020055964A1 (en) * 2000-04-19 2002-05-09 Chi-Keung Luk Software controlled pre-execution in a multithreaded processor
US20040128489A1 (en) * 2002-12-31 2004-07-01 Hong Wang Transformation of single-threaded code to speculative precomputation enabled code
US20040148491A1 (en) * 2003-01-28 2004-07-29 Sun Microsystems, Inc. Sideband scout thread processor
US20090199170A1 (en) * 2008-02-01 2009-08-06 Arimilli Ravi K Helper Thread for Pre-Fetching Data
US20110231612A1 (en) * 2010-03-16 2011-09-22 Oracle International Corporation Pre-fetching for a sibling cache
US20110296431A1 (en) * 2010-05-25 2011-12-01 International Business Machines Corporation Method and apparatus for efficient helper thread state initialization using inter-thread register copy

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9659339B2 (en) * 2003-10-29 2017-05-23 Nvidia Corporation Programmable graphics processor for multithreaded execution of programs
US8621478B2 (en) * 2010-01-15 2013-12-31 International Business Machines Corporation Multiprocessor system with multiple concurrent modes of execution
US9021493B2 (en) * 2012-09-14 2015-04-28 International Business Machines Corporation Management of resources within a computing environment
US9348599B2 (en) * 2013-01-15 2016-05-24 International Business Machines Corporation Confidence threshold-based opposing branch path execution for branch prediction

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020055964A1 (en) * 2000-04-19 2002-05-09 Chi-Keung Luk Software controlled pre-execution in a multithreaded processor
US20040128489A1 (en) * 2002-12-31 2004-07-01 Hong Wang Transformation of single-threaded code to speculative precomputation enabled code
US20040148491A1 (en) * 2003-01-28 2004-07-29 Sun Microsystems, Inc. Sideband scout thread processor
US20090199170A1 (en) * 2008-02-01 2009-08-06 Arimilli Ravi K Helper Thread for Pre-Fetching Data
US20110231612A1 (en) * 2010-03-16 2011-09-22 Oracle International Corporation Pre-fetching for a sibling cache
US20110296431A1 (en) * 2010-05-25 2011-12-01 International Business Machines Corporation Method and apparatus for efficient helper thread state initialization using inter-thread register copy

Also Published As

Publication number Publication date
US9753735B2 (en) 2017-09-05
US20150227376A1 (en) 2015-08-13
GB201402259D0 (en) 2014-03-26
GB2522910B (en) 2021-04-07

Similar Documents

Publication Publication Date Title
US9697125B2 (en) Memory access monitor
US7366878B1 (en) Scheduling instructions from multi-thread instruction buffer based on phase boundary qualifying rule for phases of math and data access operations with better caching
US9547530B2 (en) Data processing apparatus and method for processing a plurality of threads
CN103809964B (en) SIMT processors with the system and method for sets of threads execution sequence code and comprising it
US10679322B2 (en) Primitive processing in a graphics processing system with tag buffer storage of primitive identifiers
US7206902B2 (en) System, apparatus and method for predicting accesses to a memory
EP2593861B1 (en) System and method to allocate portions of a shared stack
US20210065438A1 (en) Primitive Processing in a Graphics Processing System
US20090113181A1 (en) Method and Apparatus for Executing Instructions
KR100998929B1 (en) Cache controller device, interfacing method and programming method using thereof
US20190088009A1 (en) Forward killing of threads corresponding to graphics fragments obscured by later graphics fragments
CN105359089B (en) Method and apparatus for carrying out selective renaming in the microprocessor
US7853751B2 (en) Stripe caching and data read ahead
US10705849B2 (en) Mode-selectable processor for execution of a single thread in a first mode and plural borrowed threads in a second mode
WO2007034232A2 (en) Scalable multi-threaded media processing architecture
JPH0371354A (en) Method and apparatus for processing memory access request
US20130124805A1 (en) Apparatus and method for servicing latency-sensitive memory requests
US20100088475A1 (en) Data processing with a plurality of memory banks
US9753735B2 (en) Thread issue control
WO2016087831A1 (en) Method of and apparatus for providing an output surface in a data processing system
US20140164743A1 (en) Reordering buffer for memory access locality
WO2008007038A1 (en) Data dependency scoreboarding
WO2021091649A1 (en) Super-thread processor
CN102855122A (en) Processing pipeline control
DE102012222391B4 (en) Multichannel Time Slice Groups