GB2519103A - Decoding a complex program instruction corresponding to multiple micro-operations - Google Patents

Decoding a complex program instruction corresponding to multiple micro-operations Download PDF

Info

Publication number
GB2519103A
GB2519103A GB1317857.9A GB201317857A GB2519103A GB 2519103 A GB2519103 A GB 2519103A GB 201317857 A GB201317857 A GB 201317857A GB 2519103 A GB2519103 A GB 2519103A
Authority
GB
United Kingdom
Prior art keywords
micro
instruction
program
operations
fetch
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
GB1317857.9A
Other versions
GB2519103B (en
GB201317857D0 (en
Inventor
Rune Holm
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
ARM Ltd
Original Assignee
ARM Ltd
Advanced Risc Machines Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ARM Ltd, Advanced Risc Machines Ltd filed Critical ARM Ltd
Priority to GB1317857.9A priority Critical patent/GB2519103B/en
Publication of GB201317857D0 publication Critical patent/GB201317857D0/en
Priority to US14/466,183 priority patent/US9934037B2/en
Priority to KR1020140127455A priority patent/KR102271986B1/en
Priority to CN201410521111.7A priority patent/CN104572016B/en
Publication of GB2519103A publication Critical patent/GB2519103A/en
Application granted granted Critical
Publication of GB2519103B publication Critical patent/GB2519103B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30145Instruction analysis, e.g. decoding, instruction word fields
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/22Microcontrol or microprogram arrangements
    • G06F9/26Address formation of the next micro-instruction ; Microprogram storage or retrieval arrangements
    • G06F9/262Arrangements for next microinstruction selection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/3017Runtime instruction translation, e.g. macros
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
    • G06F9/3851Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution from multiple instruction streams, e.g. multistreaming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units
    • G06F9/3887Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units controlled by a single instruction for multiple data lanes [SIMD]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/52Binary to binary
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/32Address formation of the next instruction, e.g. by incrementing the instruction counter
    • G06F9/321Program or instruction counter, e.g. incrementing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3818Decoding for concurrent execution

Abstract

A multi threaded processor 2 uses a single instruction decoder 30, which is shared between the threads. The decoder generates micro-operations, which are executed by the processors processing circuitry 4. Fetch units 8, each of which is associated with a particular thread, send requests to the decoder. The requests identify a specific micro-operation of a particular complex program instruction to be executed by the processing circuitry. Each fetch unit may place the micro-operations in a corresponding queue 6. A cache 20 may be used to store micro-operations already decoded by the decoder. The fetch units may maintain a program counter 12, a micro-program counter 14 and a flag indicating a micro-operation is the last one in a complex program instruction. An instruction buffer 42 may hold instructions, which have been decoded. The processor may be a graphics processing unit (GPU).

Description

DECODING A COMPLEX PROGRAM INSTRUCTION CORRESPONDING TO
MULTIPLE MICRO-OPERATIONS
The present technique relates to the field of data processing. More particularly, it relates to decoding program instructions to generate micro-operations in a data processing apparatus supporting parallel threads of processing.
Some instrtLctlon sets mckLde some complex program instructions which correspond to multiple micro-operations to be performed by the processing circuitry, each micro-operation representing part of the operation associated with the program instruction. Hence, an instruction decoder may decode program instructions to generate micro-operations to be performed by the processing circuitry.
Some processing apparatuses support multiple parallel threads of processing. Separate fetch ILnits may be provided for respective threads to trigger fetches of micro-operations into the processing circuitry. A shared instruction decoder may be provided to generate the micro-operations required by the respective fetch units. Typical instruction decoders decode complex program instructions as a single entity so that, in response to a request for decoding of the complex program instruction from one of the fetch units, the instruction decoder will then generate all the micro-operations corresponding to that complex program instruction in successive cycles. However, this can be problematic in a system where the shared instruction decoder is shared between multiple fetch units corresponding to threads of processing. If one of the fetch units is stalled partway through fetching the micro-operations corresponding to a complex program instruction, so cannot accept further micro-operations for the same complex program instruction, then the shared decoder will also stall because it is committed to finishing all the micro-operations for the complex program instruction. This prevents other fetch units from receiving decoded micro-operations from the instrlLction decoder until the stall of the first fetch unit is resolved, even though those other fetch units could have accepted micro-operations. This causes reduced processing performance. The present technique seeks to address this problem.
Viewed from one aspect, the present invention provides a data processing apparatus comprising: processing circuitry configured to process a plurality of threads of processing in parallel; a shared instruction decoder configlLred to decode program instrlLctions to generate micro-operations to be processed by the processing circuitry, the program instructions comprising at least one complex program instruction corresponding to a plurality of micro-operations; and a plurality of fetch units configured to fetch, for processing by the processing circuitry.
the micro-operations generated by the shared instruction decoder, each fetch unit associated with at least one of the plurality of threads; wherein the shared instruction decoder is configured to generate each micro-operation in response to a decode request triggered by one of the plurality of fetch units; and the shared instruction decoder is configiwed to generate the plurality of micro-operations of a complex program instruction individually in response to separate decode requests each identifying which micro-operation of the complex program instruction is to be generated by the shared instruction decoder in response to the decode request.
The instruction decoder of the present technique generates the micro-operations of a complex program instruction individually in response to separate decode requests triggered by the fetch units. Each decode request may identify a particular micro-operation of the complex program instruction which is to be generated in response to the decode request. Hence, rather than generating all the micro-operations in response to a single request as in previous systems, after each successive decode request the instruction decoder may decode the requested micro-operation of the complex program instruction and then wait for a further decode request before decoding another micro-operation. By requiring each micro-operation to be requested individually, this allows the shared instruction decoder to switch which instruction is being decoded partway through generating the micro-operations for a complex program instruction.
Therefore, even if one fetch unit stalls after only some of the micro-operations required for a complex program instruction have been generated, the decoder can switch 10 generating micro- operations requested by another fetch unit and then return to generating the remaining micro-operations of the first program instruction when the fetch unit requesting these micro-operations has unstalled. This reduces the number of cycles in which the instruction decoder is inactive and hence improves processing performance of the data processing apparatus as a whole.
In some examples, each thread to be processed by the processing circuitry may have its wn fetch unit for fetching the micro-operations to be performed for that thread. In other examples, at least one of the fetch units may be shared between multiple threads.
in some examples, each fetch unit may send the fetched micro-operations directly to the processing circuitry for processing. On the other hand, micro-operation queues may be provided, each queue corresponding to one of the fetch units so that the micro-operations fetched by the fetch unit are queued in the corresponding queue. The queued micro-operations may then be issued for processing by the processing circuitry, if micro-operation queues are provided, then the fetch unit may for example request the next micro-operation when space becomes available in the queue.
The fetch unit may trigger the decode request in different ways. in some examples, the micro-operations generated by the instruction decoder may be passed directly to the corresponding fetch unit, in this case, the fetch unit may generate the decode request identifying a selected micro-operation which is to be decoded and fetched for processing by the processing circuitry. In response to the decode request from the fetch unit, the instruction decoder may generate the selected micro-operation and send it to the fetch ILnit.
in other examples, the fetch unit may indirectly trigger the decode request and need not generate the decode request itself For example, a micro-operation cache may be provided to store the micro-operations generated by the shared instruction decoder. Often, the same micro-operation may be required multiple times within the same thread or within different threads, and so by caching the micro-operations generated by the decoder, energy efficiency can be improved since this avoids the need for the decoder to repeatedly generate the same micro-operation. if the micro-operation cache is provided, then the fetch circuitry may provide a fetch request to the micro-operation cache to request fetching of a selected micro-operation from the cache, and then the micro-operation cache may trigger the decode request to the instruction decoder if the selected micro-operation is not in the cache. If the selected micro-operation is already in the cache then a decode request may be unnecessary. The decode reqttest triggered by the micro-operation cache may pass directly to the instruction decoder, or indirectly via another circuit element such as a higher level cache storing the program instructions to be decoded.
The micro-operation cache may support a greater number of requests per processing cycle than a number of decode requests per processing cycle supported by the shared instruction decoder. This means that the cache can provide an apparent instruction fetch bandwidth to the fetch unit which is greater than the shared instruction decoder can sustain. This is particularly useful when there are multiple fetch units corresponding to a single shared instruction decoder.
In embodiments where the shared instrnction decoder can slLpport two or more decode requests per processing cycle, so that multiple micro-operations can be generated in the same cycle, the two or more decode requests may be for micro-operations conesponding to different program instructions altogether, or for different micro-operations of the same program instruction.
Nevertheless, even where multiple micro-operations are generated in the same cycle, each micro-operation may still be generated in response to a separate decode request.
As well as generating the micro-operation itself; the shared instruction decoder may also generate a corresponding control flag indicating whether the generated micro-operation is the last micro-operation for the corresponding instruction. The fetch unit may maintain a program counter and a micro program counter for identifying the next micro-operation to be fetched. The program counter indicates the program instruction corresponding to the next micro-operation to be fetched and the micro program counter indicates which micro-operation of that instruction is the next micro-operation to be fetched. The control flag allows the fetch nnit to determine whether to increment the micro program counter or the program counter when it receives the fetched micro-operation. If the control flag for a fetched micro-operation indicates that the fetched micro-operalion is not the last micro-operation, then the micro program counter may be incremented to indicate that the following micro-operation for the same instruction should be fetched next. On the other hand, if the control flag indicates that the fetched micro-operation is the last micro-operation, then the program counter may be incremented to indicate the next program instruction. When incrementing the program counter, the micro program counter may also be reset to indicate the first micro-operation to be fetched for the next program instruction.
By generating the control flag using the instruction decoder when a micro-operation is decoded, the fetch unit does not need to keep track of how many micro-operations correspond to each program instruction or whether there are any further micro-operations to be receiverl for the same instruction. This simplifies the configuration of the fetch tmit.
The apparatus may have an instruction buffer which stores one or more program instructions previously decoded by the shared instruction decoder. Since each micro-operation of a complex program instrttction is decoded individually in response to separate requests, the same program instruction may be required for several successive processing cycles to allow the decoder to generate all the micro-operations for that instruction. By storing one or more recently decoded program instructions in the instruction buffer, performance and energy efficiency can be improved because this reduces the likelihood that the same instruction needs to be fetched multiple times from a higher level instruction data store, such as an instruction cache or memory.
When a decode request is received for a given program insflction, the decoder may cheek whether the instruction is in the instruction buffer, and if so fetch it from the instruction buffer. If the specified program instruction is not in the instruction buffer, then the decoder may obtain the specified program instruction from an instruction cache or memory. Typically, the buffer may store the one or more most recently decoded program instructions, although it could instead have a more complicated eviction scheme for determining which program instructions should be buffered and which should be evicted from the buffer. Also, while it is possible for the buffer to store more than one instruction, in many cases a significant performance improvement may be achieved with a buffer with capacity for only a single proam insflction, and this will be more efficient to implement in hardware than a larger buffer. In embodiments where the buffer only stores one instruction and the most recently decoded instruction is placed in the buffer and then overwritten with the next instruction when the next instruction is decoded, the instruction decoder can determine whether a required instruction is in the buffer by checking whether the program counter for the current decode request is the same as the program counter for the preceding decode request. On the other hand, if the decoder supports multiple decode requests per cycle then it maybe useful 10 provide a buffer capable of holding multiple instructions, in which case it may be required to match the program counter against address tags stored with each instruction in the buffer.
The processing circuitry may process the threads in parallel in different Ways. Tn some cases, the processing circuifry may have multiple processing units which can each process at least one of the threads. On the other hand, other embodiments may perform time division multiplexing of threads using a common set of hardware, for example with each thread having an allocated time slot when the processing circuilry executes that thread. Hence it is not essential for the processing circuitry to actually perform several operations in parallel at any one time -it is sufficient that the threads are active simultaneously but processed one at a time by time division multiplexing.
In some cases, for at least one of the threads the processing circuitry may execute in parallel multiple instances of the same block of micro-operations in lockstep with different operands for each instance. This approach is sometimes referred to as simultaneous multithrcading (STMT). This is particularly useful for processing where the samc set of micro-operations need to be performed on many sets of data values, which is particularly common in graphics processing for example. To slLpport this, the processing circuitry may have a set of parallel arithmetic units for performing each instance of the micro-operations, with the arithmetic units being controlled by a common control unit using a shared program counter and micro program counter. In some cases, there may be multiple parallel SIMT groups, each SIMT group processing a plurality of instances of the micro-operations for a given thread in parallel with different operand valiLes.
The present technique is particularly usefttl for systems in which the processing circuitry comprises a graphics processing unit (GPU). Typically, (iPiJs may require a large number of threads of processing. While conventional CPUs would not use insfruction sets having complex program instructions corresponding to multiple micro-operations, and so the problem addressed by the present technique does not often arise in such CPUs, the present technique recognises that it is desirable to provide a CPU which can be controlled using a general purpose instruction set architecture which includes complex program instrtLctlons. By using a general purpose instruction set to control a CPU having many threads, this allows for compatibility of the CPU with code written for other devices such as a cenfral processing unit (CPU) or other general purpose processing units, making programming simpler. By addressing the performance bottleneck caused by decoding of complex instructions in a many-threaded system in which a shared instruction decoder is shared between multiple fetch units, the present technique facilitates the use of general purpose insfroction set architectures in GPUs. This is in confrast to many CPUs which use their own special instruction set which typically would not have any complex program insfructions.
Viewed from another aspect, the present invention provides a data processing apparatus comprising: processing means for processing a plurality of threads of processing in parallel; shared instruction decoding means for decoding program instructions to generate micro-operations to be processed by the processing means, the program instructions comprising at least one complex program instruction corresponding to a plurality of micro-operations; and a plurality of fetch means for fetching, for processing by the processing means, the micro-operations generated by the shared instruction means, each fetch means associated with at least one of the plurality of threads; wherein the shared instruction decoding means is configured to generate each micro-operation in response to a decode request triggered by one of the plurality of fetch means; and the shared instruction decoding means is configured to generate the plurality of micro-operations of a complex program instruction individually in response to separate decode requests each identifying which micro-operation of the complex program instruction is to be generated by the shared instruction decoding means in response to the decode request.
Viewed from a further aspect, the present invention provides a data processing method, comprising: decoding program instructions with a shared instruction decoder to generate micro-operations to be processed, the program instructions comprising at least one program instruction corresponding to a plurality of micro-operations; and fetching for processing the micro-operations generated by the shared instruction decoder, wherein the fetching is performed with a plurality of fetch units, each fetch unit associated with at least one of a plurality of threads processed in parallel; wherein each micro-operation is generated by the shared instruction decoder in response to a decode request triggered by one of the plurality of fetch units; and the shared instruction decoder generates the plurality of micro-operations of a complex program instruction individually in response to separate decode requests each identifying which micro-operation of the complex program instruction is to be generated in response to the decode request.
As discussed above, the ability to individually generate each micro-operation of a complex program instruction in response to separate decode reqtiest is useful because it permits switching of decoding between instructions after generating only some of the micro-operations of the first instruction. In response to a first decode request, the decoder may decode a first program instruction to generate a first micro-operation of the first program instruction. In response to a second decode request identifying a micro-operation of a second program instruction, the second program instruction can be decoded to generate the identified micro-operation. The decoder can later return to decoding the fir st program instruction in response to a third decode request requesting generation of a second micro-operation of the first program instruction. Hence, the decoder can interrupt decoding of one instruction and generate a micro-operation of another instruction before returning to the original instruction, which is not possible with typical decoding mechanisms.
Further aspects, features and advantages of the present technique will be apparent from the following description, which is to be read in conjunction with the accompanying drawings in which: Figure 1 schematically illustrates a portion of a data processing apparatus; Figure 2 schematically illustrates parallel processing of multiple instances of the same set of micro-operations; Figure 3 illustrates time division multiplexing of different threads of processing; Figure 4 illustrates parallel processing of threads using respective processing units; Figure 5 illustrates a problem encountered in previous systems where a stall in one fetch unit causes a stall in the instruction decoder even if another fetch unit could accept a decoded micro-ape ration; Figure 6 illustrates how this problem can be solved by generating the micro-operations of a complex instruction individually in response to the separate decode reqnests; Figure 7 illustrates a method of decoding instrtLctions to generate micro-operations; Figures 8A and 8B illustrate functions performed by a micro-operation cache for storing decoded micro-operations generated by the decoder; and Figure 9 illustrates a method of fetching micro-operations to be performed by the processing circuitry.
Figure 1 schematically illustrates a portion of a data processing apparatus 2 for processing data. The apparatus 2 has processing circuitry 4 which can perform multiple parallel threads of processing. The apparatus 2 executes an instruction set which includes complex program instructions corresponding to multiple micro-operations to be performed by the processing circuitry 4. An example of a complex program instruction is a load or store multiple instruction for loading multiple values from memory into registers of the processing circuitry 4 or storing multiple values from registers of the processing circuitry 4 to memory. The load/store multiple instruction may be decoded to generate multiple micro-operations each for loading/storing one of the multiple values. Another example of a complex program instruction is an instruction for performing a relatively complex arithmetic operation such as a square root operation or floating point arithmetic operation. The complex arithmetic instruction may be mapped to several simpler micro-operations to be performed by the processing circuitry 4.
Hence, while the processing circuitry 4 executes micro-operations (iops) the apparatus receives complex instructions I which need to be decoded into micro-operations. The instruction front end for fetching program instructions, decoding them into micro-operations, and fetching the mciro-operations for processing is shown in Figure 1. It will be appreciated that the data processing apparatus 2 may have many other elements that are not shown in Figure 1 for conciseness.
The apparatus 2 has several instruction queues 6, each queue 6 corresponding to at least one thread of processing to be performed by the processing circuitry 4. Each queue has a limited amount of space for storing micro-operations to be performed by the processing circnitry 4. In the example of Figure 1, each queue 6 has a depth of four micro-operations, although in other examples the queues 6 may store a greater or smaller number of micro-operations, and it is possible for different queues 6 to store different numbers of micro-operations. Each queue has a corresponding fetch unit 8 for fetching micro-operations into the corresponding queue 6. Micro-operations from the queue are issued for processing by issue circuitry 10.
As shown in Figures 2 to 4, the processing circuitry 4 may handle the parallel processing of the respective threads represented by queues 6 in different ways. Figure 3 shows an example of time division multiplexing the respective threads so that a single processing unit can be shared between the threads. Each thread QO, QI. Q2, Q3 is allocated a time slot for processing by the processing circuitry 4. In some examples, the processing circuitry 4 may cycle through executing each thread QO, Qi, Q2, Q3 in order, while in other examples there may be a priority mechanism or similar scheme for selecting which threads are executed when. Alternatively, as shown in Figure 4 the processing circuitry 4 may have multiple processing units 4-0, 4-1, 4-2 which can each process a respective thread simultaneously, so that multiple threads are executed at the same time.
As shown in Figure 2, with either of the approaches of Figures 3 and 4 it is possible for the same group of micro-operations for a particular thread (e.g. thread QO) to be processed multiple times in parallel with different operands being used for each instance of the O1L of micro-operations. This approach is sometimes referred to as SIMT. A single program counter and micro program counter is maintained for each of the instances 0, 1, 2, N so that the instances proceed in lockstep with the same instructions executed for each instance. However, different operand values may be used for each instance. As shown in Figure 2, for example the values added in response to the ADD micro-operation MopO are different for each instance and produce different results. It is not essential for every micro-operation in the common block of micro-operations to be performed by every instance. For example, as shown in Figure 2 in response to a branch instruction BR some instructions may branch to omit certain instructions such as the multiply micro-operation top2 in Figure 2. Nevertheless, as processing proceeds in lockstep then the instances which (10 not require the multiply micro-operation must wait until the program counter or micro program colLnter has reached micro-operation op3 before proceeding with that micro-operation. This approach is useful when the same set of operations need to be performed on a large set of data values, which is often useful in graphics processing in particular.
For example, the common set of micro-operations may implement a fragment shader which determines what colour should be rendered in a given pixel of an image. The same fragment shader program may be exectLted in parallel for a block of adjacent pixels with different operands for each pixel. This approach is most efficient when the parallel instances do not diverge significantly from each other in terms of the path they take through the program or the memoty accesses made. Any of the threads corresponding to the queues QO, Qi, Q2 etc. may use such SIMT processing. In some embodiments, all of the queues of micro-operations may be carried out as a SIMT group on multiple sets of data values.
As shown in Figure 1, each fetch tLnit 8 may maintain a program counter 12 (PC) and a micro proam counter 14 (MPC) which together indicate the next micro-operation to be fetched into the corresponding queue 6. The proam colLnter 12 is an indication of which program instruction I corresponds to the next micro-operation to be fetched and the micro program counter 14 indicates which micro-operation within that program instruction should be fetched next.
When space becomes available in the corresponding queue 6, then the fetch unit 8 issues a fetch request 16 to a level 0 (LO) instruction cache 20 for caching micro-operations. The fetch request 16 specifies the current value of the program counter 12 and micro program counter 14. In response to the fetch request 16, the LO instruction cache 20 (also referred to as a micro-operation cache) checks whether it currently stores the micro-operation indicated by the program counter and micro program counter in the fetch request 16, and if so, then the L0 instruction cache 20 sends the requested micro-operation 22 to the fetch unit 8 which issued the fetch request 16.
On the other hand, if the requested micro-operation is not in the LO instruction cache 20 then a decode request 24 is issued to a shared instruction decoder 30 which is shared between the respective fetch units 8. In some embodiments, the shared instruction decoder 30 could be a pre-decoder in a two-level decoding scheme, with the rest of the decoding happening later in the pipeline. The processing circuitiy 4 may have a separate decoder for decoding micro-operations.
The decode request 24 specifies the program counter and micro program counter values indicated in the fetch request 16 which triggered the decode request, so that the decoder 30 can identi' the ii micro-operation to be generated. In response to the decode request 24 the decoder 30 decodes the program instruction I indicated by the program counter of the decode request 24 to generate the micro-operation indicated by the micro program counter of the decode request 24. Unlike previous instruction decoders 30, for a complex program instruction I corresponding to multiple micro-operations, the decoder 30 generates a single micro-operation in response to the decode request 24, with other micro-operations for the same instruction 1 being generated in response to separate decode requests 24 for those micro-operations. Hence, each micro-operation of a complex program instruction is generated individually in response to a separate decode request 24.
The decoder 30 outputs the generated micro-operation 32 and a corresponding control flag 34 to the LO instruction cache 20, which caches the micro-operation and control flag. The control flag 34 indicates whether the generated micro-operation 32 was the last micro-operation for the corresponding program instruction I or whether there are further micro-operations to be generated for that instruction I. The control flag L is provided to the fetch unit S along with a fetched micro-operation. As discussed with respect to Figure 9 below, the control flag L controls whether the fetch unit 8 increments the program counter 12 or the micro program counter 14 to indicate the next micro-operation to be fetched.
The shared instruction decoder 30 obtains instructions to be decoded from a level 1 (LI) instruction cache 40 which caches instructions fetched from memory. Tn other examples the LI instruction cache 40 may not be provided and instead the shared decoder 30 may obtain the instructions directly from memory. However, providing the Li instruction cache 40 is advantageous to reduce the latency and energy overhead associated with fetching instructions I into the decoder 30.
Since the decoder 30 decodes each micro-operation of a complex instruction individually in response to a separate decode request 24, it is possible that the same instruction may need to be decoded in several successive cycles. To improve performance, an instruction buffer 42 is provided between the Li instruction cache 40 and the shared instruction decoder 30 to store at least one recently decoded instruction. In this embodiment, the buffer 42 stores the previously decoded instruction, so that if the same instruction is required in the next cycle then it can be fetched more efficiently from the buffer 42 instead of the LI instruction cache 40. Hence, if the program counter of the decode request 24 is the same as the program counter for the previous decode request 24, then the decoder 30 can use the instruction in the buffer 42, and if the program counter is different to the previously requested program counter then the instruction can be fetched from the LI instruction cache 40. In other embodiments, the buffer 42 may store multiple instuctions and the decoder 30 can determine based on the address associated with each buffered instruction whether the instruction corresponding to the program counter of the decode request 24 is in the buffer 42.
The micro-operation queues 6 shown in Figure I are optional, and in other examples the fetch unit 8 may otttput the fetched micro-operations directly to the issue circuitry 10. However, the queues 6 enable improved performance because while the micro-operations of one thread are being issued from one queue 6 to the processing circuitty 4, the micro-operations for another thread can be fetched into another queue 6, rather than having to wait for the issue stage 10 to be ready for issuing micro-operations before they can be fetched from the LO cache 20. In some embodiments the fetch tinit 8 and corresponding instrlLction queue 6 may be combined in a single unit.
The micro-operation cache 20 allows the same micro-operation to be decoded once and then fetched multiple times, improving performance and reducing energy consumption by avoiding repeated decoding of the same micro-operation. The micro-operation cache 20 also improves the apparent fetch bandwidth since it can support a greater number of fetch requests 16 per processing cycle than the decoder 30 can support decode requests 24 per processing cycle.
Nevertheless, the micro-operation cache 20 may be optional and in other embodiments the shared decoder 30 may provide the micro-operations directly to the fetch units 8. In this case, the fetch unit 8 may send the fetch request 16 directly to the shared decoder 30, so that the fetch request 16 also functions as the decode request 24.
Similarly, the instruction buffer 42 is optional and in other examples the shared decoder may obtain all the program instructions 1 from the Li instruction cache 40 or a memory.
Figures 5 and 6 show an example of how decoding each micro-operation separately in response to individual decode requests can improve performance. Figure 5 shows a comparative example showing stalling of the decoder 30 which can arise if a complex program instruction I is decoded in its entirety in response to a single decode request, as in previous decoding techniques.
Figure 5 shows an example in which in processing cycle 0 the fetch unit 0 issues a fetch reqttest 16 to request fetching of a complex load multiple (LDM) program instruction which loads, for instance, six different values from memory into registers of the processing circuiUy 4. Hence, the LDTVI instruction is decoded into six separate micro-operations iop0 to pnp5. In response to the fetch request, the micro-operation cache 20 determines that the required operations are not in the cache 20 and so issues a corresponding decode request 24 to the decoder 30. In response to the decode request, the decoder 30 begins to decode the micro-operations for the LDM instruction at cycle 1, and then continues to generate the other micro-operations for the LDM instrlLction in the following cycles, 2, 3, 4. However, at cycle 4 the fetch unit 0 is stalled, for example because the corresponding queue QO cannot accept any further micro-operations. The decoder is committed I 0 to generating all the micro-operations for the load multiple instruction, and cannot interrupt decoding partially through an instruction, because if the decoder internLpted decoding of the load multiple instruction then it would not know where to start again later. Therefore, the decoder must also stop decoding micro-operations, and so the stall propagates back from the fetch unit 0 to the decoder 30. Therefore, during processing cycles 5, 6, 7, no micro-operations are generated.
IS The decoder 30 only starts decoding again once the fetch unit 0 has unstalled, and then completes the remaining micro-operations op4, MopS. Once all the micro-operations for the LDM instruction have been generated, the decoder 30 can then switch to generating a micro-operation ADD for another fetch unit 1. However, the fetch/decode request for the ADD instruction was made in processing cycle 4, and fetch unit 1 was not stalled and so could have accepted the ADD micro-operation if it had been generated during one of the cycles 5,6, 7 when the decoder 30 was stalled.
In contrast, Figure 6 shows how the stalling of the decoder can be avoided by decoding each micro-operation of a complex instruction separately in response to separate decode requests.
The fetch unit 8 provides separate fetch requests for each individual micro-operation. Hence, the fetch unit 0 which requires the LDM instruction to be performed issues fetch requests in cycles 0 to 3 conesponding to micro-operations opO to pop3. The LO cache 20 does not contain these micro-operations and so triggers corresponding decode requests to the decoder 30. The decoder responds to each decode request by generating the corresponding micro-operation in cycles 1 to 4.
When the fetch unit 0 stalls in cycle 4, then the decoder 30 does not stall because it is not committed to finishing all the micro-operations for the LDM instruction since it can generate the remaining micro-operations iop4, blopS later in response to separate decode requests. Therefore, in cycle S the decoder 30 can instead generate the ADD micro-operation required by fetch unit 1.
Similarly, decode requests for other instructions or from other fetch units could be handled by the decoder 30 during cycles 6 and 7. When the fetch unit 0 has unstalled at cycle 7, then it begins issuing fetch requests for the remaining micro-operations top4, top5 and this triggers new decode requests to the decoder 30 which then generates these micro-operations in cycles 8, 9.
Therefore, the decoder 30 can now rapidly switch between decoding different instructions in the middle of an instruction, to allow performance to be improved.
Figure 7 is a flow diagram illustrating an example of the operation of the shared instruction decoder 30. At step 50, the decoder 30 receives a decode request 24 specifying a program counter and micro program counter which together identify the micro-operation to be I 0 generated. At step 52, the decoder 30 determines whether the program counter specified in the decode request 24 is the same as the program colLnter for the last decode request. If so then at step 54 the required instruction is fetched from the instruction buffer 42 which stores the most recently decoded instruction, if the program counter is not the same as the program counter of the last decode request, then at step 56 the decoder 30 fetches the required instruction from the Li IS instruction cache 40. The fetched instruction from step 54 or step 56 is then decoded at step 58 to generate the micro-operation identified by the micro program counter of the decode request 24.
At this point, the decoder 30 generates only the micro-operation specified by the decode request.
Other micro-operations of the same program instruction either will already have been generated.
or will be generated later in response to other decode requests. At step 60, the shared instruction decoder 30 determines whether the newly generated micro-operation is the last micro-operation for the decoded program instruction. If the generated micro-operation is the last micro-operation then at step 62 the control flag L is set to 1, while if there is still at least one remaining micro- operation to be generated then the control tiag is set to 0 at step 64. At step 66, the micro-operation and the control flag L are output to the LO instruction cache 20. At step 68, the decoder 30 waits for the next decode request, when the method returns to step 50. While Figure 7 shows handling of a single decode request, in some embodiments the decoder 30 may be able to service multiple decode requests per processing cycle and in this case then the steps of Figure 7 would be performed for each received decode request.
Figures 8A and 8B illustrate thnctions performed by the LO instruction cache 20. Figure MA shows a method of storing micro-operations in the cache 20. At step 70, the LO instruction cache 20 receives a micro-operation from the shared instruction decoder 30. At step 72, the LU instruction cache 20 stores the micro-operation and the control flag L in the cache 20. The cache also stores the program counter and the micro program counter corresponding to the conesponding micro-operation so that it can identify which micro-operations are stored in the cache and respond to fetch requests 16 accordingly. For example, the proam counter and micro program counter may act as a tag for locating the entiy of the LO instruction cache 20 storing a required micro-operation.
Figure SB shows the functions performed by the LO instruction cache 20 in response to a fetch request 16. At step 80, a fetch request is received from one of the fetch units 8. The fetch request 16 specifies the current values of the program counter 12 and micro program counter 14 for the corresponding fetch unit 8. At step 82, the cache 20 determines whether the requested micro-operation identified by the program counter and micro program counter is stored in the cache 20. If so, then at step 84 the cache 20 provides the requested micro-operation and the corresponding control flag L to the fetch unit 8 that sent the fetch request 16. If the requested micro-operation is noi in the cache then at step 86 the cache 20 sends a decode request 24 to the instruction decoder 30. The decode request 24 includes the program counter and the micro program counter that were included in the fetch request 16 that ftiggered the decode request 24.
The method then proceeds to step 80 where the cache 20 awaits the next fetch request 16. Again, the LO instruction cache 20 may handle multiple fetch requests in parallel in the same processing cycle, in which case the steps of Figure SB would be performed for each fetch request.
Figure 9 is a flow diagram illustrating functions performed by the fetch unit 8. At step 90, the fetch unit 8 determines whether there is space in the corresponding micro-operation queue 6 for the next micro-operation to be fetched. If there is space, then at step 92 the fetch unit 8 sends a fetch request 16 to the LO instruction cache 20, the fetch request 16 indicating the current values of the program counter 12 and micro program counter 14 maintained by that fetch unit 8.
At step 94, the fetch unit S receives the requested micro-operation as well as the control flag L corresponding to that micro-operation. Step 94 may occur relatively soon after the fetch reqttest was issued at step 92 if the requested micro-operation is stored in the LO cache 20, or there could be a delay if the LO cache 20 has to obtain the micro-operation from the decoder 30 first. At step 96, the fetch unit 8 adds the received micro-operation to the queue 6.
At step 98, the fetch unit 8 determines the value of the control flag L for the fetched micro-operation. If the control flag has a value of 1 then the fetched micro-operation is the last micro-operation for the current program instruction, and so at step 100 the fetch unit 8 increments the program counter 12 to indicate the next program instrttction and resets the micro program counter 14 to indicate the first micro-operation to be fetched for the new program instruction. On the other hand, if at step 98 the fetch unit 8 determines that the control flag L has a value of 0 then the micro-operation is not the last micro-operation, and so at step 102 the fetch unit S increments the micro program counter to indicate the next micro-operation to be fetched for the same program instruction, and the program counler 12 is not incremented. In this context, the term "increment" means that the program counter or micro program counter is set to the valiLe required for the next micro-operation to be fetched. The incrementing need not be by the same amount cach timc. For example, the program counter may gcnerally bc incremented by a certain amount such as an interval between addresses of adjacent instruction, bitt sometimes there may need to be a different increment amount. For example, the fetch unit may include a branch predictor and if a branch is predicted taken then a non-sequential instruction fetch may be performed. Also, while Figures 7 and 9 show an example where the value of 1 of the eonnol flag L indicates the last micro-operation of a complex instruction and a value of 0 of the control flag L indicates a micro-operation other than the last micro-operation, in other examples these values could be swapped or this information could be represented in a different way.
Although illustrative embodiments of the invention have been described in detail herein with reference to the accompanying drawings, it is to be understood that the invention is not limited to those precise embodiments, and that various changes and modifications can be effected therein by one skilled in the art without departing from the scope and spirit of the invention as defined by the appended claims.

Claims (22)

  1. CLAIMSI. A data processing apparatus comprising: processing circuitry configured to process a plurality of threads of processing in parallel; a shared instruction decoder configured to decode program instructions to generale micro-operations to be processed by the processing circuitry, the proam instrlLctions compnsrng at least one complex progam instruction corresponding to a plurality of micro-operations; and a plurality of fetch units configured to fetch, for processing by the processing circuitry, the micro-operations generated by the shared instruction decoder, each fetch unit associated with at least one of the plurality of threads; wherein the shared instruction decoder is configured to generate each micro-operation in response to a decode request triggered by one of the plurality of fetch units; and the shared instruction decoder is configured to generate the plurality of micro-operations of a complex program instruction individually in response to separate decode requests each identifying which micro-operation of the complex program instruction is to be generated by the shared instruction decoder in response to the decode request.
  2. 2. The data processing apparatus according to claim 1, comprising a plurality of micro- operation queues each corresponding to one of the fetch units and configured to queue the micro-operations fetched by the corresponding fetch unit for processing by the processing circuitry.
  3. 3. The data processing apparatus according to any of claims 1 and 2, wherein the fetch unit is configured to generate the decode request identi1ring a selected micro-operation to be generated by the shared instruction decoder and to be fetched for processing by the processing circuitry.
  4. 4. The data processing apparatus according to any of claims I and 2, comprising a micro-operation cache configured to store the micro-operations generated by the shared instruction decoder.
  5. 5. The data processing apparatus according to claim 4, wherein the fetch circuitry is configured to provide a fetch request to the micro-operation cache to request fetching of a selected micro-operation from the micro-operation cache; and the micro-operation cache is configured to trigger a decode request for the selected micro-operation if the selected micro-operation is not stored in the micro-operation cache.
  6. 6. The data processing apparatus according to any of claims 4 and 5, wherein the micro-operation cache is configured to support a greater number of fetch requests per processing cycle than the number of decode requests per processing cycle supported by the shared insfruction decoder.
  7. 7. The data processing apparatiLs according to any preceding claim, wherein for each micro-operation, the shared instruction decoder is configured to generate a coffesponding control flag indicating whether the micro-operation is the last micro-operation for the corresponding program inslluction.
  8. 8. The data processing apparatus according to any preceding claim, wherein each fetch unit is configured to maintain a program counter and micro program counter for identif'ing the next micro-operation to be fetched, the program counter indicating the program insfruction corresponding to said next micro-operation and the micro program counter indicating which micro-operation of the corresponding program insh-uction is said next micro-operation.
  9. 9. The data processing apparatus according to claim 8 when dependent on claim 7, wherein each fetch unit is configured to: (i) increment the micro program counter if the control flag for a fetched micro-operation indicates that the fetched micro-operation is not the last micro-operation for the corresponding program instruction; and (ii) increment the program counter if the control flag for a fetched micro-operation indicates that the fetched micro-operation is the last micro-operation for the conesponding program insflction.
  10. 10. The data processing apparatus according to any preceding claim, comprising an instruction buffer configured to store one or more proam instrlLctions previously decoded by the shared insflction decoder.
  11. II. The data processing apparatus according to claim 10, wherein in rcsponsc to the (ICcOdC request for a specified program instruction: (a) if the specified program instrlLction is stored in the instruction buffer, then the shared instruction decoder is configured to obtain the specified program instruction from the instruction buffer; and (b) if the specified program insnction is not stored in the insfruction buffer, then the shared instruction decoder is configured to obtain the specified program instruction from an instruction cache or memory.
  12. 12. The data processing apparatus according to any of claims 10 and II, wherein the instruction blLffer is configured to store the one or more program instructions that were most recently decoded by the shared instruction decoder.
  13. 13. The data processing apparatus according to any of claim 10 to 12, wherein the instruction buffer is configured to store a single program instruction.
  14. 14. The data processing apparatus according to any preceding claim, wherein the processing circuiliy comprises a plurality of processing units each configured to process at least one of the plurality of threads.
  15. IS. The data processing apparatus according to any preceding claim, wherein the processing circuiny is configured to execute in parallel, for at least one of the plurality of threads, a plurality of instances of the same one or more micro-operations in lockstep with different operands for each instance.
  16. 16. The data processing apparatus according to any preceding claim, wherein the processing circuitry is configured to perform time division multiplexing of at least some of the plurality of threads.
  17. 17. The data processing apparatus according to any preceding claim, wherein the processing circuitry comprises a graphics processing unit (CiPU).
  18. 18. A data processing apparatus comprising: processing means for processing a plurality of threads of processing in parallel; shared instruction decoding means for decoding program instrlLctions to generate micro-operations to be processed by the processing means, the program instructions comprising at least one complex program instruction corresponding to a plurality of micro-operations; and a plurality of fetch means for fetching, for processing by the processing means, the micro-operations generated by the shared instruction means, each fetch means associated with at least one of the plurality of threads; wherein the shared instrnction decoding means is configured to generate each micro-operation in response to a decode request triggered by one of the plurality of fetch means; and the shared instruction decoding means is configured to generate the plurality of micro-operations of a complex program instruction individually in response to separate decode requests each identifying which micro-operation of the complex program instruction is to be generated by the shared instruction decoding means in response to the decode request.
  19. 19. A data processing method, comprising: decoding program instructions with a shared instruction decoder to generate micro-operations to be processed, the program instructions comprising at least one program instruction corresponding to a plurality of micro-operations; and fetching for processing the micro-operations generated by the shared instruction decoder, wherein the fetching is performed with a plurality of fetch units, each fetch unit associated with at least one of a plurality of threads processed in parallel; wherein each micro-operation is generated by the shared instruction decoder in response to a decode request triggered by one of the plurality of fetch units; and the shared instruction decoder generates the plurality of micro-operations of a complex program instruction individually in response to separate decode requests each identifying which micro-operation of the complex program instruction is to be generated in response to the decode request.
  20. 20. The method of claim 19, comprising steps of: in response to a first decode request identifying a first micro-operation of a first complex program instruction, decoding the first program instrttction to generate the first micro-operation; in response to a second decode request identifying a selected micro-operation of a second program instruction, decoding the second program instruction to generate the selected micro-operation; and in response to a third decode request identifying a second micro-operation of the first program instruction, decoding the first program instruction to generate the second micro-operation.
  21. 2 I. A data processing apparatus substantially as herein described with reference to the accompanying drawings.
  22. 22. A data processing method substantially as herein described with reference to the accompanying drawings.
GB1317857.9A 2013-10-09 2013-10-09 Decoding a complex program instruction corresponding to multiple micro-operations Expired - Fee Related GB2519103B (en)

Priority Applications (4)

Application Number Priority Date Filing Date Title
GB1317857.9A GB2519103B (en) 2013-10-09 2013-10-09 Decoding a complex program instruction corresponding to multiple micro-operations
US14/466,183 US9934037B2 (en) 2013-10-09 2014-08-22 Decoding a complex program instruction corresponding to multiple micro-operations
KR1020140127455A KR102271986B1 (en) 2013-10-09 2014-09-24 Decoding a complex program instruction corresponding to multiple micro-operations
CN201410521111.7A CN104572016B (en) 2013-10-09 2014-09-30 The decoding that complicated process corresponding to multiple microoperations instructs

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
GB1317857.9A GB2519103B (en) 2013-10-09 2013-10-09 Decoding a complex program instruction corresponding to multiple micro-operations

Publications (3)

Publication Number Publication Date
GB201317857D0 GB201317857D0 (en) 2013-11-20
GB2519103A true GB2519103A (en) 2015-04-15
GB2519103B GB2519103B (en) 2020-05-06

Family

ID=49630421

Family Applications (1)

Application Number Title Priority Date Filing Date
GB1317857.9A Expired - Fee Related GB2519103B (en) 2013-10-09 2013-10-09 Decoding a complex program instruction corresponding to multiple micro-operations

Country Status (4)

Country Link
US (1) US9934037B2 (en)
KR (1) KR102271986B1 (en)
CN (1) CN104572016B (en)
GB (1) GB2519103B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200410088A1 (en) * 2018-04-04 2020-12-31 Arm Limited Micro-instruction cache annotations to indicate speculative side-channel risk condition for read instructions

Families Citing this family (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10140129B2 (en) * 2012-12-28 2018-11-27 Intel Corporation Processing core having shared front end unit
US10496412B2 (en) * 2016-02-08 2019-12-03 International Business Machines Corporation Parallel dispatching of multi-operation instructions in a multi-slice computer processor
US10324730B2 (en) 2016-03-24 2019-06-18 Mediatek, Inc. Memory shuffle engine for efficient work execution in a parallel computing system
CN106066786A (en) * 2016-05-26 2016-11-02 上海兆芯集成电路有限公司 Processor and processor operational approach
US10838871B2 (en) * 2016-11-07 2020-11-17 International Business Machines Corporation Hardware processor architecture having a hint cache
US10606599B2 (en) * 2016-12-09 2020-03-31 Advanced Micro Devices, Inc. Operation cache
US10324726B1 (en) * 2017-02-10 2019-06-18 Apple Inc. Providing instruction characteristics to graphics scheduling circuitry based on decoded instructions
US10896044B2 (en) * 2018-06-21 2021-01-19 Advanced Micro Devices, Inc. Low latency synchronization for operation cache and instruction cache fetching and decoding instructions
US10884751B2 (en) * 2018-07-13 2021-01-05 Advanced Micro Devices, Inc. Method and apparatus for virtualizing the micro-op cache
CN110825436B (en) * 2018-08-10 2022-04-29 昆仑芯(北京)科技有限公司 Calculation method applied to artificial intelligence chip and artificial intelligence chip
CN109101276B (en) * 2018-08-14 2020-05-05 阿里巴巴集团控股有限公司 Method for executing instruction in CPU
GB2577738B (en) 2018-10-05 2021-02-24 Advanced Risc Mach Ltd An apparatus and method for providing decoded instructions
US11500803B2 (en) * 2019-09-03 2022-11-15 Qorvo Us, Inc. Programmable slave circuit on a communication bus
US11726783B2 (en) * 2020-04-23 2023-08-15 Advanced Micro Devices, Inc. Filtering micro-operations for a micro-operation cache in a processor
CN111679856B (en) * 2020-06-15 2023-09-08 上海兆芯集成电路股份有限公司 Microprocessor with high-efficiency complex instruction decoding
US20220100519A1 (en) * 2020-09-25 2022-03-31 Advanced Micro Devices, Inc. Processor with multiple fetch and decode pipelines
US11595154B1 (en) * 2021-09-24 2023-02-28 Apple Inc. Instruction-based multi-thread multi-mode PDCCH decoder for cellular data device
CN114201219B (en) * 2021-12-21 2023-03-17 海光信息技术股份有限公司 Instruction scheduling method, instruction scheduling device, processor and storage medium
CN115525343B (en) * 2022-10-31 2023-07-25 海光信息技术股份有限公司 Parallel decoding method, processor, chip and electronic equipment
CN115525344B (en) * 2022-10-31 2023-06-27 海光信息技术股份有限公司 Decoding method, processor, chip and electronic equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2000067113A2 (en) * 1999-04-29 2000-11-09 Intel Corporation Method and apparatus for thread switching within a multithreaded processor
WO2001053935A1 (en) * 2000-01-21 2001-07-26 Intel Corporation Method and apparatus for pausing execution in a processor
JP2011113457A (en) * 2009-11-30 2011-06-09 Nec Corp Simultaneous multi-threading processor, control method, program, compiling method, and information processor

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5574883A (en) * 1993-11-30 1996-11-12 Unisys Corporation Single chip processing unit providing immediate availability of frequently used microcode instruction words
US7856633B1 (en) * 2000-03-24 2010-12-21 Intel Corporation LRU cache replacement for a partitioned set associative cache
US7353364B1 (en) * 2004-06-30 2008-04-01 Sun Microsystems, Inc. Apparatus and method for sharing a functional unit execution resource among a plurality of functional units
EP1622009A1 (en) * 2004-07-27 2006-02-01 Texas Instruments Incorporated JSM architecture and systems

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2000067113A2 (en) * 1999-04-29 2000-11-09 Intel Corporation Method and apparatus for thread switching within a multithreaded processor
WO2001053935A1 (en) * 2000-01-21 2001-07-26 Intel Corporation Method and apparatus for pausing execution in a processor
JP2011113457A (en) * 2009-11-30 2011-06-09 Nec Corp Simultaneous multi-threading processor, control method, program, compiling method, and information processor

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200410088A1 (en) * 2018-04-04 2020-12-31 Arm Limited Micro-instruction cache annotations to indicate speculative side-channel risk condition for read instructions

Also Published As

Publication number Publication date
US9934037B2 (en) 2018-04-03
CN104572016B (en) 2019-05-31
US20150100763A1 (en) 2015-04-09
CN104572016A (en) 2015-04-29
GB2519103B (en) 2020-05-06
KR20150041740A (en) 2015-04-17
KR102271986B1 (en) 2021-07-02
GB201317857D0 (en) 2013-11-20

Similar Documents

Publication Publication Date Title
US9934037B2 (en) Decoding a complex program instruction corresponding to multiple micro-operations
US9798548B2 (en) Methods and apparatus for scheduling instructions using pre-decode data
US9830156B2 (en) Temporal SIMT execution optimization through elimination of redundant operations
US7366878B1 (en) Scheduling instructions from multi-thread instruction buffer based on phase boundary qualifying rule for phases of math and data access operations with better caching
US9830158B2 (en) Speculative execution and rollback
US8407454B2 (en) Processing long-latency instructions in a pipelined processor
US7734897B2 (en) Allocation of memory access operations to memory access capable pipelines in a superscalar data processing apparatus and method having a plurality of execution threads
US7590830B2 (en) Method and structure for concurrent branch prediction in a processor
US8522000B2 (en) Trap handler architecture for a parallel processing unit
US8639882B2 (en) Methods and apparatus for source operand collector caching
US7526636B2 (en) Parallel multithread processor (PMT) with split contexts
US20190004810A1 (en) Instructions for remote atomic operations
US20130166882A1 (en) Methods and apparatus for scheduling instructions without instruction decode
WO2016113105A1 (en) Parallel slice processor having a recirculating load-store queue for fast deallocation of issue queue entries
EP3716056A1 (en) Apparatus and method for program order queue (poq) to manage data dependencies in processor having multiple instruction queues
US9069664B2 (en) Unified streaming multiprocessor memory
US10140129B2 (en) Processing core having shared front end unit
US9626191B2 (en) Shaped register file reads
EP3575955A1 (en) Indirect memory fetcher
US10152329B2 (en) Pre-scheduled replays of divergent operations
US9442759B2 (en) Concurrent execution of independent streams in multi-channel time slice groups
KR101420592B1 (en) Computer system
US20040128476A1 (en) Scheme to simplify instruction buffer logic supporting multiple strands
KR20000010200A (en) Instruction decoder having reduced instruction decoding path

Legal Events

Date Code Title Description
PCNP Patent ceased through non-payment of renewal fee

Effective date: 20221009