US20190171462A1 - Processing core having shared front end unit - Google Patents
Processing core having shared front end unit Download PDFInfo
- Publication number
- US20190171462A1 US20190171462A1 US16/200,203 US201816200203A US2019171462A1 US 20190171462 A1 US20190171462 A1 US 20190171462A1 US 201816200203 A US201816200203 A US 201816200203A US 2019171462 A1 US2019171462 A1 US 2019171462A1
- Authority
- US
- United States
- Prior art keywords
- processor
- threads
- processing
- units
- instructions
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
- 238000012545 processing Methods 0.000 title claims abstract description 112
- 239000000872 buffer Substances 0.000 claims abstract description 13
- 238000000034 method Methods 0.000 claims description 13
- 230000008569 process Effects 0.000 description 4
- 238000005192 partition Methods 0.000 description 3
- 241001522296 Erithacus rubecula Species 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 239000004065 semiconductor Substances 0.000 description 2
- 238000013519 translation Methods 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000001934 delay Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline, look ahead
- G06F9/3885—Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units
- G06F9/3889—Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units controlled by multiple instructions, e.g. MIMD, decoupled access or execute
- G06F9/3891—Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units controlled by multiple instructions, e.g. MIMD, decoupled access or execute organised in groups of units sharing resources, e.g. clusters
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30098—Register arrangements
- G06F9/3012—Organisation of register space, e.g. banked or distributed register file
- G06F9/30123—Organisation of register space, e.g. banked or distributed register file according to context, e.g. thread buffers
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline, look ahead
- G06F9/3802—Instruction prefetching
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline, look ahead
- G06F9/3818—Decoding for concurrent execution
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline, look ahead
- G06F9/3836—Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
- G06F9/3851—Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution from multiple instruction streams, e.g. multistreaming
Definitions
- the field of invention pertains to the computing sciences generally, and, more specifically, to a processing core having a shared front end unit.
- FIG. 1 shows the architecture of an exemplary multi-core processor 100 .
- the processor includes: 1) multiple processing cores 101 _ 1 to 101 _N; 2) an interconnection network 102 ; 3) a last level caching system 103 ; 4) a memory controller 104 and an I/O hub 105 .
- Each of the processing cores contain one or more instruction execution pipelines for executing program code instructions.
- the interconnect network 102 serves to interconnect each of the cores 101 _ 1 to 101 _N to each other as well as the other components 103 , 104 , 105 .
- the last level caching system 103 serves as a last layer of cache in the processor before instructions and/or data are evicted to system memory 108 .
- the memory controller 104 reads/writes data and instructions from/to system memory 108 .
- the I/O hub 105 manages communication between the processor and “I/O” devices (e.g., non volatile storage devices and/or network interfaces).
- Port 106 stems from the interconnection network 102 to link multiple processors so that systems having more than N cores can be realized.
- Graphics processor 107 performs graphics computations.
- Power management circuitry (not shown) manages the performance and power states of the processor as a whole (“package level”) as well as aspects of the performance and power states of the individual units within the processor such as the individual cores 101 _ 1 to 101 _N, graphics processor 107 , etc.
- Other functional blocks of significance e.g., phase locked loop (PLL) circuitry
- PLL phase locked loop
- FIG. 2 shows an exemplary embodiment 200 of one of the processing cores of FIG. 1 .
- each core includes two instruction execution pipelines 250 , 260 .
- Each instruction execution pipeline 250 , 260 includes its own respective: i) instruction fetch stage 201 ; ii) data fetch stage 202 ; iii) instruction execution stage 203 ; and, iv) write back stage 204 .
- the instruction fetch stage 201 fetches “next” instructions in an instruction sequence from a cache, or, system memory (if the desired instructions are not within the cache). Instructions typically specify operand data and an operation to be performed on the operand data.
- the data fetch stage 202 fetches the operand data from local operand register space, a data cache or system memory.
- the instruction execution stage 203 contains a set of functional units, any one of which is called upon to perform the particular operation called out by any one instruction on the operand data that is specified by the instruction and fetched by the data fetch stage 202 .
- the write back stage 204 “commits” the result of the execution, typically by writing the result into local register space coupled to the respective pipeline.
- the respective data fetch stage 202 of pipelines 250 , 260 is enhanced to include data dependency logic 205 to recognize when an instruction does not have a dependency on an earlier in flight instruction, and, permit its issuance to the instruction execution stage 203 “ahead of”, e.g., an earlier instruction whose data has not yet been fetched.
- the write-back stage 204 is enhanced to include a re-order buffer 206 that re-orders the results of out-of-order executed instructions into their correct order, and, delays their retirement to the physical register file until a correctly ordered consecutive sequence of instruction execution results have retired.
- the enhanced instruction execution pipeline is also observed to include instruction speculation logic 207 within the instruction fetch stage 201 .
- the speculation logic 207 guesses at what conditional branch direction or jump the instruction sequence will take and begins to fetch the instruction sequence that flows from that direction or jump. The speculative instructions are then processed by the remaining stages of the execution pipeline.
- FIG. 1 shows a processor (prior art)
- FIG. 2 shows an instruction execution pipeline (prior art).
- FIG. 3 shows a processing core having a shared front end unit
- FIG. 4 shows a method performed by the processing core of FIG. 3 ;
- FIG. 5 shows a processor whose respective cores have a shared front end unit
- FIG. 6 shows a computing system composed of processors whose respective cores have a shared front end unit.
- the number of logic transistors manufactured on a semiconductor chip can be viewed as the semiconductor chip's fixed resource for processing information.
- a characteristic of the processor and processing core architecture discussed above with respect to FIGS. 1 and 2 is that an emphasis is placed on reducing the latency of the instructions that are processed by the processor.
- the fixed resources of the processor design of FIGS. 1 and 2 such as the out-of-order execution enhancements made to each of the pipelines, have been devoted to running a thread through the pipeline with minimal delay.
- FIG. 3 shows an embodiment of an e architecture of a processing core 300 that can be instantiated multiple times (e.g., once for each processing core) within a multi-core processor.
- the processing core architecture of FIG. 3 is designed with more execution units than is typical for a standard processing core so as to increase the overall throughput of the processing core (i.e., increase the number of threads that the processing core can simultaneously process).
- the processing core architecture includes a shared front end unit 301 coupled to a plurality of processing units 302 _ 1 to 302 _N.
- Each of the processing units 302 _ 1 to 302 _N contain at least one set of functional units (e.g., at least one set of functional units 303 ) capable of supporting an entire instruction set, such as an entire x86 instruction set or other general purpose instruction set (as opposed to a more limited specific purpose instruction set such as the typical instruction set of a digital signal processor (DSP) or accelerator).
- at least one set of functional units e.g., at least one set of functional units 303
- an entire instruction set such as an entire x86 instruction set or other general purpose instruction set (as opposed to a more limited specific purpose instruction set such as the typical instruction set of a digital signal processor (DSP) or accelerator.
- DSP digital signal processor
- the shared front end unit 301 fetches and receives the instructions to be processed by the processing core 300 , decodes the received instructions, and dispatches the decoded instructions to their appropriate processing unit.
- the shared front end unit fetches all instructions for all of the threads being executed by all of the general purpose processing units of the processing core.
- a particular thread is assigned to a particular processing unit, and, each processing unit, as described in more detail below, is multi-threaded (i.e., can simultaneously and/or concurrently process more than one thread).
- each processing unit can simultaneously/concurrently execute up to M hardware threads and there are N processing units
- the processing core can simultaneously/concurrently execute up to MN hardware threads.
- the product MN may be greater than the typical number of hardware threads that can simultaneously executed in a typical processing core (e.g., greater than 8 or 16 at current densities).
- the shared front end unit contains program control logic circuitry 311 to identify and fetch appropriate “next” instructions for each thread.
- the program control logic circuitry 311 includes an instruction pointer 312 _ 1 to 312 _MN for each thread and instruction fetch circuitry 313 .
- FIG. 3 indicates that there are MN instruction pointers to reflect support for MN different hardware threads.
- the instruction fetch circuitry 313 first looks first to an instruction cache 314 for the instruction identified within the thread's instruction pointer. If the sought for instruction is not found within the instruction cache 314 it is fetched from program memory 315 .
- blocks of instructions may be stored and fetched from cache and/or memory on a per hardware thread basis.
- the individual hardware threads may be serviced by the instruction fetch circuitry 313 on a time-sliced basis (e.g., a fair round robin approach). Further still, the instruction fetch circuitry 313 may be parallelized into similar/same blocks that fetch instructions for different hardware threads in parallel (e.g., each parallel block of instruction fetch circuitry services a different subset of instruction pointers).
- the individual hardware threads may be processed slower than a traditional processor (e.g., because per thread latency reduction circuitry has not been instantiated in favor of more processing units as described above), it is conceivable that some implementations may not require parallel instruction fetch capability, or, at least include less than N parallel instruction fetch channels (e.g., N/2 parallel instruction fetch blocks). Accordingly, in any of these cases, certain components of the front end unit 301 are shared by at least two of the processing units 302 _ 1 to 302 _N.
- the program control logic circuitry 311 also includes an instruction translation look-aside buffer (ITLB) circuit 316 for each hardware thread.
- ITLB instruction translation look-aside buffer
- an ITLB translates the instruction addresses received from program memory 315 into actual addresses in physical memory where the instructions actually reside.
- an instruction After an instruction has been fetched it is decoded by an instruction decoder 317 .
- there is an instruction decoder for each processing unit i.e., there are N decoders. Again, e.g., where the number of processing units N has been increased at the expense of executing threads with lower latency, there may be more than one processing unit per instruction decoder. Conceivably there may even be one decoder for all the processing units.
- An instruction typically specifies: i) an operation to be performed in the form of an “opcode”; ii) the location where the input operands for the operation can be found (register and/or memory space); and, iii) the location where the resultant of the operation is to be stored (register and/or memory space).
- the instruction decoder 317 decodes an instruction not only by breaking the instruction down into its opcode and input operand/resultant storage locations, but also, converting the opcode into a sequence of micro-instructions.
- micro-instructions are akin to a small software program (microcode) that an execution unit will execute in order to perform the functionality of an instruction.
- an instruction opcode is converted to the microcode that corresponds to the functional operation of the instruction.
- the opcode is entered as a look-up parameter into a circuit 318 configured to behave like a look-up table (e.g., a read only memory (ROM) configured as a look-up table).
- ROM read only memory
- the look-up table circuit 318 responds to the input opcode with the microcode for the opcode's instruction.
- there is a ROM for each processing unit in the processing core or, again, there is more than one processing unit per micro-code ROM because the per-thread latency of the processing units has been diluted compared to a traditional processor).
- microcode for a decoded instruction is then dispatched along with the decoded instruction's register/memory addresses of its input operands and resultants to the processing unit that has been assigned to the hardware thread that the decoded instruction is a component of. Note that the respective micro-code for two different instructions of two different hardware threads running on two different processing units may be simultaneously dispatched to their respective processing units.
- each processing unit 302 _ 1 to 302 _N can simultaneously and/or concurrently execute more than one hardware thread.
- each processing unit can concurrently (as opposed to simultaneously) execute multiple software threads.
- concurrent execution as opposed to simultaneous execution, corresponds to the execution of multiple software threads within a period of time by alternating processing resources amongst the software threads supported by the processing unit (e.g., servicing each of the software threads in an round robin fashion resources).
- a single processing unit may concurrently execute multiple software threads by switching the software threads and their associated state information in/out of the processing unit as hardware threads of the processing unit.
- each processing unit has a microcode buffer 320 to store the microcode that has been dispatched from the instruction decoder 317 .
- the microcode buffer 320 may be partitioned so that separate FIFO queuing space exists for each hardware thread supported by the processing unit.
- the input operand and resultant addresses are also queued in an aligned fashion or otherwise associated with the respective microcode of their instruction.
- Each processing unit includes register space 321 coupled to its internal functional unit set(s) 303 for keeping the operand/resultant data of the thread(s) the functional unit set(s) 303 are responsible for executing. If a single functional unit set is to concurrently execute multiple hardware threads, the register space 321 for the functional unit set 303 may be partitioned such that there is one register set partition for each hardware thread the functional unit set 303 is to concurrently execute. As such, the functional unit set 303 “operates out of” a specific register partition for each unique hardware thread that the functional unit set is concurrently executing.
- each processing unit 302 _ 1 to 302 _N includes register allocation logic 322 to allocate registers for the instructions of each of the respective hardware threads that the processing unit is concurrently and/or simultaneously executing.
- register allocation logic 322 to allocate registers for the instructions of each of the respective hardware threads that the processing unit is concurrently and/or simultaneously executing.
- the register allocation logic circuitry 322 includes data fetch logic to fetch operands (that are called out by the instructions) from register space 321 associated with the functional unit that the operands' respective instructions are targeted to.
- the data fetch logic circuitry may be coupled to system memory 323 to fetch data operands from system memory 323 explicitly.
- each functional unit set 303 includes: i) an integer functional unit cluster that contains functional units for executing integer mathematical/logic instructions; ii) a floating point functional unit cluster containing functional units for executing floating point mathematical/logic instructions; iii) a SIMD functional unit cluster that contains functional units for executing SIMD mathematical/logic instructions; and, iv) a memory access functional unit cluster containing functional units for performing data memory accesses (for integer and/or floating point and/or SIMD operands and/or results).
- the memory access functional unit cluster may contain one or more data TLBs to perform virtual to physical address translation for its respective threads.
- Micro-code for a particular instruction issues from its respective microcode buffer 320 to the appropriate functional unit along with the operand data that was fetched for the instruction by the fetch circuitry associated with the register allocation logic 322 . Results of the execution of the functional units are written back to the register space 321 associated with the execution units.
- each processing unit contains a data cache 324 that is coupled to the functional units of the memory access cluster.
- the functional units of the memory access cluster are also coupled to system memory 323 so that they can fetch data from memory.
- each register file partition described above may be further partitioned into separate integer, floating point and SIMD register space that is coupled to the corresponding functional unit cluster.
- operating system and/or virtual machine monitor (VMM) software assigns specific software threads to a specific processing unit.
- the shared front end logic 301 and/or operating system/VMM is able to dynamically assign a software thread to a particular processing unit or functional unit set to activate the thread as a hardware thread.
- each processing unit includes “context switching” logic (not shown) so that each processing unit can be assigned more software threads than it can simultaneously or concurrently support as hardware threads. That is, the number of software threads assigned to the processing unit can exceed the number of “active” hardware threads the processing unit is capable of presently executing (either simultaneously or concurrently) as evidenced by the presence of context information of a thread within the register space of the processing unit.
- a software thread when a software thread becomes actived as a hardware thread, its context information (e.g., the values of its various operands and control information) is located within the register space 321 that is coupled to the functional unit set 303 that is executing the thread's instructions. If a decision is made to transition the thread from an active to inactive state, the context information of the thread is read out of this register space 321 and stored elsewhere (e.g., system memory 323 ). With the register space of the thread now being “freed up”, the context information of another “inactive” software thread whose context information resides, e.g., in system memory 232 , can be written into the register space 321 . As a consequence, the other thread converts from “inactive” to “active” and its instructions are executed as a hardware thread going forward.
- context information e.g., the values of its various operands and control information
- any of the mechanisms and associated logic circuitry for “speeding-up” a hardware thread's execution may not be present in the shared front end or processing unit circuitry.
- Such eliminated blocks may include any one or more of: 1) speculation logic (e.g., branch prediction logic); 2) out-of-order execution logic (e.g., register renaming logic and/or a re-order buffer and/or data dependency logic); 3) superscalar logic to dynamically effect parallel instruction issuance for a single hardware thread.
- a multi-core processor built with multiple instances of the processing core architecture of FIG. 3 may include any/all of the surrounding features discussed above with respect to FIG. 1 .
- FIG. 4 shows a flow chart describing a methodology of the processing core described above.
- first and second instructions of different hardware threads are fetched 401 and decoded in a shared front-end unit.
- the instructions are decoded and respective microcode and operand/resultant addresses for the instructions are issued to different processing units from the shared front-end unit 402 .
- the respective processing units fetch data for their respective operands and issue the received microcode and respective operands to respective functional units 403 .
- the functional units then execute their respective instructions 404 .
- FIG. 5 shows an embodiment of a processer 500 having multiple processing cores 550 _ 1 through 550 _N each having a respective shared front end unit 511 _ 1 , 511 _ 2 , . . . 511 _N (with respective instruction TLB 516 _ 1 , 516 _ 2 , . . . 516 _N) and respective processing units having with corresponding micro-code buffer (e.g., micro-code buffers 520 _ 1 , 520 _ 2 , etc. within the processing units of core 501 _ 1 ).
- micro-code buffer e.g., micro-code buffers 520 _ 1 , 520 _ 2 , etc.
- Each core also includes one or more caching levels 550 _ 1 , 550 _ 2 , 550 _N to cache instructions and/or data of each processing unit individually and/or a respective core as a whole.
- the cores 501 _ 1 , 501 _ 2 , . . . 501 _N are coupled to one another through an interconnection network 502 that also couples the cores to one or more caching levels (e.g., last level cache 503 ) that caches instructions and/or data for the cores 501 _ 1 , 501 _ 2 . . . 501 _N) and a memory controller 504 that is coupled to, e.g., a “slice” of system memory.
- Other components such as any of the components of FIG. 1 may also be included in FIG. 5 .
- FIG. 6 shows an embodiment of a computing system, such as a computer, implemented with multiple processors 600 _ 1 through 600 _ z having the features discussed above in FIG. 5 .
- the multiple processors 600 _ 1 through 600 _ z are connected to each other through a network that also couples the processors to a plurality of system memory units 608 _ 1 , 608 _ 2 , a non volatile storage unit 610 (e.g., a disk drive) and an external (e.g., Internet) network interface 611 .
- a non volatile storage unit 610 e.g., a disk drive
- an external e.g., Internet
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Advance Control (AREA)
- Computer Hardware Design (AREA)
Abstract
Description
- The present patent application is a continuation application claiming priority from U.S. patent application Ser. No. 13/730,719, filed Dec. 28, 2012, and titled: “Processing Core Having Shared Front End Unit”, which is incorporated herein by reference in its entirety.
- The field of invention pertains to the computing sciences generally, and, more specifically, to a processing core having a shared front end unit.
-
FIG. 1 shows the architecture of an exemplarymulti-core processor 100. As observed inFIG. 1 , the processor includes: 1) multiple processing cores 101_1 to 101_N; 2) aninterconnection network 102; 3) a lastlevel caching system 103; 4) amemory controller 104 and an I/O hub 105. Each of the processing cores contain one or more instruction execution pipelines for executing program code instructions. Theinterconnect network 102 serves to interconnect each of the cores 101_1 to 101_N to each other as well as theother components level caching system 103 serves as a last layer of cache in the processor before instructions and/or data are evicted tosystem memory 108. - The
memory controller 104 reads/writes data and instructions from/tosystem memory 108. The I/O hub 105 manages communication between the processor and “I/O” devices (e.g., non volatile storage devices and/or network interfaces).Port 106 stems from theinterconnection network 102 to link multiple processors so that systems having more than N cores can be realized.Graphics processor 107 performs graphics computations. Power management circuitry (not shown) manages the performance and power states of the processor as a whole (“package level”) as well as aspects of the performance and power states of the individual units within the processor such as the individual cores 101_1 to 101_N,graphics processor 107, etc. Other functional blocks of significance (e.g., phase locked loop (PLL) circuitry) are not depicted inFIG. 1 for convenience. -
FIG. 2 shows anexemplary embodiment 200 of one of the processing cores ofFIG. 1 . As observed inFIG. 2 , each core includes twoinstruction execution pipelines instruction execution pipeline instruction fetch stage 201; ii)data fetch stage 202; iii)instruction execution stage 203; and, iv) write backstage 204. Theinstruction fetch stage 201 fetches “next” instructions in an instruction sequence from a cache, or, system memory (if the desired instructions are not within the cache). Instructions typically specify operand data and an operation to be performed on the operand data. Thedata fetch stage 202 fetches the operand data from local operand register space, a data cache or system memory. Theinstruction execution stage 203 contains a set of functional units, any one of which is called upon to perform the particular operation called out by any one instruction on the operand data that is specified by the instruction and fetched by thedata fetch stage 202. The writeback stage 204 “commits” the result of the execution, typically by writing the result into local register space coupled to the respective pipeline. - In order to avoid the unnecessary delay of an instruction that does not have any dependencies on earlier “in flight” instructions, many modern instruction execution pipelines have enhanced data fetch and write back stages to effect “out-of-order” execution. Here, the respective
data fetch stage 202 ofpipelines data dependency logic 205 to recognize when an instruction does not have a dependency on an earlier in flight instruction, and, permit its issuance to theinstruction execution stage 203 “ahead of”, e.g., an earlier instruction whose data has not yet been fetched. - Moreover, the write-
back stage 204 is enhanced to include are-order buffer 206 that re-orders the results of out-of-order executed instructions into their correct order, and, delays their retirement to the physical register file until a correctly ordered consecutive sequence of instruction execution results have retired. - The enhanced instruction execution pipeline is also observed to include
instruction speculation logic 207 within theinstruction fetch stage 201. Thespeculation logic 207 guesses at what conditional branch direction or jump the instruction sequence will take and begins to fetch the instruction sequence that flows from that direction or jump. The speculative instructions are then processed by the remaining stages of the execution pipeline. - The present invention is illustrated by way of example and not limitation in the figures of the accompanying drawings, in which like references indicate similar elements and in which:
-
FIG. 1 shows a processor (prior art); -
FIG. 2 shows an instruction execution pipeline (prior art); -
FIG. 3 shows a processing core having a shared front end unit; -
FIG. 4 shows a method performed by the processing core ofFIG. 3 ; -
FIG. 5 shows a processor whose respective cores have a shared front end unit; -
FIG. 6 shows a computing system composed of processors whose respective cores have a shared front end unit. - The number of logic transistors manufactured on a semiconductor chip can be viewed as the semiconductor chip's fixed resource for processing information. A characteristic of the processor and processing core architecture discussed above with respect to
FIGS. 1 and 2 is that an emphasis is placed on reducing the latency of the instructions that are processed by the processor. Said another way, the fixed resources of the processor design ofFIGS. 1 and 2 , such as the out-of-order execution enhancements made to each of the pipelines, have been devoted to running a thread through the pipeline with minimal delay. - The dedication of logic circuitry to the speed-up of a currently active threads is achieved, however, at the expense of the total number of threads that the processor can simultaneously process at any instant of time. Said another way, if the logic circuitry units of a processor were emphasized differently, the processor might be able to simultaneously process more threads than the processor of
FIG. 1 whose processing core are designed according to the architecture of 2. For example, if the logic circuitry resources of the out-of-order execution enhancements were removed, the “freed up” logic circuitry could be re-utilized to instantiate more execution units within the processor. With more execution units, the processor could simultaneously execute more instructions and therefore more threads. -
FIG. 3 shows an embodiment of an e architecture of a processingcore 300 that can be instantiated multiple times (e.g., once for each processing core) within a multi-core processor. The processing core architecture ofFIG. 3 is designed with more execution units than is typical for a standard processing core so as to increase the overall throughput of the processing core (i.e., increase the number of threads that the processing core can simultaneously process). As observed inFIG. 3 , the processing core architecture includes a sharedfront end unit 301 coupled to a plurality of processing units 302_1 to 302_N. Each of the processing units 302_1 to 302_N, in an embodiment, contain at least one set of functional units (e.g., at least one set of functional units 303) capable of supporting an entire instruction set, such as an entire x86 instruction set or other general purpose instruction set (as opposed to a more limited specific purpose instruction set such as the typical instruction set of a digital signal processor (DSP) or accelerator). - As observed in
FIG. 3 , the sharedfront end unit 301 fetches and receives the instructions to be processed by theprocessing core 300, decodes the received instructions, and dispatches the decoded instructions to their appropriate processing unit. In an embodiment, the shared front end unit fetches all instructions for all of the threads being executed by all of the general purpose processing units of the processing core. - A particular thread is assigned to a particular processing unit, and, each processing unit, as described in more detail below, is multi-threaded (i.e., can simultaneously and/or concurrently process more than one thread). Thus, if each processing unit can simultaneously/concurrently execute up to M hardware threads and there are N processing units, the processing core can simultaneously/concurrently execute up to MN hardware threads. Here, the product MN may be greater than the typical number of hardware threads that can simultaneously executed in a typical processing core (e.g., greater than 8 or 16 at current densities).
- Referring to the shared
front end unit 301, the shared front end unit contains programcontrol logic circuitry 311 to identify and fetch appropriate “next” instructions for each thread. Here, the programcontrol logic circuitry 311 includes an instruction pointer 312_1 to 312_MN for each thread andinstruction fetch circuitry 313. Note thatFIG. 3 indicates that there are MN instruction pointers to reflect support for MN different hardware threads. For each hardware thread, theinstruction fetch circuitry 313 first looks first to aninstruction cache 314 for the instruction identified within the thread's instruction pointer. If the sought for instruction is not found within theinstruction cache 314 it is fetched from program memory 315. In various implementations, blocks of instructions may be stored and fetched from cache and/or memory on a per hardware thread basis. - The individual hardware threads may be serviced by the
instruction fetch circuitry 313 on a time-sliced basis (e.g., a fair round robin approach). Further still, theinstruction fetch circuitry 313 may be parallelized into similar/same blocks that fetch instructions for different hardware threads in parallel (e.g., each parallel block of instruction fetch circuitry services a different subset of instruction pointers). - Because, however, the individual hardware threads may be processed slower than a traditional processor (e.g., because per thread latency reduction circuitry has not been instantiated in favor of more processing units as described above), it is conceivable that some implementations may not require parallel instruction fetch capability, or, at least include less than N parallel instruction fetch channels (e.g., N/2 parallel instruction fetch blocks). Accordingly, in any of these cases, certain components of the
front end unit 301 are shared by at least two of the processing units 302_1 to 302_N. - In a further embodiment, the program
control logic circuitry 311 also includes an instruction translation look-aside buffer (ITLB)circuit 316 for each hardware thread. As is understood in the art, an ITLB translates the instruction addresses received from program memory 315 into actual addresses in physical memory where the instructions actually reside. - After an instruction has been fetched it is decoded by an
instruction decoder 317. In an embodiment there is an instruction decoder for each processing unit (i.e., there are N decoders). Again, e.g., where the number of processing units N has been increased at the expense of executing threads with lower latency, there may be more than one processing unit per instruction decoder. Conceivably there may even be one decoder for all the processing units. - An instruction typically specifies: i) an operation to be performed in the form of an “opcode”; ii) the location where the input operands for the operation can be found (register and/or memory space); and, iii) the location where the resultant of the operation is to be stored (register and/or memory space). In an embodiment, the
instruction decoder 317 decodes an instruction not only by breaking the instruction down into its opcode and input operand/resultant storage locations, but also, converting the opcode into a sequence of micro-instructions. - As is understood in the art, micro-instructions are akin to a small software program (microcode) that an execution unit will execute in order to perform the functionality of an instruction. Thus, an instruction opcode is converted to the microcode that corresponds to the functional operation of the instruction. Typically, the opcode is entered as a look-up parameter into a
circuit 318 configured to behave like a look-up table (e.g., a read only memory (ROM) configured as a look-up table). The look-uptable circuit 318 responds to the input opcode with the microcode for the opcode's instruction. Thus, in an embodiment, there is a ROM for each processing unit in the processing core (or, again, there is more than one processing unit per micro-code ROM because the per-thread latency of the processing units has been diluted compared to a traditional processor). - The microcode for a decoded instruction is then dispatched along with the decoded instruction's register/memory addresses of its input operands and resultants to the processing unit that has been assigned to the hardware thread that the decoded instruction is a component of. Note that the respective micro-code for two different instructions of two different hardware threads running on two different processing units may be simultaneously dispatched to their respective processing units.
- In an embodiment, as discussed above, each processing unit 302_1 to 302_N can simultaneously and/or concurrently execute more than one hardware thread. For instance, each processing unit may have X sets of execution units (where X=1 or greater), where, each set of execution units is capable of supporting an entire instruction set such as an entire x86 instruction set. Alternatively or in combination, each processing unit can concurrently (as opposed to simultaneously) execute multiple software threads. Here, concurrent execution, as opposed to simultaneous execution, corresponds to the execution of multiple software threads within a period of time by alternating processing resources amongst the software threads supported by the processing unit (e.g., servicing each of the software threads in an round robin fashion resources). Thus, in an embodiment, over a window of time, a single processing unit may concurrently execute multiple software threads by switching the software threads and their associated state information in/out of the processing unit as hardware threads of the processing unit.
- As observed in
FIG. 3 , each processing unit has amicrocode buffer 320 to store the microcode that has been dispatched from theinstruction decoder 317. Themicrocode buffer 320 may be partitioned so that separate FIFO queuing space exists for each hardware thread supported by the processing unit. The input operand and resultant addresses are also queued in an aligned fashion or otherwise associated with the respective microcode of their instruction. - Each processing unit includes
register space 321 coupled to its internal functional unit set(s) 303 for keeping the operand/resultant data of the thread(s) the functional unit set(s) 303 are responsible for executing. If a single functional unit set is to concurrently execute multiple hardware threads, theregister space 321 for the functional unit set 303 may be partitioned such that there is one register set partition for each hardware thread the functional unit set 303 is to concurrently execute. As such, the functional unit set 303 “operates out of” a specific register partition for each unique hardware thread that the functional unit set is concurrently executing. - As observed in
FIG. 3 , each processing unit 302_1 to 302_N includesregister allocation logic 322 to allocate registers for the instructions of each of the respective hardware threads that the processing unit is concurrently and/or simultaneously executing. Here, for implementations having more than one functional unit set per processing unit, there may be multiple instances ofmicro-code buffer circuitry 320 and register allocation circuitry 322 (e.g., one instance for each functional unit set of the processing unit), or, there may be one micro-code buffer and register allocation circuit that feeds more than one functional unit set (i.e., onemicro-code buffer 320 and registerallocation circuit 322 for two or more functional unit sets). The registerallocation logic circuitry 322 includes data fetch logic to fetch operands (that are called out by the instructions) fromregister space 321 associated with the functional unit that the operands' respective instructions are targeted to. The data fetch logic circuitry may be coupled to system memory 323 to fetch data operands from system memory 323 explicitly. - In an embodiment, each functional unit set 303 includes: i) an integer functional unit cluster that contains functional units for executing integer mathematical/logic instructions; ii) a floating point functional unit cluster containing functional units for executing floating point mathematical/logic instructions; iii) a SIMD functional unit cluster that contains functional units for executing SIMD mathematical/logic instructions; and, iv) a memory access functional unit cluster containing functional units for performing data memory accesses (for integer and/or floating point and/or SIMD operands and/or results). The memory access functional unit cluster may contain one or more data TLBs to perform virtual to physical address translation for its respective threads.
- Micro-code for a particular instruction issues from its
respective microcode buffer 320 to the appropriate functional unit along with the operand data that was fetched for the instruction by the fetch circuitry associated with theregister allocation logic 322. Results of the execution of the functional units are written back to theregister space 321 associated with the execution units. - In a further embodiment, each processing unit contains a data cache 324 that is coupled to the functional units of the memory access cluster. The functional units of the memory access cluster are also coupled to system memory 323 so that they can fetch data from memory. Notably, each register file partition described above may be further partitioned into separate integer, floating point and SIMD register space that is coupled to the corresponding functional unit cluster.
- According to one scenario, operating system and/or virtual machine monitor (VMM) software assigns specific software threads to a specific processing unit. The shared
front end logic 301 and/or operating system/VMM is able to dynamically assign a software thread to a particular processing unit or functional unit set to activate the thread as a hardware thread. In various embodiments, each processing unit includes “context switching” logic (not shown) so that each processing unit can be assigned more software threads than it can simultaneously or concurrently support as hardware threads. That is, the number of software threads assigned to the processing unit can exceed the number of “active” hardware threads the processing unit is capable of presently executing (either simultaneously or concurrently) as evidenced by the presence of context information of a thread within the register space of the processing unit. - Here, for instance, when a software thread becomes actived as a hardware thread, its context information (e.g., the values of its various operands and control information) is located within the
register space 321 that is coupled to the functional unit set 303 that is executing the thread's instructions. If a decision is made to transition the thread from an active to inactive state, the context information of the thread is read out of thisregister space 321 and stored elsewhere (e.g., system memory 323). With the register space of the thread now being “freed up”, the context information of another “inactive” software thread whose context information resides, e.g., in system memory 232, can be written into theregister space 321. As a consequence, the other thread converts from “inactive” to “active” and its instructions are executed as a hardware thread going forward. - As discussed above, the “room” for the logic circuitry to entertain a large number of hardware threads may come at the expense of maximizing the latency of any particular thread. As such, any of the mechanisms and associated logic circuitry for “speeding-up” a hardware thread's execution may not be present in the shared front end or processing unit circuitry. Such eliminated blocks may include any one or more of: 1) speculation logic (e.g., branch prediction logic); 2) out-of-order execution logic (e.g., register renaming logic and/or a re-order buffer and/or data dependency logic); 3) superscalar logic to dynamically effect parallel instruction issuance for a single hardware thread.
- A multi-core processor built with multiple instances of the processing core architecture of
FIG. 3 may include any/all of the surrounding features discussed above with respect toFIG. 1 . -
FIG. 4 shows a flow chart describing a methodology of the processing core described above. According to the methodology ofFIG. 4 , first and second instructions of different hardware threads are fetched 401 and decoded in a shared front-end unit. The instructions are decoded and respective microcode and operand/resultant addresses for the instructions are issued to different processing units from the shared front-end unit 402. The respective processing units fetch data for their respective operands and issue the received microcode and respective operands to respectivefunctional units 403. The functional units then execute theirrespective instructions 404. -
FIG. 5 shows an embodiment of aprocesser 500 having multiple processing cores 550_1 through 550_N each having a respective shared front end unit 511_1, 511_2, . . . 511_N (with respective instruction TLB 516_1, 516_2, . . . 516_N) and respective processing units having with corresponding micro-code buffer (e.g., micro-code buffers 520_1, 520_2, etc. within the processing units of core 501_1). Each core also includes one or more caching levels 550_1, 550_2, 550_N to cache instructions and/or data of each processing unit individually and/or a respective core as a whole. The cores 501_1, 501_2, . . . 501_N are coupled to one another through aninterconnection network 502 that also couples the cores to one or more caching levels (e.g., last level cache 503) that caches instructions and/or data for the cores 501_1, 501_2 . . . 501_N) and amemory controller 504 that is coupled to, e.g., a “slice” of system memory. Other components such as any of the components ofFIG. 1 may also be included inFIG. 5 . -
FIG. 6 shows an embodiment of a computing system, such as a computer, implemented with multiple processors 600_1 through 600_z having the features discussed above inFIG. 5 . The multiple processors 600_1 through 600_z are connected to each other through a network that also couples the processors to a plurality of system memory units 608_1, 608_2, a non volatile storage unit 610 (e.g., a disk drive) and an external (e.g., Internet)network interface 611. - In the foregoing specification, the invention has been described with reference to specific exemplary embodiments thereof. It will, however, be evident that various modifications and changes may be made thereto without departing from the broader spirit and scope of the invention as set forth in the appended claims. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense.
Claims (20)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US16/200,203 US20190171462A1 (en) | 2012-12-28 | 2018-11-26 | Processing core having shared front end unit |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US13/730,719 US10140129B2 (en) | 2012-12-28 | 2012-12-28 | Processing core having shared front end unit |
US16/200,203 US20190171462A1 (en) | 2012-12-28 | 2018-11-26 | Processing core having shared front end unit |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US13/730,719 Continuation US10140129B2 (en) | 2012-12-28 | 2012-12-28 | Processing core having shared front end unit |
Publications (1)
Publication Number | Publication Date |
---|---|
US20190171462A1 true US20190171462A1 (en) | 2019-06-06 |
Family
ID=51018681
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US13/730,719 Active 2033-10-09 US10140129B2 (en) | 2012-12-28 | 2012-12-28 | Processing core having shared front end unit |
US16/200,203 Abandoned US20190171462A1 (en) | 2012-12-28 | 2018-11-26 | Processing core having shared front end unit |
Family Applications Before (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US13/730,719 Active 2033-10-09 US10140129B2 (en) | 2012-12-28 | 2012-12-28 | Processing core having shared front end unit |
Country Status (3)
Country | Link |
---|---|
US (2) | US10140129B2 (en) |
CN (2) | CN110045988B (en) |
WO (1) | WO2014105207A1 (en) |
Families Citing this family (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9542193B2 (en) | 2012-12-28 | 2017-01-10 | Intel Corporation | Memory address collision detection of ordered parallel threads with bloom filters |
US10140129B2 (en) * | 2012-12-28 | 2018-11-27 | Intel Corporation | Processing core having shared front end unit |
US20160162290A1 (en) * | 2013-04-19 | 2016-06-09 | Institute Of Automation, Chinese Academy Of Sciences | Processor with Polymorphic Instruction Set Architecture |
US9747108B2 (en) | 2015-03-27 | 2017-08-29 | Intel Corporation | User-level fork and join processors, methods, systems, and instructions |
CN106250200A (en) * | 2016-08-02 | 2016-12-21 | 合肥奇也信息科技有限公司 | A kind of execution method dividing at least one software application section for computer |
US10489878B2 (en) * | 2017-05-15 | 2019-11-26 | Google Llc | Configurable and programmable image processor unit |
GB2563419B (en) * | 2017-06-15 | 2020-04-22 | Accelercomm Ltd | Polar decoder, communication unit, integrated circuit and method therefor |
US11893392B2 (en) | 2020-12-01 | 2024-02-06 | Electronics And Telecommunications Research Institute | Multi-processor system and method for processing floating point operation thereof |
US20230100586A1 (en) * | 2021-09-24 | 2023-03-30 | Intel Corporation | Circuitry and methods for accelerating streaming data-transformation operations |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7953933B1 (en) * | 2005-09-28 | 2011-05-31 | Oracle America, Inc. | Instruction cache, decoder circuit, basic block cache circuit and multi-block cache circuit |
US20140189300A1 (en) * | 2012-12-28 | 2014-07-03 | Name ILAN PARDO | Processing Core Having Shared Front End Unit |
US20150100763A1 (en) * | 2013-10-09 | 2015-04-09 | Arm Limited | Decoding a complex program instruction corresponding to multiple micro-operations |
Family Cites Families (138)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4943915A (en) | 1987-09-29 | 1990-07-24 | Digital Equipment Corporation | Apparatus and method for synchronization of a coprocessor unit in a pipelined central processing unit |
US4982402A (en) | 1989-02-03 | 1991-01-01 | Digital Equipment Corporation | Method and apparatus for detecting and correcting errors in a pipelined computer system |
US5329615A (en) | 1990-09-14 | 1994-07-12 | Hughes Aircraft Company | Concurrent general purpose and DMA processing in a graphics rendering processor |
CA2050658C (en) | 1990-09-14 | 1997-01-28 | John M. Peaslee | Dual hardware channels and hardware context switching in a graphics rendering processor |
US5276798A (en) | 1990-09-14 | 1994-01-04 | Hughes Aircraft Company | Multifunction high performance graphics rendering processor |
US5444853A (en) | 1992-03-31 | 1995-08-22 | Seiko Epson Corporation | System and method for transferring data between a plurality of virtual FIFO's and a peripheral via a hardware FIFO and selectively updating control information associated with the virtual FIFO's |
US5423025A (en) | 1992-09-29 | 1995-06-06 | Amdahl Corporation | Error handling mechanism for a controller having a plurality of servers |
US5430841A (en) | 1992-10-29 | 1995-07-04 | International Business Machines Corporation | Context management in a graphics system |
JPH07219774A (en) | 1994-02-07 | 1995-08-18 | Fujitsu Ltd | Data processor and exception processing method |
US5550988A (en) | 1994-03-01 | 1996-08-27 | Intel Corporation | Apparatus and method for performing error correction in a multi-processor system |
US6341324B1 (en) | 1995-10-06 | 2002-01-22 | Lsi Logic Corporation | Exception processing in superscalar microprocessor |
US5778211A (en) | 1996-02-15 | 1998-07-07 | Sun Microsystems, Inc. | Emulating a delayed exception on a digital computer having a corresponding precise exception mechanism |
US6061711A (en) | 1996-08-19 | 2000-05-09 | Samsung Electronics, Inc. | Efficient context saving and restoring in a multi-tasking computing system environment |
CN1147785C (en) * | 1996-08-27 | 2004-04-28 | 松下电器产业株式会社 | Multi-program-flow synchronous processor independently processing multiple instruction stream, soft controlling processing function of every instrunetion |
US6148326A (en) | 1996-09-30 | 2000-11-14 | Lsi Logic Corporation | Method and structure for independent disk and host transfer in a storage subsystem target device |
US6247040B1 (en) | 1996-09-30 | 2001-06-12 | Lsi Logic Corporation | Method and structure for automated switching between multiple contexts in a storage subsystem target device |
US6081849A (en) | 1996-10-01 | 2000-06-27 | Lsi Logic Corporation | Method and structure for switching multiple contexts in storage subsystem target device |
US6275497B1 (en) | 1997-02-10 | 2001-08-14 | Hybrid Networks, Inc. | Method and apparatus for controlling communication channels using contention and polling schemes |
US6075546A (en) | 1997-11-10 | 2000-06-13 | Silicon Grahphics, Inc. | Packetized command interface to graphics processor |
US6272522B1 (en) * | 1998-11-17 | 2001-08-07 | Sun Microsystems, Incorporated | Computer data packet switching and load balancing system using a general-purpose multiprocessor architecture |
US6397240B1 (en) | 1999-02-18 | 2002-05-28 | Agere Systems Guardian Corp. | Programmable accelerator for a programmable processor system |
GB2352066B (en) * | 1999-07-14 | 2003-11-05 | Element 14 Ltd | An instruction set for a computer |
CA2383526A1 (en) * | 1999-09-01 | 2001-03-15 | Intel Corporation | Branch instruction for multithreaded processor |
US6543026B1 (en) | 1999-09-10 | 2003-04-01 | Lsi Logic Corporation | Forward error correction apparatus and methods |
JP3621315B2 (en) | 1999-11-22 | 2005-02-16 | Necエレクトロニクス株式会社 | Microprocessor system |
US6820105B2 (en) | 2000-05-11 | 2004-11-16 | Cyberguard Corporation | Accelerated montgomery exponentiation using plural multipliers |
US6742104B2 (en) | 2000-08-21 | 2004-05-25 | Texas Instruments Incorporated | Master/slave processing system with shared translation lookaside buffer |
EP1182568A3 (en) | 2000-08-21 | 2004-07-21 | Texas Instruments Incorporated | TLB operation based on task-id |
EP1182569B8 (en) | 2000-08-21 | 2011-07-06 | Texas Instruments Incorporated | TLB lock and unlock operation |
JP3729087B2 (en) | 2001-05-23 | 2005-12-21 | 日本電気株式会社 | Multiprocessor system, data-dependent speculative execution control device and method thereof |
JP2003015900A (en) | 2001-06-28 | 2003-01-17 | Hitachi Ltd | Follow-up type multiplex system and data processing method capable of improving reliability by follow-up |
US20030028751A1 (en) | 2001-08-03 | 2003-02-06 | Mcdonald Robert G. | Modular accelerator framework |
US6901491B2 (en) | 2001-10-22 | 2005-05-31 | Sun Microsystems, Inc. | Method and apparatus for integration of communication links with a remote direct memory access protocol |
US7228401B2 (en) | 2001-11-13 | 2007-06-05 | Freescale Semiconductor, Inc. | Interfacing a processor to a coprocessor in which the processor selectively broadcasts to or selectively alters an execution mode of the coprocessor |
US20030126416A1 (en) | 2001-12-31 | 2003-07-03 | Marr Deborah T. | Suspending execution of a thread in a multi-threaded processor |
US20030135719A1 (en) | 2002-01-14 | 2003-07-17 | International Business Machines Corporation | Method and system using hardware assistance for tracing instruction disposition information |
US20030135718A1 (en) | 2002-01-14 | 2003-07-17 | International Business Machines Corporation | Method and system using hardware assistance for instruction tracing by revealing executed opcode or instruction |
US7313734B2 (en) | 2002-01-14 | 2007-12-25 | International Business Machines Corporation | Method and system for instruction tracing with enhanced interrupt avoidance |
US20040215444A1 (en) | 2002-03-25 | 2004-10-28 | Patel Mukesh K. | Hardware-translator-based custom method invocation system and method |
US6944746B2 (en) | 2002-04-01 | 2005-09-13 | Broadcom Corporation | RISC processor supporting one or more uninterruptible co-processors |
US7200735B2 (en) | 2002-04-10 | 2007-04-03 | Tensilica, Inc. | High-performance hybrid processor with configurable execution units |
GB2388447B (en) | 2002-05-09 | 2005-07-27 | Sun Microsystems Inc | A computer system method and program product for performing a data access from low-level code |
US6952214B2 (en) | 2002-07-12 | 2005-10-04 | Sun Microsystems, Inc. | Method for context switching a graphics accelerator comprising multiple rendering pipelines |
US7313797B2 (en) | 2002-09-18 | 2007-12-25 | Wind River Systems, Inc. | Uniprocessor operating system design facilitating fast context switching |
US20040111594A1 (en) | 2002-12-05 | 2004-06-10 | International Business Machines Corporation | Multithreading recycle and dispatch mechanism |
US7673304B2 (en) | 2003-02-18 | 2010-03-02 | Microsoft Corporation | Multithreaded kernel for graphics processing unit |
US7079147B2 (en) | 2003-05-14 | 2006-07-18 | Lsi Logic Corporation | System and method for cooperative operation of a processor and coprocessor |
US7714870B2 (en) | 2003-06-23 | 2010-05-11 | Intel Corporation | Apparatus and method for selectable hardware accelerators in a data driven architecture |
US7082508B2 (en) | 2003-06-24 | 2006-07-25 | Intel Corporation | Dynamic TLB locking based on page usage metric |
US7765388B2 (en) | 2003-09-17 | 2010-07-27 | Broadcom Corporation | Interrupt verification support mechanism |
US8566828B2 (en) | 2003-12-19 | 2013-10-22 | Stmicroelectronics, Inc. | Accelerator for multi-processing system and method |
US7302627B1 (en) | 2004-04-05 | 2007-11-27 | Mimar Tibet | Apparatus for efficient LFSR calculation in a SIMD processor |
US20050257186A1 (en) * | 2004-05-13 | 2005-11-17 | Michael Zilbershlag | Operation system for programmable hardware |
US7370243B1 (en) | 2004-06-30 | 2008-05-06 | Sun Microsystems, Inc. | Precise error handling in a fine grain multithreaded multicore processor |
US8190863B2 (en) | 2004-07-02 | 2012-05-29 | Intel Corporation | Apparatus and method for heterogeneous chip multiprocessors via resource allocation and restriction |
US7388588B2 (en) | 2004-09-09 | 2008-06-17 | International Business Machines Corporation | Programmable graphics processing engine |
US7437581B2 (en) | 2004-09-28 | 2008-10-14 | Intel Corporation | Method and apparatus for varying energy per instruction according to the amount of available parallelism |
US7809982B2 (en) | 2004-10-01 | 2010-10-05 | Lockheed Martin Corporation | Reconfigurable computing machine and related systems and methods |
US7350055B2 (en) | 2004-10-20 | 2008-03-25 | Arm Limited | Tightly coupled accelerator |
US7598958B1 (en) | 2004-11-17 | 2009-10-06 | Nvidia Corporation | Multi-chip graphics processing unit apparatus, system, and method |
US8788787B2 (en) | 2005-03-02 | 2014-07-22 | The Boeing Company | Systems, methods and architecture for facilitating software access to acceleration technology |
US20060288193A1 (en) | 2005-06-03 | 2006-12-21 | Silicon Integrated System Corp. | Register-collecting mechanism for multi-threaded processors and method using the same |
US7426626B2 (en) | 2005-08-23 | 2008-09-16 | Qualcomm Incorporated | TLB lock indicator |
US7583268B2 (en) | 2005-11-10 | 2009-09-01 | Via Technologies, Inc. | Graphics pipeline precise interrupt method and apparatus |
US7545381B2 (en) | 2005-11-10 | 2009-06-09 | Via Technologies, Inc. | Interruptible GPU and method for context saving and restoring |
US8212824B1 (en) | 2005-12-19 | 2012-07-03 | Nvidia Corporation | Apparatus and method for serial save and restore of graphics processing unit state information |
US7725624B2 (en) | 2005-12-30 | 2010-05-25 | Intel Corporation | System and method for cryptography processing units and multiplier |
US7509481B2 (en) * | 2006-03-03 | 2009-03-24 | Sun Microsystems, Inc. | Patchable and/or programmable pre-decode |
US7480838B1 (en) | 2006-03-23 | 2009-01-20 | Intel Corporation | Method, system and apparatus for detecting and recovering from timing errors |
US7746350B1 (en) | 2006-06-15 | 2010-06-29 | Nvidia Corporation | Cryptographic computations on general purpose graphics processing units |
US8041929B2 (en) * | 2006-06-16 | 2011-10-18 | Cisco Technology, Inc. | Techniques for hardware-assisted multi-threaded processing |
US7487341B2 (en) | 2006-06-29 | 2009-02-03 | Intel Corporation | Handling address translations and exceptions of a heterogeneous resource of a processor using another processor resource |
US8959311B2 (en) | 2006-08-25 | 2015-02-17 | Texas Instruments Incorporated | Methods and systems involving secure RAM |
US9478062B2 (en) | 2006-09-19 | 2016-10-25 | Imagination Technologies Limited | Memory allocation in distributed memories for multiprocessing |
US7949887B2 (en) | 2006-11-01 | 2011-05-24 | Intel Corporation | Independent power control of processing cores |
US8127113B1 (en) | 2006-12-01 | 2012-02-28 | Synopsys, Inc. | Generating hardware accelerators and processor offloads |
US7827383B2 (en) | 2007-03-09 | 2010-11-02 | Oracle America, Inc. | Efficient on-chip accelerator interfaces to reduce software overhead |
CN100489830C (en) * | 2007-03-19 | 2009-05-20 | 中国人民解放军国防科学技术大学 | 64 bit stream processor chip system structure oriented to scientific computing |
US8015368B2 (en) * | 2007-04-20 | 2011-09-06 | Siport, Inc. | Processor extensions for accelerating spectral band replication |
US7937568B2 (en) | 2007-07-11 | 2011-05-03 | International Business Machines Corporation | Adaptive execution cycle control method for enhanced instruction throughput |
US7743232B2 (en) * | 2007-07-18 | 2010-06-22 | Advanced Micro Devices, Inc. | Multiple-core processor with hierarchical microcode store |
US8345052B1 (en) | 2007-11-08 | 2013-01-01 | Nvidia Corporation | Method and system for using a GPU frame buffer in a multi-GPU system as cache memory |
US8339404B2 (en) | 2007-11-29 | 2012-12-25 | Accelereyes, Llc | System for improving utilization of GPU resources |
US8140823B2 (en) | 2007-12-03 | 2012-03-20 | Qualcomm Incorporated | Multithreaded processor with lock indicator |
GB2455344B (en) | 2007-12-06 | 2012-06-13 | Advanced Risc Mach Ltd | Recovering from control path errors |
US7865675B2 (en) | 2007-12-06 | 2011-01-04 | Arm Limited | Controlling cleaning of data values within a hardware accelerator |
US8780123B2 (en) | 2007-12-17 | 2014-07-15 | Nvidia Corporation | Interrupt handling techniques in the rasterizer of a GPU |
US7793080B2 (en) * | 2007-12-31 | 2010-09-07 | Globalfoundries Inc. | Processing pipeline having parallel dispatch and method thereof |
US8086825B2 (en) * | 2007-12-31 | 2011-12-27 | Advanced Micro Devices, Inc. | Processing pipeline having stage-specific thread selection and method thereof |
US7877582B2 (en) | 2008-01-31 | 2011-01-25 | International Business Machines Corporation | Multi-addressable register file |
US8055872B2 (en) | 2008-02-21 | 2011-11-08 | Arm Limited | Data processor with hardware accelerator, accelerator interface and shared memory management unit |
US8776077B2 (en) | 2008-04-02 | 2014-07-08 | Oracle America, Inc. | Method for multithreading an application using partitioning to allocate work to threads |
US8776030B2 (en) | 2008-04-09 | 2014-07-08 | Nvidia Corporation | Partitioning CUDA code for execution by a general purpose processor |
US8141102B2 (en) | 2008-09-04 | 2012-03-20 | International Business Machines Corporation | Data processing in a hybrid computing environment |
US8230442B2 (en) | 2008-09-05 | 2012-07-24 | International Business Machines Corporation | Executing an accelerator application program in a hybrid computing environment |
US8082426B2 (en) | 2008-11-06 | 2011-12-20 | Via Technologies, Inc. | Support of a plurality of graphic processing units |
US20100274972A1 (en) | 2008-11-24 | 2010-10-28 | Boris Babayan | Systems, methods, and apparatuses for parallel computing |
US7930519B2 (en) | 2008-12-17 | 2011-04-19 | Advanced Micro Devices, Inc. | Processor with coprocessor interfacing functional unit for forwarding result from coprocessor to retirement unit |
US8281185B2 (en) | 2009-06-30 | 2012-10-02 | Oracle America, Inc. | Advice-based feedback for transactional execution |
US20110040924A1 (en) | 2009-08-11 | 2011-02-17 | Selinger Robert D | Controller and Method for Detecting a Transmission Error Over a NAND Interface Using Error Detection Code |
US8458677B2 (en) | 2009-08-20 | 2013-06-04 | International Business Machines Corporation | Generating code adapted for interlinking legacy scalar code and extended vector code |
US8719547B2 (en) | 2009-09-18 | 2014-05-06 | Intel Corporation | Providing hardware support for shared virtual memory between local and remote physical memory |
US8405666B2 (en) | 2009-10-08 | 2013-03-26 | Advanced Micro Devices, Inc. | Saving, transferring and recreating GPU context information across heterogeneous GPUs during hot migration of a virtual machine |
US8244946B2 (en) | 2009-10-16 | 2012-08-14 | Brocade Communications Systems, Inc. | Interrupt moderation |
US8095824B2 (en) | 2009-12-15 | 2012-01-10 | Intel Corporation | Performing mode switching in an unbounded transactional memory (UTM) system |
US8166437B2 (en) * | 2009-12-15 | 2012-04-24 | Apple Inc. | Automated pad ring generation for programmable logic device implementation of integrated circuit design |
US8316194B2 (en) | 2009-12-15 | 2012-11-20 | Intel Corporation | Mechanisms to accelerate transactions using buffered stores |
US8970608B2 (en) | 2010-04-05 | 2015-03-03 | Nvidia Corporation | State objects for specifying dynamic state |
US9015443B2 (en) | 2010-04-30 | 2015-04-21 | International Business Machines Corporation | Reducing remote reads of memory in a hybrid computing environment |
JP4818450B1 (en) | 2010-06-30 | 2011-11-16 | 株式会社東芝 | Graphics processing unit and information processing apparatus |
US20120023314A1 (en) * | 2010-07-21 | 2012-01-26 | Crum Matthew M | Paired execution scheduling of dependent micro-operations |
US8667253B2 (en) | 2010-08-04 | 2014-03-04 | International Business Machines Corporation | Initiating assist thread upon asynchronous event for processing simultaneously with controlling thread and updating its running status in status register |
US9552206B2 (en) | 2010-11-18 | 2017-01-24 | Texas Instruments Incorporated | Integrated circuit with control node circuitry and processing circuitry |
EP2458510B1 (en) | 2010-11-29 | 2014-05-07 | NTT DoCoMo, Inc. | Method and apparatus for performing a cross-correlation |
US20120159090A1 (en) | 2010-12-16 | 2012-06-21 | Microsoft Corporation | Scalable multimedia computer system architecture with qos guarantees |
US20120166777A1 (en) * | 2010-12-22 | 2012-06-28 | Advanced Micro Devices, Inc. | Method and apparatus for switching threads |
CN102567556A (en) | 2010-12-27 | 2012-07-11 | 北京国睿中数科技股份有限公司 | Verifying method and verifying device for debugging-oriented processor |
CN102270166A (en) | 2011-02-22 | 2011-12-07 | 清华大学 | Simulator and method for injecting and tracking processor faults based on simulator |
US8683175B2 (en) | 2011-03-15 | 2014-03-25 | International Business Machines Corporation | Seamless interface for multi-threaded core accelerators |
US8892924B2 (en) | 2011-05-31 | 2014-11-18 | Intel Corporation | Reducing power consumption of uncore circuitry of a processor |
US8793515B2 (en) | 2011-06-27 | 2014-07-29 | Intel Corporation | Increasing power efficiency of turbo mode operation in a processor |
US9003102B2 (en) | 2011-08-26 | 2015-04-07 | Sandisk Technologies Inc. | Controller with extended status register and method of use therewith |
CN104011705A (en) | 2011-12-01 | 2014-08-27 | 新加坡国立大学 | Polymorphic heterogeneous multi-core architecture |
US20130159630A1 (en) | 2011-12-20 | 2013-06-20 | Ati Technologies Ulc | Selective cache for inter-operations in a processor-based environment |
US9436512B2 (en) | 2011-12-22 | 2016-09-06 | Board Of Supervisors Of Louisana State University And Agricultural And Mechanical College | Energy efficient job scheduling in heterogeneous chip multiprocessors based on dynamic program behavior using prim model |
US9268596B2 (en) | 2012-02-02 | 2016-02-23 | Intel Corparation | Instruction and logic to test transactional execution status |
WO2013147885A1 (en) | 2012-03-30 | 2013-10-03 | Intel Corporation | Apparatus and method for accelerating operations in a processor which uses shared virtual memory |
EP2831721B1 (en) | 2012-03-30 | 2020-08-26 | Intel Corporation | Context switching mechanism for a processing core having a general purpose cpu core and a tightly coupled accelerator |
US20130332937A1 (en) | 2012-05-29 | 2013-12-12 | Advanced Micro Devices, Inc. | Heterogeneous Parallel Primitives Programming Model |
US9753778B2 (en) | 2012-07-20 | 2017-09-05 | Microsoft Technology Licensing, Llc | Domain-agnostic resource allocation framework |
US9123128B2 (en) | 2012-12-21 | 2015-09-01 | Nvidia Corporation | Graphics processing unit employing a standard processing unit and a method of constructing a graphics processing unit |
US9361116B2 (en) | 2012-12-28 | 2016-06-07 | Intel Corporation | Apparatus and method for low-latency invocation of accelerators |
US9417873B2 (en) | 2012-12-28 | 2016-08-16 | Intel Corporation | Apparatus and method for a hybrid latency-throughput processor |
US9053025B2 (en) | 2012-12-28 | 2015-06-09 | Intel Corporation | Apparatus and method for fast failure handling of instructions |
US20140189333A1 (en) | 2012-12-28 | 2014-07-03 | Oren Ben-Kiki | Apparatus and method for task-switchable synchronous hardware accelerators |
US9086813B2 (en) | 2013-03-15 | 2015-07-21 | Qualcomm Incorporated | Method and apparatus to save and restore system memory management unit (MMU) contexts |
US10031770B2 (en) | 2014-04-30 | 2018-07-24 | Intel Corporation | System and method of delayed context switching in processor registers |
US9703603B1 (en) | 2016-04-25 | 2017-07-11 | Nxp Usa, Inc. | System and method for executing accelerator call |
-
2012
- 2012-12-28 US US13/730,719 patent/US10140129B2/en active Active
-
2013
- 2013-06-28 WO PCT/US2013/048694 patent/WO2014105207A1/en active Application Filing
- 2013-06-28 CN CN201811504065.4A patent/CN110045988B/en active Active
- 2013-06-28 CN CN201380060918.9A patent/CN105027075B/en active Active
-
2018
- 2018-11-26 US US16/200,203 patent/US20190171462A1/en not_active Abandoned
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7953933B1 (en) * | 2005-09-28 | 2011-05-31 | Oracle America, Inc. | Instruction cache, decoder circuit, basic block cache circuit and multi-block cache circuit |
US20140189300A1 (en) * | 2012-12-28 | 2014-07-03 | Name ILAN PARDO | Processing Core Having Shared Front End Unit |
US20150100763A1 (en) * | 2013-10-09 | 2015-04-09 | Arm Limited | Decoding a complex program instruction corresponding to multiple micro-operations |
Also Published As
Publication number | Publication date |
---|---|
CN110045988B (en) | 2023-08-15 |
CN105027075A (en) | 2015-11-04 |
WO2014105207A1 (en) | 2014-07-03 |
US10140129B2 (en) | 2018-11-27 |
US20140189300A1 (en) | 2014-07-03 |
CN110045988A (en) | 2019-07-23 |
CN105027075B (en) | 2019-01-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20190171462A1 (en) | Processing core having shared front end unit | |
US11907105B2 (en) | Backward compatibility testing of software in a mode that disrupts timing | |
US7818592B2 (en) | Token based power control mechanism | |
US7818542B2 (en) | Method and apparatus for length decoding variable length instructions | |
US8099566B2 (en) | Load/store ordering in a threaded out-of-order processor | |
US8335911B2 (en) | Dynamic allocation of resources in a threaded, heterogeneous processor | |
EP2176740B1 (en) | Method and apparatus for length decoding and identifying boundaries of variable length instructions | |
US7702888B2 (en) | Branch predictor directed prefetch | |
US11243775B2 (en) | System, apparatus and method for program order queue (POQ) to manage data dependencies in processor having multiple instruction queues | |
JP5543366B2 (en) | System and method for performing locked operations | |
US20180365053A1 (en) | Method and apparatus for dynamically balancing task processing while maintaining task order | |
US20040034759A1 (en) | Multi-threaded pipeline with context issue rules | |
US6581155B1 (en) | Pipelined, superscalar floating point unit having out-of-order execution capability and processor employing the same | |
US20090006814A1 (en) | Immediate and Displacement Extraction and Decode Mechanism | |
US7457932B2 (en) | Load mechanism | |
CN112395000B (en) | Data preloading method and instruction processing device | |
WO2021061626A1 (en) | Instruction executing method and apparatus | |
CN112148106A (en) | System, apparatus and method for hybrid reservation station for processor | |
CN113568663A (en) | Code prefetch instruction |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
STPP | Information on status: patent application and granting procedure in general |
Free format text: APPLICATION DISPATCHED FROM PREEXAM, NOT YET DOCKETED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: FINAL REJECTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: FINAL REJECTION MAILED |
|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION |