US20220035635A1 - Processor with multiple execution pipelines - Google Patents

Processor with multiple execution pipelines Download PDF

Info

Publication number
US20220035635A1
US20220035635A1 US17/505,101 US202117505101A US2022035635A1 US 20220035635 A1 US20220035635 A1 US 20220035635A1 US 202117505101 A US202117505101 A US 202117505101A US 2022035635 A1 US2022035635 A1 US 2022035635A1
Authority
US
United States
Prior art keywords
instruction
execution
control unit
unit
execution control
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
US17/505,101
Inventor
Christian Wiencke
Shrey Bhatia
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Texas Instruments Inc
Original Assignee
Texas Instruments Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Texas Instruments Inc filed Critical Texas Instruments Inc
Priority to US17/505,101 priority Critical patent/US20220035635A1/en
Publication of US20220035635A1 publication Critical patent/US20220035635A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30181Instruction operation extension or modification
    • G06F9/30196Instruction operation extension or modification using decoder, e.g. decoder per instruction set, adaptable or programmable decoders
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3818Decoding for concurrent execution
    • G06F9/3822Parallel decoding, e.g. parallel decode units
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution

Definitions

  • Pipelining is one technique employed to increase the performance of processing systems such as microprocessors. Pipelining divides the execution of an instruction (or operation) into a number of stages where each stage corresponds to one step in the execution of the instruction. As each stage completes processing of a given instruction, and processing of the given instruction passes to a subsequent stage, the stage becomes available to commence processing of the next instruction. Thus, pipelining increases the overall rate at which instructions can be executed by partitioning execution into a plurality steps that allow a new instruction to begin execution before execution of a previous instruction is complete. A processor that includes a single instruction pipeline can execute instructions at a rate approaching one instruction per cycle.
  • a processor includes a first execution pipeline and a second execution pipeline.
  • the first execution pipeline includes a first decode unit and a first execution control unit coupled to the first decode unit.
  • the first execution control unit is configured to control execution of all instructions executable by the processor.
  • the second execution pipeline includes a second decode unit, and a second execution control unit coupled to the second decode unit.
  • the second execution control unit is configured to control execution of only a subset of the instructions executable via the first execution control unit.
  • a method in another embodiment, includes fetching an instruction to be executed by a processor. Whether the instruction is executable by a first execution control unit configured to execute only a subset of all instructions executable by the processor is determined. The instruction is directed to a second execution control unit configured to execute all instructions executable by the processor based on the first execution control unit not being configured to execute the instruction.
  • an instruction execution device includes a first execution pipeline and a second execution pipeline.
  • the first execution pipeline includes a first execution control unit and a first decode unit.
  • the first execution control unit is configured to control execution of all instructions executable by the device, and to apply all operand addressing modes supported by the device to access operands.
  • the first decode unit is coupled to the first execution control unit, and is configured to decode all instructions executable by the device.
  • the second execution pipeline includes a second execution control unit and a second decode unit.
  • the second execution control unit is configured to control execution of only a subset of the instructions executable via the first execution control unit, and to apply only register and immediate addressing modes to access operands.
  • the second decode unit is coupled to the second execution pipeline, and is configured to decode only the subset of the instructions.
  • FIG. 1 shows a block diagram of a processor in accordance with various embodiments
  • FIGS. 2 and 3 show diagrams of instruction execution in pipelines of a processor in accordance with various embodiments
  • FIGS. 4 and 5 show performance of conventional processors relative to a multi-pipeline processor in accordance with various embodiments.
  • FIG. 6 shows a flow diagram for a method for executing instructions in a multi-pipeline device in accordance with various embodiments.
  • subset means a “proper subset” that includes fewer than all the elements of a set from which the subset is derived.
  • Superscalar processors include multiple instruction pipelines operating in parallel in order to provide execution of more than one instruction per cycle.
  • the fetch unit can provide more than one instruction per cycle
  • multiple instruction decoders determine which instructions can be executed in parallel
  • multiple execution pipelines operate in parallel using redundant execution units.
  • a superscalar processor generally has a much higher gate count and energy consumption than a single-scalar processor, and in real-world applications the performance increase provided by a superscalar processor may be much less than the full capacity of the execution pipelines. Consequently, the energy consumed per task by a superscalar processor can be significantly higher than the energy consumed by a single-scalar processor executing the same tasks. Increased per task energy consumption is one reason that superscalar processors are rarely applied in embedded systems that are directed to low energy consumption applications, such as applications in which long battery life is important.
  • Embodiments of the present disclosure include multiple execution pipelines arranged to increase the rate of instruction execution relative to single-scalar processors, and to reduce energy consumption in comparison to conventional superscalar processors. While conventional superscalar processors include multiple instruction decoders, each capable of decoding the full instruction set of the superscalar processor, embodiments disclosed herein include a single decoder capable of decoding the full instruction set, and one or more additional decoders capable of decoding only a small subset of the full instruction set. Similarly, embodiments of the present disclosure include a single execution pipeline capable of executing the full instruction set, and one or more additional execution pipelines capable of executing only the small subset of the full instruction set. The small subset of instructions executable by the additional execution pipeline(s) includes instructions most frequently executed in practical applications.
  • FIG. 1 shows a block diagram of a processor 100 in accordance with various embodiments.
  • the processor 100 includes a fetch unit 102 , a dispatcher 104 , a full execution control unit 110 , a subset execution control unit 112 , a register file 116 , and execution units 118 .
  • the processor 100 may include various other components and subsystems that have been omitted from FIG. 1 in the interest of clarity.
  • embodiments of the processor 100 may include instruction and/or data caches, memory, communication devices, interrupt controllers, timers, clock circuitry, direct memory access controllers, and various other components and peripherals.
  • the fetch unit 102 retrieves instructions to be executed by the processor 110 from a storage device, such as a random access memory.
  • the fetch unit 102 may include program counters that specify the location of instructions being retrieved, pre-fetching logic that retrieves and stores instructions for later execution, etc.
  • the dispatcher 104 assigns each instruction provided by the fetch unit 102 for execution to one of the multiple execution pipelines of the processor 100 , where each execution pipeline includes a decode unit and an execution control unit.
  • each execution pipeline includes a decode unit and an execution control unit.
  • the processor 100 includes two execution pipelines. Other embodiments of the processor 100 may include more than two execution pipelines.
  • the dispatcher 104 includes full decode unit 106 and subset decode unit 108 .
  • the decode units 106 , 108 examine the instructions received from the fetch unit 104 , and translate each instruction into controls suitable for operating the associated execution control units, processor registers, and other components of the processor 100 to perform operations that effectuate execution of the instructions.
  • the full decode unit 106 is capable of decoding all instructions (i.e., the full and complete instruction set) executable by the processor 100 .
  • the subset decode unit 108 is capable of decoding only a small subset of the instructions executable by the processor 100 (i.e., a small subset of the instructions decodable by the full decoder 106 ).
  • the subset decode unit 108 may be capable of decoding only the most frequently executed instructions or a selected ones of the most frequently executed instruction.
  • Some embodiments of the subset decode unit 108 may be capable of decoding only instructions that apply relatively simple operand addressing (e.g., instructions applying only register or immediate addressing modes).
  • the dispatcher 104 includes dependency logic 120 that identifies dependencies (e.g., data dependencies) between instructions being executed, and causes the decode units 106 , 108 to resolve dependencies identified by the dependency logic 120 . For example, on identification of a dependency by the dependency logic 120 , the decode unit decoding the instruction subject to the dependency may delay transfer of the instruction to the execution control unit until the dependency has been resolved.
  • dependencies e.g., data dependencies
  • Each decode unit 106 , 108 passes decoded instructions to the corresponding execution control unit.
  • Full decode unit 106 passes instructions to full execution control unit 110 for execution
  • subset decode unit 108 passes instructions to subset execution control unit 112 for execution.
  • the full execution control unit 110 is a capable of executing all instructions (i.e., the full and complete instruction set) executable by the processor 100 .
  • the subset execution control unit 112 is capable of executing only a small subset of the instructions executable by the processor 100 (i.e., a small subset of the instructions decodable by the full decoder 106 ).
  • the subset execution control unit 112 may be capable of executing only selected instructions that are most frequently executed by the processor 100 .
  • Some embodiments of the subset execution control unit 112 may be capable of executing only instructions that apply relatively simple operand addressing (e.g., instructions applying only register or immediate addressing modes).
  • the full execution control unit 110 may include multiple execution stages 114 to provide a high instruction execution rate over the entire instruction set.
  • the subset execution control unit 112 may include fewer execution stages 114 than the full execution control unit 110 . For example, only a single execution stage 114 may be provided via the subset execution control unit 112 to execute the small subset of instructions executable by the subset execution control unit 112 .
  • the full decode unit 106 and full execution control unit 110 may decode and execute instructions of a complex instruction set (i.e., CISC instructions) and instructions of a reduced instruction set (i.e., RISC instructions) executable by the processor 100 , and the subset decode unit 108 and subset execution control unit 112 may decode and execute only the RISC instructions.
  • the subset decode unit 108 and subset execution control unit 112 may decode and execute only a subset of the RISC instructions executable by the processor 100 .
  • subset decode unit 108 and subset execution control unit 112 may decode and execute only RISC instructions that apply only the ALU 122 and/or that manipulate only operands stored in the register file 116 or provided in the instruction itself (i.e., apply only register or immediate addressing modes).
  • the execution units 118 include various function units (shift unit, multiply unit, etc.) applied by the execution control units 110 , 112 to manipulate data and perform other operations needed for instruction execution.
  • the full execution control unit 110 may have access to and apply any and all of the function units provided by the execution units 118 .
  • the subset execution control unit 112 may have access to and apply fewer function units of the execution units 118 than the full execution control unit 110 .
  • the subset execution control unit 112 may access and apply only the arithmetic logic unit (ALU) 122 in some embodiments.
  • Some embodiments of the execution units 118 may include more than one instance of a function unit to facilitate parallel instruction execution in the execution pipelines.
  • the execution units 118 may include more than one ALU 122 to allow parallel access to ALU 122 functionality by the full execution control unit 110 and the subset execution control unit 112 .
  • the register file 116 includes registers that store operands for access and manipulation by the dispatcher 104 , the full execution control unit 110 , the subset execution control unit 112 , and the execution units 118 .
  • the number and/or width of the registers included in the register file 116 may be different in different embodiments of the processor 100 .
  • the performance gained by inclusion of the subset decoder 108 and the subset execution control unit 112 in the processor 100 can approach that provided by conventional superscalar implementations, by providing parallel execution of the most frequently encountered instructions, while substantially reducing energy consumption relative to conventional superscalar implementations.
  • the circuitry added to the processor 100 to implement the subset decoder 108 and the subset execution control unit 112 is relatively small in comparison to the circuitry of the full decoder 106 and the full execution control unit 110 .
  • the energy consumed by the subset decoder 108 and the subset execution control unit 112 is relatively low in comparison to that consumed by the full decoder 106 and the full execution control unit 110 .
  • FIGS. 2 and 3 show diagrams of instruction execution in the processor 100 .
  • the fetch unit 102 provides, in fetch cycle 202 , instructions to be decoded and executed.
  • the full decode unit 106 decodes a first instruction in decode cycle 204
  • the subset decode unit 108 decodes a second instruction in decode cycle 206 , which is in parallel with decode cycle 204 .
  • Execution of the decoded instructions proceeds in parallel with full execution control unit 110 executing the first instruction in execution cycle 208 .
  • the second instruction is executed in parallel by the subset execution control unit 112 , which executes the second instruction in execution cycle 210 . Execution of the second instruction completes in a single cycle, while execution of the first instruction requires multiple cycles.
  • FIG. 3 shows a multi-instruction execution sequence in the processor 100 .
  • performance of the processor 100 is very similar to that achievable by a conventional superscalar architecture because the execution timing is constrained by data dependencies.
  • Executing selected instructions via the limited decoding and execution capabilities of the subset decode unit 108 and the subset execution control unit 112 can reduce the energy consumed by execution of the instruction sequence with little or no reduction in performance.
  • Instructions 1 and 2 are executed in parallel as explained with regard to FIG. 2 .
  • Instructions 3 and 4 are fetched in cycle 2, but a dependency 304 between the instructions causes the dispatcher 104 to delay execution of instruction 4 for one cycle. Accordingly, instruction 4 is decoded in cycle 4 in parallel with execution of instruction 3.
  • Execution of instructions 3 and 4 may be performed in either of the pipelines of processor 100 that provide suitable decoding and execution functionality. In some embodiments, execution by the subset decode unit 108 and subset execution control unit 112 may be selected to reduce energy consumption.
  • Instructions 5 and 6 are fetched in cycle 3. Decoding of instruction 5 is delayed until cycle 5 due to dependency 310 .
  • Instruction 5 is a complex instruction that requires multiple execution cycles in the full execution control unit 110 to complete.
  • Instruction 6 is also a complex instruction that must be executed in the full execution pipeline. Therefore, decoding of instruction 6 is delayed until cycle 6.
  • instruction 6 is the only instruction for which decoding and subsequent execution is delayed by the limited decoding and execution capabilities of the subset decode unit 108 and the subset execution control unit 112 when compared to execution by a conventional superscalar implementation.
  • Instructions 7 and 8 are fetched in cycle 4 and execution is delayed. Instruction 7 is decoded by the subset decode unit in cycle 6, and executed in the subset execution control unit in cycle 7. Instruction 8 is decoded by the subset decode unit 108 in cycle 7, and executed in the subset execution control unit 112 in cycle 8. In various embodiments, instruction 8 may be executed in either pipeline of the processor 100 .
  • FIGS. 4 and 5 show performance of conventional processors relative to the processor 100 .
  • execution performance for a practical application exhibiting low instruction parallelism e.g. due to a substantial number of instruction dependencies
  • performance of the processor 100 and the conventional superscalar processor are only slightly better than that of the single-scalar processor.
  • the energy consumption of the processor 100 is slightly higher than that of the single-scalar processor, and the energy consumption of the conventional superscalar processor is substantially higher that the single-scalar processor and the processor 100 .
  • FIG. 5 shows performance for a practical application exhibiting high instruction parallelism. Performance of both the conventional superscalar processor and the processor 100 is significantly higher than that of the single-scalar processor, with the conventional superscalar processor performing slightly better than the processor 100 . However, the energy consumption of the conventional superscalar processor is substantially higher than single-scalar processor, and the processor 100 consumes less energy that the single-scalar processor. Thus, as shown in FIGS. 4 and 5 , the processor 100 can provide a substantial performance increase over a single-scalar processor while consuming much less energy than a conventional superscalar processor.
  • FIG. 6 shows a flow diagram for a method 600 for executing instructions in accordance with various embodiments. Though depicted sequentially as a matter of convenience, at least some of the actions shown can be performed in a different order and/or performed in parallel. Additionally, some embodiments may perform only some of the actions shown.
  • the fetch unit 102 fetches instructions from memory for execution and provides the fetched instructions to the dispatcher 104 .
  • the dispatcher 104 evaluates an instruction received from the fetch unit 102 , and determines whether the instruction can be executed by the subset execution pipeline (e.g., the subset decoder 108 and subset execution control unit 112 ).
  • the subset execution pipeline executes only a small subset of the full instruction set executable by the processor 100 .
  • the subset execution pipeline may execute only RISC instructions or a subset of the RISC instructions executable by the processor 100
  • the full instruction pipeline can execute any and all instructions (including CISC instructions) executable by the processor 100 .
  • the dispatcher 104 routes the instruction to the subset decoder 108 .
  • the subset decoder 108 decodes the instruction.
  • the subset decoder 108 passes the decoded instruction to the subset execution control unit 112 , and the subset execution control unit 112 applies the execution units 118 to execute the instruction. In some embodiments, the subset execution control unit 112 executes the instruction in a single cycle.
  • the dispatcher 104 routes the instruction to the full decoder 106 .
  • the full decoder 106 decodes the instruction.
  • the full decoder 106 passes the decoded instruction to the full execution control unit 110 , and the full execution control unit 110 applies the execution units 118 to execute the instruction. In some embodiments, the full execution control unit 110 executes the instruction in a multiple cycles.
  • embodiments of the present disclosure have been described with reference to the processor 100 , embodiments of the multi-pipeline arrangement disclosed herein may be applied to improve performance while minimizing energy consumption in a wide variety of instruction execution devices.

Abstract

An apparatus and method system and method for increasing performance in a processor or other instruction execution device while minimizing energy consumption. A processor includes a first execution pipeline and a second execution pipeline. The first execution pipeline includes a first decode unit and a first execution control unit coupled to the first decode unit. The first execution control unit is configured to control execution of all instructions executable by the processor. The second execution pipeline includes a second decode unit, and a second execution control unit coupled to the second decode unit. The second execution control unit is configured to control execution of a subset of the instructions executable via the first execution control unit.

Description

    CROSS-REFERENCE TO RELATED APPLICATION
  • This application is a continuation of U.S. patent application Ser. No. 16/594,830, filed on Oct. 7, 2019, which is a continuation of U.S. patent application Ser. No. 14/554,709, filed on Nov. 26, 2014, now issued U.S. Pat. No. 10,437,596, each of which is incorporated herein by reference.
  • BACKGROUND
  • Processors and other instruction execution machines apply various techniques to increase performance. Pipelining is one technique employed to increase the performance of processing systems such as microprocessors. Pipelining divides the execution of an instruction (or operation) into a number of stages where each stage corresponds to one step in the execution of the instruction. As each stage completes processing of a given instruction, and processing of the given instruction passes to a subsequent stage, the stage becomes available to commence processing of the next instruction. Thus, pipelining increases the overall rate at which instructions can be executed by partitioning execution into a plurality steps that allow a new instruction to begin execution before execution of a previous instruction is complete. A processor that includes a single instruction pipeline can execute instructions at a rate approaching one instruction per cycle.
  • SUMMARY
  • An apparatus and method for increasing performance in a processor, or other instruction execution device, while minimizing energy consumption are disclosed herein. In one embodiment, a processor includes a first execution pipeline and a second execution pipeline. The first execution pipeline includes a first decode unit and a first execution control unit coupled to the first decode unit. The first execution control unit is configured to control execution of all instructions executable by the processor. The second execution pipeline includes a second decode unit, and a second execution control unit coupled to the second decode unit. The second execution control unit is configured to control execution of only a subset of the instructions executable via the first execution control unit.
  • In another embodiment, a method includes fetching an instruction to be executed by a processor. Whether the instruction is executable by a first execution control unit configured to execute only a subset of all instructions executable by the processor is determined. The instruction is directed to a second execution control unit configured to execute all instructions executable by the processor based on the first execution control unit not being configured to execute the instruction.
  • In a further embodiment, an instruction execution device includes a first execution pipeline and a second execution pipeline. The first execution pipeline includes a first execution control unit and a first decode unit. The first execution control unit is configured to control execution of all instructions executable by the device, and to apply all operand addressing modes supported by the device to access operands. The first decode unit is coupled to the first execution control unit, and is configured to decode all instructions executable by the device. The second execution pipeline includes a second execution control unit and a second decode unit. The second execution control unit is configured to control execution of only a subset of the instructions executable via the first execution control unit, and to apply only register and immediate addressing modes to access operands. The second decode unit is coupled to the second execution pipeline, and is configured to decode only the subset of the instructions.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • For a detailed description of exemplary embodiments of the invention, reference will now be made to the accompanying drawings in which:
  • FIG. 1 shows a block diagram of a processor in accordance with various embodiments;
  • FIGS. 2 and 3 show diagrams of instruction execution in pipelines of a processor in accordance with various embodiments;
  • FIGS. 4 and 5 show performance of conventional processors relative to a multi-pipeline processor in accordance with various embodiments; and
  • FIG. 6 shows a flow diagram for a method for executing instructions in a multi-pipeline device in accordance with various embodiments.
  • NOTATION AND NOMENCLATURE
  • Certain terms are used throughout the following description and claims to refer to particular system components. As one skilled in the art will appreciate, companies may refer to a component by different names. This document does not intend to distinguish between components that differ in name but not function. In the following discussion and in the claims, the terms “including” and “comprising” are used in an open-ended fashion, and thus should be interpreted to mean “including, but not limited to . . . .” Also, the term “couple” or “couples” is intended to mean either an indirect or direct electrical connection. Thus, if a first device couples to a second device, that connection may be through a direct electrical connection, or through an indirect electrical connection via other devices and connections. The recitation “based on” is intended to mean “based at least in part on.” Therefore, if X is based on Y, X may be based on Y and any number of additional factors. The term “subset,” as used herein, means a “proper subset” that includes fewer than all the elements of a set from which the subset is derived.
  • DETAILED DESCRIPTION
  • The following discussion is directed to various embodiments of the invention. Although one or more of these embodiments may be preferred, the embodiments disclosed should not be interpreted, or otherwise used, as limiting the scope of the disclosure, including the claims. In addition, one skilled in the art will understand that the following description has broad application, and the discussion of any embodiment is meant only to be exemplary of that embodiment, and not intended to intimate that the scope of the disclosure, including the claims, is limited to that embodiment.
  • Superscalar processors include multiple instruction pipelines operating in parallel in order to provide execution of more than one instruction per cycle. In a superscalar processor, the fetch unit can provide more than one instruction per cycle, multiple instruction decoders determine which instructions can be executed in parallel, and multiple execution pipelines operate in parallel using redundant execution units. However, a superscalar processor generally has a much higher gate count and energy consumption than a single-scalar processor, and in real-world applications the performance increase provided by a superscalar processor may be much less than the full capacity of the execution pipelines. Consequently, the energy consumed per task by a superscalar processor can be significantly higher than the energy consumed by a single-scalar processor executing the same tasks. Increased per task energy consumption is one reason that superscalar processors are rarely applied in embedded systems that are directed to low energy consumption applications, such as applications in which long battery life is important.
  • Embodiments of the present disclosure include multiple execution pipelines arranged to increase the rate of instruction execution relative to single-scalar processors, and to reduce energy consumption in comparison to conventional superscalar processors. While conventional superscalar processors include multiple instruction decoders, each capable of decoding the full instruction set of the superscalar processor, embodiments disclosed herein include a single decoder capable of decoding the full instruction set, and one or more additional decoders capable of decoding only a small subset of the full instruction set. Similarly, embodiments of the present disclosure include a single execution pipeline capable of executing the full instruction set, and one or more additional execution pipelines capable of executing only the small subset of the full instruction set. The small subset of instructions executable by the additional execution pipeline(s) includes instructions most frequently executed in practical applications.
  • FIG. 1 shows a block diagram of a processor 100 in accordance with various embodiments. The processor 100 includes a fetch unit 102, a dispatcher 104, a full execution control unit 110, a subset execution control unit 112, a register file 116, and execution units 118. The processor 100 may include various other components and subsystems that have been omitted from FIG. 1 in the interest of clarity. For example, embodiments of the processor 100 may include instruction and/or data caches, memory, communication devices, interrupt controllers, timers, clock circuitry, direct memory access controllers, and various other components and peripherals.
  • The fetch unit 102 retrieves instructions to be executed by the processor 110 from a storage device, such as a random access memory. The fetch unit 102 may include program counters that specify the location of instructions being retrieved, pre-fetching logic that retrieves and stores instructions for later execution, etc.
  • The dispatcher 104 assigns each instruction provided by the fetch unit 102 for execution to one of the multiple execution pipelines of the processor 100, where each execution pipeline includes a decode unit and an execution control unit. In the embodiment of FIG. 1, the processor 100 includes two execution pipelines. Other embodiments of the processor 100 may include more than two execution pipelines. The dispatcher 104 includes full decode unit 106 and subset decode unit 108. The decode units 106, 108 examine the instructions received from the fetch unit 104, and translate each instruction into controls suitable for operating the associated execution control units, processor registers, and other components of the processor 100 to perform operations that effectuate execution of the instructions.
  • The full decode unit 106 is capable of decoding all instructions (i.e., the full and complete instruction set) executable by the processor 100. The subset decode unit 108 is capable of decoding only a small subset of the instructions executable by the processor 100 (i.e., a small subset of the instructions decodable by the full decoder 106). For example, the subset decode unit 108 may be capable of decoding only the most frequently executed instructions or a selected ones of the most frequently executed instruction. Some embodiments of the subset decode unit 108 may be capable of decoding only instructions that apply relatively simple operand addressing (e.g., instructions applying only register or immediate addressing modes).
  • The dispatcher 104 includes dependency logic 120 that identifies dependencies (e.g., data dependencies) between instructions being executed, and causes the decode units 106, 108 to resolve dependencies identified by the dependency logic 120. For example, on identification of a dependency by the dependency logic 120, the decode unit decoding the instruction subject to the dependency may delay transfer of the instruction to the execution control unit until the dependency has been resolved.
  • Each decode unit 106, 108 passes decoded instructions to the corresponding execution control unit. Full decode unit 106 passes instructions to full execution control unit 110 for execution, and subset decode unit 108 passes instructions to subset execution control unit 112 for execution. The full execution control unit 110 is a capable of executing all instructions (i.e., the full and complete instruction set) executable by the processor 100. The subset execution control unit 112 is capable of executing only a small subset of the instructions executable by the processor 100 (i.e., a small subset of the instructions decodable by the full decoder 106). For example, the subset execution control unit 112 may be capable of executing only selected instructions that are most frequently executed by the processor 100. Some embodiments of the subset execution control unit 112 may be capable of executing only instructions that apply relatively simple operand addressing (e.g., instructions applying only register or immediate addressing modes).
  • The full execution control unit 110 may include multiple execution stages 114 to provide a high instruction execution rate over the entire instruction set. The subset execution control unit 112 may include fewer execution stages 114 than the full execution control unit 110. For example, only a single execution stage 114 may be provided via the subset execution control unit 112 to execute the small subset of instructions executable by the subset execution control unit 112.
  • In some embodiments of the processor 100, the full decode unit 106 and full execution control unit 110 may decode and execute instructions of a complex instruction set (i.e., CISC instructions) and instructions of a reduced instruction set (i.e., RISC instructions) executable by the processor 100, and the subset decode unit 108 and subset execution control unit 112 may decode and execute only the RISC instructions. In some embodiments of the processor 100, the subset decode unit 108 and subset execution control unit 112 may decode and execute only a subset of the RISC instructions executable by the processor 100. For example, the subset decode unit 108 and subset execution control unit 112 may decode and execute only RISC instructions that apply only the ALU 122 and/or that manipulate only operands stored in the register file 116 or provided in the instruction itself (i.e., apply only register or immediate addressing modes).
  • The execution units 118 include various function units (shift unit, multiply unit, etc.) applied by the execution control units 110, 112 to manipulate data and perform other operations needed for instruction execution. The full execution control unit 110 may have access to and apply any and all of the function units provided by the execution units 118. The subset execution control unit 112 may have access to and apply fewer function units of the execution units 118 than the full execution control unit 110. For example, the subset execution control unit 112 may access and apply only the arithmetic logic unit (ALU) 122 in some embodiments. Some embodiments of the execution units 118 may include more than one instance of a function unit to facilitate parallel instruction execution in the execution pipelines. For example, the execution units 118 may include more than one ALU 122 to allow parallel access to ALU 122 functionality by the full execution control unit 110 and the subset execution control unit 112.
  • The register file 116 includes registers that store operands for access and manipulation by the dispatcher 104, the full execution control unit 110, the subset execution control unit 112, and the execution units 118. The number and/or width of the registers included in the register file 116 may be different in different embodiments of the processor 100.
  • In practice, the performance gained by inclusion of the subset decoder 108 and the subset execution control unit 112 in the processor 100 can approach that provided by conventional superscalar implementations, by providing parallel execution of the most frequently encountered instructions, while substantially reducing energy consumption relative to conventional superscalar implementations. The circuitry added to the processor 100 to implement the subset decoder 108 and the subset execution control unit 112 is relatively small in comparison to the circuitry of the full decoder 106 and the full execution control unit 110. As a result, the energy consumed by the subset decoder 108 and the subset execution control unit 112 is relatively low in comparison to that consumed by the full decoder 106 and the full execution control unit 110.
  • FIGS. 2 and 3 show diagrams of instruction execution in the processor 100. In FIG. 2, the fetch unit 102 provides, in fetch cycle 202, instructions to be decoded and executed. The full decode unit 106 decodes a first instruction in decode cycle 204, and the subset decode unit 108 decodes a second instruction in decode cycle 206, which is in parallel with decode cycle 204. Execution of the decoded instructions proceeds in parallel with full execution control unit 110 executing the first instruction in execution cycle 208. The second instruction is executed in parallel by the subset execution control unit 112, which executes the second instruction in execution cycle 210. Execution of the second instruction completes in a single cycle, while execution of the first instruction requires multiple cycles.
  • FIG. 3 shows a multi-instruction execution sequence in the processor 100. In the execution sequence of FIG. 3, performance of the processor 100 is very similar to that achievable by a conventional superscalar architecture because the execution timing is constrained by data dependencies. Executing selected instructions via the limited decoding and execution capabilities of the subset decode unit 108 and the subset execution control unit 112 can reduce the energy consumed by execution of the instruction sequence with little or no reduction in performance.
  • Instructions 1 and 2 are executed in parallel as explained with regard to FIG. 2. Instructions 3 and 4 are fetched in cycle 2, but a dependency 304 between the instructions causes the dispatcher 104 to delay execution of instruction 4 for one cycle. Accordingly, instruction 4 is decoded in cycle 4 in parallel with execution of instruction 3. Execution of instructions 3 and 4 may be performed in either of the pipelines of processor 100 that provide suitable decoding and execution functionality. In some embodiments, execution by the subset decode unit 108 and subset execution control unit 112 may be selected to reduce energy consumption.
  • Instructions 5 and 6 are fetched in cycle 3. Decoding of instruction 5 is delayed until cycle 5 due to dependency 310. Instruction 5 is a complex instruction that requires multiple execution cycles in the full execution control unit 110 to complete. Instruction 6 is also a complex instruction that must be executed in the full execution pipeline. Therefore, decoding of instruction 6 is delayed until cycle 6. In the instruction sequence of FIG. 3, instruction 6 is the only instruction for which decoding and subsequent execution is delayed by the limited decoding and execution capabilities of the subset decode unit 108 and the subset execution control unit 112 when compared to execution by a conventional superscalar implementation.
  • Instructions 7 and 8 are fetched in cycle 4 and execution is delayed. Instruction 7 is decoded by the subset decode unit in cycle 6, and executed in the subset execution control unit in cycle 7. Instruction 8 is decoded by the subset decode unit 108 in cycle 7, and executed in the subset execution control unit 112 in cycle 8. In various embodiments, instruction 8 may be executed in either pipeline of the processor 100.
  • FIGS. 4 and 5 show performance of conventional processors relative to the processor 100. In FIG. 4, execution performance for a practical application exhibiting low instruction parallelism (e.g. due to a substantial number of instruction dependencies) is shown. Because of the low instruction parallelism, performance of the processor 100 and the conventional superscalar processor are only slightly better than that of the single-scalar processor. The energy consumption of the processor 100 is slightly higher than that of the single-scalar processor, and the energy consumption of the conventional superscalar processor is substantially higher that the single-scalar processor and the processor 100.
  • FIG. 5 shows performance for a practical application exhibiting high instruction parallelism. Performance of both the conventional superscalar processor and the processor 100 is significantly higher than that of the single-scalar processor, with the conventional superscalar processor performing slightly better than the processor 100. However, the energy consumption of the conventional superscalar processor is substantially higher than single-scalar processor, and the processor 100 consumes less energy that the single-scalar processor. Thus, as shown in FIGS. 4 and 5, the processor 100 can provide a substantial performance increase over a single-scalar processor while consuming much less energy than a conventional superscalar processor.
  • FIG. 6 shows a flow diagram for a method 600 for executing instructions in accordance with various embodiments. Though depicted sequentially as a matter of convenience, at least some of the actions shown can be performed in a different order and/or performed in parallel. Additionally, some embodiments may perform only some of the actions shown.
  • In block 602, the fetch unit 102 fetches instructions from memory for execution and provides the fetched instructions to the dispatcher 104.
  • In block 604, the dispatcher 104 evaluates an instruction received from the fetch unit 102, and determines whether the instruction can be executed by the subset execution pipeline (e.g., the subset decoder 108 and subset execution control unit 112). As explained above, the subset execution pipeline executes only a small subset of the full instruction set executable by the processor 100. For example, the subset execution pipeline may execute only RISC instructions or a subset of the RISC instructions executable by the processor 100, while the full instruction pipeline can execute any and all instructions (including CISC instructions) executable by the processor 100.
  • If, in block 606, the subset execution pipeline is deemed capable of executing the instruction evaluated by the dispatcher 104, then in block 608, the dispatcher 104 routes the instruction to the subset decoder 108. The subset decoder 108 decodes the instruction.
  • In block 610, the subset decoder 108 passes the decoded instruction to the subset execution control unit 112, and the subset execution control unit 112 applies the execution units 118 to execute the instruction. In some embodiments, the subset execution control unit 112 executes the instruction in a single cycle.
  • If, in block 606, the subset execution pipeline is deemed incapable of executing the instruction evaluated by the dispatcher 104, then in block 612, the dispatcher 104 routes the instruction to the full decoder 106. The full decoder 106 decodes the instruction.
  • In block 614, the full decoder 106 passes the decoded instruction to the full execution control unit 110, and the full execution control unit 110 applies the execution units 118 to execute the instruction. In some embodiments, the full execution control unit 110 executes the instruction in a multiple cycles.
  • While embodiments of the present disclosure have been described with reference to the processor 100, embodiments of the multi-pipeline arrangement disclosed herein may be applied to improve performance while minimizing energy consumption in a wide variety of instruction execution devices.
  • The above discussion is meant to be illustrative of the principles and various embodiments of the present invention. Numerous variations and modifications will become apparent to those skilled in the art once the above disclosure is fully appreciated. It is intended that the following claims be interpreted to embrace all such variations and modifications.

Claims (20)

What is claimed is:
1. A processing device comprising:
an instruction fetch unit;
a first decode unit coupled to the instruction fetch unit;
a first execution control unit coupled to the first decode unit;
a second decode unit coupled to the instruction fetch unit;
a second execution control unit coupled to the second decode unit; and
a set of execution units coupled to the first execution control unit and the second execution control unit wherein:
the first decode unit and the first execution control unit are configured to cause the set of execution units to execute any instruction of a first instruction set; and
the second decode unit and the second execution control unit are configured to cause the set of execution units to execute any instruction of a second instruction set that is a subset of the first instruction set such that a remainder of the first instruction set is not in the second instruction set.
2. The processing device of claim 1 further comprising a register file coupled to the set of execution units, wherein each instruction in the second instruction set utilizes a register operand mode, an immediate addressing mode, or a combination thereof.
3. The processing device of claim 1, wherein the second instruction set includes only reduced instruction set computer (RISC) instructions and the remainder of the first instruction set includes complex instruction set computer (CISC) instructions.
4. The processing device of claim 1, wherein the second instruction set includes only instructions that can be executed in one clock cycle and the remainder of the first instruction set includes instructions that are executed in more than one clock cycle.
5. The processing device of claim 1, wherein:
the set of execution units includes an arithmetic logic unit and a remainder of the set of execution units;
the first execution control unit and the second execution control unit are each configured to control the arithmetic logic unit;
the first execution control unit is configured to control the remainder of the set of execution units; and
the second execution control unit is not configured to control the remainder of the set of execution units.
6. The processing device of claim 5, wherein the remainder of the set of execution units includes at least one of: a shift unit, a multiply unit, a floating point unit, or a load/store unit.
7. The processing device of claim 1, wherein:
the first execution control unit is configured to cause the set of execution units to execute an instruction of the first instruction set in a first number of pipeline stages; and
the second execution control unit is configured to cause the set of execution units to execute an instruction of the second instruction set in a second number of pipeline stages that is less than the first number of pipeline stages.
8. The processing device of claim 1, wherein:
the instruction fetch unit is configured to receive a first instruction and provide the first instruction to either the first execution control unit or the second execution control unit based on whether the first instruction is in the second instruction set.
9. The processing device of claim 1, wherein the first decode unit and the second decode unit are coupled to operate in parallel.
10. The processing device of claim 8, wherein the first execution control unit is configured to cause the set of execution units to execute a first instruction and the second execution control unit is configured to cause the set of execution units to execute a second instruction in parallel with the first instruction.
11. A processing device comprising:
a first decode unit;
a second decode unit coupled in parallel with the first decode unit;
a first execution control unit coupled to the first decode unit;
a second execution control unit coupled to the second decode unit; and
a set of execution units coupled to the first execution control unit and the second execution control unit wherein:
the first decode unit and the first execution control unit are configured to cause the set of execution units to execute a first instruction having a first addressing mode and a second instruction having a second addressing mode that is different from the first addressing mode; and
the second decode unit and the second execution control unit are configured to cause the set of execution units to execute the first instruction having the first addressing mode and not configured to cause the set of execution units to execute the second instruction having the second addressing mode.
12. The processing device of claim 11, wherein the first addressing mode includes at least one of a register operand mode or an immediate addressing mode.
13. The processing device of claim 11, wherein the first instruction utilizes only a register operand mode, an immediate addressing mode, or a combination thereof.
14. The processing device of claim 11, wherein the second addressing mode is not a register operand mode or an immediate addressing mode.
15. The processing device of claim 11, wherein:
the first execution control unit includes a first number of execution stage circuits; and
the second execution control unit includes a second number of execution stage circuits that is less than the first number of execution stage circuits.
16. The processing device of claim 15, wherein:
the second decode unit and the second execution control unit are configured to cause the set of execution units to execute the first instruction in a single clock cycle; and
the first decode unit and the first execution control unit are configured to cause the set of execution units to execute the second instruction in more than one clock cycle.
17. The processing device of claim 13, wherein:
the set of execution units includes an arithmetic logic unit and a remainder of the set of execution units;
the first execution control unit and the second execution control unit are each configured to control the arithmetic logic unit; and
the first execution control unit but not the second execution control unit is configured to control the remainder of the set of execution units.
18. The processing device of claim 17, wherein the remainder of the set of execution units includes at least one of: a shift unit, a multiply unit, a floating point unit, or a load/store unit.
19. The processing device of claim 11 further comprising an instruction fetch unit, wherein the first decode unit and the second decode unit are coupled to the instruction fetch unit in parallel.
20. The processing device of claim 19, wherein the first execution control unit and the second execution control unit are configured to cause the set of execution units to execute instructions in parallel.
US17/505,101 2014-11-26 2021-10-19 Processor with multiple execution pipelines Pending US20220035635A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/505,101 US20220035635A1 (en) 2014-11-26 2021-10-19 Processor with multiple execution pipelines

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US14/554,709 US10437596B2 (en) 2014-11-26 2014-11-26 Processor with a full instruction set decoder and a partial instruction set decoder
US16/594,830 US11150906B2 (en) 2014-11-26 2019-10-07 Processor with a full instruction set decoder and a partial instruction set decoder
US17/505,101 US20220035635A1 (en) 2014-11-26 2021-10-19 Processor with multiple execution pipelines

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US16/594,830 Continuation US11150906B2 (en) 2014-11-26 2019-10-07 Processor with a full instruction set decoder and a partial instruction set decoder

Publications (1)

Publication Number Publication Date
US20220035635A1 true US20220035635A1 (en) 2022-02-03

Family

ID=56010275

Family Applications (3)

Application Number Title Priority Date Filing Date
US14/554,709 Active 2035-05-19 US10437596B2 (en) 2014-11-26 2014-11-26 Processor with a full instruction set decoder and a partial instruction set decoder
US16/594,830 Active US11150906B2 (en) 2014-11-26 2019-10-07 Processor with a full instruction set decoder and a partial instruction set decoder
US17/505,101 Pending US20220035635A1 (en) 2014-11-26 2021-10-19 Processor with multiple execution pipelines

Family Applications Before (2)

Application Number Title Priority Date Filing Date
US14/554,709 Active 2035-05-19 US10437596B2 (en) 2014-11-26 2014-11-26 Processor with a full instruction set decoder and a partial instruction set decoder
US16/594,830 Active US11150906B2 (en) 2014-11-26 2019-10-07 Processor with a full instruction set decoder and a partial instruction set decoder

Country Status (1)

Country Link
US (3) US10437596B2 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024025864A1 (en) * 2022-07-28 2024-02-01 Texas Instruments Incorporated Multiple instruction set architectures on a processing device

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP7409208B2 (en) * 2020-04-10 2024-01-09 富士通株式会社 arithmetic processing unit

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140173312A1 (en) * 2012-12-13 2014-06-19 Advanced Micro Devices, Inc. Dynamic re-configuration for low power in a data processor

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5283874A (en) * 1991-10-21 1994-02-01 Intel Corporation Cross coupling mechanisms for simultaneously completing consecutive pipeline instructions even if they begin to process at the same microprocessor of the issue fee
GB2263565B (en) * 1992-01-23 1995-08-30 Intel Corp Microprocessor with apparatus for parallel execution of instructions
US5416913A (en) * 1992-07-27 1995-05-16 Intel Corporation Method and apparatus for dependency checking in a multi-pipelined microprocessor
US5692167A (en) * 1992-07-31 1997-11-25 Intel Corporation Method for verifying the correct processing of pipelined instructions including branch instructions and self-modifying code in a microprocessor
US5630083A (en) * 1994-03-01 1997-05-13 Intel Corporation Decoder for decoding multiple instructions in parallel
US5574937A (en) * 1995-01-30 1996-11-12 Intel Corporation Method and apparatus for improving instruction tracing operations in a computer system
US5787026A (en) * 1995-12-20 1998-07-28 Intel Corporation Method and apparatus for providing memory access in a processor pipeline
US5881279A (en) * 1996-11-25 1999-03-09 Intel Corporation Method and apparatus for handling invalid opcode faults via execution of an event-signaling micro-operation
JP3805314B2 (en) * 2003-02-27 2006-08-02 Necエレクトロニクス株式会社 Processor
US20040205322A1 (en) * 2003-04-10 2004-10-14 Shelor Charles F. Low-power decode circuitry for a processor
US7958335B2 (en) * 2005-08-05 2011-06-07 Arm Limited Multiple instruction set decoding
GB2484489A (en) * 2010-10-12 2012-04-18 Advanced Risc Mach Ltd Instruction decoder using an instruction set identifier to determine the decode rules to use.

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140173312A1 (en) * 2012-12-13 2014-06-19 Advanced Micro Devices, Inc. Dynamic re-configuration for low power in a data processor

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024025864A1 (en) * 2022-07-28 2024-02-01 Texas Instruments Incorporated Multiple instruction set architectures on a processing device

Also Published As

Publication number Publication date
US20200034149A1 (en) 2020-01-30
US11150906B2 (en) 2021-10-19
US10437596B2 (en) 2019-10-08
US20160147538A1 (en) 2016-05-26

Similar Documents

Publication Publication Date Title
US9495159B2 (en) Two level re-order buffer
US20160055004A1 (en) Method and apparatus for non-speculative fetch and execution of control-dependent blocks
US8386754B2 (en) Renaming wide register source operand with plural short register source operands for select instructions to detect dependency fast with existing mechanism
CN107077321B (en) Instruction and logic to perform fused single cycle increment-compare-jump
KR100900364B1 (en) System and method for reducing write traffic in processors
US20120060016A1 (en) Vector Loads from Scattered Memory Locations
US7565513B2 (en) Processor with power saving reconfigurable floating point unit decoding an instruction to single full bit operation or multiple reduced bit operations
US9459871B2 (en) System of improved loop detection and execution
US20130151822A1 (en) Efficient Enqueuing of Values in SIMD Engines with Permute Unit
US20180173534A1 (en) Branch Predictor with Branch Resolution Code Injection
US11188342B2 (en) Apparatus and method for speculative conditional move operation
KR101723711B1 (en) Converting conditional short forward branches to computationally equivalent predicated instructions
US20110314259A1 (en) Operating a stack of information in an information handling system
US20220035635A1 (en) Processor with multiple execution pipelines
US20160283247A1 (en) Apparatuses and methods to selectively execute a commit instruction
US20150277910A1 (en) Method and apparatus for executing instructions using a predicate register
US20180067731A1 (en) Apparatus and method for efficient call/return emulation using a dual return stack buffer
US20120191956A1 (en) Processor having increased performance and energy saving via operand remapping
US11451241B2 (en) Setting values of portions of registers based on bit values
US20220027162A1 (en) Retire queue compression
US20220091852A1 (en) Instruction Set Architecture and Microarchitecture for Early Pipeline Re-steering Using Load Address Prediction to Mitigate Branch Misprediction Penalties
US10963253B2 (en) Varying micro-operation composition based on estimated value of predicate value for predicated vector instruction
US20230401067A1 (en) Concurrently fetching instructions for multiple decode clusters
US20230195456A1 (en) System, apparatus and method for throttling fusion of micro-operations in a processor
Lozano et al. A deeply embedded processor for smart devices

Legal Events

Date Code Title Description
STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION

STPP Information on status: patent application and granting procedure in general

Free format text: NON FINAL ACTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER

STPP Information on status: patent application and granting procedure in general

Free format text: FINAL REJECTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION