CN107832083B

CN107832083B - Microprocessor with conditional instruction and processing method thereof

Info

Publication number: CN107832083B
Application number: CN201711069237.5A
Authority: CN
Inventors: G.葛兰.亨利; 泰瑞.派克斯; 罗德尼.E.虎克
Original assignee: Via Technologies Inc
Current assignee: Via Technologies Inc
Priority date: 2011-04-07
Filing date: 2012-04-09
Publication date: 2020-06-12
Anticipated expiration: 2032-04-09
Also published as: CN103218203A; CN103218203B; CN102707988B; CN107832083A; CN102707988A

Abstract

A microprocessor having an instruction set architecture. The instruction set architecture defines an instruction having an immediate field with a first portion specifying a first value and a second portion specifying a second value, the instruction directing the microprocessor to perform an operation with a fixed value as one of the source operands, the fixed value being obtained by rotating/moving the first value by a number of bits based on the second value. The microprocessor includes: an instruction translator for translating instructions into at least one immediate ALU micro-instruction, wherein the immediate ALU micro-instruction is encoded in an instruction encoding manner different from that defined by the instruction set architecture; and an execution pipeline that executes the microinstructions generated by the instruction translator to generate results defined by the instruction set architecture. Wherein the instruction translator, but not the execution pipeline, generates the fixed value as a source operand for the immediate ALU micro-instruction based on the first and second values for execution by the execution pipeline.

Description

Microprocessor with conditional instruction and processing method thereof

The present invention is a divisional application of an application entitled "microprocessor with conditional instruction and processing method thereof" with application number 201610126292.2 (wherein the application date of the original application of the application is 2012-4-9 and application number 201210102141.5), whose application date is 2012-4-9.

Technical Field

The present invention relates to the field of microprocessors, and more particularly to microprocessors having conditional instructions in the instruction set.

Background

The x86 processor architecture developed by Intel Corporation of Santa Clara, California and the advanced reduced instruction set machines (ARM) architecture developed by ARM Ltd. Many computer systems have emerged that use ARM or x86 processors, and the demand for such computer systems is growing rapidly. Nowadays, the ARM architecture processing core is mainly a low power consumption and low price computer market, such as mobile phones, handheld electronic products, tablet computers, network routers and hubs, set-top boxes, and the like. For example, the main processing power of apple iPhone and iPad is provided by the processing core of ARM architecture. On the other hand, x 86-based processors dominate high-cost markets that require high performance, such as laptops, desktops, servers, etc. However, as the performance of the ARM core increases and the power consumption and cost of some x86 processors improve, the aforementioned low-cost and high-cost markets become increasingly blurred. In the mobile computing market, such as smart phones, these two architectures have begun to compete vigorously. In the laptop, desktop, and server markets, it is expected that these two architectures will compete more frequently.

The competitive situation makes it difficult for computer device manufacturers and consumers to decide which architecture will dominate the market, and more precisely, software developers who cannot decide which architecture will develop more software. For example, some consumers who purchase large numbers of computer systems on a regular basis monthly or yearly may prefer to purchase computer systems having the same system configuration settings based on cost-efficiency considerations, such as price discounts for large purchases and simplification of system maintenance. However, the user group of these large consumers often has various computing requirements for these computer systems with the same system configuration settings. Specifically, some users may wish to execute programs on an ARM architecture processor, others may wish to execute programs on an x86 architecture processor, or even some users may wish to execute programs on both architectures simultaneously. In addition, new, unexpected computational requirements may arise that require the use of an alternative architecture. In these cases, some of the capital invested by these large individuals becomes wasted. In another example, a user has an important application that can only execute on the x86 architecture, and therefore he purchases a computer system of the x86 architecture (and vice versa). However, this later version of the application was developed for the ARM architecture instead and outperformed the original x86 version. The user may wish to convert the architecture to execute a new version of the application, but unfortunately, he has already invested considerable costs in an architecture that is not intended for use. Similarly, users who originally invested in applications that were only capable of executing on the ARM architecture later also wanted to be able to use applications developed for the x86 architecture that were not found in the ARM architecture or that were better than those developed for the ARM architecture, and vice versa. It is noted that although the smaller entity or the larger entity to which the individual has invested less, the proportion of investment losses may be higher. Other examples of similar investment losses may occur in various computing markets, such as the case of switching from the x86 architecture to the ARM architecture or from the ARM architecture to the x86 architecture. Finally, computing device manufacturers, such as OEM manufacturers, invest significant resources in developing new products and are still dilemma in the architecture choice. If a manufacturer develops and manufactures a large number of products based on the x86 or ARM architecture, the user's demand changes suddenly, which results in a waste of valuable research and development resources.

It is helpful for manufacturers and consumers of computing devices to keep their investment free from being surpassed by either of the two architectures, so it is necessary to provide a solution for system manufacturers to develop computing devices that allow users to execute programs of the x86 architecture and the ARM architecture simultaneously.

There is a long felt need to enable systems to execute multiple instruction set programs, mainly because consumers devote software programs that are executed on old hardware at a considerable cost, and their instruction sets are often incompatible with new hardware. For example, the IBM 360 system Model 30 has features compatible with the IBM 1401 system to alleviate the pain of the user from the 1401 system to a 360 system with higher performance and improved features. The Model 30 has a Read Only Storage control (ROS) of 360 systems and 1401 systems, so that it can be used in 1401 systems with the auxiliary Storage space storing the required information in advance. Furthermore, where software programs are developed in high-level languages, there is little way for new hardware developers to control the software programs compiled for the old hardware, and software developers also lack the power to re-compile (re-build) source code for the new hardware, which occurs particularly where the software developers are individuals different from the hardware developers. Siberman and Ebcioglu in Computer, June 1993, No.6 discloses a technique for improving the execution efficiency of existing Complex Instruction Set (CISC) architecture (for example, IBM S/390) by using a system executed in Reduced Instruction Set (RISC), super scalar architecture (superscalar) and Very Long Instruction Word (VLIW) architecture (hereinafter referred to as native architecture), wherein the disclosed system includes a native engine (native engine) for executing native code and a migration engine (migrant engine) for executing destination code, and the destination code (object code) can be translated into native code (native code) according to the translation effect of translation software, and conversion is required between the two codes. Please refer to us 7,047,394 patent publication No. 5/16/2006, van dyke et al, which discloses a processor having an execution pipeline for executing program instructions of a native reduced instruction set (Tapestry) and using a combination of hardware translation and software translation to translate x86 program instructions into instructions of the native reduced instruction set. Nakada et al, proposes a heterogeneous multithreaded SMT processor having an ARM architecture front-end pipeline for irregular (e.g., operating systems) software programs and a Fujitsu FR-V architecture front-end pipeline for multimedia applications that provides an added vliw queue to the FR-V vliw back-end pipeline to hold instructions from the front-end pipeline. Refer to the article "OROCHI: A Multiple Instruction Set SMTProcessor" published by Buchty and Weib, eds, Universal works verlag Karlsruhe at 11.2008 in First International Workshop on New front students in High-performance and Hardware-aware Computing (HipHaC' 08), Lake Como, Italy (in coordination with MICRO-41) (ISBN 978-3-86644-298-6). The method proposed herein is to reduce the overall system footprint within a heterogeneous system-on-a-chip (SOC) device, such as a Texas instruments OMAP application processor, having an ARM processor core plus one or more co-processors (e.g., TMS320, digital signal processors, or Graphics Processing Units (GPUs)). These coprocessors do not share instruction execution resources, but are integrated into different processing cores on the same chip.

Software translators, or software emulators, dynamic binary translators, and the like, are also used to support the ability to execute software programs on processors that are not the same software program architecture. Popular commercial examples of these are the Motorola 68K-to-PowerPC emulator with apple Macintosh computers, which can execute the 68K program on a Macintosh computer with a PowerPC processor, and the later developed PowerPC-to-x86 emulator, which can execute the 68K program on a Macintosh computer with an x86 processor. Full-amada, located in Santa Clara, California, combines core hardware of Very Long Instruction Words (VLIW) with "pure Software instruction translator (i.e., codemoving Software) to dynamically compile or emulate (emulate) the x86 code sequence" to execute x86 code, please refer to 2011 wikipedia for full-amada (Transmeta) < http:// en. In addition, reference is made to U.S. Pat. No. 5,832,205, issued to Kelly et al, 11/3/1998. IBM's DAISY (dynamic Architecture Instruction Set from Yorktown) system has a Very Long Instruction Word (VLIW) machine and dynamic binary software translation that provides 100% old Architecture software compatible emulation. The DAISY has a Virtual Machine observer (Virtual Machine Monitor) in the rom to process (parallel) and store Very Long Instruction Word (VLIW) primitives in a portion of the main memory not found in the legacy architecture, which prevents re-compilation of code fragments of the legacy architecture in subsequent programs. DAISY has a fast compiler optimization algorithm (fast compiler optimization algorithms) to improve performance. QEMU is a machine emulator (emulator) with a software dynamic translator. QEMU can emulate various central processors in various host systems (hosts), such as x86, PowerPC, ARM, SPARC, Alpha, and MIPS, such as x86, PowerPC, ARM, and SPARC. Referring to QEMU, a Fast and Portable dynamic translator, Fabric Bellard, USENIX Association, FREENIX Track:2005USENIX Annual technical Conference, as it is called by its developer "conversion of a dynamic translator when executing instructions to a target processor (run translation) converts them to the main system instruction set, and the resulting binary code is stored in a translation cache for repeat access. … QEMU is much simpler than other dynamic translators because it only concatenates machine code fragments generated by the GNC C compiler when offline. Meanwhile, refer to the academic paper "ARM Instruction Set organization on Multi-core x86 Hardware" of LeeWang Hao, university of Adelaide, 6.19.2009. While software translation-based solutions provide processing performance that can meet some of the computational demands, they are less than adequate for many users.

Static binary translation is another technique with the potential for high performance. However, the use of binary translation techniques presents technical problems (e.g., self-modifying code (self-modifying code), indirect branches (index branches) values known only at execution time (run-time)), and commercial and legal barriers (e.g., this technique may require hardware developers to work with the pipeline required to develop new programs for distribution; there is a potential risk of authorization or copyright infringement for the original program distributor).

The ARM Instruction Set Architecture (ISA) features conditional instruction execution. As described in the ARM architecture Reference Manual (ARMArchitecture Reference Manual) at pages a 4-3: "most ARM instructions can be executed conditionally, meaning that they will not work properly in the programmer's mode operation, memory and co-processor if the N, Z, C and V flags in the APSR satisfy the conditions specified by the instruction. If the flags do not satisfy the condition, the instruction behaves as a No Operation (NOP) machine instruction that proceeds to the next normal instruction, including all associated confirmation operations performed for exception events, but with no other effect. "

Conditional execution advantageously reduces the size of the instruction code and increases performance by reducing the number of branch instructions, but instruction misprediction suffers from performance penalties. Therefore, how to efficiently execute conditional instructions, especially in the case of supporting high microprocessor clocks, is a problem to be solved.

Disclosure of Invention

One embodiment of the present invention provides a microprocessor that executes conditional non-branch instructions. Wherein each of the conditional non-branch instructions specifies a condition, each of the conditional non-branch instructions instructing the microprocessor to perform an operation when the condition is satisfied and not performing the operation when the condition does not satisfy a condition flag of the microprocessor. The microprocessor may include: a predictor for providing predictions regarding a conditional non-branch instruction; an instruction translator for: translating the conditional non-branch instruction into a no-op microinstruction having condition codes when the prediction predicts that the condition will not be satisfied, wherein the no-op microinstruction having condition codes does not perform other operations except causing the execution unit to be launched to check the prediction; and translating the conditional non-branch instruction into a single operable microinstruction having a condition code to unconditionally execute the operation if the prediction predicts that the condition will be satisfied. Wherein the instruction translator translates instructions of an x86 Instruction Set Architecture (ISA) program and an advanced reduced instruction set machine (ARM) ISA program into microinstructions defined by a microinstruction set of the microprocessor, wherein the microinstructions are encoded differently than instructions defined by the instruction sets of the x86ISA and ARM ISA. And an execution pipeline including an instruction issue unit and a plurality of execution units, wherein the instruction issue unit is operative to issue the single operable microinstruction having the condition code to a selected one of the plurality of execution units, and the selected execution unit is operative to execute the single operable microinstruction having the condition code.

Another embodiment of the present invention provides a method for executing a conditional non-branch instruction using a microprocessor. Wherein the microprocessor has an instruction translator that translates instructions of an x86 Instruction Set Architecture (ISA) program and an advanced reduced instruction set machine (ARM) ISA program into microinstructions defined by the microinstruction set of the microprocessor, wherein the microinstructions are encoded in a manner different from the manner in which the instructions defined by the instruction sets of the x86ISA and ARM ISA are encoded, wherein each of the conditional non-branch instructions specifies a condition, each of the conditional non-branch instructions instructing the microprocessor to perform an operation when the condition is satisfied and not performing the operation when the condition does not satisfy a condition flag of the microprocessor. The method comprises the following steps: providing a prediction for a conditional non-branch instruction; translating the conditional non-branch instruction into a no-op microinstruction having condition codes when the prediction predicts that the condition will not be satisfied, wherein the no-op microinstruction having condition codes does not perform other operations except causing the execution unit to be launched to check the prediction; translating the conditional non-branch instruction into a single operable micro-instruction having a condition code to unconditionally perform the operation when the predicting predicts that the condition will be satisfied; and the instruction issue unit issues the single operable microinstruction with the condition code to a selected one of the plurality of execution units, which executes the single operable microinstruction with the condition code. Wherein the instruction issue unit and the selected execution unit are part of a hardware execution pipeline of the microprocessor.

Another embodiment of the present invention provides a computer program product encoded on a computer-readable storage medium, comprising computer-readable program code for directing a microprocessor to execute conditional non-branch instructions. Wherein each conditional non-branch instruction specifies a condition. Each conditional non-branch instruction directs the microprocessor to perform an operation when the condition is satisfied and does not perform the operation when the condition does not satisfy the condition flag of the microprocessor. The computer readable program code includes first program code for specifying a predictor for providing predictions regarding a conditional non-branch instruction. The computer readable program code also includes second program code for specifying an instruction translator for translating the conditional non-branch instruction into a no-operation micro instruction if the predicted prediction condition is not to be satisfied, and for translating the conditional non-branch instruction into a micro instruction group of one or more micro instructions for unconditionally performing the operation if the predicted prediction condition is to be satisfied. The computer readable program code also includes third program code for specifying an execution pipeline to execute the non-operational microinstructions or the set of microinstructions provided by the instruction translator.

One embodiment of the present invention provides a microprocessor having an instruction set architecture. The instruction set architecture defines an instruction that includes an immediate field having a first portion specifying a first value and a second portion specifying a second value, the instruction directing the microprocessor to perform an operation with a fixed value as one of the source operands, the fixed value being obtained by rotating/moving the first value by a number of bits based on the second value. The microprocessor includes: an instruction translator for translating the instruction into at least one immediate ALU micro-instruction, wherein the immediate ALU micro-instruction is encoded in a different instruction encoding than that defined by the instruction set architecture; and an execution pipeline that executes the micro instructions generated by the instruction translator to generate results defined by the instruction set architecture. Wherein the instruction translator, but not the execution pipeline, generates the fixed value as a source operand for the immediate ALU micro-instruction based on the first and second values for execution by the execution pipeline.

Another embodiment of the invention provides a method performed by a microprocessor having an instruction set architecture. The instruction set architecture defines an instruction that includes an immediate field having a first portion specifying a first value and a second portion specifying a second value, the instruction directing the microprocessor to perform an operation with a fixed value as one of the source operands, the fixed value being obtained by rotating/moving the first value by a number of bits based on the second value. The method comprises the following steps: translating the instruction into at least one immediate ALU microinstruction encoded in an instruction encoding manner different from that defined by the instruction set architecture, wherein the translating is performed by an instruction translator of the microprocessor; and executing the microinstructions generated by the instruction translator to generate a result defined by the instruction set architecture, wherein the executing is performed by an execution pipeline of the microprocessor. Wherein the fixed value is generated by the instruction translator, but not the execution pipeline, as a source operand to the immediate ALU micro-instruction based on the first and second values for execution by the execution pipeline.

Another embodiment of the present invention provides a microprocessor having an instruction set architecture. The instruction set architecture defines an instruction that includes an immediate field having a first portion specifying a first value and a second portion specifying a second value, the instruction directing the microprocessor to perform an operation with a fixed value as one of the source operands, the fixed value being obtained by rotating/moving the first value by a number of bits based on the second value. The microprocessor includes: an instruction translator for translating the instruction into micro instructions; and an execution pipeline that executes the microinstructions generated by the instruction translator to generate a result defined by the instruction set architecture. Wherein when a value of the immediate field falls within a predetermined subset of values: the instruction translator translates the instruction into at least one immediate ALU micro-instruction; the instruction translator, but not the execution pipeline, generating the fixed value according to the first and second values; and the execution pipeline executes the immediate ALU micro instruction using the fixed value generated by the instruction translator as one of the source operands. And wherein when the value of the immediate field does not fall within the predetermined subset of values: the instruction translator translates the instruction into at least first and second microinstructions; the execution pipeline, other than the instruction translator, generating the fixed value by executing the first micro instruction; and the execution pipeline executes the second micro instruction by using the fixed value generated by the execution of the first micro instruction as one of the source operands.

Another embodiment of the present invention provides a method performed by a microprocessor having an instruction set architecture. The instruction set architecture defines an instruction that includes an immediate field having a first portion specifying a first value and a second portion specifying a second value, the instruction directing the microprocessor to perform an operation with a fixed value as one of the source operands, the fixed value being obtained by rotating/shifting the first value by a number of bits based on the second value, the microprocessor also including an instruction translator and an execution pipeline. The method comprises the following steps: determining, by the instruction translator, whether a value of the immediate field falls within a predetermined subset of values; when the value of the immediate field falls within the predetermined subset of values: translating the instruction into at least an immediate ALU micro-instruction using the instruction translator; generating the fixed value according to the first and second values using the instruction translator instead of the execution pipeline; and executing the immediate ALU microinstruction using the fixed value generated by the instruction translator as one of the source operands using the execution pipeline. And wherein when the value of the immediate field does not fall within the predetermined subset of values: translating the instruction into at least first and second microinstructions using the instruction translator; generating the fixed value by executing the first micro instruction using the execution pipeline instead of the instruction translator; and executing the second micro instruction using the execution pipeline by using the fixed value generated by the first micro instruction execution as one of the source operands.

In another embodiment, a computer program product encoded on at least one computer-readable storage medium for use in a computing device is provided. The computer program product includes computer readable program code encoded on the medium for specifying a microprocessor. The microprocessor has an instruction set architecture that defines at least one instruction. The instruction includes an immediate field having a first portion specifying a first value and a second portion specifying a second value. The instruction directs the microprocessor to perform an operation with a fixed value as one of the source operands. The fixed value is obtained by rotating/shifting the first value by a certain number of bits based on the second value. The computer readable program code has first program code specifying an instruction translator for translating at least one instruction into one or more micro instructions, wherein the instruction is encoded in an instruction encoding scheme different from that defined by the instruction set architecture. The computer readable program code also has second program code for specifying an execution pipeline for executing the microinstructions generated by the instruction translator to generate a result defined by the instruction set architecture. The instruction translator, but not the execution pipeline, generates the fixed value based on the first and second values as a source operand of the at least one micro instruction for execution by the execution pipeline.

The advantages and spirit of the present invention can be further understood by the following detailed description of the invention and the accompanying drawings.

Drawings

FIG. 1 is a block diagram of an embodiment of a microprocessor according to the present invention that executes x86 code set architecture and ARM code set architecture machine language programs.

FIG. 2 is a block diagram illustrating the hardware instruction translator of FIG. 1 in greater detail.

FIG. 3 is a block diagram illustrating the instruction formatter of FIG. 2 in greater detail.

FIG. 4 is a block diagram detailing the execution pipeline of FIG. 1.

FIG. 5 is a block diagram illustrating the register file of FIG. 1 in greater detail.

FIGS. 6A and 6B are a flowchart illustrating the steps of operating the microprocessor of FIG. 1.

FIG. 7 is a block diagram of a dual core microprocessor of the present invention.

FIG. 8 is a block diagram of another embodiment of a microprocessor according to the present invention that implements x86ISA and ARM ISA machine language programs.

FIG. 9 is a block diagram showing a portion of the microprocessor of FIG. 1 in greater detail.

FIGS. 10A and 10B are a flowchart illustrating the operation of the hardware instruction translator of FIG. 1 in translating conditional ALU instructions.

FIG. 11 is a flowchart illustrating the operation of the execution unit of FIG. 4 in executing a shift microinstruction.

FIGS. 12A and 12B are a flowchart illustrating the operation of the execution unit of FIG. 4 in executing a conditional ALU micro-instruction.

FIG. 13 is a flowchart illustrating the operation of the execution unit of FIG. 4 in executing a conditional move microinstruction.

FIGS. 14-20 are block diagrams illustrating the operation of the execution pipeline 112 of FIG. 1 to execute various forms of conditional ALU instructions translated in accordance with the translation operations of FIG. 10.

21A and 21B are a flow chart illustrating the operation of the hardware instruction translator of FIG. 1 in translating a conditional ALU instruction to specify one of the source registers as being the same as the destination register.

FIGS. 22-28 are block diagrams illustrating the operation of the execution pipeline 112 of FIG. 1 to execute various forms of conditional ALU instructions translated in accordance with the translation operations of FIG. 21.

FIG. 29 is a block diagram of the microprocessor 100 predicting unconditional branch instructions according to the present invention.

FIG. 30 is a block diagram illustrating one embodiment of the instruction translator of FIG. 29 of translation of a conditional ALU instruction.

FIGS. 31A and 31B are flow diagrams illustrating an embodiment of the present invention where the microprocessor of FIG. 29 executes a conditional ALU instruction of FIG. 30.

FIG. 32 is a block diagram of a microprocessor that processes modified immediate constants during translation in accordance with an embodiment of the present invention.

FIG. 33 is a block diagram illustrating an embodiment of the present invention for selectively translating an immediate operand instruction into a ROR micro-instruction and an ALU micro-instruction or into an immediate ALU micro-instruction.

34A and 34B are a flow chart illustrating one embodiment of the operation of the microprocessor 100 of FIG. 32 to execute an immediate operand instruction of FIG. 33 according to the present invention.

[ description of the reference numerals of the main elements ]

Microprocessor (processing core) 100 instruction cache 102

Hardware instruction translator 104 register file 106

Memory subsystem 108 execution pipeline 112

Instruction fetch Unit and Branch predictor 114 ARM Program Counter (PC) register 116

x86 Instruction Pointer (IP) register 118

Configuration register 122

ISA instruction 124 microinstruction 126

Results 128

Instruction mode pointer (instruction mode indicator) 132

Fetch address 134

Environment mode pointer (environment mode indicator) 136

Instruction formatter 202 Simple Instruction Translator (SIT)204

Complex Instruction Translator (CIT)206 multiplexer (mux) 212

x86 simple instruction translator 222 ARM simple instruction translator 224

Micro-program counter (micro-PC)232

Microcode read-only memory 234

Microprogrammer (microsequencer) 236

Instruction Indirect Register (IIR) 235

Micro translator 237

Formatted ISA instructions 242

Execute microinstructions (filling microinstructions) 244

Executing microinstructions 246

Selection input 248

Microcode address 252

Read only memory address 254

ISA instruction information 255

Pre-decoder (pre-decoder) 302

Instruction Byte Queue (IBQ)304

Length decoders (length decoders) and ripple logic gates (ripple logic) 306

Multiplexer Queue (MQ)308

Multiplexer 312

Formatted Instruction Queue (FIQ)314

ARM instruction set state 322

Micro instruction queue 401

Register Allocation Table (RAT)402

Instruction scheduler 404

Reservation station 406

Instruction issue unit (instruction issue unit) 408

Integer/branch unit 412

Media unit (media unit) 414

Load/store unit 416

Floating point unit 418

Reorder buffer (ROB)422

Execution unit 424 ARM specific registers 502

x86 special register 504 shared register 506

Dual core microprocessor 700 micro instruction cache 892

Condition flag register 926 multiplexer 922

FLANGE BUS 928 CONDITIONED FLANGE VALUE 928/924

ISA Condition flags 902 Condition Satisfied (SAT) bit 904

Pre-shift carry (PSC) bits 906 USE shift carry (USE) bits 908

Dynamic predictor 2932 predictor selector 2934

Static predictor 2936 dynamic prediction 2982

Prediction selection 2984 static prediction 2986

History update 2974 misprediction 2976

ALU microinstruction 3044 conditional move microinstruction 3046

Conditional ALU microinstruction with condition codes 3045

No operation micro instruction with condition code 3047

Opcode fields a202, a212, a222, a252, a272

Condition code field a204, a224, a254, a274

Fields a206, a216, a256 of source registers 1 and 2

Destination register fields a208, a218, a232, a258

Field a226 of Source register 1 field a228 of Source register 2

Immediate operand 3266 ROR microinstruction 3344

ALU MICRO-INSTRUCTION 3346 immediately ALU MICRO-INSTRUCTION 3348

Opcode fields b202, b212, b222, b232

Fields b204, b214, b234 of source register 1

Field b235 of source register 2

Destination register fields b206, b216, b226, b236

Immediate field b207 immed _8 field b208, b228

rotate _ imm field b209, b229 immate-32 field b218

Detailed Description

Noun definitions

The instruction set defines the correspondence between the set of binary coded values (i.e., machine language instructions) and the operations executed by the microprocessor. The machine language program is basically encoded in binary, although other systems of carry systems, such as some of the earlier IBM computer machine language programs, are also available, and are ultimately represented in binary-valued physical signals at high and low voltages, but are encoded in decimal. The machine language instructions instruct the microprocessor to perform operations such as: the method includes adding an operand in a register 1 and an operand in a register 2 and writing the result into a register 3, subtracting an immediate operand specified by an instruction from the operand in a memory address 0x12345678 and writing the result into a register 5, shifting a value in a register 6 according to the number of bits specified by a register 7, and if a zero flag is set, loading a value in a memory address 0xABCD0000 into a register 8 by branching to 36 bytes after the instruction. Thus, the instruction set defines the binary code values that each machine language instruction causes the microprocessor to perform the desired operation. It should be understood that the instruction set defines a correspondence between binary values and microprocessor operations, and does not mean that a single binary value corresponds to a single microprocessor operation. In particular, multiple binary values may correspond to the same microprocessor operation in some instruction sets.

An Instruction Set Architecture (ISA), comprising (1) an instruction set from the context of a microprocessor family; (2) the set of resources that the instructions of the instruction set can access (e.g., registers and patterns required for memory addressing); and (3) exception event sets (e.g., divide by zero, page fault, memory protection violation, etc.) generated by the microprocessor in response to instruction execution of the instruction set. Because program writers, such as compiler and compiler writers, want to define the ISA of a family of microprocessors when a machine language program is executed on the microprocessor, the manufacturer of the family of microprocessors typically defines the ISA in an operator's operating manual. For example, the Intel64and IA-32architecture Software Developer manuals (Intel64and IA-32architecture Software developers' Manual), published 3 months 2009, define the ISAs of Intel64and IA-32 processor Architectures. The software developer manual includes five chapters, the first chapter is a basic framework; chapter a is the instruction set references a through M; chapter B, second, is the instruction set references N through Z; chapter a, system programming guide; chapter B, the second part of the system programming guide, is chapter iii, this manual series of references. Such processor architectures are generally referred to as the x86 architecture and are described herein in terms of the x86, x86ISA family, x86 family, or the like. In another example, the ARM architecture reference manual published in 2010, ARM v7-A and ARM v7-R version Errata markup, define the ISA of the ARM processor architecture. This series of reference manuals is a reference. The ISA of the ARM processor architecture is also referred to herein as ARM, ARM ISA, the ARM ISA family, the ARM family, or similar terms. Other well known families of ISAs are IBM System/360/370/390 and z/Architecture, DEC VAX, Motorola 68k, MIPS, SPARC, PowerPC and DEC Alpha, among others. The definition of the ISA covers the processor family, since manufacturers have advanced the processor family to improve the ISA of the original processor by adding new instructions in the instruction set and/or new registers in the register set. For example, with the development of the x86ISA architecture, which introduces a set of 128-bit multimedia extension instruction set (MMX) registers as part of the single instruction multiple data stream extension (SSE) instruction set in the Intel Pentium III processor family, x86ISA machine language programs have been developed to utilize XMM registers to improve performance, although existing x86ISA machine language programs do not use XMM registers of the single instruction multiple data stream extension instruction set. In addition, other manufacturers have designed and manufactured microprocessors that execute x86ISA machine language programs. For example, ultra-Micro Semiconductors (AMDs) and VIA Technologies (VIA Technologies) add new features to the x86ISA, such as 3 DNOW!for ultra-micro semiconductors! The techniques of single instruction multiple data Stream (SIMD) vector processing instructions, and a vector security engine random number generator (random number generator) and an advanced decoding engine (advanced cryptographic engine) of the wafer electronics, all of which adopt a machine language program of the x86ISA, but are not implemented by the existing Intel microprocessor. As another example, the ARM ISA natively defines the ARM instruction set state to have 4 bytes of instructions. However, with the development of the ARM ISA, other instruction set states, such as Thumb instruction set states with 2-byte instructions to increase coding density and Jazelle instruction set states to speed up Java bytecode programs, have been developed to use some or all of the other ARM ISA instruction set states, even though existing ARM ISA machine language programs do not employ them.

An Instruction Set Architecture (ISA) machine language program includes a sequence of ISA instructions, i.e., a sequence of binary coded values of the ISA instruction set corresponding to a sequence of operations to be performed by a program writer. Thus, the x86ISA machine language program includes an x86ISA instruction sequence, and the ARM ISA machine language program includes an ARM ISA instruction sequence. The machine language program instructions are stored in the memory and retrieved and executed by the microprocessor.

A hardware instruction translator includes an arrangement of transistors that receive as inputs ISA machine language instructions (e.g., x86ISA or ARM ISA machine language instructions) and, in response, output one or more microinstructions to an execution pipeline of a microprocessor. The execution pipeline executes the execution results of the microinstructions as defined by the ISA instructions. Thus, the execution pipeline "implements" the ISA instructions by collective execution of these microinstructions. That is, the execution pipeline performs the operations specified by the incoming ISA instructions by performing collective execution of micro instructions output by the hardware instruction translator to produce the results defined by the ISA instructions. Thus, a hardware instruction translator may be viewed as "translating" an ISA instruction into one or more carry-out microinstructions. The microprocessor described in this embodiment has a hardware instruction translator to translate x86ISA instructions and ARM ISA instructions into microinstructions. It should be understood, however, that the hardware instruction translator does not necessarily translate the entire instruction set defined by the x86 user instruction manual or the ARM user instruction manual, but rather translates only a subset of those instructions, as most x86ISA and ARM ISA processors support only a subset of instructions defined by their corresponding user instruction manuals. Specifically, the x86 user instruction manual defines a subset of instructions translated by the hardware instruction translator, and not necessarily to all existing x86ISA processors, and the ARM user instruction manual defines a subset of instructions translated by the hardware instruction translator, and not necessarily to all existing ARM ISA processors.

The execution pipeline is a multi-level sequence (sequence of stages). Each level of the multi-level sequence has hardware logic and a hardware register. The hardware register holds an output signal of the hardware logic and provides the output signal to a next level of the multi-level sequence according to a clock signal of the microprocessor. The execution pipeline may have a plurality of multi-level sequences, such as multiple execution pipelines. The execution pipeline receives the micro instructions as input signals and accordingly executes the operations specified by the micro instructions to output execution results. Operations specified by the microinstructions and performed by the hardware logic of the execution pipeline include, but are not limited to, arithmetic, logic, memory load/store, compare, test, and branch resolution, and data formats for performing the operations include, but are not limited to, integer, floating point, word, Binary Coded Decimal (BCD), and packed format. The execution pipeline executes the micro instructions to implement ISA instructions (e.g., x86 and ARM) to produce the results defined by the ISA instructions. The execution pipeline is distinct from the hardware instruction translator. Specifically, the hardware instruction translator generates the execution microinstructions, which the execution pipeline executes, but does not generate.

The instruction cache is a random access memory device in the microprocessor, and the microprocessor places instructions of the ISA machine language program (e.g., machine language instructions of the x86ISA and the ARM ISA) therein, wherein the instructions are fetched from the system memory and executed by the microprocessor according to the execution flow of the ISA machine language program. More specifically, the ISA defines an instruction address register to hold the memory address of the next ISA instruction to be executed (e.g., Instruction Pointer (IP) in the x86ISA and Program Counter (PC) in the ARM ISA. when the microprocessor executes a machine language program to control program flow, the microprocessor updates the contents of the instruction address register, ISA instructions are cached for subsequent retrieval. when the register contains the ISA instruction address of the next machine language program in the current instruction cache, the ISA instructions are quickly retrieved from the instruction cache based on the contents of the instruction register to retrieve data from the system memory. Instructions of the instruction set architecture are treated as dedicated data caches of data (e.g., data presented by the hardware portion of a system that employs software translation), and are accessed using a load/store address rather than based on the value of an instruction address register, which is not referred to herein as an instruction cache. Additionally, a hybrid cache that can fetch instructions and data based on the values of the instruction address register and based on load/store addresses, rather than just load/store addresses, is also encompassed within the definition of an instruction cache described herein. In this specification, a load instruction refers to an instruction to read data from a memory to a microprocessor, and a store instruction refers to an instruction to write data from the microprocessor to the memory.

A microinstruction set is a collection of instructions (microinstructions) that can be executed by the execution pipelines of a microprocessor.

Description of the embodiments

The microprocessor disclosed in the embodiment of the invention can translate the corresponding x86ISA and ARM ISA instructions into micro instructions directly executed by the execution pipeline of the microprocessor through hardware, so as to achieve the purpose of executing the machine language programs of the x86ISA and ARM ISA. The micro instructions are defined by micro instruction sets of microprocessor micro architectures (micro instruction sets) that are different from the x86ISA and the ARM ISA. Since the microprocessor described herein requires the execution of x86 and ARM machine language programs, the microprocessor's hardware instruction translator translates the x86 and ARM instructions into micro instructions that are provided to the microprocessor's execution pipeline, where they are executed by the microprocessor to implement the x86 and ARM instructions. Because these execution microinstructions are provided directly from the hardware instruction translator to the execution pipeline for execution, unlike systems that employ software translators, which require pre-storing native (host) instructions to memory before the execution pipeline executes the instructions, the disclosed microprocessor has the potential to execute x86 and ARM machine language programs at faster execution speeds.

FIG. 1 is a block diagram illustrating an embodiment of a microprocessor 100 according to the present invention that implements machine language routines of the x86ISA and the ARM ISA. The microprocessor 100 has an instruction cache 102; a hardware instruction translator 104 for receiving x86ISA instructions and ARM ISA instructions 124 from the instruction cache 102 and translating them into microinstructions 126; an execution pipeline 112 that executes micro instructions 126 received by the hardware instruction translator 104 to generate micro instruction results 128 that are returned in operand form to the execution pipeline 112; a register file 106 and a memory subsystem 108, which provide operands to the execution pipeline 112 and receive micro instruction results 128 from the execution pipeline 112, respectively; an instruction fetch unit and branch predictor 114 providing a fetch address 134 to the instruction cache 102; an ARM ISA-defined program counter register 116 and an x86 ISA-defined instruction pointer register 118 that are updated based on micro instruction results 128 and provide their contents to the instruction fetch unit and branch predictor 114; and configuration registers 122 that provide an instruction mode pointer 132 and a context mode pointer 136 to the hardware instruction translator 104 and the instruction fetch unit and branch predictor 114, and are updated based on the micro instruction results 128.

Since the microprocessor 100 is capable of executing x86ISA and ARM ISA machine language instructions, the microprocessor 100 fetches instructions from a system memory (not shown) to the microprocessor 100 according to a program flow. Microprocessor 100 accesses recently fetched machine language instructions of the x86ISA and ARM ISA to instruction cache 102. The instruction fetch unit 114 generates a fetch address 134 according to the x86 or ARM instruction byte section fetched from the system memory. If the instruction cache 102 hits, the instruction cache 102 provides the x86 or ARM instruction byte segment at the fetch address 134 to the hardware instruction translator 104, otherwise the instruction 124 of the instruction set architecture is fetched from system memory. The instruction fetch unit 114 generates the fetch address 134 based on the value of the ARM program counter 116 and the x86 instruction pointer 118. Specifically, the instruction fetch unit 114 maintains a fetch address in a fetch address register. Whenever the instruction fetch unit 114 fetches a new ISA instruction byte section, it updates the fetch address according to the size of the section and proceeds sequentially according to the existing manner until a control flow event occurs. The control flow events include the generation of exception events, the prediction by the branch predictor 114 that a branch is about to occur within the fetch block (taken branch), and the updating of the ARM program counter 116 and the x86 instruction pointer 118 by the execution pipeline 112 in response to the execution of a branch instruction that is not predicted to occur by the branch predictor 114. The instruction fetch unit 114 updates the fetch address to the exception handler address, the predicted target address, or the execution target address accordingly in response to a control flow event. In one embodiment, instruction cache 102 is a hybrid cache for accessing ISA instructions 124 and data. It is noted that although the hybrid cache is capable of writing data to or reading data from the hybrid cache based on a load/store address, in the case of instructions 124 in which the microprocessor 100 is based on the hybrid cache fetch instruction set architecture, the hybrid cache is accessed based on the values of the ARM program counter 116 and the x86 instruction pointer 118, rather than based on the load/store address. Instruction cache 102 may be a random access memory device.

The instruction mode indicator 132 indicates whether the microprocessor 100 is currently fetching, formatting/decoding, and translating x86ISA or ARM ISA instructions 124 into microinstructions 126. In addition, the execution pipeline 112 and memory subsystem 108 receive the instruction mode pointer 132. the instruction mode pointer 132 affects how the microinstructions 126 execute, although only a small set of microinstructions is affected. The x86 instruction pointer register 118 holds the memory address of the next x86ISA instruction 124 to be executed and the ARM program counter register 116 holds the memory address of the next ARM ISA instruction 124 to be executed. To control program flow, the microprocessor 100 updates the x86 instruction pointer register 118 and the ARM program counter register 116 to the target address of the next instruction, branch instruction, or exception handler address, respectively, when it executes the x86 and ARM machine language programs. When the microprocessor 100 executes the instructions of the x86 and ARM ISA machine language programs, the microprocessor 100 retrieves from system memory the instructions of the instruction set architecture of the machine language programs and places them in the instruction cache 102 in place of the instructions that were not recently retrieved and executed. The instruction fetch unit 114 generates the fetch address 134 based on the value of the x86 instruction pointer register 118 or the ARM program counter register 116 and based on the instruction mode pointer 132 indicating whether the ISA instruction 124 being fetched by the microprocessor 100 is x86 or ARM mode. In one embodiment, the x86 instruction pointer register 118 and the ARM program counter register 116 may be implemented as a shared hardware instruction address register that provides its contents to the instruction fetch unit and branch predictor 114 and is updated by the execution pipeline 112 according to the semantic (sematiscs) of x86 or ARM and x86 or ARM as indicated by the instruction mode pointer 132.

A state of the context mode indicator 136 indicates that the microprocessor 100 is operating in a variety of execution environments, such as virtual memory, exception events, cache control, and global execution time protection, using the semantics of the x86 or ARM ISA. Thus, the instruction mode pointer 132 and the context mode pointer 136 collectively generate a plurality of execution modes. In the first mode, instruction mode pointer 132 and context mode pointer 136 both point to the x86ISA, and microprocessor 100 is implemented as a normal x86ISA processor. In the second mode, the instruction mode pointer 132 and the environment mode pointer 136 both point to the ARM ISA, and the microprocessor 100 operates as a general ARM ISA processor. In the third mode, the instruction mode indicator 132 points to the x86ISA, whereas the environment mode indicator 136 points to the ARM ISA, which facilitates execution of user mode x86 machine language programs under the control of the ARM operating system or hypervisor; conversely, in the fourth mode, the instruction mode pointer 132 points to the ARM ISA, whereas the environment mode pointer 136 points to the x86ISA, which facilitates execution of user-mode ARM machine language programs under the control of the x86 operating system or hypervisor. The values of the instruction mode pointer 132 and the context mode pointer 136 are determined at the beginning of the reset (reset). In one embodiment, the initial value is encoded as a microcode constant, but may be modified by blowing configuration fuses and/or using microcode patches. In another embodiment, the initial value is provided to the microprocessor 100 from an external input. In one embodiment, the context mode indicator 136 is changed only after a reset is performed by a reset-to-ARM instruction 124 or a reset-to-x86 (reset-to-x86) instruction 124 (see FIGS. 6A and 6B below); that is, the context mode indicator 136 does not change when the microprocessor 100 is operating normally and a reset is not performed by the normal reset, reset to x86, or reset to ARM instruction 124.

The hardware instruction translator 104 receives as inputs the x86 and the ARM ISA machine language instructions 124 and accordingly provides as output one or more microinstructions 126 to implement the x86 or ARM ISA instructions 124. The execution pipeline 112 executes the one or more microinstructions 126 whose collective execution results in the implementation of the x86 or ARM ISA instruction 124. That is, the collective execution of the microinstructions 126 may be performed according to the x86 or ARM ISA instructions 124 specified by the inputs to perform the operations specified by the x86 or ARM ISA instructions 124 to produce the results defined by the x86 or ARM ISA instructions 124. Thus, the hardware instruction translator 104 translates the x86 or ARM ISA instruction 124 into one or more microinstructions 126. The hardware instruction translator 104 includes a set of transistors configured in a predetermined manner to translate the x86ISA and ARM ISA machine language instructions 124 into the execute microinstructions 126. The hardware instruction translator 104 also has Boolean logic gates to generate the execute micro instructions 126 (such as the simple instruction translator 204 shown in FIG. 2). In one embodiment, the hardware instruction translator 104 includes a microcode ROM (e.g., element 234 of the complex instruction translator 206 of FIG. 2) that is utilized by the hardware instruction translator 104 to generate the execution microinstructions 126 according to the complex ISA instruction 124, as described further in conjunction with the description of FIG. 2. For a preferred embodiment, the hardware instruction translator 104 need not be capable of translating the entire set of ISA instructions 124 as defined in the x86 user's instruction manual or the ARM user's instruction manual, but rather a subset of these instructions. Specifically, the subset of ISA instructions 124 defined by the x86 user manual and translated by the hardware instruction translator 104 does not necessarily correspond to any existing x86ISA processor developed by Intel, whereas the subset of ISA instructions 124 defined by the ARM user manual and translated by the hardware instruction translator 104 does not necessarily correspond to any existing ISA processor developed by ARM ltd. One or more of the aforementioned implementations of microinstructions 126 for implementing the x86 or ARM ISA instruction 124 may be provided by the hardware instruction translator 104 to the execution pipeline 112 all at once or in sequence. An advantage of this embodiment is that the hardware instruction translator 104 can provide the execute microinstructions 126 directly to the execution pipeline 112 for execution without storing the microinstructions 126 in memory disposed therebetween. In the embodiment of the microprocessor 100 of FIG. 1, when the microprocessor 100 executes an x86 or ARM machine language program, the hardware instruction translator 104 translates the x86 or ARM machine language instruction 124 into one or more microinstructions 126 each time the microprocessor 100 executes an x86 or ARM instruction 124. However, the embodiment of FIG. 8 utilizes a micro instruction cache to avoid the problem of duplicate translation that microprocessor 100 would encounter each time it executes x86 or ARM ISA instruction 124. An embodiment of the hardware instruction translator 104 is illustrated in more detail in FIG. 2.

The execution pipeline 112 executes the execute microinstructions 126 provided by the hardware instruction translator 104. Basically, the execution pipeline 112 is a general purpose high speed micro instruction processor. Although the functions described herein are performed by the execution pipeline 112 having x86/ARM specific features, most of the x86/ARM specific functions are performed by other portions of the microprocessor 100, such as the hardware instruction translator 104. In one embodiment, the execution pipeline 112 executes register renaming, superscalar issue, and NAND sequential execution of execute microinstructions 126 received by the hardware instruction translator 104. The execution pipeline 112 is illustrated in more detail in FIG. 4.

The micro-architecture of the microprocessor 100 includes: (1) a set of microinstructions; (2) the set of resources available to microinstructions 126 of the microinstruction set, which is a superset (superset) of the resources of the x86 and ARM ISA; and (3) a micro-exception set defined by the microprocessor 100 in response to execution of the microinstructions 126, the micro-exception set comprising a superset of exceptions of the x86ISA and the ARM ISA. The micro-architecture is different from the x86ISA and ARM ISA. Specifically, the microinstruction set is oriented in a number of ways other than in the x86ISA and ARM ISA. First, the micro instructions of the micro instruction set indicate that the operations performed by the execution pipeline 112 do not correspond one-to-one to the operations performed by the microprocessor as indicated by the instructions of the x86ISA and ARM ISA instruction sets. Although many of the operations are the same, some microinstruction set specific operations are not specified by the x86ISA and/or ARM ISA instruction set. In contrast, there are some operations specified by the x86ISA and/or ARM ISA instruction set that are not specified by the micro-instruction set. Second, the microinstructions of the microinstruction set are encoded in a manner that is different from the encoding of the instructions of the x86ISA and ARM ISA instruction sets. That is, although many of the same operations (e.g., add, offset, load, return) are specified in the microinstruction set and the x86 and ARM ISA instruction sets, the correspondence table of the microinstruction set to the binary opcode values of the x86 or ARM ISA instruction sets does not correspond one-to-one. The microinstruction sets are generally identical to the binary opcode value mapping tables of the x86 or ARM ISA instruction set, yet do not have a one-to-one correspondence therebetween. Third, the micro instruction bit field of the micro instruction set does not correspond to the x86 or the instruction bit field of the ARM ISA instruction set.

In general, the microprocessor 100 may execute x86ISA and ARM ISA machine language program instructions. However, the execution pipeline 112 itself cannot execute x86 or ARM ISA machine language instructions; but executes the execution microinstructions 126 of the microinstruction set of the microprocessor 100 microarchitecture translated from the x86ISA and ARM ISA instructions. However, although the micro-architecture is different from the x86ISA and ARMISA, other embodiments are contemplated in which the set of micro-instructions and other micro-architecture-specific resources are exposed to the user. In these embodiments, the micro-architecture effectively operates as a third ISA, external to the x86ISA and ARM ISAs, having machine language programs executable by the microprocessor.

The following table (Table 1) describes some bit fields of the microinstructions 126 of the microinstruction set of the present invention microprocessor 100.

The following table (Table 2) describes some of the micro instructions of the micro instruction set of one embodiment of the microprocessor 100 of the present invention.

Microprocessor 100 also includes microarchitecturally specific resources such as microarchitecturally specific general purpose registers, media registers, and section registers (e.g., registers used for renaming or registers used by microcode), and control registers not found in the x86 or ARM ISA, as well as a Private Random Access Memory (PRAM). In addition, the microarchitecture may generate exception events, i.e., the aforementioned microarchitectures. These exceptions are not found in or specified by the x86 or the ARM ISA, but are typically the micro instructions 126 and the associated micro instructions 126 re-execution (replay). For example, these situations include: the load miss (load miss) condition, in which the execution pipeline 112 assumes the load operation and re-executes the load microinstruction 126 upon miss; missing a Translation Lookaside Buffer (TLB), and re-executing the microinstruction 126 after a lookup table (page table walk) and the TLB are filled; the floating-point microinstruction 126 receives an exception operand (denormal operand) that evaluates to normal, and re-executes the microinstruction 126 after the operand is normalized by the execution pipeline 112; a load micro instruction 126 is executed upon detecting that an earlier store micro instruction 126 and its address conflict (address-matching) require re-execution of the load micro instruction 126. It should be understood that the bit fields listed in Table 1, the micro instructions listed in Table 2, and the microarchitecturally specified resources and microarchitecturally specified exceptions are merely exemplary to illustrate the microarchitecture of the present invention and are not exhaustive of all possible embodiments of the present invention.

The register file 106 includes hardware registers used by the microinstructions 126 to hold resource and/or destination operands. The execution pipeline 112 writes its results 128 to the register file 106 and receives operands for microinstructions 126 from the register file 106. Hardware register is an example of a number of registers in the shared register file 106 that reference (instruction) x86ISA definitions and ARM ISA definitions. For example, in one embodiment, register file 106 references fifteen 32-bit registers, shared by the ARM ISA registers R0-R14 and the x86ISA accumulate register (EAX register) -R14D. Thus, if a first microinstruction 126 writes a value into the ARM R2 register, then a subsequent second microinstruction 126 reads the x86 accumulation register and receives the same value as the first microinstruction 126 writes, and vice versa. This feature facilitates fast register communication between the machine language programs of the x86ISA and ARM ISA. For example, assuming that the ARM machine language program executing in the ARM machine language operating system is capable of changing the instruction mode 132 to the x86ISA and transferring control to an x86 machine language program to perform certain functions, the x86ISA may support instructions that perform faster than the ARM ISA, which may facilitate increased execution speed. The ARM program provides the required data to the x86 execution program via the shared registers of register file 106. Conversely, the x86 executive may provide the execution results to the shared registers in the register file 106 so that the ARM program may see the execution results after the x86 executive has returned. Similarly, an x86 machine language program executing on an x86 machine language operating system may change the instruction mode 132 to the ARM ISA and transfer control to the ARM machine language program; the x86 program may provide the required data to the ARM executive via the shared registers of the register file 106, and the ARM executive may provide the results of the execution via the shared registers of the register file 106, so that the x86 program may see the results of the execution after the ARM executive has returned. Because the ARM R15 register is an independently referenced ARM program counter register 116, the sixteenth 32-bit register that references the x86R15D register is not shared with the ARM R15 register. Additionally, in one embodiment, the 32-bit section of the sixteen 128-bit XMM 0-XMM 15 registers and sixteen 128-bit Advanced Single instruction multiple data expansion (Advanced SIMD ("Neon")) registers of x86 are shared with thirty-two 32-bit ARM VFPv3 floating-point registers. Register file 106 also references flag registers (i.e., the x86EFLAGS register and the ARM conditional flag register), as well as a variety of control and status registers defined by the x86ISA and the ARM ISA, including Model Specific Registers (MSRs) of the x86 architecture and coprocessor (8-15) registers reserved for the ARM architecture. The register file 106 also references non-architectural registers, such as non-architectural general purpose registers used for register renaming or used by microcode 234, and non-architectural x86 specific model registers and implementation-defined or manufacturer-specified ARM coprocessor registers. Register file 106 is further illustrated in FIG. 5.

Memory subsystem 108 includes a cache hierarchy of caches (in one embodiment, level-1 instruction cache 102, level-1 data cache, and level-2 hybrid cache). The memory subsystem 108 includes various memory request queues such as load, store, fill, snoop, merge write merge buffers. The memory subsystem also includes a Memory Management Unit (MMU). The memory management unit has translation look-aside buffers (TLBs), preferably separate instruction and data translation look-aside buffers. The memory subsystem also includes a table walk engine (table walk engine) for obtaining translations between virtual and physical addresses in response to translation lookaside buffer misses. Although instruction cache 102 and memory subsystem 108 are shown as separate entities in FIG. 1, instruction cache 102 is logically part of memory subsystem 108. The memory subsystem 108 is configured to enable the x86 and the ARM machine language program to share a common memory space, so that the x86 and the ARM machine language program can easily communicate with each other through the memory.

Memory subsystem 108 is aware of instruction mode 132 and environment mode 136, enabling it to perform a variety of operations in the context of the appropriate ISA. For example, the memory subsystem 108 performs checks for a particular memory access violation, such as a limit access check (limit access check), depending on whether the instruction mode pointer 132 indicates x86 or the ARM ISA. In another embodiment, in response to a change in the context mode pointer 136, the memory subsystem 108 updates (flush) the translation lookaside buffer; however, when the instruction mode pointer 132 changes, the memory subsystem 108 does not update the translation lookaside buffer accordingly to provide better performance in the third and fourth modes of instruction mode pointer 132 and context mode pointer 136, respectively, x86 and ARM. In another embodiment, in response to a translation lookaside buffer miss (TKB miss), the lookup engine determines to perform a paging lookup action to fetch the translation lookaside buffer using the x86 paging table or the ARM paging table according to the context mode pointer 136 indicating x86 or the ARM ISA. In another embodiment, if the context status indicator 136 indicates x86ISA, the memory subsystem 108 checks the architectural state of the x86ISA control registers (e.g., bits CR0CD and NW) that affect the caching policy; if the environmental mode pointer 136 indicates ARM ISA, then the architectural mode of the associated ARM ISA control registers (e.g., SCTLR I and C bits) are checked. In another embodiment, if the status indicator 136 indicates x86ISA, the memory subsystem 108 checks the architectural state of the x86ISA control registers (e.g., bit CR0 PG) that affect memory management; if the environmental mode pointer 136 indicates ARM ISA, then the architectural mode of the associated ARM ISA control register (e.g., SCTLR M bit) is checked. In another embodiment, the memory subsystem 108 checks the architectural state of the x86ISA control registers (e.g., bit CR0 AM) that affect alignment detection if the state indicator 136 indicates x86ISA, and checks the architectural mode of the associated ARM ISA control registers (e.g., bit SCTLR a) if the environmental mode indicator 136 indicates ARM ISA. In another embodiment, if the state indicator 136 indicates x86ISA, the memory subsystem 108 (and the hardware instruction translator 104 for privileged instructions) checks the architectural state of the x86ISA control registers of the currently specified privilege level (CPL); if the ambient mode indicator 136 indicates ARM ISA, then the architectural mode of the associated ARM ISA control registers indicating user or privilege mode is checked. However, in one embodiment, the x86ISA and ARM ISA share control bytes/registers of the microprocessor 100 having similar functionality, and the microprocessor 100 does not reference separate control bytes/registers for each instruction set architecture.

Although configuration registers 122 are shown as separate from register file 106, configuration registers 122 may be understood as part of register file 106. Configuration register 122 has a global configuration register for controlling various operations of microprocessor 100 in the x86ISA and ARM ISA, such as enabling or disabling features. The global configuration register disables the ability of the microprocessor 100 to execute ARM ISA machine language programs, i.e., enables the microprocessor 100 to be a microprocessor 100 capable of executing only x86 instructions, and disables other related and dedicated ARM capabilities, such as enable x86(launch-x86) and reset to x86 instructions 124 and implementation-defined (implementation-defined) coprocessor registers as referred to herein. The global configuration registers also disable the ability of the microprocessor 100 to execute x86ISA machine language programs, i.e., enable the microprocessor 100 to be a microprocessor 100 that is capable of executing only ARM instructions, and disable other related capabilities, such as ARM enable and reset to ARM instructions 124 and the new non-architectural specific model registers referred to herein. In one embodiment, microprocessor 100 is manufactured with predetermined configuration settings, such as hard code values in microcode 234, which microcode 234 uses to set the configuration of microprocessor 100 at startup, such as by writing to code registers 122. However, the partial encode register 122 is set in hardware rather than in microcode 234. In addition, microprocessor 100 has a plurality of fuses that can be read by microcode 234. These fuses may be blown to modify preset configuration values. In one embodiment, microcode 234 reads the fuse value, performs an exclusive-OR operation on the default value and the fuse value, and writes the operation result to configuration register 122. In addition, the effect of the modification to the fuse value can be recovered by a microcode 234 patch. In the case where the microprocessor 100 is capable of executing both the x86 and ARM programs, the global configuration register may be used to confirm whether the microprocessor 100 (or a particular core 100 of the multi-core portion of the processor shown in fig. 7) is powered on in the form of an x86 microprocessor or an ARM microprocessor when reset or in response to an INIT instruction in the form of x86 as shown in fig. 6A and 6B. The global configuration register has bits that provide initial default values to specific architectural control registers, such as ARM ISA SCTLT and CPACR registers. The multi-core embodiment shown in FIG. 7 has only one global configuration register, even though the configuration of each core can be set separately, such as when the instruction mode pointer 132 and the environment mode pointer 136 are both set to x86 or ARM, the x86 core or ARM core is selected to be powered on. In addition, the enable ARM instructions 126 and the enable x86 instructions 126 may be used to dynamically switch between the x86 and ARM instruction modes 132. In one embodiment, the global configuration register may be read from a new non-architecture specific model register via an x86RDMSR instruction, and some of the control bits may be written to the previously uncovered new non-architecture specific model register via a write from the x86WRMSR instruction. The global configuration register can also read an ARM coprocessor register corresponding to the previously-disclosed new non-architecture specific model register through an ARM MCR/MCRR instruction, and part of the control bits can be written in through writing of the ARM coprocessor register corresponding to the new non-architecture specific model register through the ARM MRC/MRRC instruction.

Configuration registers 122 also contain a variety of different control registers that control the operation of microprocessor 100 from different orientations. These non-x86 (non-x86)/ARM control registers include what are referred to herein as global control registers, non-instruction set architecture control registers, non-x 86/ARM control registers, general purpose control registers, and other similar registers. In one embodiment, the control registers may be accessed using x86RDMSR/WRMSR instructions to non-architecture specific model registers (MSRs) and ARM MCR/MRC (or MCRR/MRRC) instructions to newly implemented coprocessor registers. For example, microprocessor 100 includes control registers other than the x86/ARM register to identify micro-cached (fine-cached) control that is smaller than the x86ISA and ARM ISA control registers can provide.

In one embodiment, the microprocessor 100 provides ARM ISA machine language programs access to the x86ISA specific model registers via implementation-defined ARM ISA coprocessor registers that directly correspond to corresponding x86 specific model registers. The address of this model-specific register is specified in the ARM ISA R1 register. This data is read from or written to the ARM ISA register specified by the MRC/MRRC/MCR/MCRR instructions. In one embodiment, a subset of the pattern specific registers are password protected, i.e., the instruction must use a password in attempting to access the pattern specific registers. In this embodiment, the cryptogram is specified in the ARM R7: R6 registers. If the access operation results in a x86 general protection fault, microprocessor 100 generates an ARM ISA undefined instruction abort mode (UND) exception. In one embodiment, ARM coprocessor 4 (address: 0,7,15,0) accesses the corresponding x86 specific model register.

Microprocessor 100 also includes an interrupt controller (not shown) coupled to execution pipeline 112. In one embodiment, the interrupt controller is an Advanced Programmable Interrupt Controller (APIC) of the x86 type. The interrupt controller corresponds x86ISA interrupt events to ARM ISA interrupt events. In one embodiment, the x86INTR corresponds to an ARM IRQ interrupt event; x86NMI corresponds to ARM IRQ interrupt event; the x86INIT initiates an INIT-reset sequence (INIT-reset sequence) at startup of the microprocessor 100, whichever instruction set architecture (x86 or ARM) was originally initiated by a hardware reset; x86SMI corresponds to an ARM FIQ interrupt event; and x86STPCLK, A20, Thermal, PREQ, and Relay do not correspond to ARM interrupt events. The ARM machine language enables access to the advanced programmable interrupt controller functionality through the new implementation-defined ARM coprocessor registers. In one embodiment, the APIC register address is specified in the ARM R0 register, which is the same as the x86 address. In one embodiment, the ARM coprocessor 6is typically used for privileged mode functions performed by the operating system, and the address of the ARM coprocessor 6 is: 0,7, nn, 0; wherein nn is 15 to access the advanced programmable interrupt controller; nn are 12-14 for accessing the bus interface unit to perform 8-bit, 16-bit and 32-bit input/output cycles on the processor bus. Microprocessor 100 also includes a bus interface unit (not shown) coupled to memory subsystem 108 and execution pipeline 112 for interfacing microprocessor 100 with a processor bus. In one embodiment, the processor bus conforms to the specifications of a microprocessor bus of the Intel Pentium microprocessor family. The ARM machine language program may access the bus interface unit function through the new implementation-defined ARM coprocessor register to generate I/O cycles on the processor bus, i.e., from the I/O bus to a specific address in the I/O space, to communicate with the system chipset, e.g., the ARM machine language program may generate an SMI-approved specific cycle or an I/O cycle for C-state transition. In one embodiment, the input-output address is specified in the ARM R0 register. In one embodiment, microprocessor 100 has power management capabilities, such as P-state and C-state management, as is known. The ARM machine language program may define ARM coprocessor registers to perform power management through new implementations. In one embodiment, microprocessor 100 includes an encryption unit (not shown) that is located within execution pipeline 112. In one embodiment, the encryption unit is substantially similar to the encryption unit of a VIA microprocessor with Padlock Security technology function. The ARM machine language program can obtain the functions of the encryption unit, such as encryption instructions, through the ARM coprocessor register defined by the new implementation. In one embodiment, the ARM co-processor 5 is used for user mode functions that are normally performed by user mode applications, such as those generated using the technical features of the cryptography unit.

Each time the microprocessor 100 executes either the x86 or the ARM ISA instruction 124, the hardware instruction translator 104 performs hardware translation while the microprocessor 100 executes both x86ISA and ARM ISA machine language programs. In contrast, systems that employ software translation can reuse the same translation for multiple events, rather than re-translating previously translated machine language instructions, thereby facilitating improved performance. In addition, the embodiment of FIG. 8 uses a micro instruction cache to avoid duplicate translation operations that may occur each time the microprocessor executes an x86 or ARM ISA instruction 124. The above embodiments of the present invention are described in conjunction with different program features and execution environments thereof, and thus do help to improve performance.

The branch predictor 114 accesses historical data of previously executed x86 and ARM branch instructions. The branch predictor 114 analyzes the cache line from the instruction cache 102 for the x86 and ARM branch instructions and their target addresses based on previous cache history data. In one embodiment, the cache history data includes the memory address of the branch instruction 124, the branch target address, a direction pointer, the type of branch instruction, the start byte of the branch instruction cache line, and a pointer indicating whether multiple cache lines are to be spanned. In one embodiment, the provisional application U.S. Pat. No. 61/473,067 entitled "APPARATUS AND METHOD FOR USING BRANCH PREDICTION TO EFFICIENT DETECTION CONDITION NON-BRANCH INSTRUCTIONS" filed on 7.4.2011 provides a METHOD FOR improving the performance of the BRANCH predictor 114 TO enable it TO predict the direction of ARM ISA CONDITIONAL NON-BRANCH INSTRUCTIONS. In one embodiment, the hardware instruction translator 104 also includes a static branch predictor that predicts the direction and branch target address of the x86 and ARM branch instructions based on the type of execution code, condition code, backward (backward) or forward (forward) data, and the like.

The present invention also contemplates various embodiments for implementing various combinations of features defined by the x86ISA and the ARM ISA. For example, in one embodiment, the microprocessor 100 implements ARM, Thumb, and Jazelle instruction set states, but provides meaningless implementation for the Jazelle extended instruction set (triviality); microprocessor 100 also implements the following extended instruction set, including: thumb-2, VFPv3-D32, Advanced Single instruction multiple data (Advanced SIMD (Neon)), multiprocessing, and VMSA; but does not implement the following extended instruction set, including: security extensions, fast content switching extensions, ARM debug (ARM program can get x86 debug functions via ARM MCR/MRC instructions to new implementation-defined coprocessor registers), performance detection counters (ARM program can get x86 performance counters via new implementation-defined coprocessor registers). For example, in one embodiment, microprocessor 100 treats the ARM SETEND instruction as a no operation instruction (NOP) and supports only the Little-endian data format. In another embodiment, microprocessor 100 does not implement the functionality of x86SSE 4.2.

The present invention contemplates various embodiments of improvements to microprocessor 100, such as the commercial microprocessor VIA Nano, manufactured by Weisheng electronics, Inc. of Taipei, Taiwan^TMThe improvement is made. The Nano microprocessor is capable of executing x86ISA machine language programs, but is incapable of executing ARM ISA machine language programs. The Nano microprocessor includes high-performance register renaming, superscalar instruction technology, non-sequential execution pipelines, and a hardware translator to translate x86ISA instructions into micro instructions for execution by the execution pipelines. The present invention is an improvement over the Nano hardware instruction translator that translates the ARM ISA machine language instructions into microinstructions for execution by the execution pipeline in addition to the x86 machine language instructions. Improvements in hardware instruction translators include improvements in simple instruction translators and improvements in complex instruction translators (including microcode). In addition, the microinstruction set may incorporate new microinstructions to support translation between ARM ISA machine language instructions and microinstructions and may improve the execution pipeline to enable execution of the new microinstructions. In addition, the Nano register file and memory subsystem may also be improved to support the ARM ISA, including the sharing of specific registers. The branch prediction unit may be adapted for ARM branch instruction prediction by improving to be outside of the x86 branch prediction. This embodiment has the advantage that only the execution pipeline of the Nano microprocessor needs to be executed because of the ISA-independent (large ISA-architectural) constraint to a large extentWith slight modification, the method can be applied to ARM ISA instructions. Improvements to the execution pipeline include condition code flag generation and usage, semantics for updating and reporting instruction pointer registers, access privilege protection methods, and various memory management related functions such as access violation detection, use of paging and Translation Lookaside Buffers (TLBs), and caching policies. The foregoing is illustrative, and not restrictive, of the present invention, and some of the features will be described in further detail below. Finally, as mentioned above, some of the features defined by the x86ISA and ARM ISA, such as the x86SSE4.2 and ARM security extensions, fast context switch extensions, debug and performance counters, may not be supported by the disclosed embodiments for improving the Nano microprocessor, some of which will be further described below. In addition, the disclosure provides an improvement to the Nano processor to support the ARM ISA machine language program, which is an embodiment of a single integrated circuit product that integrates the design, test and manufacturing resources to perform x86 and ARM machine language programs, and that covers the vast majority of existing machine language programs in the market, and thus meets the market trend today. Embodiments of the microprocessor 100 described herein may be configured substantially as an x86 microprocessor, an ARM microprocessor, or a microprocessor capable of executing both x86ISA and ARM ISA machine language programs. The microprocessor is capable of performing both x86ISA and ARM ISA machine language programs by dynamically switching between x86 and ARM instruction mode 132 on a single microprocessor 100 (or core 100 of fig. 7), or is capable of performing both x86ISA and ARM ISA machine language programs by configuring one or more cores of the multi-core microprocessor 100 (corresponding to that shown in fig. 7) as an ARM core and one or more cores as an x86 core, i.e., by dynamically switching between x86 and ARM instructions on each core of the multi-core 100. In addition, the ARM ISA core is designed as an intellectual property core, and is incorporated into applications, such as system chips and/or embedded applications, by various third-party vendors. Thus, the ARM ISA does not have a specific standard processor bus for interfacing the ARM core to other parts of the system (e.g., a chipset or other interface device)An interface. Advantageously, the Nano processor already has a high speed x86 type processor bus as an interface to memory and interface devices, and a memory coherency architecture that supports the execution of ARM ISA machine language programs in the x86 computer system environment in conjunction with the microprocessor 100.

Referring now to FIG. 2, a block diagram illustrating the hardware instruction translator 104 of FIG. 1 is shown in greater detail. The hardware instruction translator 104 includes hardware, and more specifically, a set of transistors. The hardware instruction translator 104 includes an instruction formatter 202 that receives the instruction mode pointer 132 and the block of x86ISA and ARM ISA instruction bytes 124 from the instruction cache 102 of fig. 1 and outputs formatted x86ISA and ARM ISA instructions 242; a Simple Instruction Translator (SIT)204 receives the instruction mode pointer 132 and the context mode pointer 136 and outputs execute microinstructions 244 and a microcode address 252; a Complex Instruction Translator (CIT)206 (also referred to as a microcode unit) that receives microcode addresses 252 and context mode pointers 136 and provides execute microinstructions 246; and a multiplexer 212 having one input for receiving the microinstructions 244 from the simple instruction translator 204 and another input for receiving the microinstructions 246 from the complex instruction translator 206 and providing the execution microinstructions 126 to the execution pipeline 112 of FIG. 1. The instruction formatter 202 is illustrated in more detail in FIG. 3. The simple instruction translator 204 includes an x86 simple instruction translator 222 and an ARM simple instruction translator 224. The complex instruction translator 206 includes a micro-program counter (micro-PC)232 that receives a micro-code address 252, a micro-code ROM 234 that receives a ROM address 254 from the micro-program counter 232, a micro-sequencer 236 that updates the micro-program counter, an Instruction Indirect Register (IIR) 235, and a micro-translator 237 that generates a carry-out micro-instruction 246 that is output by the complex instruction translator. Both the execute microinstruction 244 generated by the simple instruction translator 204 and the execute microinstruction 246 generated by the complex instruction translator 206 are microinstructions 126 of the microinstruction set of the microarchitecture of the microprocessor 100 and are executed directly by the execution pipeline 112.

The multiplexer 212 is controlled by a select input 248. Typically, the multiplexer 212 selects micro instructions from the simple instruction translator 204; however, when the simple instruction translator 204 encounters a complex x86 or ARM ISA instruction 242 to transfer control to the complex instruction translator 206, or encounters traps (traps), the simple instruction translator 204 controls the select input 248 to cause the multiplexer 212 to select the microinstruction 246 from the complex instruction translator. When the register configuration table (RAT)402 (see FIG. 4) encounters a microinstruction 126 with a specific bit indicating that it is the last microinstruction 126 to implement the complex ISA instruction 242 sequence, the register configuration table 402 then controls the select input 248 to cause the multiplexer 212 to revert to selecting the microinstruction 244 from the simple instruction translator 204. In addition, when the reorder buffer 422 (FIG. 4) is ready to retire a microinstruction 126 whose state indicates a need to select the microinstruction from the complex instruction translator 206, the reorder buffer 422 controls the select input 248 to cause the multiplexer 212 to select the microinstruction 246 from the complex instruction translator 206. The case where the retire microinstruction 126 is required is as follows: microinstructions 126 have caused an exception condition to be generated.

The simple instruction translator 204 receives ISA instructions 242 and decodes these instructions as x86ISA instructions when the instruction mode indicator 132 indicates x86, and decodes these instructions as ARMISA instructions when the instruction mode indicator 132 indicates ARM. The simple instruction translator 204 also recognizes the ISA instruction 242 as a simple or complex ISA instruction. The simple instruction translator 204 is capable of outputting all of the execute microinstructions 126 for implementing the ISA instruction 242 as simple ISA instructions 242; that is, the complex instruction translator 206 does not provide any execute microinstructions 126 to the simple ISA instructions 124. In contrast, complex ISA instructions 124 require the complex instruction translator 206 to provide at least some, if not all, of the execute microinstructions 126. In one embodiment, for a subset of the ARM and x86ISA instructions 124, the simple instruction translator 204 outputs microinstructions 244 that partially implement the x86/ARM ISA instructions 126, and then transfers control to the complex instruction translator 206, which in turn outputs the remaining microinstructions 246 to implement the x86/ARM ISA instructions 126. The multiplexer 212 is controlled to provide first the execute microinstructions 244 from the simple instruction translator 204 as the microinstructions 126 provided to the execution pipeline 112, and then the execute microinstructions 246 from the complex instruction translator 206 as the microinstructions 126 provided to the execution pipeline 112. The simple instruction translator 204 is aware of the addresses of the initial microcode rom 234 of the microcode routines executed by the hardware instruction translator 104 to execute the microinstructions 126 for the complex ISA instructions 124, and when the simple instruction translator 204 decodes a complex ISA instruction 242, the simple instruction translator 204 provides the corresponding microcode routine address 252 to the microprogram counter 232 of the complex instruction translator 206. The simple instruction translator 204 outputs microinstructions 244 necessary to implement a significant proportion of the instructions 124 in the ARM and x86ISA instruction sets, particularly ISA instructions 124 that are frequently executed by x86ISA and ARM ISA machine language programs, while only a relatively small number of instructions 124 require the complex instruction translator 206 to provide the execution microinstructions 246. According to one embodiment, the x86 instructions implemented primarily by the complex instruction translator 206, such as RDMSR/WRMSR, CPUID, complex operation instructions (such as FSQRT and transcendental instruction), and IRET instructions; ARM instructions such as MCR, MRC, MSR, MRS, SRS, and RFE instructions are implemented primarily by the complex instruction translator 206. The instructions listed in the preceding disclosure are not intended to limit the present invention, but merely to illustrate the types of ISA instructions that can be implemented by the complex instruction translator 206 of the present invention.

When the instruction mode pointer 132 indicates x86, the x86 simple instruction translator 222 decodes the x86ISA instruction 242 and translates it into an execute microinstruction 244; when the instruction mode pointer 132 indicates ARM, the ARM simple instruction translator 224 decodes the ARM ISA instruction 242 and translates it into an execute microinstruction 244. In one embodiment, the simple instruction translator 204 is a Boolean logic gate block that can be synthesized by known synthesis tools. In one embodiment, the x86 simple instruction translator 222 and the ARM simple instruction translator 224 are separate Boolean logic gate blocks; however, in another embodiment, the x86 simple instruction translator 222 and the ARM simple instruction translator 224 are located in the same Boolean logic gate block. In one embodiment, the simple instruction translator 204 translates up to three ISA instructions 242 and provides up to six execute microinstructions 244 to the execution pipeline 112 in a single clock cycle. In one embodiment, the simple instruction translator 204 includes three sub-translators (not shown), each of which translates a single formatted ISA instruction 242, wherein the first translator is capable of translating formatted ISA instructions 242 that require no more than three execute microinstructions 126; the second translator is capable of translating formatted ISA instructions 242 that require no more than two execute microinstructions 126; the third translator can post-translate formatted ISA instructions 242 that require no more than one execute microinstruction 126. In one embodiment, simple instruction translator 204 includes a hardware state machine that enables it to output multiple micro instructions 244 in multiple clock cycles to implement an ISA instruction 242.

In one embodiment, the simple instruction translator 204 performs a plurality of different exception event detections based on the instruction mode indicator 132 and/or the context mode indicator 136. For example, if the instruction mode pointer 132 indicates x86 and the x86 simple instruction translator 222 decodes an ISA instruction 124 that is invalid for the x86ISA, the simple instruction translator 204 generates an x86 invalid opcode exception; similarly, if the instruction mode pointer 132 indicates ARM and the ARM simple instruction translator 224 decodes an ISA instruction 124 that is invalid for the ARM ISA, the simple instruction translator 204 generates an ARM undefined instruction exception. In another embodiment, if the context mode indicator 136 indicates x86ISA, the simple instruction translator 204 detects whether a special privilege level (privileged level) is required for each x86ISA instruction 242 it encounters, and if so, detects whether the Current Privilege Level (CPL) satisfies the special privilege level required by the x86ISA instruction 242 and generates an exception if not; similarly, if the ambient mode indicator 136 indicates ARM ISA, the simple instruction translator 204 then detects whether a privileged mode instruction is required for each formatted ARM ISA instruction 242, and if so, detects whether the current mode is privileged and generates an exception event when the current mode is user mode. Complex instruction translator 206 also performs similar functions for specific complex ISA instructions 242.

The complex instruction translator 206 outputs a series of execute microinstructions 246 to the multiplexer 212. The microcode rom 234 stores rom instructions 247 of microcode routines. The microcode rom 234 outputs a rom instruction 247 in response to the address of the next rom instruction 247 fetched from the microcode rom 234 and held by the microprogram counter 232. Generally, the micro program counter 232 receives its start value 252 from the simple instruction translator 204 in response to the simple instruction translator 204 decoding a complex ISA instruction 242. In other cases, such as in response to a reset or exception, the microcode counter 232 receives the reset microcode program address or the appropriate microcode exception handling address, respectively. The micro-sequencer 236 generally updates the micro-program counter 232 to a sequence of micro-code programs and optionally to a target address generated by the execution pipeline 112 in response to execution of a control type micro-instruction 126 (e.g., a branch instruction) based on the size of the ROM instruction 247 to enable validation of branches pointing to non-program addresses within the micro-code ROM 234. Microcode rom 234 is fabricated within the semiconductor chip of microprocessor 100.

In addition to the microinstructions 244 used to implement simple ISA instructions 124 or partially complex ISA instructions 124, the simple instruction translator 204 also generates ISA instruction information 255 to write to the instruction indirect registers 235. The ISA instruction information 255 stored in the instruction indirect register 235 includes information regarding the translated ISA instruction 124, such as information identifying the source and destination registers specified by the ISA instruction and the format of the ISA instruction 124, e.g., the ISA instruction 124 is executed on an operand in memory or within an architectural register 106 of the microprocessor 100. This may thereby enable microcode routines to become general-purpose, i.e., without the use of different microcode routines for each different source and/or destination architectural register 106. In particular, the simple instruction translator 204 knows the contents of the register file 106, including which registers are shared registers 504, and can translate register information provided within the x86ISA and ARM ISA instructions 124 to the appropriate registers within the register file 106 through the use of the ISA instruction information 255. The ISA instruction information 255 includes a shift field, an immediate field, a constant field, renaming information for each source operand and the microinstruction 126 itself, information indicating the first and last microinstruction 126 in the series of microinstructions 126 used to implement the ISA instruction 124, and other bits that store useful information gathered by the hardware instruction translator 104 when translating the ISA instruction 124.

The micro-translator 237 receives the ROM instruction 247 from the microcode ROM 234 and the indirect instruction register 235 and accordingly generates the execute micro-instruction 246. The micro-translator 237 translates a particular ROM instruction 247 into a different micro-instruction 246 family according to information received from the indirect instruction register 235, such as according to the ISA instruction 124 format and the source and/or destination architectural register 106 combination specified thereby. In some embodiments, a number of ISA instruction information 255 is incorporated into read only memory instructions 247 to produce execute microinstructions 246. In one embodiment, each ROM instruction 247 is approximately 40 bits wide and each microinstruction 246 is approximately 200 bits wide. In one embodiment, the micro-translator 237 is capable of generating up to three micro-instructions 246 from a single micro-read memory instruction 247. The micro-translator 237 includes a plurality of Boolean logic gates to generate the execute micro-instructions 246.

The advantage of using the micro-translator 237 is that since the simple instruction translator 204 itself generates the ISA instruction information 255, the microcode ROM 234 does not need to store the ISA instruction information 255 provided by the indirect instruction register 235, thereby reducing its size. Furthermore, since the microcode ROM 234 does not need to provide a separate program for each different ISA instruction format, and for each combination of source and/or destination architectural registers 106, the microcode ROM 234 program may contain fewer conditional branch instructions. For example, if the complex ISA instruction 124 is in memory format, the simple instruction translator 204 generates logic programming of the microinstruction 244, which includes the microinstruction 244 loading source operands from memory into a temporary register 106, and the microinstruction 237 generates the microinstruction 246 to store the result from the temporary register 106 into memory; however, if the complex ISA instruction 124 is in register format, the logic programming moves the source operand from the source register specified by the ISA instruction 124 to the temporary register, and the micro-translator 237 generates the micro-instruction 246 to move the result from the temporary register to the architectural destination register 106 specified by the indirect instruction register 235. In one embodiment, the micro-translator 237 is oriented similarly to U.S. patent application No. 12/766,244, filed on 23/4/2010, which is incorporated by reference herein. However, the micro-translator 237 is modified to translate the ARM ISA instructions 124 in addition to the x86ISA instructions 124.

It is noted that the micro program counter 232 is different from the ARM program counter 116 and the x86 instruction pointer 118, i.e., the micro program counter 232 does not hold the address of the ISA instruction 124, nor does the micro program counter 232 hold an address that falls within the system memory address space. Furthermore, it is noted that the microinstructions 246 are generated by the hardware instruction translator 104 and provided directly to the execution pipeline 112 for execution, rather than as the execution results 128 of the execution pipeline 112.

Referring to FIG. 3, the instruction formatter 202 of FIG. 2 is illustrated in a block diagram. The instruction formatter 202 receives blocks of x86ISA and ARM ISA instruction bytes 124 from the instruction cache 102 of fig. 1. By virtue of the variable length nature of x86ISA instructions, the x86 instruction 124 may begin with any byte of the instruction byte 124 chunk. The task of identifying the length and location of x86ISA instructions within a cache block is further complicated by the fact that the length of the x86ISA allowed prefix byte is affected by the current address length and operand length default values. In addition, depending on the current ARM instruction set state 322 and the opcode of the ARM ISA instructions 124, the ARM ISA instructions are either 2-byte or 4-byte in length, and thus are either 2-byte aligned or 4-byte aligned. Thus, the instruction formatter 202 fetches the different x86ISA and ARM ISA instructions from a stream (stream) of instruction bytes 124, which stream is the block received by the instruction cache 102. That is, instruction formatter 202 formats x86ISA and arm ISA instruction byte strings, thereby greatly simplifying the difficult task of ISA instructions 124 decoding and translation by the simple instruction translator of fig. 2.

The instruction formatter 202 includes a pre-decoder 302 that pre-decodes the instruction bytes 124 as x86 instruction bytes to generate pre-decode information when the instruction mode pointer 132 indicates x86, and pre-decoder 302 pre-decodes the instruction bytes 124 as ARM instruction bytes to generate pre-decode information when the instruction mode pointer 132 indicates ARM. An Instruction Byte Queue (IBQ)304 receives blocks of ISA instruction bytes 124 and associated pre-decode information generated by the pre-decoder 302.

An array of length decoders and ripple logic 306 receives the contents of the bottom entry (bottom entry) of the instruction byte queue 304, i.e., blocks of ISA instruction bytes 124 and associated predecode information. The length decoder and ripple logic 306 also receives the instruction mode indicator 132 and the ARM ISA instruction set state 322. In one embodiment, ARM ISA instruction set state 322 includes the J and T bits of the ARM ISA CPSR registers. In response to its input, the length decoder and ripple logic 306 generates decoded information. The decode information includes the length of the x86 and ARM instructions within the block of ISA instruction bytes 124, the x86 prefix information, and a pointer to each ISA instruction byte 124 that indicates whether the byte is the start byte, the end byte, and/or a valid byte of the ISA instruction 124. A multiplexer queue 308 receives blocks of ISA instruction bytes 124, the associated predecode information generated by the predecoder 302, and the associated decode information generated by the length decoder and ripple logic 306.

Control logic (not shown) examines the contents of the bottom entry of Multiplexer Queue (MQ)308 and controls multiplexer 312 to retrieve different, or formatted, ISA instructions and associated predecode and decode information, which is provided to a Formatted Instruction Queue (FIQ) 314. Formatted instruction queue 314 buffers formatted ISA instructions 242 and the related information provided to simple instruction translator 204 of fig. 2. In one embodiment, multiplexer 312 retrieves up to three formatted ISA instructions and associated information per clock cycle.

In one embodiment, the command formatter 202 is similar in many respects to the XIBQ, command formatter, and FIQ disclosed in co-pending U.S. patent application nos. 12/571,997, 12/572,002, 12/572,045, 12/572,024, 12/572,052, and 12/572,058, filed on 10/1 of 2009, which are incorporated herein by reference. However, the XIBQ, instruction formatter, and FIQ disclosed in the aforementioned patent application are modified to format ARM ISA instructions 124 in addition to x86ISA instructions 124. Length decoder 306 is modified to enable decoding of ARM ISA instructions 124 to generate length and byte pointers for start, end and validity. In particular, if the instruction mode indicator 132 indicates ARM ISA, the length decoder 306 detects the current ARM ISA state 322 and the opcode of the ARM ISA instruction 124 to confirm that the ARM instruction 124 is a 2-byte long or 4-byte long instruction. In one embodiment, the length decoder 306 comprises a plurality of independent length decoders for generating the length data of the x86ISA instructions 124 and the length data of the ARM ISA instructions 124, respectively, the outputs of the independent length decoders are wired or (wire-ORed) coupled together to provide an output to the ripple logic gate 306. In one embodiment, the formatted instruction queue 314 comprises a separate queue to hold multiple, separate portions of formatted instructions 242. In one embodiment, instruction formatter 202 provides simple instruction translator 204 with up to three formatted ISA instructions 242 in a single clock cycle.

Referring now to FIG. 4, a block diagram illustrating the execution pipeline 112 of FIG. 1 is shown in greater detail, where the execution pipeline 112 is coupled to the hardware instruction translator 104 for receiving directly executed microinstructions from the hardware instruction translator 104 of FIG. 2. The execution pipeline 112 includes a micro instruction queue 401 to receive micro instructions 126; a register allocation table 402, which receives micro instructions from the micro instruction queue 401; an instruction scheduler 404 coupled to the register allocation table 402; a plurality of reservation stations 406 coupled to the instruction scheduler 404; an instruction issue unit 408 coupled to reservation stations 406; a reorder buffer 422, coupled to register allocation table 402, instruction scheduler 404, and reservation stations 406; also, an execution unit 424 is coupled to the reservation station 406, the instruction issue unit 408, and the reorder buffer 422. The register configuration table 402 and the execution unit 424 receive the instruction mode pointer 132.

The micro instruction queue 401 acts as a buffer in the event that the hardware instruction translator 104 generates the execution micro instructions 126 at a different rate than the execution pipeline 112 executes the micro instructions 126. In one embodiment, the micro instruction queue 401 includes an M-to-N compressible micro instruction queue. The compressible micro instruction queue enables the execution pipeline 112 to receive up to M (in one embodiment, M is six) micro instructions 126 from the hardware instruction translator 104 in a given clock cycle, and then store the received micro instructions 126 into a queue structure of width N (in one embodiment, N is three) to provide up to N micro instructions 126 per clock cycle to the register allocation table 402, the register allocation table 402 capable of processing up to N micro instructions 126 per clock cycle. The micro instruction queue 401 is compressible because it sequentially fills the empty entries of the queue with micro instructions 126 issued by the hardware instruction translator 104 regardless of the particular clock cycle in which the micro instructions 126 are received, thus leaving no empty entries in the queue. This approach advantageously allows for full utilization of the execution units 424 (see FIG. 4) because it provides higher instruction storage performance than in an instruction queue of non-compressible width M or width M. Specifically, a non-compressible width N queue may require hardware instruction translator 104, and more particularly simple instruction translator 204, to repeatedly translate one or more ISA instructions 124 that have been translated in a previous clock cycle in a subsequent clock cycle. This is done because a queue of incompressible width N cannot receive more than N microinstructions 126 in the same clock cycle, and repeated translation results in power consumption. However, the incompressible width M of the queue, although not requiring repeated translation by the simple instruction translator 204, is wasteful due to the empty queue entry, requiring more rows of entries and a larger and more power-consuming queue to provide substantial buffering capacity.

The register allocation table 402 receives the microinstructions 126 from the microinstruction queue 401 and generates information pertaining to the microinstructions 126 in progress within the microprocessor 100. the register allocation table 402 performs register renaming operations to increase the ability of the microinstructions to be processed in parallel, thereby facilitating the superscalar, non-sequential execution capabilities of the execution pipeline 112. If the ISA instruction 124 indicates x86, the register allocation table 402 may generate dependency information and perform corresponding register renaming operations corresponding to the x86ISA registers 106 of the microprocessor 100; otherwise, if the ISA instructions 124 indicate ARM, the register configuration table 402 generates the dependency information and performs the corresponding register renaming operation corresponding to the ARM ISA registers 106 of the microprocessor 100; however, as previously described, portions of the registers 106 may be shared by the x86ISA and the ARM ISA. The register allocation table 402 also allocates an entry in the reorder buffer 422 for each of the microinstructions 126 according to program order, such that the reorder buffer 422 enables the microinstructions 126 and their associated x86ISA and ARM ISA instructions 124 to retire according to program order, even though execution of the microinstructions 126 is non-sequential with respect to their intended implementation of the x86ISA and ARM ISA instructions 124. Reorder buffer 422 comprises a circular queue whose entries store information about in-flight microinstructions 126, including, among other things, the execution state of microinstructions 126, a tag identifying whether microinstructions 126 are translated by x86 or arm isa instruction 124, and storage space for storing the results of microinstructions 126.

The instruction dispatcher 404 receives the register rename microinstructions 126 and dependency information from the register allocation table 402 and dispatches the microinstructions 126 and dependency information to the reservation stations 406 associated with the appropriate execution units 424 based on the type of instruction and availability of the execution units 424. The execution unit 424 executes the microinstructions 126.

For each microinstruction 126 waiting in the reservation station 406, the instruction issue unit 408 issues the microinstruction 126 to the execution unit 424 for execution when it determines that the associated execution unit 424 is available and its dependency information is satisfied (e.g., source operands are available). As previously described, the microinstructions 126 issued by the instruction issue unit 408 may be executed non-round-robin and in a superscalar manner.

In one embodiment, the execution units 424 include an integer/branch unit 412, a media unit 414, a load/store unit 416, and a floating point unit 418. The execution unit 424 executes the microinstruction 126 to generate the result 128, which is provided to the reorder buffer 422. Although the execution units 424 are not significantly affected by the micro instructions 126 that they execute translated from the x86 or the ARM ISA instruction 124, the execution units 424 still use the instruction mode indicator 132 and the context mode indicator 136 to execute a relatively small subset of the micro instructions 126. For example, the execution pipeline 112 manages the generation of flags, which are somewhat different depending on whether the instruction mode pointer 132 indicates x86ISA or ARM ISA, and the execution pipeline 112 updates the ARM condition code flags in the x86EFLAGS register or the Program Status Register (PSR) depending on whether the instruction mode pointer 132 indicates x86ISA or ARM ISA. In another example, the execution pipeline 112 samples the instruction mode pointer 132 to determine whether to update the x86 Instruction Pointer (IP)118 or the ARM Program Counter (PC)116, or to update a common instruction address register. In addition, the execution pipeline 122 may determine to use either the x86 or ARM semantics to perform the operations described above. Once a microinstruction 126 becomes the oldest completed microinstruction 126 in the microprocessor 100 (i.e., at the head of the reorder buffer 422 queue and exhibits a completed state) and all other microinstructions 126 implementing the associated ISA instruction 124 have completed, the reorder buffer 422 retires the ISA instruction 124 and frees the entry associated with the execution of the microinstruction 126. In one embodiment, microprocessor 100 may retire up to three ISA instructions 124 in a clock cycle. Advantageously, the execution pipeline 112 is a high performance, general purpose execution engine that executes micro instructions 126 of the micro architecture of the microprocessor 100 that supports both the x86ISA and ARM ISA instructions 124.

Referring to FIG. 5, a block diagram illustrating register file 106 of FIG. 1 is shown. For a preferred embodiment, register file 106 is a separate register block entity. In one embodiment, the general purpose register is implemented by a register file entity having a plurality of read ports and write ports; other registers may be physically separate from the general register file and other adjacent functional blocks that access these registers but have fewer read write ports. In one embodiment, portions of non-general purpose registers, particularly those that do not directly control the hardware of microprocessor 100 but merely store values that microcode 234 may use (e.g., portions of x86MSR or ARM coprocessor registers), are implemented in a Private Random Access Memory (PRAM) accessible by microcode 234. However, x86ISA and ARM ISA programmers cannot see the private ram, i.e., the memory is not within the ISA system memory address space.

In summary, as shown in FIG. 5, register file 106 is logically divided into three types, namely ARM specific register 502, x86 specific register 504, and shared register 506. In one embodiment, shared registers 506 comprise fifteen 32-bit registers shared by ARM ISA registers R0-R14 and x86ISA EAX-R14D, and sixteen 128-bit registers shared by the x86ISA XMM 0-XMM 15 registers and ARM ISA advanced single instruction multiple data extension (Neon) registers, partially overlapping thirty-two 32-bit ARM VFPv3 floating point registers. As described above with respect to FIG. 1, the sharing of general purpose registers means that values written to a shared register by the x86ISA instruction 124 are seen by the ARMISA instruction 124 when the shared register is subsequently read, and vice versa. This has the advantage of enabling the x86ISA and ARM ISA programs to communicate with each other via registers. In addition, as mentioned above, specific bits of the architectural control registers of the x86ISA and ARM ISA may also be referred to as the shared register 506. As described above, in one embodiment, the x86 ISA-specific registers are accessible by the ARMISA instruction 124 through implementation-defined coprocessor registers, and are thus shared by the x86ISA and the ARM ISA. The shared registers 506 may include non-architectural registers, such as non-architectural equivalents of condition flags, that are also renamed by the register configuration table 402. The hardware instruction translator 104 knows which register is shared by the x86ISA and the arm ISA, and therefore generates the execute microinstruction 126 to access the correct register.

ARM specific registers 502 include other registers defined by the ARM ISA but not included in shared registers 506, while x86 specific registers 502 include other registers defined by the x86ISA but not included in shared registers 506. For example, ARM specific registers 502 include ARM program counter 116, CPSR, SCTRL, FPSCR, CPACR, co-processor registers, spare general purpose registers and program status save registers (SPSRs) for multiple exception event modes, and so forth. The ARM specific register 502 is not meant to limit the present invention, but is merely exemplary to illustrate the present invention. Additionally, for example, the x 86-specific registers 504 include the x86 instruction pointer (EIP or IP)118, EFLAGS, R15D, the upper 32 bits of the 64-bit R0-R15 register (i.e., not falling part of the shared register 506), a section register (SS, CS, DS, ES, FS, GS), an x87FPU register, an MMX register, control registers (e.g., CR0-CR3, CR8), and so forth. The x86 specific register 504 listed above is not intended to limit the present invention, but is merely exemplary to illustrate the present invention.

In one embodiment, the microprocessor 100 includes a new implementation-defined ARM coprocessor register that is accessible to perform x86 ISA-related operations when the instruction mode pointer 132 indicates the ARM ISA. These operations include, but are not limited to: the ability to reset microprocessor 100 to an x86ISA processor (reset to x86 instruction); the ability to initialize the microprocessor 100 to a x 86-specific state, switch the instruction mode pointer 132 to x86, and begin fetching x86 instructions 124 (launched to x86 instructions) at a specific x86 target address; the ability to access the global configuration registers; the ability to access x86 specific registers (e.g., EFLAGS), the x86 registers are specified in the ARM R0 registers for access to power management (e.g., P-state to C-state transitions), access to processor bus functions (e.g., I/O cycles), access to interrupt controllers, and access to encryption acceleration functions. In addition, in one embodiment, the microprocessor 100 includes a new x86 non-architectural specific model register that can be accessed to perform ARM ISA-related operations when the instruction mode pointer 132 indicates the x86 ISA. These operations include, but are not limited to: the ability to reset the microprocessor 100 to an ARM ISA processor (reset to ARM instructions); the ability to initialize the microprocessor 100 to an ARM-specific state, switch the instruction mode indicator 132 to ARM, and begin fetching ARM instructions 124 (boot to ARM instructions) at a specific ARM target address; the ability to access the global configuration registers; the ability to access ARM specific registers (e.g., CPSR) specified in the EAX register.

Referring to FIGS. 6A and 6B, a flowchart illustrating a process for operating the microprocessor 100 of FIG. 1 is shown. The flow begins at step 602.

The microprocessor 100 is reset, as shown in step 602. This reset operation may be signaled to a reset input of microprocessor 100. In addition, in one embodiment, the microprocessor bus is an x86 type processor bus, and the reset operation may be performed by an INIT command of the x86 type. In response to this reset operation, the reset procedure of microcode 234 is invoked for execution. The operation of the reset microcode includes: (1) initializing the x 86-specific state 504 to a preset value specified by the x86 ISA; (2) initializing ARM specific state 502 to a preset value specified by ARM ISA; (3) initializing a non-ISA specific state of the microprocessor 100 to a preset value specified by a manufacturer of the microprocessor 100; (4) initializing shared ISA states 506, such as GPRs, to a preset value specified by the x86 ISA; and (5) set the instruction mode pointer 132 and the context mode pointer 136 to indicate the x86 ISA. In another embodiment, unlike the preceding operations (4) and (5), the reset microcode initializes the shared ISA state 506 to a predetermined value specific to the ARM ISA and sets the instruction mode indicator 132 and the context mode indicator 136 to indicate the ARM ISA. In this embodiment, the operations at 638 and 642 need not be performed, and prior to 614, the reset microcode initializes the shared ISA state 506 to a predetermined value specified by the x86ISA and sets the instruction mode pointer 132 and context mode pointer 136 to indicate the x86 ISA. Step 604 is entered next.

At step 604, the reset microcode confirms that the microprocessor 100 is configured as an x86 processor or an ARM processor for power-on. In one embodiment, as described above, the default ISA boot mode is hard-coded in microcode, but may be modified by blowing configuration fuses or by using a microcode patch. In one embodiment, the default ISA boot mode is provided as an external input to microprocessor 100, such as an external input pin. Step 606 is entered next. In step 606, if the default ISA boot mode is x86, go to step 614; otherwise, if the default boot mode is ARM, step 638 is entered.

At block 614, the reset microcode causes the microprocessor 100 to fetch the x86 instruction 124 starting at the reset vector address specified by the x86 ISA. Step 616 is next entered.

At step 616, x86 system software (e.g., BIOS) configures microprocessor 100 to use, for example, x86ISA RDMSR and WRMSR instructions 124. Step 618 is next entered.

In step 618, the x86 system software executes a reset to ARM instruction 124. This reset to ARM instruction causes the microprocessor 100 to reset and leave the reset routine with an ARM processor state. However, because the x 86-specific state 504 and the non-ISA-specific configuration state are not altered by the instruction 126 being reset to ARM, this facilitates the x86 firmware performing initial configuration of the microprocessor 100 and allowing the microprocessor 100 to be subsequently rebooted in the ARM processor state while still maintaining intact the non-ARM configuration of the microprocessor 100 performed by the x86 system software. Thus, the method can use the 'small-sized' micro boot code to execute the boot program of the ARM operating system, and does not need to use the micro boot code to solve the complicated problem of how to configure the microprocessor 100. In one embodiment, the reset to ARM instruction is an x86WRMSR instruction to a new non-architected model-specific register. Step 622 is next entered.

At step 622, the simple instruction translator 204 enters trap-to-reset microcode in response to the complex reset-to-ARM instruction 124. This reset microcode initializes the ARM _ special state 502 to a preset value specified by the ARM ISA. However, the reset microcode does not modify the non-ISA-specific state of microprocessor 100, and thus facilitates saving the configuration settings required for the performance of step 616. In addition, reset microcode initializes the shared ISA state 506 to a predetermined value specified by the ARM ISA. Finally, the microcode sets the instruction mode pointer 132 and the context mode pointer 136 to indicate the ARM ISA. Step 624 is next entered.

At block 624, the reset microcode causes the microprocessor 100 to begin fetching ARM instructions 124 at the address specified by the x86ISA EDX: EAX register. The flow ends at step 624.

At step 638, reset microcode initializes shared ISA states 506, such as GPRs, to a predetermined value specified by the ARM ISA. Step 642 is next entered.

At step 642, the microcode sets the instruction mode pointer 132 and the context mode pointer 136 to indicate ARMISA. Step 644 is next entered.

At step 644, the reset microcode causes the microprocessor 100 to begin fetching ARM instructions 124 at the ARM ISA specified reset vector address. The ARM ISA defines two reset vector addresses and is selectable by an input. In one embodiment, microprocessor 100 includes an external input to select between two reset vector addresses defined by the ARM ISA. In another embodiment, microcode 234 includes a default selection between two ARM ISA-defined reset vector addresses, which may be modified by blowing fuses and/or microcode patching. Step 646 is next entered.

In step 646, the ARM system software configures the microprocessor 100 to use specific instructions, such as ARM ISA MCR and MRC instructions 124. Step 648 is next entered.

In step 648, the ARM system software executes a reset to x86 instruction 124 to reset the microprocessor 100 and leave the reset routine in the state of an x86 processor. However, because the ARM-specific state 502 and the non-ISA-specific configuration state are not altered by the instruction 126 reset to x86, this facilitates the ARM system firmware performing a preliminary set up of the microprocessor 100 and allowing the microprocessor 100 to be subsequently rebooted in the x86 processor state while still maintaining the non-x86 configuration of the microprocessor 100 as performed by the ARM system software intact. Thus, the method enables the boot process of the x86 operating system to be performed using "small" micro boot code, without the need to use micro boot code to solve the complex problem of how to configure the microprocessor 100. In one embodiment, the reset to x86 instruction is an ARM MRC/MRCC instruction to a new implementation-defined coprocessor register. Step 652 is next entered.

In step 652, simple instruction translator 204 enters trap to reset microcode in response to the complex reset to x86 instruction 124. Reset microcode initializes the x86 special state 504 to the default value specified by the x86 ISA. However, reset microcode does not modify the non-ISA specific state of microprocessor 100, which advantageously preserves the configuration settings performed in step 646. In addition, reset microcode initializes shared ISA state 506 to a predetermined value specified by the x86 ISA. Finally, the microcode sets the instruction mode pointer 132 and the context mode pointer 136 to indicate the x86 ISA. Step 654 is next entered.

At 654, the reset microcode causes the microprocessor 100 to begin fetching ARM instructions 124 at the address specified by the ARM ISA R1: R0 register. Flow ends at step 654.

Referring now to FIG. 7, a block diagram illustrating a dual-core microprocessor 700 according to the present invention is shown. The dual core microprocessor 700 includes two processing cores 100, each of the cores 100 including elements of the microprocessor 100 of FIG. 1, whereby each core is capable of executing both x86ISA and ARM ISA machine language programs. These cores 100 may be configured such that both cores 100 execute x86ISA programs, both cores 100 execute ARM ISA programs, or one core 100 executes x86ISA programs and the other core 100 executes ARM ISA programs. The three settings can be mixed and dynamically changed during operation of the microprocessor 700. As described in the description of fig. 6A and 6B, each core 100 has a default value for its instruction mode pointer 132 and context mode pointer 136, which can be modified by fuse or microcode repair, so that each core 100 can be changed to x86 or ARM processor independently by reset. Although the embodiment of FIG. 7 has only two cores 100, in other embodiments, microprocessor 700 may have more than two cores 100, each of which may execute x86ISA and ARM ISA machine language programs.

Referring now to FIG. 8, a block diagram illustrating a microprocessor 100 capable of executing x86ISA and ARM ISA machine language programs in accordance with an alternate embodiment of the present invention is shown. Microprocessor 100 of FIG. 8 is similar to microprocessor 100 of FIG. 1, and like reference numerals are used to refer to like elements. However, the microprocessor 100 of FIG. 8 also includes a micro instruction cache 892, which accesses micro instructions 126 generated by the hardware instruction translator 104 and provided directly to the execution pipeline 112. The micro instruction cache 892 is indexed by a fetch address generated by the instruction fetch unit 114. If the fetch address 134 hits in the micro instruction cache 892, a multiplexer (not shown) within the execution pipeline 112 selects micro instructions 126 from the micro instruction cache 892 instead of micro instructions 126 from the hardware instruction translator 104; otherwise, the multiplexer selects the microinstruction 126 provided directly from the hardware instruction translator 104. The operation of a micro instruction cache, also commonly referred to as a trace cache, is well known in the art of microprocessor design. An advantage of the micro instruction cache 892 is that the micro instructions 126 are generally fetched from the micro instruction cache 892 in less time than the instructions 124 are fetched from the instruction cache 102 and translated into the micro instructions 126 by a hardware instruction translator. In the embodiment of FIG. 8, when the microprocessor 100 executes x86 or ARM ISA machine language programs, the hardware instruction translator 104 does not need to perform hardware translation each time x86 or ARM ISA instructions 124 are executed, i.e., no hardware translation is required if the microinstruction 126 is already present in the microinstruction cache 892.

An advantage of embodiments of the microprocessor described herein is that they are capable of executing x86ISA and ARM ISA machine language programs by a built-in hardware instruction translator that translates the x86ISA and ARM ISA instructions into microinstructions of a microinstruction set, which is different from the x86ISA and ARM ISA instruction sets, and which are executable by a shared execution pipeline of the microprocessor to provide for the execution of the microinstructions. An advantage of embodiments of the microprocessor described herein is that by utilizing a large number of ISA-independent execution pipelines in conjunction with the execution of microinstructions hardware-translated from the x86ISA and ARM ISA instructions, the design and fabrication of the microprocessor requires fewer resources than two independently designed microprocessors (i.e., one capable of executing the x86ISA machine language programs and one capable of executing the ARM ISA machine language programs). In addition, embodiments of these microprocessors, particularly those employing superscalar non-sequential execution pipelines, have the potential to provide higher performance than existing ARMISA processors. In addition, these embodiments of the microprocessor also offer the potential for higher performance in x86 and ARM execution than systems that employ software translators. Finally, because the microprocessor can execute both x86ISA and ARMISA machine language programs, the microprocessor advantageously implements a system that can efficiently execute both x86 and ARM machine language programs.

CONDITIONAL arithmetic and logic unit (CONDITIONAL ALU) instruction

It is desirable for a microprocessor to include functionality in the instruction set that allows instructions to be conditionally executed. A conditional execution instruction means that the instruction specifies a condition (e.g., zero, or negative, or greater) that is to be executed by the microprocessor if the condition flags are satisfied, and that is not to be executed if the condition flags are not satisfied. As mentioned above, the ARM ISA provides this functionality not only to branch instructions, but also to a large portion of the instructions in its instruction set. The conditional instruction specifies a source operand from a general purpose register to generate a result that is written to a general purpose register. U.S. patent No. 7,647,480 to armlimit, of Cambridge, Great bridge, describes a data processing apparatus for processing conditional instructions. Generally, a pipeline processing unit executes a conditional instruction to generate a result data value. The resultant data value indicates the result of the computation specified by the conditional instruction when the condition is satisfied and indicates the current data value stored in the destination register when the condition is not satisfied. Two possible solutions are described in the following paragraphs.

In a first solution, each conditional instruction in the instruction set is restricted to the registers specified by the instruction condition being both the source register and the destination register. In this manner, conditional instructions occupy only two read ports of the register file, providing the current destination register value as a source operand and providing other source operands. Thus, this first solution may further reduce the minimum number of register file read ports required to support the pipeline processing unit to execute conditional instructions.

The second solution removes the restriction on conditional instructions in the first solution, whereby the conditional instruction can specify separate destination and source registers. The second solution requires the use of an additional read port of the register file to read the operand data values required by the conditional instruction (i.e., the source operand and the destination operand from the register file) in a single cycle. Since the second solution not only costs additional read ports, but also requires a larger number of bits to specify conditional instructions and more complex data paths, the first solution was chosen as its target in U.S. patent 7,647,480. In particular, this data path needs to provide logic processing for three input paths from the register file, and may also need steering logic to couple to any of these three paths.

The embodiments presented herein are advantageous in that they enable conditional instructions to specify source operand registers other than destination registers, and do not require the use of an additional read port in the register file. Generally, according to embodiments of the invention, the hardware instruction translator 104 of the microprocessor 100 of FIG. 1 translates a conditional execution ISA instruction 124 into a sequence of one or more microinstructions 126 for execution by the execution pipeline 112. The execution unit 424 executing the last microinstruction 126 of the sequence receives the original value of the destination register specified by the conditional instruction 124 to determine whether the condition is satisfied. The previous microinstruction 126, or the last microinstruction 126 itself, performs an operation on the source operand to generate a result. If the condition is not satisfied, the execution unit 424 executing the last microinstruction 126 of the sequence writes the original value back to the destination register instead of writing the result value to the destination register.

In one embodiment of the present invention, the conditional ALU instruction is an ISA instruction 124 that directs the microprocessor 100 to perform an arithmetic or logical operation on one or more source operands to generate a result and write the result to a destination register. Other types of conditional instructions 124 may also be supported by the ISA instruction set of the microprocessor 100, such as conditional branch instructions 124 or conditional load/store instructions 124, which are distinct from conditional ALU instructions 124.

The number and type of microinstructions 126 issued in sequence by the hardware instruction translator 104 in response to encountered conditional ALU instructions 124 is characterized by two characteristics. The first feature is whether the conditional ALU instruction 124 specifies that one of the source operands is to be pre-shifted. In one embodiment, the pre-shift operation includes the operations described in the ARM architecture reference manual, pages A8-10 through A8-12, for example. If the conditional ALU instruction 124 specifies a pre-shift operation, the hardware instruction translator 104 generates a shift microinstruction 126 (labeled as SHF in FIG. 10) as the first microinstruction 126 in the sequence. The shift microinstruction 126 performs the pre-shift operation to generate a shift result that is written into a temporary register (temporary register) for use by subsequent microinstructions 126 in the sequence. The second feature is whether the destination register specified by the conditional ALU instruction 124 is also one of these source operand registers. If so, the hardware instruction translator 104 performs an optimization procedure to translate the conditional ALU instruction 124 into a fewer number of microinstructions 126 than would be produced by a conditional ALU instruction 124 that does not specify one of the destination operand registers. This process is described primarily in FIGS. 21-28.

In addition, the conditional ALU instruction 124 specifies a condition that must be satisfied by an architectural condition flag, enabling the microprocessor 100 to execute the conditional ALU instruction 124. The conditional ALU instruction 124 specifies that architectural condition flags need to be updated with the result of the ALU operation and/or a carry flag (carry flag) generated by a pre-shift. However, if the condition is not satisfied, the architectural condition flags are not updated. This is complicated to achieve because the hardware instruction translator 104 translates the conditional ALU instruction 124 into a sequence of microinstructions 126. Specifically, if the condition is satisfied, at least one of the microinstructions 126 must write the new condition flag value; however, the old value of the condition flag may be used by the micro instruction 126 in the sequence to determine whether the condition specified by the conditional ALU instruction 124 is satisfied and/or to perform an ALU operation. An advantage of these embodiments is that the microprocessor 100 employs tricks to ensure that the condition flags are not updated when the condition is not satisfied, and that the flags are updated with the correct values when the condition is satisfied, including updating with the pre-shift carry flag value.

In one embodiment of the microprocessor 100 of the present invention, as shown in FIG. 1, the register file 106 holding general purpose registers has read ports only sufficient for the register file 106 to provide at most two source operands to the execution unit 424 where the microinstructions are executed to implement the conditional ALU instruction 124. As described above with respect to FIG. 1, the present invention is directed to an improved microprocessor 100. The register file used to hold the general purpose registers of the commercially available microprocessor has read ports only sufficient for the register file to provide at most two source operands to the execution unit, which executes what is referred to herein as a microinstruction 126 to implement the conditional ALU instruction 124. Thus, the embodiments described herein are particularly advantageous for use in the microarchitecture of such commercial microprocessors. As previously discussed with respect to FIG. 1, the commercially available microprocessor, originally designed as the x86ISA, is not a critical feature in respect of conditional execution of instructions, since the processor is accumulator-based and typically requires a source operand as a destination operand, and thus does not require the additional read port.

One advantage of the embodiments described herein is that although in some instances there may be a two clock cycle execution delay associated with the execution of two microinstructions translated by the conditional ALU instruction 124, and in some instances there may be a three clock cycle execution delay associated with the execution of two microinstructions translated by the conditional ALU instruction 124, the operations performed by each microinstruction are relatively simple, allowing the implementation of a pipelined architecture to support relatively high core clock frequencies.

Although the embodiment described herein illustrates the microprocessor 100 as being capable of executing both ARM ISA and x86ISA instructions, the present invention is not limited thereto. Embodiments of the present invention may also be applicable to microprocessors that execute only a single ISA instruction. In addition, although the embodiment described herein translates ARM ISA conditional ALU instructions into microinstructions 126, the embodiment is also applicable to cases where the microprocessor executes an ISA instruction other than ARM, also including conditional ALU instructions in its instruction set.

Referring now to FIG. 9, a block diagram is presented detailing the microprocessor 100 of FIG. 1. The microprocessor 100 includes an architectural condition flags register 926 within the register file 106 of FIG. 1. the microprocessor 100 also includes the execution unit 424 and reorder buffer 422 of FIG. 4. The condition flags register 926 stores architectural condition flags. In one embodiment, the condition flag register 926 stores the value according to the semantic meaning of the ARM ISA condition flag when the instruction mode pointer 132 indicates ARM ISA, and the condition flag register 926 stores the value according to the semantic meaning of the x86ISA condition flag, x86EFLAGS, when the instruction mode pointer 132 indicates x86 ISA. As described above in relation to the description of FIG. 5, register file 106 is preferably implemented as separate physical blocks of registers; in particular, the condition flags register 926 may be a physical register file other than a register file of general purpose registers, for example. Thus, even though the condition flags are provided to the execution unit 424 for execution of the microinstructions 126 as described below, the read port of the condition flags register file may be a different read port than the general purpose register file.

The condition flag register 926 outputs its condition flag value to a data input of a three-input multiplexer 922. A second data input of the multiplexer 922 also receives the condition flag result from the appropriate entry (entry) of the reorder buffer 422. A third data input of the multiplexer 922 also receives the condition flag result on a flag bus 928. The multiplexer 922 selects the appropriate data input as its input 924 which is provided to the execution unit 424 for execution of the microinstruction 126 to read the condition flags. This process will be described more clearly in the subsequent paragraphs. Although only a single flag bus 928 is described, in accordance with one embodiment of the present invention, each execution unit 424 capable of generating condition flags has its own flag bus 928, and each execution unit 424 capable of reading condition flags has its own condition flag input 924. Thus, different ones of the execution units 424 are capable of executing different ones of the microinstructions 126 to read and/or write condition flags at the same time.

The flag bus 928 is a portion of the result bus 128 of FIG. 1 that carries the condition flag results output by the execution units 424. The conditional flag results are written into the reorder buffer 422, and more specifically, into entries in the reorder buffer 422 allocated to the microinstructions 126 executed by the execution unit 424, and the results of the execution by the execution unit 424 are forwarded to the flag bus 928. The conditional flag result is also sent to the third data input of the multiplexer 922 on the flag bus 928.

FIG. 9 also shows in block diagram the condition flag values output by the execution units 424 on the condition bus 928, and the condition flag values 924 received by the execution units 424 by the mux 922. The condition flag value 928/924 includes ISA condition flag 902, a condition Satisfy (SAT) bit 904, a pre-shift carry (PSC) bit 906, and a USE shift carry (USE) bit 908. When the instruction mode pointer 132 indicates ARM ISA, the ISA conditional flags 902 include an ARM carry flag (C), a zero flag (Z), an overflow flag (V), and a negative flag (N). When the instruction mode pointer 132 indicates an x86ISA, the ISA condition flags 902 include an x86EFLAGS Carry Flag (CF), a Zero Flag (ZF), an Overflow Flag (OF), a Sign Flag (SF), a Parity Flag (PF), and an Auxiliary Flag (AF). The condition flag register 926 includes storage space provided for ISA condition flag 902, SAT bit 904, PSC bit 906, and USE bit 908. In one embodiment, the condition flags register 926 shares storage space for the x86ISA and ARM ISA carry flags, zero flags, overflow flags, and negative flags/sign flags.

Each microinstruction 126, in addition to its basic operations (e.g., add, load/store, shift, Boolean and, branch), indicates whether the microinstruction 126 performs one or more of the three additional operations of (1) reading the condition flag 926 (labeled RDFLAGS in the figures below FIG. 10), (2) writing the condition flag 926 (labeled WRFLAGS in the figures below FIG. 10), and (3) generating and writing a carry flag value to the PSC bit 906 of the condition flag 926 (labeled WRCARRY in the figures below FIG. 10). In one embodiment, the micro instruction 126 includes corresponding bits to indicate the three additional operations. In another embodiment, the microinstruction 126 indicates the three additional operations by the opcode of the microinstruction 126; that is, the three additional operations are indicated by different opcodes of different types of micro instructions 126, along with the operations that the micro instruction types are capable of performing.

If an execution unit 424 executes a conditional ALU microinstruction 126 (denoted as ALUOP CC, CUALUOP CC, NCUALUOP CC in the FIG. 10 illustration) indicating that the condition flags 926 (denoted as WRFLAGS) are written to and the condition flags 924 read by the execution unit 424 satisfy the condition prescribed by the microinstruction 126, the execution unit 424 then sets the SAT bit 904 to bit one; otherwise, the execution unit 424 clears the SAT bit 904 to zero. To further illustrate, if any of the microinstructions 126 executed by the execution unit 424 indicate that they are to be written with the condition flags 926 and that the microinstruction 126 is not a conditional ALU microinstruction 126, the execution unit 424 then clears the SAT bit 904 to zero. Some of the conditional microinstructions 126 specify conditions according to ISA condition flags 902 (denoted XMOV CC in the following FIG. 10), while some of the conditional microinstructions 126 specify conditions according to SAT bits 904 (denoted CMOV in the following FIG. 10), as will be described in further detail below.

If an execution unit 424 executes a shift microinstruction 126 that directs it to write the carry flag (denoted WRCARRY), the execution unit 424 then sets the USE bit 908 to 1 and writes the carry value generated by the shift microinstruction 126 into the PSC bit 906; otherwise, the execution unit 424 clears the USE bit 908 to zero. Further, if an execution unit 424 executes any microinstructions 126 that indicate its write kill condition flags 926 and are not shift microinstructions 126, the execution unit 424 then clears the USE bit 908 to zero. The USE bit 908 is used by a subsequent conditional ALU microinstruction 126 to determine whether the architectural carry flag 902 is updated with the value of the PSC bit value 906 or the carry flag generated based on the ALU operation performed by the conditional ALU microinstruction 126. This operation is further described in the following paragraphs. In another embodiment, the USE bit 908 is not present, but the hardware instruction translator 104 is used to directly generate a functional equivalent of the USE bit 908 as an indicator within the conditional ALU micro-instruction 126.

Referring now to FIG. 10 (including FIG. 10A and FIG. 10B), a flowchart illustrating operation of the hardware instruction translator 104 of FIG. 1 to translate the conditional ALU instruction 124 according to one embodiment of the present invention is shown. Basically, FIGS. 10A and 10B depict the manner in which the hardware instruction translator 104 decodes the conditional ALU instruction 124 to identify its type for translation into the appropriate sequence of micro instructions 126 for execution by the execution pipeline 112. Specifically, the hardware instruction translator 104 determines whether the conditional ALU instruction 124 updates the architectural condition flags 902, whether a pre-shift operation is performed on a source operand, whether the carry flag is used as an input to the ALU operation, and whether the ALU operation is a carry update or a non-carry update operation. This operation, as described further below, will indicate that the ALU operation updates only a subset of the architectural condition flags 902 or updates all of the architectural condition flags 902. The flow begins at step 1002.

At 1002, the hardware instruction translator 104 encounters a conditional ALU instruction 124, decodes it, and translates it into the appropriate sequence of microinstructions 126, as depicted at 1024, 1026, 1034, 1036, 1044, 1054, and 1056. The conditional ALU instruction 124 directs the microprocessor 100 to perform an arithmetic or logical operation on one or more source operands to generate a result and to write the result to a destination register. Some types of ALU operations specified by the conditional ALU instruction 124 use architectural carry flags 902 as inputs (e.g., add with carry), although most types do not. The conditional ALU instruction 124 also specifies an architectural condition flag 902, the condition of which corresponds to the ISA. If the architectural condition flags 902 satisfy the specified condition, the microprocessor 100 executes the conditional ALU instruction 124, i.e., performs the ALU operation and writes the result to the destination register. Otherwise, the microprocessor 100 treats the conditional ALU instruction 124 as a no-op instruction; specifically, the microprocessor 100 does not change the value in the destination register. In addition, the conditional ALU instruction 124 may specify the architectural condition flags 902 to be updated or not updated depending on the result of the ALU operation. However, even if the conditional ALU instruction 124 specifies the architectural condition flags 902 to be updated, the microprocessor 100 does not change the values in the architectural condition flags 902 if the architectural condition flags 902 do not satisfy the specified condition. Finally, the conditional ALU instruction 124 may additionally specify one of the source operands of the ALU operation to be pre-shifted, as described with reference to step 1012. In one embodiment, the conditional ALU instruction 124 translated by the hardware instruction translator 104 is an ARM ISA instruction. Specifically, in one embodiment, as shown in FIG. 10, ARM ISA data processing instructions and multiply instructions are translated by the hardware instruction translator 104. In one embodiment, these instructions include, but are not limited to: AND, EOR, SUB, RSB, ADD, ADC, SBC, RSC, TST, TEQ, CMP, CMN, ORR, ORN, MOV, LSL, LSR, ASR, RRX, ROR, BIC, MVN, MUL, MLA, AND MLS instructions. In

steps

1024, 1026, 1034, 1036, 1044, 1054, and 1056, the relevant types of ARM ISA conditional ALU instructions 124 are shown in the first row and microinstructions 126 generated by the hardware instruction translator 104 translating the conditional ALU instructions 124 are shown in the subsequent rows for illustrative purposes. The subscript "CC" indicates that this instruction 124 is a conditional instruction. Furthermore, the type of ALU operation is exemplified by the specified source and destination operands. A programmer may designate a destination register as the same register that provides a source operand; in this case, the hardware instruction translator 104 is configured to take advantage of this and optimize the sequence of the microinstructions 126 to facilitate translation of the conditional ALU instruction 124. This feature is described in fig. 21. Step 1004 is next entered.

In step 1004, the hardware instruction translator 104 determines whether the conditional ALU instruction 124 specifies the architectural condition flags 902 as requiring updating by the conditional ALU instruction 124. That is, in some cases, the programmer may select the manner in which the conditional ALU instruction 124 of the architectural condition flags 902 is updated depending on the result of the ALU operation, while in other cases, the programmer may select the manner in which the conditional ALU instruction 124 of the architectural condition flags 902 is not updated regardless of the result of the ALU operation. In the ARM ISA assembly language, the instruction subscript "S" indicates that the architectural condition flags 902 are to be updated, and this customary usage is employed in the diagrams below fig. 10. For example, step 1044 designates the ARM ISA conditional ALU instruction 124 as "ALUOP S" to indicate that the architectural condition flags 902 are to be updated, while step 1024 designates the ARM ISA conditional ALU instruction 124 as "ALUOP" (i.e., the difference is "S") to indicate that the architectural condition flags 902 are not to be updated. If the conditional ALU instruction 124 specifies the architectural condition flags 902 to be updated, flow proceeds to block 1042; otherwise, proceed to step 1012.

At step 1012, the hardware instruction translator 104 determines whether the type of conditional ALU instruction 124 specifies a pre-shift operation for one of the ALU operation operands. The pre-shift operation may be performed by an immediate field to generate a constant source operand, or the pre-shift operation may be performed by a source operand provided from a register. The number of pre-shift operations may be specified as a constant within the conditional ALU instruction 124. Further, in the case of using a register shift operand, the number of pre-shift operations may be specified by a value within a register. In the case of the ARM ISA, a pre-shift operation performed on an immediate value according to an immediate shift amount to generate a constant source operand is treated as a modified immediate constant. The pre-shift operation generates a carry flag value. For some types of ALU operations, the architectural carry flag 902 is updated with the carry flag value generated by the shift operation, but for some types of ALU operations, the architectural carry flag 902 is updated with the carry flag value generated by the ALU operation. However, the carry flag value generated by the pre-shift operation is not used to confirm that the condition specified by the conditional ALU instruction 124 is satisfied, and more particularly, the current architectural carry flag 902 is used. It is noted that the ARM ISA MUL, ASR, LSL, LSR, ROR, and RRX instructions do not specify a pre-shift operation, and processing is described in

steps

1024, 1026, or 1044. Additionally, a pre-shift operation may be specified in the case where the MOV and MVN instruction specifies a modified immediate constant operand, but a pre-shift operation may not be specified in the case where the MOV and MVN instruction does not specify a modified immediate constant operand (i.e., specifies a register operand), and processing is described in

steps

1024, 1026, or 1044. As described above, the pre-shift operation may be performed by an immediate field to generate a constant source operand, or the pre-shift operation may be performed by a source operand provided by a register. If the conditional ALU instruction 124 specifies a pre-shift operation, flow proceeds to block 1032; otherwise, flow proceeds to block 1022.

In step 1022, the hardware instruction translator 104 determines whether the conditional ALU instruction 124 specifies an ALU operation that uses a carry flag. ARM ISA instructions 124 that use carry flags include, for example, add with carry (ADC), reverse subtract with carry (RSC), and subtract with carry (SBC) instructions, as well as instructions that specify a shift register operand and use the carry flag for a shift operation, i.e., an RRX shift type instruction. If the conditional ALU instruction 124 specifies an ALU operation using the carry flag, then flow proceeds to step 1026; otherwise, proceed to step 1024.

At step 1024, the hardware instruction translator 104 translates the non-flag-updated, non-pre-shifted, non-carry-used conditional ALU instruction 124 into first and second microinstructions 126, i.e., (1) an ALU operation microinstruction 126 (denoted ALUOP); and (2) a conditional move microinstruction 126 (denoted XMOV). In one example of step 1024, the conditional ALU instruction 124 specifies a first source register (R1) and a second source register (R2), and performs an ALU operation (denoted ALUOP) on the first source register and the second source register to generate a result, and a destination Register (RD) for conditionally writing the result. The ALUOP micro instruction 126 and the conditional ALU instruction 124 specify the same ALU and source operands. The ALUOP microinstruction 126 performs an ALU operation on two source operands and writes the result to a temporary register (labeled T2). The conditional move microinstruction 126 specifies the same state as the conditional ALU instruction 124. The conditional move microinstruction 126 receives the value written by the ALUOP microinstruction 126 in the temporary register and receives the value of the old, or current, destination Register (RD). The conditional move microinstruction 126 receives the condition flags 924 and determines whether these flags satisfy the condition. The conditional move microinstruction 126 writes the value of the temporary register into the destination Register (RD) if the condition is satisfied, and writes the value of the old destination register back into the destination register otherwise. It is noted that although the present embodiment specifies two source register operands, the present invention is not so limited, and one of the source operands may be a constant operand specified in the immediate field of a conditional ALU instruction 124, rather than being provided by a register. The execution of the microinstructions 126 is further described with reference to FIG. 20. The term "old" as used in FIGS. 10A and 10B and the following figures refers to the flag or destination register value, unless otherwise specifically indicated, which is received by the execution unit 424 when executing the microinstruction 126. The foregoing description may also be read as to the present value. For the destination register, the old or current value is received by the forwarding result bus (Forwarding result bus) of FIG. 1, the reorder buffer 422, or the architectural register file 106. For flags, as described with respect to FIG. 9, the old or current value is received by a steering flag bus 928, reorder buffer 422, or architectural condition flag register 926. The flow ends at step 1024.

In step 1026, the hardware instruction translator 104 translates the non-flag-update, non-pre-shift, carry-used conditional ALU instruction 124 into first and second microinstructions 126, i.e., (1) a carry-used ALU operation microinstruction 126 (denoted ALUOPUC); and (2) a conditional move microinstruction 126 (denoted XMOV). In one example of step 1026, the conditional ALU instruction 124 is similar to that described in step 1024, except that the specified ALU operation uses a carry flag. The two microinstructions 126 are also similar to those described in step 1024; however, the ALUOPUC microinstruction 126 also receives the condition flags 924 to obtain the current value of the carry flag and applies in using carry ALU operations. The execution of the micro instructions 126 is described in detail in FIG. 19. The flow ends at step 1026.

In step 1032, the hardware instruction translator 104 determines whether the conditional ALU instruction 124 specifies an ALU operation to use the carry flag. If the ALU operation uses the carry flag, flow proceeds to step 1036; otherwise, go to step 1034.

In block 1034, the hardware instruction translator 104 translates the non-flag-update, pre-shift, non-carry-used conditional ALU instruction 124 into first, second, and third microinstructions 126, i.e., (1) a shift microinstruction 126 (denoted SHF); (2) an ALU operation microinstruction 126; and (3) a conditional move microinstruction 126. In one example of step 134, the conditional ALU instruction 124 is similar to that described in step 1024; however, the conditional ALU instruction 124 also specifies a pre-shift operation on the second source operand (R2) with a shift amount, in the embodiment of block 1034, stored in a third source register (R3) specified by the conditional ALU instruction 124. However, if the type of conditional ALU instruction 124 specifies the shift amount as a constant within instruction 124, the third source register is not used. This list of pre-shift operations and conditional ALU instructions 124 that may be generated may be specified to include, but is not limited to, Logical Shift Left (LSL), Logical Shift Right (LSR), Arithmetic Shift Right (ASR), rotate right (ROR), and expand rotate right with extend (RRX). In one embodiment, the hardware instruction translator 104 outputs a shift microinstruction 126 to ensure that the shift values are generated according to the semantics of the ARM ISA, such as the descriptions of the ARM architecture reference manual corresponding to the respective ARM instruction, and pages A8-10 through A8-12, and A5-10 through A5-11, for example. The shift microinstruction 126 specifies the same pre-shift operation as the conditional ALU instruction 124, and the shift microinstruction 126 also specifies the same second source operand R2 and third source operand R3 as the conditional ALU instruction 124. The shift micro instruction 126 performs a shift operation with a shift amount on the second source operand R2 and writes the result into a temporary register (denoted as T3). Although the condition flag value generated by the shift microinstruction 126 is not used in step 1034 because the conditional ALU instruction 124 specifies the architectural condition flags 902 as not to be updated, the shift flag value generated by the shift microinstruction 126 is used in step 1056, for example, as described further below. In addition, the pre-shift operation requires the rotation of the old shift flag to the shifted result value; for example, an extended right-turn (RRX) pre-shift operation shifts the carry indicator to the most significant bit in the result. In this case, although not shown in FIGS. 10A and 10B (except for step 1056), the shift microinstruction 126 also reads the condition flags 924 to obtain the current carry flag value. The ALUOP microinstruction 126 is similar to that described in step 1024; however, the ALUOP microinstruction 126 receives the value of the temporary register T3, rather than the second source operand R2, and performs ALU operations on the first source operand R1 and the temporary register T3 to generate a result that is written into the temporary register T2. XMOV microinstructions 126 are similar to those described in step 1024. The execution of the microinstructions 126 is described in greater detail in FIG. 18. Flow ends at step 1034.

In step 1036, the hardware instruction translator 104 translates the non-flag-update, pre-shift, carry-used conditional ALU instruction 124 into first, second, and third microinstructions 126, i.e., (1) a shift microinstruction 126; (2) a micro instruction 126 operating with carry ALU; and (3) a conditional move microinstruction 126. In the example of step 1036, the conditional ALU instruction 124 is similar to that described in step 1034, except that the ALU operation usage specified by this instruction 124 uses carry flags. The three microinstructions 126 are similar to those described in step 1034; however, the ALUOPUC microinstruction 126 also receives the condition flags 924 to obtain the current value of the carry flag for use in carry-use ALU operations. Execution of the microinstructions 126 is described in greater detail in FIG. 17. The flow ends at step 1036.

At step 1042, the hardware instruction translator 104 determines whether the type of conditional ALU instruction 124 specifies a pre-shift on one of the ALU operation operands. If the conditional ALU instruction 124 specifies a pre-shift, flow proceeds to block 1052; otherwise, the flow proceeds to step 1044.

In step 1044, the hardware instruction translator 104 translates the flag-updated, non-pre-shifted conditional ALU instruction 124 into first and second microinstructions 126, i.e.: (1) a conditional ALU operation microinstruction 126 (denoted ALUOP CC); and (2) a conditional move microinstruction 126 (labeled CMOV). In the example of step 1044, the conditional ALU instruction 124 is similar to the conditional ALU instruction 124 of step 1024, except that the present embodiment updates the architectural condition flags 902. The conditional ALU microinstruction 126 and the conditional ALU instruction 124 specify the same condition and source operands. The conditional ALU operation microinstruction 126 performs an ALU operation on two source operands and writes the result to a temporary register (labeled T2). In addition, the conditional ALU operation microinstruction 126 receives the architectural condition flags 902 and confirms whether they satisfy the condition. In addition, the conditional ALU operation microinstruction 126 writes to the condition flag register 926. Specifically, the conditional ALU operation microinstruction 126 writes the SAT bit 904 to indicate whether the architectural condition flags 902 satisfy the condition. Further, if the condition is not satisfied, the conditional ALU operation microinstruction 126 writes the old condition flag value to the architectural condition flags 902; otherwise, if the condition is satisfied, the conditional ALU operation microinstruction 126 updates the architectural condition flags 902 based on the result of the ALU operation. The updated value of the architectural condition flag 902 is related to the type of ALU operation. That is, for some types of ALU operations, all the architectural condition flags 902 are updated with new values according to the result of the ALU operation; conversely, for some types of ALU operations, some of the architectural condition flags 902 (Z and N flags in one embodiment) are updated with new values based on the result of the ALU operation, but the old values are retained for other architectural condition flags 902 (V and C flags in one embodiment). The updating of the architectural condition flags 902 is illustrated in more detail in FIG. 14. Conditional Move (CMOV) microinstruction 126 receives the value written into the temporary register (T2) by ALUOP microinstruction 126 and receives the old or current value of the destination Register (RD). The Conditional Move (CMOV) microinstruction 126 receives the condition flags 924 and checks the SAT bit 904 to determine whether the conditional ALU operation microinstruction 126 indicates that the architectural condition flags 902 satisfy the condition. If the condition is satisfied, a Conditional Move (CMOV) microinstruction 126 writes the value of the temporary register into the destination register, otherwise writes the value of the old destination register back into the destination register. The execution of the microinstructions 126 is described in greater detail in FIG. 14. It is noted that the conditional ALU operation generated in step 1044 (and in steps 1054 and 1056) may be an ALU operation using a condition flag (similar to that described in steps 1026 and 1036) and that the execution unit 424 may have a carry flag to perform the carry-using ALU operation since the microinstruction 126 reads the flag (e.g., RDFLAGS pointer). The process ends at step 1044.

In step 1052, the hardware instruction translator 104 determines whether the conditional ALU instruction 124 specifies an ALU operation of a type that updates the architectural carry flag 902. It is necessary for the hardware instruction translator 104 to distinguish whether the architectural carry flag 902 will be updated, because if the architectural carry flag 902 is not updated by an ALU operation, the carry flag value generated by the pre-shift operation, rather than the condition flag value generated based on the ALU operation, must be used to update the architectural carry flag 902. In one embodiment, the ARM ISA instruction 124 that specifies an ALU operation that does not update the architectural carry flag 902, but specifies a pre-shift operation includes, but is not limited to, AND, BIC, EOR, ORN, ORR, TEQ, AND TST, AND the MOV/MVN instruction 124 that further specifies an adjusted immediate constant by a non-zero rotation value. If the ALU operation updates the architectural carry flag 902, the flow proceeds to step 1054; otherwise, proceed to step 1056.

In step 1054, the hardware instruction translator 104 translates the flag update, pre-shift, carry used conditional ALU instruction 124 into first, second, and third microinstructions 126, i.e.: (1) a shift microinstruction 126; (2) a conditional carry update ALU operation microinstruction 126 (denoted CU ALUOP CC); and (3) a conditional move microinstruction 126. In one example of step 1054, the conditional ALU instruction 124 is similar to that described in step 1034; however, this conditional ALU instruction 124 also specifies the architectural condition flags 902 to be updated. The shift microinstruction 126 is similar to that described in step 1034. The conditional carry update ALU operation microinstruction 126 specifies the same condition as the conditional ALU instruction 124. The conditional carry update ALU operation microinstruction 126 performs an ALU operation on the first source operand R1 and the temporary register T3 and writes the result to a temporary register (denoted as T2). In addition, the conditional carry update ALU operation microinstruction 126 receives the architectural condition flags 902 and confirms whether they satisfy the condition. In addition, the conditional carry update ALU operation microinstruction 126 writes to the condition flag register 926. Specifically, the conditional carry update ALU operation microinstruction 126 writes a SAT bit 904 to indicate whether the architectural condition flags 902 satisfy the condition. Further, if the condition is not satisfied, the conditional carry update ALU operation microinstruction 126 writes the old condition flag value to the architectural condition flags 902; conversely, if the condition is satisfied, the conditional carry update ALU operation microinstruction 126 updates the architectural condition flags 902 according to the result of the ALU operation. The updating of the architectural condition flags 902 is described in more detail in FIG. 16. Conditional Move (CMOV) microinstruction 126 is similar to that described in step 1044. The flow ends at step 1054.

In step 1056, the hardware instruction translator 104 translates the flag-update, pre-shift, non-carry-update conditional ALU instruction 124 into first, second, and third microinstructions 126, i.e., (1) a shift microinstruction 126; (2) a conditional non-carry update ALU operation microinstruction 126 (labeled NCUALUOP CC); and (3) a conditional move microinstruction 126. In the example of step 1056, the conditional ALU instruction 124 is similar to that described in step 1054; however, the conditional ALU instruction 124 specifies a non-carry-updating ALU operation. Thus, when the condition is satisfied, the architectural carry flag 902 is updated with the pre-shift flag value. The shift microinstruction 126 is similar to that described in step 1034; however, the microinstruction 126 reads and writes the condition flags register 926. Specifically, the shift microinstruction 126: (1) writing a condition flag value generated by the pre-shift operation to the PSC bit 906; (2) setting the USE bit 908 to indicate that the conditional non-carry update ALU operation microinstruction 126 USEs the PSC906 to update the architectural carry flag 902; and (3) write the old architectural condition flags 902 back to the condition flags register 926, whereby the conditional non-carry-updating ALU operation microinstruction 126 may evaluate the old value of the architectural condition flags 902 to determine whether it satisfies the condition. The conditional non-carry-updating ALU operation microinstruction 126 specifies the same condition as the conditional ALU instruction 124. The conditional non-carry-updating ALU operation microinstruction 126 performs an ALU operation on the source operand R1 and the temporary register T3 and writes the result to a temporary register (labeled T2). Further, the conditional non-carry update ALU operation microinstruction 126 receives the architectural condition flags 902 and determines whether they satisfy the condition. In addition, the conditional non-carry update ALU operation microinstruction 126 writes to the condition flag register 926. Specifically, the conditional non-carry update ALU operation microinstruction 126 writes a SAT bit 904 to indicate whether the architectural condition flags 902 satisfy the condition. Further, if the condition is not satisfied, the conditional non-carry-update ALU operation microinstruction 126 writes the old condition flag value to the architectural condition flags 902; conversely, if the condition is satisfied, the conditional non-carry update ALU operation microinstruction 126 updates the architectural condition flags 902 based on the result of the ALU operation. Specifically, the architectural overflow (V) flag 902 is written with the old overflow flag value. In addition, the architectural carry flag 902 is updated with the pre-shifted carry flag value located in the PSC bit 906 as indicated by the USE bit 908, and is otherwise updated with the old carry flag value 924. The updating of the architectural condition flags 902 is described in more detail in FIG. 15. The CMOV microinstruction 126 is similar to that described in step 1044. In another embodiment, the USE bit 908 is not present, and the hardware instruction translator 104 directly generates a functional equivalent of the USE bit 908 as a pointer to the conditional non-carry-update ALU operation microinstruction 126. The execution unit 424 checks this pointer to determine whether the architectural carry flag 902 is updated with the pre-shifted carry flag value located in the PSC bit 906 or with the old carry flag value 924. The flow ends at step 1056.

In one embodiment, the hardware instruction translator 104 is configured to generate and provide an adjusted immediate constant instead of outputting a shift microinstruction 126 to do so. In this embodiment, the process is similar to that described in

steps

1024, 1026 and 1044, rather than

steps

1034, 1036 and 1054/1056. Furthermore, in this embodiment, the hardware instruction translator 104 also generates and provides a carry flag value from the pre-shift operation for use by the conditional ALU operation microinstruction 126 to update the architectural carry flag 902.

Referring now to FIG. 11, a flowchart illustrating operation of the execution unit 424 of FIG. 4 to execute a shift microinstruction 126 according to the present invention is shown. The flow begins at step 1102.

In step 1102, a shift microinstruction 126, such as that described in FIG. 10 and generated by the hardware instruction translator 104 in response to the encountered conditional ALU instruction 124, is received by one of the execution units 424 of FIG. 4. The execution unit 424 also receives source operands specified by the microinstructions 126, including condition flag values 924, which may or may not be used by the microinstructions 126. Proceed next to step 1104.

At step 1104, the execution unit 424 performs a shift operation prescribed by the shift micro instruction 126 on operands prescribed by the shift micro instruction 126 to generate a result, which is output to the result bus 128. In one embodiment, the shift operation may include, but is not limited to, a logical left (LSL), logical right (LSR), arithmetic right (ASR), right turn (ROR), and extended right turn (RRX). In addition, the execution unit 424 generates a new condition flag value based on the result of the shift operation. Specifically, the execution unit 424 generates a carry flag value based on the result of the shift operation. In one embodiment, in the case of a logical left (LSL) shift operation, the carry flag value is the Nth bit of an extended value (extended value) that is a series of M least significant bits zero shifted with the operand after the left shift, where N is the number of bits of the original operand and M is the specified positive shift amount; in the case of a logical shift to the right (LSR) operation, the carry flag value is the (M-1) th bit of an extended value, which is zero extended (M + N) bits of the original operand, where M is the specified positive shift amount and N is the number of bits of the original operand; in the case of an Arithmetic Shift Right (ASR) operation, the carry flag value is the (M-1) th bit of an extended value, which is the sign-extended (M + N) bits of the original operand, where M is a specific positive shift amount and N is the number of bits of the original operand; in the case of a right-turn (ROR) shift operation, the carry flag value is the (N-1) th bit of the result after the right-turn of the operand, which is right-turned according to a specific non-zero shift amount (mod) N, where N is the number of bits of the original operand; in the case of an extended right (RRX) shift operation, the carry flag value is bit zero of the original operand. Proceed next to step 1106.

In step 1106, the execution unit 424 confirms whether the shift microinstruction 126 output by the hardware instruction translator 104 indicates that the execution unit 424 should write a carry flag, as with the instruction WRCARRY in step 1056 of FIG. 10B. Specifically, this shift microinstruction 126 indicates that the PSC bit 906 located on the flag bus output 928 should be written with the carry flag value generated by the shift operation, while the USE bit 908 should be set to validate the subsequent conditional non-carry update ALU operation microinstruction 126 to conditionally write the PSC bit 906 value to the architectural carry flag 902. If the execution unit 424 should write the carry flag, flow proceeds to block 1114; otherwise, flow proceeds to block 1108.

In block 1108, the execution unit 424 determines whether the shift microinstruction 126 output by the hardware instruction translator 104 indicates that the execution unit 424 should write a condition flag (denoted WRFLAGS). Although none of the shift microinstructions of FIG. 10 indicate that the execution unit 424 should write the condition flag, in the event that the shift microinstruction 126 does not indicate that the PSC bit 906 (labeled WRCARRY) should be written, the hardware instruction translator 104 generates such a shift microinstruction 126 when translating other ISA instructions 124. If the execution unit 424 should write the condition flags, flow proceeds to block 1112; otherwise it will terminate.

At 1112, the execution unit 424 outputs values on the flag bus 928 to clear the PSC bit 906, the USE bit 908, and the SAT bit 904 to zero, and writes the new architecture condition flag 902 value generated at 1104 into the architecture condition flag 902. The flow ends at step 1114.

In step 1114, the execution unit 424 outputs a value on the flag bus 928 to write the carry flag value generated in step 1112 to the PSC bit 906, sets the USE bit 908 to one, clears the SAT bit 904 to zero, and writes the value architectural condition flag 902 with the old architectural condition flag 902 received in step 1102. The flow ends at step 1114.

Referring now to FIG. 12 (including FIGS. 12A and 12B), a flowchart illustrating operation of the execution unit 424 of FIG. 4 to execute a conditional ALU micro-instruction 126 in accordance with the present invention is shown. The flow begins at step 1202.

At step 1202, one of the execution units 424 of FIG. 4 receives a conditional ALU microinstruction 126, as described in FIG. 10, generated by the hardware instruction translator 104 in response to encountering a conditional ALU instruction 124. The execution unit 424 also receives source operands specified by the micro instructions, including the condition flag values 924, whether or not they are used by the micro instructions 126. It should be appreciated that the execution unit 424 executes the unconditional ALU microinstruction 126, which may be the conditional microinstruction generated by the hardware instruction translator 104 in response to encountering a conditional ALU instruction 124 as described in fig. 10, according to a process similar to that described with respect to fig. 12, with the exception of the

shift operations

1209, 1212, 1214, and 1216. Additionally, the execution units 424 that execute the conditional ALU microinstructions 126 may be the same as or different from the execution units 424 that execute the associated shift microinstructions 126 and/or XMOV/CMOV microinstructions 126. The flow then proceeds to step 1204.

At block 1204, the execution unit 424 performs the ALU operation specified by the conditional ALU microinstruction 126 on the operands specified by the conditional ALU microinstruction 126 to produce a result and outputs the result to the result bus 128. In addition, the execution unit 424 also generates a new architectural condition flag 902 value based on the result of the ALU operation. If the ALU operation uses the carry flag, the execution unit 424 uses the old value of the received architectural carry flag 924 instead of the new carry flag value generated by the ALU operation. Flow proceeds to block 1206.

In block 1206, the execution block 424 determines whether the architectural condition flags 924 received in block 1202 satisfy the specified condition. This confirmation is used in

subsequent steps

1212 and 1214. Flow then proceeds to step 1208.

In step 1208, the execution unit 424 determines whether the conditional ALU microinstruction 126 instructs the execution unit 424 to write the condition flag register 926, as with the instruction WRFLAGS in many of the steps of FIGS. 10A and 10B. If so, flow proceeds to block 1214; otherwise, flow proceeds to block 1209.

In step 1209, if the result of the determination in step 1206 is satisfied, the flow proceeds to step 1211; otherwise, flow proceeds to block 1212.

In step 1211, the execution unit 424 outputs the result of step 1204 to the result bus 128 as a result of the condition being satisfied. However, the conditional ALU micro-instruction 126 does not update the condition flag register 926 because the conditional ALU micro-instruction 126 is specified not to update the architectural condition flags 902. As previously described, the results and condition flag values output by the execution units 424 on the result bus 128/928 are passed to other execution units 424 of the execution pipeline 112 and written into the reorder buffer 422 associated with the entries of the conditional ALU microinstruction 126. It should be appreciated that even if the microinstruction 126 is specified to not update the architectural condition flags 902, the execution unit 424 will still output values to the flag result bus 928 for writing to the reorder buffer 422 associated with the conditional ALU microinstruction 126, but these values will not be retired from the reorder buffer 422 to the destination register 106 and/or the condition flag register 926. That is, the determination of whether the value written to the entry of reorder register 422 is eventually retired is made by the retirement unit of execution pipeline 112 based on the type of microinstruction 126, exception occurrence, branch misprediction, or other invalidation event, rather than by execution unit 424 itself. The flow ends at step 1211.

At block 1212, the execution unit 424 outputs the first source operand to the result bus 128. It is noted that the various conditional ALU microinstructions 126 described in FIGS. 10A and 10B do not use the output first source operand when the condition is not satisfied. Specifically, the XMOV and CMOV microinstructions 126 of FIGS. 10A and 10B write back the old destination register value instead of the value of the temporary register T2. However, in FIGS. 21A and 21B and the following description, for other types of conditional ALU instruction 124 translations, i.e., the same source destination conditional ALU instruction 124 (or other ISA instruction 124), the hardware instruction translator 104 generates the conditional ALU microinstruction 126, wherein the first source operand is also the destination register specified by the ISA instruction 124, to write back the value of the original destination register if the condition is not satisfied. As depicted in step 1211, the conditional ALU microinstruction 126 does not update the condition flag register 926 because the conditional ALU microinstruction 126 is designated to not update the architectural condition flags 902. Flow ends at block 1212.

In step 1214, if it is determined in step 1206 that the conditions are met, flow proceeds to step 1218; otherwise, flow proceeds to block 1216.

In block 1216, the execution unit 424 outputs the first source operand, clears the USE bit 908, PSC bit 906, and SAT bit 904 to zero, and outputs the old architecture condition flag 924 value received in block 1202 to the flag bus 928, enabling execution of the conditional ALU instruction 124 as a whole as a no-operation instruction (i.e., without executing the conditional ALU instruction 124) without adjusting the value of the architecture condition flag 902. The flow ends at step 1216.

In step 1218, the execution unit 424 determines whether the conditional ALU microinstruction 126 specifies a carry-update ALU operation. In one embodiment, the execution unit 424 decodes the opcode of the conditional ALU microinstruction 126 to make the validation result. In another embodiment, the hardware instruction translator 104 determines whether the ALU operation is a carry update operation of step 1052 of FIG. 10A, and accordingly provides a pointer to the execution unit 424. In one embodiment, non-carry-updating ALU operations include, but are not limited to, operations specified by AND, BIC, EOR, ORN, ORR, TEQ, TST, MUL, MOV, MVN, ASR, LSL, LSR, ROR, AND RRX ARM ISA instruction 124. If the ALU operation is a carry update operation, flow proceeds to block 1222; otherwise, flow proceeds to block 1224.

At block 1222, the execution unit 424 outputs the result of block 1204, clears the USE bit 908 and the PSC bit 906 to zero, sets the SAT bit 904 to one, and outputs the new architectural condition flag value generated at block 1204 to the flag bus 928. It is noted that the processing of the conditional ALU microinstruction 126 that does not update the overflow flag, but specifies a carry-update ALU operation (e.g., ASR, LSL, LSR, ROR, and RRX operations) is somewhat different than that described in step 1222. In particular, the execution unit 424 outputs the old V flag value instead of the new V flag value. The flow ends at step 1222.

In step 1224, the execution unit 424 checks the USE bit 908. If the USE bit 908 is set to one, flow proceeds to block 1228; otherwise, flow proceeds to block 1226. In another embodiment, the USE bit 908 is not present, and the execution unit 424 examines a pointer within the conditional ALU microinstruction 126 to determine whether to update the architectural carry flag 902 with a pre-shifted carry flag value within the PSC bit 906 or to USE the old carry flag value 924, as described above/below.

In step 1226, the execution unit 424 outputs the result of step 1204, clears the USE bit 908 and the PSC bit 906 to zero, sets the SAT bit 904 to one, and outputs an architectural condition flag to the flag bus 928 as follows: the C and V flags are written to the old C and V flag values received in step 1202; the N and Z flags are written with the new N and Z flag values generated in step 1204, respectively. The flow ends at step 1226.

In block 1228, the execution unit 424 outputs the result of block 1204, clears the USE bit 908 and the PSC bit 906 to zero, sets the SAT bit 904 to one, and outputs an architectural condition flag to the flag bus 928 as follows: the C flag writes the value of the PSC bit 906 received by step 1202; the V flag writes the old V flag value received by step 1202; the N and Z flags are written to the new N and Z flag values received in step 1204, respectively. The flow ends at step 1228.

In one embodiment, the values output on the flag bus 928 are different depending on whether the instruction mode pointer 132 indicates x86 or ARM, and thus the execution unit 424 executes the conditional ALU microinstruction 126 differently. Specifically, if the instruction mode pointer 132 indicates x86, the execution unit 424 does not distinguish whether the ALU mode of operation is a carry update or a not carry update, disregards the USE bit 908, and updates the condition code flags with the x86 semantic.

Referring now to FIG. 13, therein is shown an operation of the execution unit 424 of FIG. 4 executing a conditional move microinstruction 126 according to the present invention. The flow begins at step 1302.

At block 1302, one of the execution units 424 of FIG. 4 receives a conditional move microinstruction 126, which is generated by the hardware instruction translator 104 in response to encountering a conditional ALU instruction 124 (denoted as CMOV or XMOV) as described in FIG. 10. The execution unit 424 also receives source operands specified by the microinstructions 126, including the condition flag values 924, whether or not they are used by the microinstructions 126. Proceed next to step 1304.

At 1304, the execution unit 424 decodes the microinstruction 126 to determine whether it is an XMOV microinstruction 126 or a microinstruction 126. If so, flow proceeds to step 1308; otherwise, flow proceeds to step 1306.

In step 1306, the execution unit 424 asserts the architectural condition flags 902 received in step 1302 and asserts if the condition is satisfied. Proceed next to step 1312.

At block 1308, the execution unit 424 checks the SAT bit 904 received from block 1302 and determines therefrom whether the condition is satisfied as previously determined by a corresponding conditional ALU microinstruction 126 writing to the SAT bit 904, as described in

blocks

1044, 1054, and 1056 of FIG. 10. Flow then proceeds to block 1312.

At step 1312, if either of steps 1306 or 1308 confirm that the conditions are met, flow proceeds to step 1316; otherwise, proceed to step 1314.

At 1314, the execution unit 424 outputs the value of the first source operand to the result bus 128. In FIG. 10, the value of the first source operand is the old destination register value, so that it is advantageous to treat the conditional ALU instruction 124 as a whole as a no-operation instruction (i.e., not to execute the conditional ALU instruction 124) for execution without the condition being satisfied and without changing the value of the destination register. Flow ends at 1314.

In step 1316, execution unit 424 outputs the value of the second source operand to result bus 128. As shown in FIG. 10, the second source operand value is written by the associated conditional ALU micro-instruction 126 to a temporary register value to facilitate execution of the conditional ALU instruction 124 by writing the result to a destination register if a predetermined condition is met. Flow ends at 1316.

Referring now to FIG. 14, a block diagram illustrating operation of the execution pipeline 112 of FIG. 1 to execute a conditional ALU instruction 124 according to the present invention is shown. Specifically, the conditional ALU instruction 124 is a flag update, non-pre-shift, conditional ALU operation ISA instruction 124. The hardware instruction translator 104 translates the instruction 124 into the microinstruction 126 of step 1044 of FIG. 10. The register allocation table 402 of FIG. 4 generates dependency information for the CMOV microinstruction 126 located in the temporary register T2 and the value of the condition flags register 926 written by the conditional ALUOP microinstruction 126, among other things. The instruction scheduler 404 dispatches the microinstructions 126 to the appropriate reservation stations 406 of FIG. 4. When a microinstruction 126 obtains values from all source operands (whether from a forwarding bus 128, a reorder buffer (ROB)422, or a register file 106), the instruction issue unit 408 determines that a microinstruction 126 is ready for execution by its reservation station 406 for issue to the corresponding execution unit. The micro instructions 126 are executed according to the description of FIG. 12 (including FIG. 12A and FIG. 12B) and FIG. 13.

The execution unit 424 receives the conditional ALUOP microinstruction 126 generated in step 1044 from the reservation station 406, the source operand values from registers R1 and R2 of the register file 106 of fig. 1, and the condition flags 924 from the condition flags register 926 of fig. 9 (or from the direction bus 128 and/or the ROB422) according to step 1202 of fig. 12A. The execution unit 424 performs an ALU operation on the registers R1 and R2 (if the ALU operation uses a carry operation, then the operation is performed on the received C flag 902) to generate a result, which is written into the temporary register T2, according to step 1204. Additionally, (1) if the architectural condition flags 902 do NOT satisfy the specified condition (denoted NOT SATISFIED in FIG. 14), the execution unit 424 generates a new condition flag 928 value to write to the condition flag register 926 according to step 1216 of FIG. 12B; (2) if the architectural condition flags 902 satisfy the specified condition and the ALU operation is a non-carry-update operation (labeled NCUALUOP SAT in FIG. 14), the execution unit 424 generates a new condition flag 928 value to write to the condition flags register 926, according to step 1226 of FIG. 12; and (3) if the architectural condition flags 902 satisfy the specified condition and the ALU operation is a carry update operation (labeled CU ALUOP SAT in FIG. 14), the execution unit 424 generates a new condition flag 928 value to write to the condition flags register 926, according to step 1222 of FIG. 12. The value of the temporary register T2 and the condition flag 928 are provided on the pilot bus 128 for use by the CMOV microinstruction 126, written into the reorder buffer 422 for use by the CMOV microinstruction 126 in the event that it does not come from the pilot bus 128, and eventually retired to the appropriate architectural state for use by the CMOV microinstruction 126 in the event that it does not come from either the pilot bus 128 or the reorder buffer 422, except in the event of an exception, branch misprediction, or other invalidation event. In particular, the mux 922 of FIG. 9 operates to select the appropriate condition flags 924 for provision to the execution units 424.

The execution unit 424 receives the source operand values of the CMOV microinstruction 126, the temporary register T2 and the destination Register (RD) of step 1044, and the condition flags 924 generated in accordance with step 1302 of FIG. 13. Referring to FIGS. 13, in

steps

1316 and 1314, the execution unit 424 outputs the value of the source operand of the temporary register T2 when the SAT bit 904 is set, and the execution unit 424 outputs the value of the source operand of the destination register RD when the SAT bit 904 is cleared. The result value is provided on the direction bus 128 for use by subsequent micro instructions 126 and written into the entry of the reorder register 422, which are eventually retired to their appropriate architectural state for use by micro instructions 126, except for exception occurrences, branch mispredictions, or other invalidating events.

The flag-updating conditional ALU instruction 124 specifies a carry-updating ALU operation, as depicted in block 1222, but does not update the overflow flags, such as ARM ISA ASR, LSL, LSR, ROR, and RRX instruction 124, which instructions 124 handle somewhat differently than those depicted in FIG. 14. In particular, the execution unit 424 outputs the old V flag value instead of the new V flag value. Finally, as described above, the ARM ISA MUL and MOV/MVN (register) instructions 124 are not carry update instructions and cannot specify a pre-shift operation, and are therefore processed as described in step 1044. This is more specifically illustrated in step 1226 of FIG. 12B.

As can be seen in the foregoing, the ALU operation microinstruction 126 indicates, via the SAT bit 904, whether the old condition flag 902 of the CMOV microinstruction 126 satisfies the specified condition, thereby causing the ALU operation microinstruction 126 to replace the old value of the condition flag 902 and, if the condition is satisfied, to proceed with processing in accordance with the appropriate value produced by the result of the ALU operation.

Referring now to FIG. 15 (including FIG. 15A and FIG. 15B), a block diagram illustrating the execution of a conditional ALU instruction 124 by the execution pipeline 112 of FIG. 1 according to the present invention is shown. Specifically, the conditional ALU instruction 124 is an operation ISA instruction 124 of a flag-update, pre-shift, non-carry-update conditional ALU, and the hardware instruction translator 104 translates the instruction 124 into microinstructions 126 as shown in step 1056 of fig. 10B. The operation of FIG. 15 (including FIGS. 15A and 15B) is similar to that of FIG. 14 in many respects, and similar operations will not be repeated here, except for the differences listed below. The register allocation table 402 of FIG. 4 generates dependency information for the NCUALUOP microinstruction 126 located in the temporary register T3 and the value of the condition flags register 926 written by the shift microinstruction 126, among other things. The micro instructions 126 are executed as described with respect to FIGS. 11, 12, and 13.

The execution unit 424 receives the shift microinstruction 126 generated in step 1056 from the reservation station 406, the source operand values from the registers R2 and R3 of the register file 106, and the condition flags 924 from the condition flags register 926 (or from the direction bus 128 and/or reorder buffer 422) according to step 1102 of FIG. 11. The execution unit 424 performs a shift operation (on the received C flag 902 if the ALU operation is a carry use operation) in the registers R2 and R3 to generate a result and writes to the temporary register T3, per step 1104. In addition, the execution unit 424 generates a new architectural condition flag 902 value according to step 1104 and writes a new condition flag 928 according to step 1114 of FIG. 11 writing to the condition flags register 926. The value of temporary register T3 and condition flags 928 are provided to the steering bus 128 for use by the NCUALUOP microinstruction 126, entries written to reorder buffer 422 if not from steering bus 128 for use by the NCUALUOP microinstruction 126, and are eventually retired to their appropriate state for use by the NCUALUOP microinstruction 126 except for exception occurrences, branch mispredictions, or other invalidating events, if not from either the steering bus 128 or reorder buffer 422. In particular, the operation of mux 922 of FIG. 9 selects the appropriate condition flags 924 to provide to execution units 424.

The execution unit 424 receives the NCUALUOP microinstruction 126 generated in step 1056 from the reservation station 406, the source operand values from the register R1 and the scratch register T3 of the register file 106, and the condition flags 924 from the condition flags register 926, according to step 1202. The execution unit 424 performs an ALU operation (also performed on the received C flag 902 when the ALU operation is a carry-using operation) at register R1 and temporary register T3 to generate a result, and writes to temporary register T2, according to step 1204. Further: (1) if the architectural condition flags 902 do not satisfy the specified condition (NOTIFIED in FIG. 15), the execution unit 424 generates a new condition flag 928 value to write to the condition flag register 926, according to block 1216; (2) if the architectural condition flags 902 satisfy the specified condition and the USE bit 908 is cleared (labeled sat in fig. 15, USE ═ 0), the execution unit 424 generates a new condition flag 928 value to write to the condition flag register 926, per step 1226 of fig. 12B; and (3) if the architectural condition flags 902 satisfy the specified condition and the USE bit 908 is set (labeled sat., USE ═ 1 in fig. 15), the execution unit 424 generates a new condition flag 928 value to write to the condition flag register 926, according to step 1228 of fig. 12. The CMOV microinstructions 126 of FIG. 15 execute similarly to that described in FIG. 14. In another embodiment, as described above, the USE bit 908 is not present, and the execution unit 424 instead examines the pointer within the conditional ALU microinstruction 126 to determine whether the architectural carry flag 902 is updated with the pre-shifted carry flag value within the PSC bit 906 or the old carry flag value 924.

As can be seen from the foregoing, the shift microinstruction 126 does not replace the old value of the condition flags 902, but instead writes the old value of the condition flags 902 back into the condition flags register 926, so that the conditional ALU operation microinstruction 126, which receives the result of the condition flags register 926 from the shift microinstruction 126, can determine whether the old condition flags 902 satisfy the condition specified by the ISA conditional ALU instruction 124. On the other hand, if the shift microinstruction 126 replaces the old carry flag 902 with the newly generated carry flag value, the conditional ALU operation microinstruction 126 will not confirm whether the old condition flag 902 satisfies the specified condition.

Referring now to FIG. 16 (including FIG. 16A and FIG. 16B), a block diagram illustrating the execution of a conditional ALU instruction 124 by the execution pipeline 112 of FIG. 1 according to the present invention is shown. Specifically, the conditional ALU instruction 124 is a flag update, pre-shift, carry update conditional ALU operations ISA instruction 124, and the hardware instruction translator 104 translates the instruction 124 into microinstructions 126 according to step 1054 of fig. 10. The operation of FIG. 16 is similar to that of FIG. 15 in many respects, and similar parts are not repeated here, but only differences are described. The register allocation table 402 of FIG. 4 generates the dependency information for the CU ALUOP microinstruction 126 that the shift microinstruction 126 writes to the value in the temporary register T3, however, because the shift microinstruction 126 does not write to the condition flags register, the register allocation table does not generate its relevant information.

The execution unit 424 receives the shift microinstruction 126 generated by step 1054 from the reservation station 406 and receives the source operand values from the registers R2 and R3 of the register file 106 according to step 1102, but does not receive the condition flags 924 (unless the ALU operation is a carry-using operation). The execution unit 424 performs a shift operation on registers R2 and R3 (on the received C flag 902 if the ALU operation is a carry-using operation) to generate a result that is written to the temporary register T3 according to step 1104. The value of temporary register T3 is provided to the pilot bus 128 for use by the CU ALUOP microinstructions 126, written into the reorder buffer 422 for use by the CU ALUOP microinstructions 126 if not from the pilot bus 128, and retired to its appropriate state for use by the CU ALUOP microinstructions 126 if not from the pilot bus 128 or reorder buffer 422, except for the occurrence of an exception, branch misprediction, or other invalidation event.

The execution unit 424 receives the CUALUOP microinstruction 126 from the reservation station 406 generated in step 1054, the source operand values from the register R1 and the buffer T3 of the register file 106, and the condition flags 924 from the condition flags register 926 and according to step 1202. The execution unit 424 performs an ALU operation (performed on the received C flag 902 if the ALU operation is a carry-using operation) at register R1 and temporary register T3 to generate a result and writes to temporary register T2, per step 1204. Further: (1) if the architectural condition flags 902 do NOT satisfy the specified condition (NOT confirmed in fig. 16), the execution unit 424 generates a new condition flag 928 value to write to the condition flag register 926, per block 1216; and (2) if the architectural condition flags 902 satisfy the specified condition (labeled SATISFIED in FIG. 16), the execution unit 424 generates a new condition flag 928 value to write to the condition flags register 926, according to step 1222 of FIG. 12. The CMOV microinstructions 126 of FIG. 16 execute similarly to that described in FIG. 14.

Referring now to FIG. 17, a block diagram illustrating operation of the execution pipeline 112 of FIG. 1 to execute a conditional ALU instruction 124 according to the present invention is shown. Specifically, the conditional ALU instruction 124 is a non-flag-updating, pre-shifting, ISA instruction 124 that operates using a carry-conditional ALU, and the hardware instruction translator 104 translates the instruction into microinstructions 126 as described in step 1036 of fig. 10. The operation according to fig. 17 is similar to that of fig. 16 in many respects, and similar operations will not be described again, except for the differences. The shift microinstruction 126 of FIG. 17 executes similarly to that described with respect to FIG. 16.

The execution unit 424 receives the ALUOPUC microinstruction 126 generated by step 1036 from the reservation station 406, the source operand values from the register R1 and the temporary register T3 of the register file 106, and the condition flags 924 from the condition flags register 926, according to step 1202. Because the ALU operation is a carry operation, the execution unit 424 performs the ALU operation in register R1, the temporary register T3, and the received C flag 902 to generate a result and write the temporary register T2 according to step 1204. The execution unit 424 does not write to the condition flags register 926.

The execution unit 424 receives the source operand values of the XMOV microinstruction 126, the temporary register T2 and the destination register RD generated in step 1036 and the condition flags 924 generated in accordance with step 1302 of FIG. 13. According to

steps

1316 and 1314 of FIG. 13, the execution unit 424 outputs the value of the source operand of the temporary register T2 as its result when the condition flags 924 satisfy the predetermined condition, and the execution unit 424 outputs the value of the source operand of the destination register RD as its result when the condition flags 924 do not satisfy the predetermined condition. The result value is provided to the direction bus 128 for use by subsequent micro instructions 126, written into the entry of reorder buffer 422, and retired to its appropriate architectural state, except for exception occurrences, branch mispredictions, or other invalidating events.

Referring now to FIG. 18, a block diagram illustrating the execution of a conditional ALU instruction 124 by the execution pipeline 112 of FIG. 1 according to the present invention is shown. Specifically, the conditional ALU instruction 124 is an ISA instruction 124 that is not a flag-update, pre-shift, non-carry-using conditional ALU operation, and the hardware instruction translator 104 translates the instruction 124 into microinstructions 126 of step 1034 of fig. 10. The operations performed according to fig. 18 are similar to those performed according to fig. 17 in many aspects, wherein the similarities are not repeated and only the differences are described. The shift microinstruction 126 of FIG. 18 executes similarly to that described with respect to FIG. 16. The execution of the ALUOP microinstruction 126 of fig. 18 is similar to the execution of the ALUOPUC microinstruction 126 of fig. 17, except that the ALUOP microinstruction 126 of fig. 18 does not use the C flag 902 to generate its result. The execution of the XMOV microinstruction 126 of FIG. 18 is similar to the execution of the XMOV microinstruction 126 of FIG. 17.

Referring now to FIG. 19, a block diagram illustrating the execution of a conditional ALU instruction 124 by the execution pipeline 112 of FIG. 1 according to the present invention is shown. Specifically, the conditional ALU instruction 124 is an ISA instruction 124 that is not flag-updated, not pre-shifted, using carry-conditional ALU operations, and the hardware instruction translator 104 translates the instruction 124 into microinstructions 126 as described in step 1026 of fig. 10. The operation according to fig. 19 is similar to that described in fig. 17 in many respects, and the similarities are not repeated here, and only the differences are described. The translation of the conditional ALU instruction 124 is a non-flag-updating, non-pre-shifting, ISA instruction 124 that operates using a carry conditional ALU, and does not include a shift microinstruction 126.

The execution unit 424 receives the ALUOPUC microinstruction 126 from the reservation station 406 at step 1026, the source operand values from registers R1 and R2 of the register file 106, and the condition flags 924 from the condition flags register 926, as per step 1202. Because the ALU operation is a carry-in operation, the execution unit 424 performs the ALU operation at registers R1 and R2 and the received C flag 902 to generate a result and writes to the temporary register T2, as per step 1204. The execution unit 424 does not write to the condition flags register 926. The execution of the XMOV microinstruction 126 of FIG. 19 is similar to the execution of the XMOV microinstruction 126 of FIG. 17.

Referring now to FIG. 20, a block diagram illustrating the execution of a conditional ALU instruction 124 by the execution pipeline 112 of FIG. 1 according to the present invention is shown. Specifically, the conditional ALU instruction 124 is an ISA instruction 124 that is not a flag-update, non-pre-shift, non-carry-using conditional ALU operation, and the hardware instruction translator 104 translates the instruction into microinstructions 126 as described in step 1024 of fig. 10. The operation according to fig. 20 is similar to the operation described in fig. 19 in many aspects, and the same parts are not repeated, and only different parts are illustrated. Execution of the ALUOP microinstruction 126 of fig. 20 is similar to execution of the ALUOPUC microinstruction 126 of fig. 19, except that the ALUOP microinstruction 126 of fig. 20 does not use the C flag 902 to generate its result. The execution of the XMOV microinstruction 126 of FIG. 20 is similar to the execution of the XMOV microinstruction 126 of FIG. 17.

As can be seen from the foregoing, embodiments of the present invention avoid the disadvantages that derive from allowing the microinstruction 126 to specify an additional source operand. These disadvantages include, first, the need to provide an additional read port in the general register file for each execution unit 424 that will execute microinstructions 126 with additional source operands. Second, an additional read port is provided in reorder buffer 422 for each execution unit 424 that will execute microinstructions 126 with additional source operands. Third, more lines are required on the steering bus 128 for each execution unit 424 that will execute the microinstructions 126 with additional source operands. Fourth, an additional relatively large multiplexer is required for each execution unit 424 that will execute the microinstructions 126 with additional source operands. Fifth, Q additional tag comparators need to be used, where:

Q＝∑i＝1to n,(R[i]*P[i]*J[i])

where n is the number of execution units 424, R [ i ] is the number of entries 406 provided by the reservation station 406 to the [ i ] th execution unit 424, P [ i ] is the maximum number of source operands that can be specified by the microinstructions executed by the [ i ] th execution unit 424, and J [ i ] is the number of execution units 424 that can be directed to the [ i ] th execution unit 424. Sixth, for additional source operands, additional rename lookup operations are required in register configuration table 402. Seventh, reservation stations 406 need to be extended to handle additional source operands. These additional costs in speed, power and space are undesirable.

SAME-SOURCE-purpose (SAME-Source-DESTINATION) optimized embodiments

Referring now to FIG. 21, a flowchart illustrating operation of the hardware instruction translator 104 of FIG. 1 to translate a conditional ALU instruction 124 according to the present invention is shown. Basically, the operation of the hardware instruction translator 104 in accordance with FIG. 21 is similar in many respects to the operation in accordance with FIG. 10, particularly with respect to the various steps required to make the determination, and thus steps are given the same reference numbers herein.

Referring to FIG. 21, step 1002 of FIG. 10 is replaced with step 2102. In step 2102, the conditional ALU instruction 124 encountered by the hardware instruction translator 104 is different than that encountered in step 1002 because the conditional ALU instruction 124 encountered in step 2102 specifies a register among a plurality of source registers as a destination register. The hardware instruction translator 104 is configured to identify this condition and optimize the microinstructions 126 that it outputs. In particular, the hardware instruction translator 104 decodes and translates the same source-destination conditional ALU instruction 124 into a sequence of microinstructions 126 that differs from that described in

steps

1024, 1026, 1034, 1036, 1044, 1054, and 1055 (step 10XX) of fig. 10. The various micro instruction 126 sequences are described in FIG. 21 at

steps

2124, 2126, 2134, 2136, 2144, 2154, and 2156 (step 21XX) in place of their corresponding steps 10 XX. In particular, the sequence of micro instructions 126 for each of the steps 21XX has one less micro instruction 126 than the corresponding sequence of micro instructions 126 for the step 10 XX. Specifically, the sequence of step 21XX does not include CMOV or XMOV microinstructions 126, and the operation of selectively writing either the original destination register value or the result value is performed by the conditional ALU microinstruction 126 at the end of the sequence. This operation will be described more clearly in the following paragraphs.

In step 2124, the hardware instruction translator 104 translates the same source-destination non-flag-update, non-pre-shift, non-carry-used conditional ALU instruction 124 into a single microinstruction 126, namely a conditional ALU operation microinstruction 126 (denoted ALUOP CC). In the example of step 2124, the conditional ALU instruction 124 is similar to that described with respect to step 1024, except that a Register (RD) of the first source operand family. Thus, the conditional ALU instruction 124 specifies a first source Register (RD) and a second source register (R2), an ALU operation (denoted as ALUOP) is executed in the first source register RD and the second source register R2 to generate a result, and the destination Register (RD) is the same as the first source register, into which the execution result is conditionally written. The conditional ALUOP microinstruction 126 specifies the same ALU operation and condition as the conditional ALU instruction 124. The execution unit 424 executing the conditional ALUOP microinstruction 126 receives the value of the old or current destination Register (RD), receives the value of the second source operand R2, according to block 1202, and performs an ALU operation on both source operands to generate a result, according to block 1204. The execution unit 424 also receives the condition flags 924 and checks the condition flags 924 to see if they satisfy the specified condition, according to step 1204. If so, the execution unit 424 outputs the result according to step 1211, otherwise outputs the old destination register value according to step 1212. The execution of conditional ALUOP microinstruction 126 is shown in block diagram form in fig. 28. The flow ends at step 2124.

In step 2126, the hardware instruction translator 104 translates the same source-destination non-flag-update, non-pre-shift, carry-used conditional ALU instruction 124 into a single microinstruction 126, i.e., a carry-used conditional ALU operation microinstruction 126 (denoted ALUOPUC CC). In the example of block 2126, the conditional ALU instruction 124 is similar to that described in block 2124 except that it specifies an ALU operation that uses carry flags, and is also similar to that described in block 1026 except that the first source operand is a Register (RD). The conditional ALUOPUC microinstruction 126 is similar to that described in step 2124; however, the ALU operation it specifies uses the carry flag. The execution of the conditional ALUOPUC microinstruction 126 as shown in the block diagram of fig. 27 is similar to the execution of the conditional ALUOP microinstruction 126 of step 2124, except that the execution unit 424 uses the carry flag to perform the ALU operation. The flow ends at step 2126.

In block 2134, the hardware instruction translator 104 translates the same source non-flag-update, pre-shift, non-carry-used conditional ALU instruction 124 into first and second microinstructions 126, i.e.: (1) a shift microinstruction 126; and (2) an ALUOP micro instruction 126. In the example of block 2134, the conditional ALU instruction 124 is similar to that described in block 1034, except that the Register (RD) for the first source operand family is also present, and, in addition, the instruction is similar to that described in block 2124, except that the conditional ALU instruction 124 also specifies a pre-shift operation on the second source operand (R2) with a shift amount, which is stored in the third source register (R3) specified by the conditional ALU instruction 124 in the example of block 2134. However, if the conditional ALU instruction 124 is of a type that specifies the shift amount as a constant within the instruction 124, the third source register is not used. The shift microinstruction 126 is similar to that described in connection with block 1034, and the execution unit 424 executes the shift microinstruction 126 in a manner similar to that described in connection with block 1034 and FIG. 18. Although the carry flag value generated by the shift microinstruction 126 is not used in block 2134 because the conditional ALU instruction 124 indicates that the architectural condition flags 902 are not to be updated, the carry flag value generated by the shift microinstruction 126 is used in block 2156. In addition, the pre-shift operation requires the rotation of the old carry flag to the shifted result value; for example, the RRX pre-shift operation shifts the carry flag to the most significant bit of the result. In this case, although not shown in FIG. 21 (except for block 2156), when the execution unit 424 executes the shift microinstruction 126, it also reads the condition flags 924 to obtain the current carry flag value. The conditional ALUOP microinstruction 126 and its execution are similar to those described in step 2124; however, the micro instruction receives the value of the temporary register T3 instead of the value of the register R2, and performs ALU operations on the register R1 and the temporary register T3 to generate the result to write to the destination register. The execution of the shift microinstruction 126 and the conditional ALUOP microinstruction 126 are presented in fig. 26. Flow ends at block 2134.

In block 2136, the hardware instruction translator 104 updates the same source-destination non-flag update, pre-shift, carry-used conditional ALU instruction 124 to the first and second microinstructions 126, i.e.: (1) a shift microinstruction 126; and (2) a carry conditional ALU microinstruction 126 (labeled ALUOPUC CC). In the example of block 2136, the conditional ALU instruction 124 is similar to that described with respect to block 2134, except that the ALU operation specified uses a carry flag, and is similar to that described with respect to block 1036, except that the Register (RD) is the first source operand system. The two microinstructions 126 and their execution are similar to those described in block 2134; however, the ALUOPUC microinstruction 126 also receives the condition flags 924 to obtain the current value of the carry flag for carry-use ALU operations. Execution of the shift microinstruction 126 and the conditional ALUOPUC microinstruction 126, as shown in fig. 25, is similar to the execution of the shift microinstruction 126 and the conditional ALUOP microinstruction 126 of step 2134, except that the execution unit 424 uses the carry flag to perform the ALU operation. Flow ends at block 2136.

In step 2144, the hardware instruction translator 104 translates the same source-destination flag-updated, non-pre-shifted conditional ALU instruction 124 into a single microinstruction 126, i.e., a conditional ALU operation microinstruction 126 (denoted ALUOP CC). In the example of step 2144, the conditional ALU instruction 124 is similar to the conditional ALU instruction 124 of step 2124, except that the architectural condition flags 902 are updated, and is similar to that described in step 1044, except that the register for the first source operand family. The conditional ALU operation microinstruction 126 of step 2144 and its operation are similar to those described in step 2124, except that the ALU operation microinstruction 126 of step 2144 also updates the architectural condition flags 902, and is similar to the conditional ALU microinstruction 126 of step 1044, except that its first operand series register is not register R1 and its destination register is not register for temporary register T2. The execution unit 424 executes the execution unit 424 for the conditional ALU microinstruction 126, which receives the destination register RD and the register R2 as source operands at 1202, and performs the specified ALU operation on both source operands to generate a result at 1204. The execution unit 424 also receives the architectural condition flags 902 and determines whether they satisfy the specified conditions according to block 1206. If so, the execution unit 424 outputs the result of the ALU operation according to block 1222 or 1226 to write to the destination register RD, depending on whether the ALU operation is a carry-update operation, otherwise outputs the old value of the destination register RD according to block 1216. In addition, the execution unit 424 writes the condition flags register 926, according to either block 1216, 1222, or 1226, depending on whether the condition is satisfied and whether the ALU operation is a carry update operation. If the condition is not satisfied, the execution unit 424 writes the old condition flag value to the architectural condition flags 902, per block 1216; otherwise, if the condition is satisfied, the execution unit 424 updates the architectural condition flags 902 based on the result of the ALU operation, according to block 1222, if a conditional carry ALU operation is taken, and according to block 1226, if a non-conditional carry ALU operation is taken. The execution of the conditional arithmetic and logic unit ALUOP microinstruction 126 is shown in fig. 22. It is noted that the ALU operation performed by the conditional ALU operation microinstruction 126 generated at 2144 (and 1054 and 1056) may be an ALU operation using a carry flag (similar to that described in 1026 and 1036), and since the microinstruction 126 reads the flag (as indicated by RDFLAGS), the execution unit 424 has the carry flag to perform the carry-using ALU operation. The flow ends at step 2144.

At 2154, the hardware instruction translator 104 translates the same source destination flag update, pre-shift, carry used conditional ALU instruction 124 into first and second microinstructions 126, i.e., (1) a shift microinstruction 126; and (2) a conditional carry update ALU operation microinstruction 126 (denoted CU ALUOP CC). In the example of step 2154, the conditional ALU instruction 124 is similar to that described with respect to step 2134, except that the conditional ALU instruction 124 also specifies that the architectural condition flags 902 are to be updated, and is similar to that described with respect to step 1054, except that the first source operand series register is used. The shift microinstruction 126 is similar to that described in connection with block 1034, and the execution unit 424 executes the shift microinstruction 126 in a manner similar to that described in connection with block 1034 of FIG. 18. The CU ALUOP microinstruction 126 and its execution are similar to the conditional ALU microinstruction 126 of step 2124, except that the CU ALUOP microinstruction 126 of step 2144 also updates the architectural condition flags 902, and is similar to the conditional ALU microinstruction 126 of step 1054, except that its first operand series register is not register R1, and its destination register is register of the destination register series rather than temporary register T2. The execution unit 424 executing the CU ALU microinstruction 126 receives the destination register RD and the temporary register T3 as source operands in accordance with block 2102 and executes the specified ALU operation in the destination register and temporary register T3 to generate a result in accordance with block 1204. In addition, the execution unit 424 receives the architectural condition flags 902, as per block 1202, and determines whether they satisfy the specified conditions, as per block 1206. In addition, depending on whether the condition is satisfied, the execution unit 424 updates the condition flags register 926, depending on whether the condition is satisfied, either in

block

1216 or 1222. If the condition is not satisfied, the execution unit 424 writes the old condition flag value to the architectural condition flags 902; otherwise, if the condition is satisfied, the execution unit 424 updates the architectural condition flags 902 based on the result of the ALU operation. The execution of the shift microinstruction 126 and the conditional ALUOP microinstruction 126 is shown in fig. 24. Flow ends at 2154.

In step 2156, the hardware instruction translator 104 translates the conditional ALU instruction 124 with the same source destination flag update, to be shifted, and not carry update into first and second microinstructions 126, which are: (1) a shift microinstruction 126; and (2) a conditional non-carry-update ALU operation microinstruction 126 (labeled NCU ALUOP CC). In the example of step 2156, the conditional ALU instruction 124 is similar to that described with respect to step 2154, except that the conditional ALU instruction 124 specifies a non-carry-updating ALU operation, and is similar to that described with respect to step 1056, except that the register for the first source operand system. Thus, when the condition is satisfied, the architectural carry flag 902 is updated with the pre-shift carry flag value. The shift microinstruction 126 is similar to that described in step 2134; however, the shift microinstruction 126 reads and writes the condition flags register 926. Specifically, the execution unit 424 executing the shift microinstruction 126: (1) writing a carry flag value generated by the pre-shift operation to the PSC bits 906; (2) setting the USE bit 908 to indicate that the conditional NCUALUOP microinstruction 126 utilizes the PSC bit 906 to update the architectural carry flag 902; and (3) write the old architectural condition flags 902 back to the condition flags register 926, as per block 1114, whereby the NCUALUOP microinstruction 126 may evaluate the old value of the architectural condition flags 902 to determine whether it satisfies the specified condition. The NCUALUOP micro instruction 126 and the conditional ALU instruction 124 specify the same condition. The execution unit 424 executing the NCUALUOP micro instruction 126 executes the ALU operation on the destination register and temporary register T3 to generate a result, according to step 1204. In addition, the execution unit 424 receives the architectural condition flags 902 and determines whether they satisfy the condition according to block 1206. In addition, the execution unit 424 writes the condition flags register 926 according to

steps

1216, 1226, or 1228 depending on whether the condition is satisfied and the USE bit 908 is set. Specifically, if the condition is not satisfied, the execution unit 424 writes the old condition flag values to the architectural condition flags 902, according to block 1216; when the condition is satisfied, the execution unit 424 updates the architectural condition flags 902 based on the result of the ALU operation, optionally according to

steps

1226 or 1228, depending on whether the USE bit 908 is set. Specifically, the architectural overflow (V) flag 902 is written with the old overflow flag value 924, and the N and Z flags are written with the new values generated based on the result. Additionally, the architectural carry flag 902 is updated with the pre-shifted carry flag value located in the PSC bit 906, according to block 1228, if the USE bit 908 so indicates, and otherwise with the old carry flag value 924, according to block 1226. The execution of the shift microinstruction 126 and the NCUALUOP microinstruction 126 is shown in FIG. 23. Flow ends at 2156.

An advantage of this approach is that the hardware instruction translator 104 can optimize and reduce the sequence of micro instructions 126 generated by one micro instruction 126 when the conditional ALU instruction 124 specifies one of the destination and source registers as being the same. First, it may increase the look-ahead (lookup) function of the microprocessor 100 to increase the utilization of the execution units 424 by parallel processing of the instruction layer of the program to be executed. This look-ahead function may be enhanced because the reduced number of microinstructions 126 means that the free slots (slots) in the reorder buffer 422 for additional microinstructions 126 are increased, thereby creating a larger pool (pool) of microinstructions 126 to be issued ready for subsequent execution. Second, because the hardware instruction translator 104 can only output microinstructions 126 to a predetermined number of slots (slots) per clock cycle, in at least one embodiment, the hardware instruction translator 104 must output all of the microinstructions 126 required to implement a given ISA instruction 124 in the same clock cycle, thereby reducing the number of microinstructions 126 generated by a conditional ALU instruction 124, as well as the average number of empty microinstruction 126 slots per cycle, which facilitates increased utilization of the microprocessor 100 by speculative functions and execution units 424.

Conditional non-branch instruction prediction

The above embodiments describe techniques for translating a conditional non-branch instruction, referred to herein as a conditional ALU instruction, into microinstructions in a pipelined microprocessor that defines a read port-limited. The first micro instruction performs an ALU operation and writes the result to a temporary register. The second microinstruction receives the result from the temporary register and the current value of the destination register and writes the result into the destination register when the condition is satisfied and writes the current value back into the destination register when the condition is not satisfied. Similarly, the embodiment described in U.S. provisional application 61/473,062 translates a conditional non-branch instruction, referred to herein as a conditional load instruction, into microinstructions in a pipelined microprocessor that defines a read port. The instruction translator translates the conditional load instruction into two micro instructions: (1) a load micro instruction that fetches condition codes and flags simultaneously, does not update its architectural state (e.g., a page table walk produces a side effect of a memory write or produces an exception) and loads a dummy value into the temporary register if the condition is not met, but loads the temporary register with the true value from memory if the condition is met; and (2) a conditional move microinstruction that receives the current value of the destination register and moves the current value back to the destination register when the condition is not true and moves the value from the scratch register to the destination register when the condition is true.

Although this solution is an improvement over the conventional technique, it incurs additional cost, namely the delay associated with the second micro instruction and the dependency of the second micro instruction on the first micro instruction. Second, instruction slots in other structures of the microprocessor, such as the micro instruction queue, reorder buffer, reservation stations, and execution units, are also utilized by the second micro instruction. In addition, the presence of the second micro instruction reduces the average number of instructions issued by the instruction translator, issued by the instruction issue unit, and retired by the instruction retirement unit per clock cycle, thereby limiting the processing power of the microprocessor.

The present invention provides a more efficient solution that incorporates a prediction mechanism, similar to branch prediction methods, to predict the direction of conditional non-branch instructions, i.e., whether the prediction condition is satisfied, to determine whether the conditional non-branch instruction needs to be executed. This solution allows the conditional translator to issue a single micro instruction, rather than multiple micro instructions, based on the prediction information. The microprocessor also has a mechanism to recover from misprediction conditions.

Embodiments with both static and dynamic prediction mechanisms are described below. The static prediction mechanism is similar to static branch prediction. The dynamic (or history-based) prediction mechanism examines the program counter/instruction pointer value of a conditional non-branch instruction as it is fetched from the instruction cache, similar to a Branch Target Address Cache (BTAC).

In the static prediction mechanism, the static predictor examines this operation and/or the condition code specified by the conditional non-branch instruction (e.g., ALU operation is ADD and condition code is EQUAL) and predicts whether to perform this operation based on existing data (profiling data). For example, based on the combination of the operation and condition codes and empirical data showing that a conditional non-branch instruction will execute a significant percentage of the time, the static predictor predicts that the instruction will be executed and the instruction translator will issue a single non-conditional microinstruction, such as:

addcc dst,src1,src2

the condition code and flags are provided to the micro instruction (i.e., addcc) so that the execution unit can determine whether the prediction is correct and generate a misprediction indicator if the prediction is incorrect.

Conversely, the static predictor predicts that an instruction will not be executed if the combination of operation and condition codes and empirical data indicate that a significant proportion of temporally-conditional non-branch instructions will not be executed, the instruction translator issues a single no-operation (nop) micro-instruction, such as:

Nopcc

similarly, the condition codes and flags are provided to the microinstructions (i.e., nopcs), which may generate a misprediction indicator if necessary.

In the event that the execute/do ratio is not large enough to justify the result of the static prediction, the instruction translator reverts to the less efficient multi-micro instruction solution described above, such as: the translator issues two micro instructions:

add tmp,src1,src2

movcc dst, src-dst, tmp// src-dst are current dst reg values

In the dynamic prediction mechanism, a BTAC-like architecture, referred to herein as a Conditional ALU Direction Cache (CADC), retrieves the history of the direction of previously executed conditional non-branch instructions and their program counter/instruction pointer values, and retrieves the address value hit in the CADC entry history to predict the direction of subsequently fetched conditional non-branch instructions. The CADC provides its prediction to the instruction translator. The instruction translator issues micro instructions based on the predictions made by the static predictor.

The recovery mechanism flushes the pipeline in which the conditional non-branch instruction resides and all subsequent instructions (more precisely, micro instructions translated therefrom), or all instructions at least directly or indirectly associated with the conditional non-branch instruction, and then repeats (replay) all flushed instructions. In the repetitive execution of conditional non-branch instructions, the translator may prefer to issue multiple microinstructions.

One embodiment of the present invention uses both static and dynamic predictors, and records which predictor is the more accurate historical data for each program counter/instruction pointer value. According to the conventional two-level hybrid branch prediction method, the historical data is used to dynamically select one of the two predictors to provide the final prediction.

It is noted that mispredictions for conditional non-branch instructions cause penalties (i.e., flushing the pipeline and repeatedly executing the conditional non-branch instruction and its following instructions or at least directly indirectly related instructions), which vary and are a function of the application code and/or data set. Therefore, a solution to predicting conditional non-branch instructions may be a less efficient solution for certain application code and/or data set mixing states.

A non-branch instruction is defined that is not written to a program counter of the microprocessor, so that the microprocessor fetches and executes instructions subsequent to the non-branch instruction. The program counter is used in the ARM architecture, and other architectures use different components to replace the program counter. For example, the x86ISA uses an instruction pointer, while other ISAs use an instruction address register. Non-branch instructions are significantly different from branch instructions that write an address to the program counter/instruction pointer so that the microprocessor points to this address. The microprocessor begins fetching instructions by writing branch instructions to the address of the program counter/instruction pointer and then executing the fetched instructions. This operation is significantly different from fetching and executing subsequent instructions of the branch instruction, which are default operations of the microprocessor and are also operations when non-branch instructions are encountered. Examples of conditional non-branch instructions include conditional ALU instructions and conditional load/store instructions.

Referring now to FIG. 29, a block diagram illustrating a microprocessor 100 according to the present invention for predicting unconditional branch instructions is shown. The microprocessor 100 of FIG. 29 is similar to the microprocessor 100 of FIG. 1, and includes similar elements as those of FIGS. 1 and 4, namely an instruction cache 102, an instruction translator 104, an allocation register 122, a register allocation table 402, an instruction issue unit 408, an execution unit 424, and a reorder buffer 422. The execution units 424 include one or more units to execute what are referred to herein as microinstructions 126. Additionally, the execution unit 424 executes the no-op microinstruction 126. The no operation micro instruction 126 directs the execution unit 424 to perform no operation. Further, the no-op microinstruction 126, as referred to herein, includes a condition, or condition code, specified by the conditional ALU instruction 124 that translates out the no-op microinstruction 126. The no-op microinstructions 126 are described in further detail below. Microprocessor 100 also includes architectural registers, temporary registers 126, and flags 926 of FIG. 9.

The microprocessor 100 of FIG. 29 also includes a dynamic predictor 2932, a static predictor 2936, and a predictor selector 2934. These elements are coupled to the instruction translator 104 and are used to predict the direction (whether executed or not) of a conditional ALU instruction 124 (of fig. 2). The fetch address 134 of FIG. 1 is also provided to the dynamic predictor 2932 and the predictor selector 2934.

The dynamic predictor 2932 and the predictor selector 2934 each include a cache having a plurality of entries. Each entry caches a memory address of a previously executed ARM conditional ALU instruction 124. That is, when the microprocessor 100 retires a conditional ALU instruction 124, the dynamic predictor 2932 and predictor selector 2934 are examined to determine whether they contain an entry having the address of the conditional ALU instruction 124. If so, the entry is updated according to the correct direction of the conditional ALU instruction 124 indicated by a history update pointer 2974; if not, an entry is allocated to the conditional ALU instruction 124 in each of the dynamic predictor 2932 and the predictor selector 2934. Although the dynamic predictor 2932 and the predictor selector 2934 of FIG. 1 are separate, in one embodiment, these two elements are integrated into a single cache array. That is, each entry of the single array includes the selection field of the trend prediction and predictor selector 2934 of the dynamic predictor 2932, as will be described in further detail below.

Each entry of the dynamic predictor 2932 stores the address of a conditional ALU instruction 124 and has a field to store the direction prediction of the conditional ALU instruction 124. The direction prediction is updated in response to the correct direction that the conditional ALU instruction 124 retires on address. Trend prediction may comprise a variety of different formats. For example, the trend prediction may include a single bit to indicate being executed or not being executed. This bit is set to a particular value if the prediction run is executed, and to another value if not executed. As another example, the trend prediction may include a multi-bit counter that is incremented as much as possible when the prediction trend is being performed and decremented as much as possible when not being performed. Counter values greater than the central value are predicted to be executed and values less than the central value are predicted to be not executed.

The fetch address 134 is provided to the dynamic predictor 2932 each time an instruction block is fetched from the instruction cache 102. The dynamic predictor 2932 examines the fetch address 134 to determine whether it matches the valid tag of its cache array, i.e., whether the valid tag hit or not. If the fetch address 134 misses, the dynamic prediction output 2982 of the dynamic predictor 2932 outputs a value representing No Prediction (NP). If the fetch address 134 hits, the dynamic predictor 2932 outputs a value at its dynamic prediction output 2982 indicating either an executed (E) run or a Not Executed (NE) run based on the run prediction field value stored in the matched entry. In one embodiment, the dynamic prediction output 2982 of the dynamic predictor 2932 may output a value indicating No Prediction (NP) even if the fetch address 134 hits. For example, where historical data indicates that there is nearly equal probability that the conditional ALU instruction 124 will be executed or not executed, i.e., there is nearly equal probability that the condition will be satisfied or not satisfied. The direction prediction 2982 is provided to the instruction translator 104.

Each entry of predictor selector 2934 also includes a field to store a selector (selector) for the conditional ALU instruction 124 for which each address is stored, the selector indicating whether the dynamic predictor 2932 or the static predictor 2936 is more likely to correctly predict the direction of the conditional ALU instruction 124. The selection subsystem updates in response to the retirement of the conditional ALU instruction 124 on the address, specifically the correct way and information indicated by a history update pointer (history update indicator)2974 (indicating that the prediction was made by the dynamic predictor 2932 or the static predictor 2936). The selector may comprise a variety of different formats. For example, such a selector may comprise a single bit to represent the dynamic predictor 2932 or the static predictor 2936. This bit is set to a particular value when the dynamic predictor 2932 is correctly predicting trends and to another value when the static predictor 2936 is correctly predicting trends, and if both are correctly predicting trends, the previously selected predictor is maintained. As another example, the selector may comprise a multi-bit counter that is incremented as much as possible when the dynamic predictor 2932 predicts a correct heading, decremented as much as possible when the static predictor 2936 predicts a correct heading, and not updated if both predict a correct heading. The counter value being greater than the median value predicts that the dynamic predictor 2932 will predict the trend correctly, and less than the median value predicts that the static predictor 2936 will predict the trend correctly.

Each time an instruction block is fetched from the instruction cache 102, the instruction address 134 is provided to the predictor selector 2934 to examine the fetch address 134 to determine whether it matches a valid tag of its cache array, i.e., either a hit valid tag or a miss. If the fetch address 134 misses, the prediction select output 2984 of the predictor selector 2934 outputs a value representing No Prediction (NP). If the fetch address 134 hits, the predictor selector 2934 outputs a value at its predictor select output 2984 indicating either the dynamic predictor 2932(D) or the static predictor 2936 based on the select field value stored in the matched entry. In one embodiment, even if the fetch address 134 hits, the select prediction output 2984 of the predictor selector 2934 may still output a value indicating no prediction. Such as where historical data indicates that neither the dynamic predictor 2932 nor the static predictor 2936 is likely to predict correctly. The prediction select 2984 is provided to the instruction translator 104.

The static predictor 2936 receives an instruction 124 fetched from the instruction cache 102 and analyzes the condition code of the instruction 124 and/or its specified special ALU function to predict the direction of the conditional ALU instruction 124. The static predictor 2936 basically includes a look-up table (lookup table) that includes E, NE, or NP, index associations to each possible condition code/ALU function combination. For one embodiment, these E, NE, or NP, pointers are configured within the static predictor 2936 based on empirical data written to the ARM ISA program execution. The static prediction 2986 is provided to the instruction translator 104. In one embodiment, the static predictor 2936 is integrated within the instruction translator 104.

The instruction translator 104 translates the conditional ALU instruction into the microinstructions 126 using the prediction 2982/2984/2986, as described in more detail below with respect to FIGS. 30 and 31. These predictions 2982/2984/2986 are carried down the pipeline of the microprocessor 100 along with the conditional ALU instruction 124 for use by the execution unit 424 to confirm whether the respective predictor 2932/2934/2936 correctly predicted the direction of the conditional ALU instruction 124. In one embodiment, the dynamic predictor 2932, the predictor selector 2934 and the static predictor 2936 generate predictions 2982/2984/2986 each clock cycle in the case where it is determined that the instruction block fetched from the instruction cache 102 includes a plurality of conditional ALU instructions 124 each clock cycle.

In one embodiment, the micro-architecture of the microprocessor 100 is similar in many respects to the VIA Nano produced by Taiwan Weisheng electronics^TMThe microprocessor 100 of the present embodiment is modified to support the ARM instruction set architecture. This VIA Nano^TMThe processor micro-architecture is a high performance non-sequential execution superscalar micro-architecture to support the x86 instruction set architecture. The processor is modified as described herein to additionally support the ARM microarchitecture, and in particular, is described in detail below, corresponding to the ARM conditional ALU instruction 124 of fig. 2.

The register allocation table 402 indicates the result of a conditional move microinstruction 3046 (described in detail with reference to FIG. 30) associated with an ALU microinstruction 3044 (described in detail with reference to FIG. 30), both of which are issued by the instruction translator 104 when translating the conditional ALU instruction 124 under certain conditions. These special conditions are described below when a conditional ALU instruction 124 has no available prediction, or a conditional ALU instruction is mispredicted and repeatedly executed.

The temporary register 106 stores the non-architectural state of the microprocessor 100. The temporary register 106 may be utilized by the micro-architecture to temporarily store intermediate values needed to execute instructions of the instruction set architecture. Further, the micro instructions issued by the instruction translator 104 may specify the temporary registers 106 as source and/or destination operand locations. In particular, the ALU micro instruction 3044 of FIG. 30 may specify a temporary register 106 as its destination register, and the associated conditional move micro instruction 3046 specifies the same temporary register 106 as one of the source registers. This will be further explained in the following text.

The at least one execution unit 424 has an Arithmetic Logic Unit (ALU) (not shown) for executing various microinstructions, including an ALU microinstruction 3044 and an unconditional ALU microinstruction 3045 with Condition Codes (CC) as shown in FIG. 30. In addition, the at least one execution unit 424 is configured to execute the conditional move microinstruction 3046 and the no-operation microinstruction 3047 with Condition Codes (CC) as described in FIG. 30. For the conditional move microinstruction 3046, the unconditional ALU microinstruction 3045, or the do not operate microinstruction 3047, FIG. 30, the execution unit 424 receives as input the condition code value a224, a254, or a274 (see FIG. 30) and the current value of the flag 926. The execution unit 424 determines whether the value of the flag 926 satisfies the condition specified by the condition code a224, a254, or a 274. Thus, the execution unit 424 may determine the correct direction of the conditional ALU instruction 124 and determine whether the dynamic predictor 2932 and/or the static predictor 2936 mispredict the direction of the conditional ALU instruction 124, which is indicated by a misprediction indication (misprediction indication)2976 provided to the reorder buffer (ROB) 422. In addition, the execution unit 424 determines whether the

predictor

2932,2936 selected by the predictor selector 2934 correctly predicts the trend, and the determination result is used to update the dynamic predictor 2932 and the predictor selector 2934. For the conditional move microinstruction 3046 of FIG. 30, if the condition is satisfied, the execution unit 424 moves the value of the temporary register 106 specified in field a226 of the source register 1 to the architectural register 106 specified in field a232 of FIG. 30. If the condition is not satisfied, the value of the architectural register 106 specified by field a228 of source register 2, i.e., the value of the original destination register, is moved to the architectural register 106 specified by destination register field a 232.

Reorder buffer 422 receives results from execution unit 424, which includes an indication of whether the run of conditional ALU instruction 124 was mispredicted. If the direction is not mispredicted, the reorder buffer 422 updates the architectural state of the microprocessor 100 with the result of the ALU operation specified by opcode a202 of the conditional ALU instruction 124 executing on the source operand specified by field a206 of source register 1 and source register 2, which is used to update the flags 926 and the architectural register 106 specified by the destination register field a208 of the conditional ALU instruction, as reflected in the destination register field a232 of the conditional move microinstruction 3046 and the destination register field a258 of the unconditional ALU microinstruction with the opcode. However, if the trend is mispredicted, reorder buffer 422 generates a true value on misprediction indicator 2976. The misprediction pointer 2976 is provided to the instruction translator 104, whereby, by repeatedly executing the mispredicted conditional ALU instruction 124, the instruction translator 104 knows that it is necessary to resume multiple micro-instruction (multimicroinstruction) execution according to an unpredicted principle (NP region). The misprediction indicator 2976 is provided to other pipeline units of interest, such as the register allocation table 402 and the instruction issue unit 408, to enable them to clear microinstructions if necessary. The reorder buffer 422 also generates historical data update values 2974 to update the dynamic predictor 2932 and predictor selector 2934 based on the result of the conditional ALU instruction 124, i.e., the trend prediction result.

Referring now to FIG. 30, a block diagram illustrating translation of a conditional ALU instruction 124 by the instruction translator 104 of FIG. 29 is shown. As described herein, the instruction translator 104 of fig. 29 may translate the conditional ALU instruction 124 into three different micro-instruction sets, depending on the circumstances in which the translator 104 translates the conditional ALU instruction 124, i.e., the conditional ALU instruction 124 is predicted to be executed (E), predicted not to be executed (NE), or no-prediction (NP), as shown in fig. 30. In one embodiment, the conditional ALU instruction 124 is a conditional ALU instruction defined by the ARM instruction set architecture.

The conditional ALU instruction 124 includes an opcode field a202, a condition code field a204, source 1 and source 2 fields a206, and a destination register field a 208. The opcode field a202 contains a value to distinguish this conditional ALU instruction from other instructions within the instruction set architecture.

The condition code field a204 specifies a condition under which the destination register is selectively updated with the result of the ALU micro-instruction 3044 described below, depending on whether the current value of the flag 926 satisfies the condition. According to one embodiment compatible with the ARM instruction set architecture, the condition code field a204 specifies the upper four bits (i.e., bits [31:28]) of the conditional ALU instruction 124, enabling encoding of sixteen different possible values according to Table 3 below. For the architectural version dependency value (0b1111), the instruction cannot be predicted by the architectural version, but rather is used to indicate the unconditional instruction extension space (unconditional instruction extension space) of the other architectural version.

Table 3.

The fields a206 of source register 1 and source register 2 specify the immediate value and the architectural register 106 holding the input operands upon which the ALU operation specified by opcode a202 (e.g., add, subtract, multiply, divide, and/or the like) is to be performed to produce a result. When the condition is satisfied, this result will be conditionally loaded into the architectural register 106 specified by the destination register field a 208.

In the absence of prediction (NP), the instruction translator 104 translates the conditional ALU instruction 124 into an ALU micro-instruction 3044 and a conditional move micro-instruction 3046 for execution by the execution units 424.

ALU microinstruction 3044 includes an opcode field a212, source 1 and source 2 fields a216, and a destination register field a 218. The opcode field a212 contains a value that distinguishes the ALU micro instruction 3044 from other micro instructions of the micro instruction set architecture of the microprocessor 100. The ALU function specified by opcode a202 of the conditional ALU instruction 124 is communicated to opcode field a212 of the ALU micro instruction 3044. The field a216 of source register 1 and source register 2 specifies the immediate value and the architectural register 106 holding the operand. The ALU operation specified by opcode a212 is performed on the operands to generate a result that is loaded into the architectural or temporary register 106 specified by the destination register field a 218. In the absence of prediction, when the instruction translator 104 translates the conditional ALU instruction 124, the instruction translator 104 fills the fields a216 of the source register 1 and the source register 2 of the ALU microinstruction 3044 with the same values as the fields a206 of the source register 1 and the source register 2 of the conditional ALU instruction 124. When the instruction translator 104 translates the conditional ALU instruction 124, the instruction translator 104 fills the destination register field a218 to specify a temporary register 106 to receive the result of the ALU operation.

The conditional move microinstruction 3046 includes an opcode field a222, a condition code field a224, a source register 1 field a226, a source register 2 field a228, and a destination register field a 232. The opcode field a222 contains a value that distinguishes the conditional move microinstruction 3046 from other microinstructions of the microinstruction set architecture of the microprocessor 100. The condition code field a224 specifies a condition to selectively perform a move operation depending on whether the current value of the flag 926 satisfies the condition code field a204 of the conditional ALU instruction 124. In fact, when translating the conditional ALU instruction 124, the instruction translator 104 populates the condition code field a224 of the conditional move micro instruction 3046 with the same value as the condition code field a204 of the conditional ALU instruction 124. The field a226 of Source register 1 specifies an architectural or temporary register 106 to indicate that the first source operand of this register is to be provided to the conditional move microinstruction 3046; while the field a228 of the source register 2 specifies an architectural or temporary register 106 to indicate that the second source operand of the register is to be provided to the conditional move microinstruction 3046. When the instruction translator 104 translates the conditional ALU instruction 124, the instruction translator 104 fills the field a226 of the source register 1 with the same value as it fills the destination register field a218 of the ALU micro-instruction 3044. The instruction translator 104 also fills the field a228 of the source register 2 with the same value as the destination register field a208 of the conditional ALU instruction 124. That is, the field a228 of the source register 2 causes the conditional move microinstruction 3046 to receive the current value of the destination register, thereby writing the current value back to the destination register when the condition is not satisfied. The instruction translator 104 fills the destination register field a232 with a value that is the same as the destination register field a208 of the conditional ALU instruction, thereby either loading the destination register with the current value of the destination register field specified by the conditional ALU instruction 124 if the condition is not satisfied, or loading the destination register with the value of the temporary register holding the result of the ALU micro-instruction 3044 if the condition is satisfied.

In one embodiment, the instruction translator 104 translates the conditional ALU instruction 124 into the microinstructions 126 described in FIGS. 10-28 without prediction (NP). As described above, the micro instruction set 126 changes with the conditional ALU instruction 124, such as: whether one of the source registers is the destination register, whether it is a flag update instruction, whether a pre-shift is specified, whether the current carry flag value is used, and whether the ALU operation updates the carry flag in the case of a flag update pre-shift. In particular, in the case of a partial pre-shift conditional ALU instruction 124, the micro-instruction set would include three micro-instructions 126 as shown in FIG. 10, rather than two micro-instructions 126 as shown in FIG. 30. Second, in the case where the conditional ALU instruction 124 specifies one of the source registers as the destination register, the set of microinstructions includes a reduced number of microinstructions 126, as compared to FIG. 21 and FIG. 10. Further, the microinstruction set does not include the conditional move microinstruction 126, but rather the conditional ALU microinstruction 126 provides the conditional move function. As a result, in some examples, the set of microinstructions includes only a single microinstruction 126 as shown in FIG. 21, rather than two microinstructions 126 as shown in FIG. 30. Furthermore, in the case of a flag-updating conditional ALU instruction 124, the set of microinstructions includes a conditional move microinstruction 126 that is slightly different from the conditional move microinstruction 126 shown in FIG. 30. In particular, to determine whether the condition is satisfied, the conditional move microinstruction (CMOV)126 of FIGS. 10, 1054, and 1056 checks a non-architectural flag that is updated by the previous microinstruction 126 within the microinstruction set based on whether the architectural flag satisfies the condition, in contrast to the conditional move microinstruction 126 of FIG. 30 which checks the architectural flag to determine whether the condition is satisfied. Finally, while the ALU microinstruction 126 of FIG. 30 is a non-conditional ALU microinstruction 126, the ALU microinstructions 126 of FIGS. 10 and 21 may in some cases be conditional ALU microinstructions 126.

In the case of being executed (E), the instruction translator 104 translates the conditional ALU instruction 124 into an unconditional ALU microinstruction 3045 having condition codes for execution by the execution unit 424. The unconditional ALU microinstruction with condition codes 3045 includes an opcode field a252, a condition code field a254, fields a256 for source register 1 and source register 2, and a destination register field a 258. The opcode field a252 contains a value that distinguishes the unconditional ALU micro-instruction with condition code 3045 from other micro-instructions within the micro-instruction set architecture of the microprocessor 100. The ALU function specified by the opcode a252 of the conditional ALU instruction 124 is communicated to the opcode field a252 of the unconditional ALU microinstruction 3045 having a condition code. The fields a256 of source register 1 and source register 2 specify immediate values and architectural registers 106 hold operands upon which the ALU operation specified by opcode a252 is to be performed and produce a result. The result is loaded into the architectural or temporary register 106 specified by the destination register field a 258. In the case of execution, the instruction translator 104 populates the fields a256 of the source register 1 and source register 2 of the unconditional ALU microinstruction 3045 with condition codes with the same values as the fields a206 of the source register 1 and source register 2 of the conditional ALU instruction 124. When translating a conditional ALU instruction 124, the instruction translator 104 populates the condition code field a254 of the unconditional ALU microinstruction 3045 with condition codes with the same value as the condition code field a204 of the conditional ALU instruction 124. The condition code a254 is used by the execution unit 424 to determine whether the direction of the associated conditional ALU instruction 124 was mispredicted. When translating a conditional ALU instruction 124, the instruction translator 104 fills the destination register field a258 with the same value as the destination register field a208 of the conditional ALU instruction 124. Thus, the unconditional ALU microinstruction with condition code 3045 is an unconditional microinstruction that is executed regardless of whether the condition is satisfied or not, since the associated conditional ALU instruction 124 is speculatively executed. However, the unconditional ALU microinstruction with condition code 3045 predicts similarly to a predicted branch instruction, since its execution prediction still needs to be examined, and if a misprediction is found, the architectural register 106 specified by the destination register field a258 will not be updated with the ALU result, but the architectural register 106 will be cleared, and the associated conditional ALU instruction 124 will be executed repeatedly, this time without prediction. Conversely, if the execution prediction is correct, the architectural register 106 specified by the destination register field a258 is updated with the ALU result. In one embodiment, in addition to the unconditional ALU microinstruction 126 of FIG. 30 having condition codes, when the conditional ALU instruction 124 specifies a pre-shift operation as described in FIGS. 10-28, the instruction translator 104 additionally translates a shift microinstruction 126 for the conditional ALU instruction 124, such shift microinstruction 126 precedes the unconditional ALU microinstruction 126 having condition codes. For example, this shift microinstruction 126 is similar to the shift microinstruction described in step 1034 of FIG. 10, whereas the unconditional ALU microinstruction 126 of FIG. 30 having condition codes is modified to specify a temporary register as its source operand register, which is the destination register of the shift microinstruction 126. In the event of a misprediction, this shift microinstruction 126 will be cleared in step 3134 of FIG. 31 (described below), except for the unconditional ALU microinstruction 126 which has a condition code.

In the case of a no-execute (NE), the instruction translator 104 translates the conditional ALU instruction 124 into a no-operation microinstruction 3047 having condition codes for execution by the execution unit 424. The no-op micro instruction with condition code 3047 includes an opcode field a272 and a condition code field a 274. The opcode field a272 contains a value that distinguishes the no operation microinstruction with condition codes 3047 from other microinstructions within the microinstruction set architecture of the microprocessor 100. Upon translation of the conditional ALU instruction 124, the instruction translator 104 populates the condition code field a274 of the no-operation microinstruction with condition codes 3047 with a value that is the same as the condition code field a204 of the conditional ALU instruction 124. The condition code a274 is used by the execution unit 424 to determine whether the direction of the associated conditional ALU instruction 124 was mispredicted. The no-op microinstruction with condition codes 3047 performs no other operation than causing the execution unit 424 to activate to check the run prediction of the conditional ALU instruction.

Referring now to FIG. 31 (including FIGS. 31A and 31B), a flowchart illustrating an embodiment of the microprocessor 100 of FIG. 29 executing a conditional ALU instruction 124 of FIG. 30 according to the present invention is shown. The process starts at

steps

3102, 3104 and 3106.

At step 3102, an instruction block containing the conditional ALU instruction 124 of FIG. 30 is fetched according to the fetch address 134 of the instruction cache 102 of FIG. 29. Step 3108 is next entered.

At step 3104, the dynamic predictor 2932 examines the fetch address 134 and provides dynamic prediction 2982 to the instruction translator 104 of FIG. 29. Step 3108 is next entered.

At block 3106, the predictor selector 2934 examines the fetch address 134 and provides a predictor select 2984 to the instruction translator of FIG. 29. Step 3108 is next entered.

At step 3108, the static predictor 2936 receives the conditional ALU instruction 124, which, when evaluated, provides the static prediction 2984 to the instruction translator 104 of fig. 29. Step 3112 is then entered.

At block 3112, the instruction translator 104 encounters the conditional ALU instruction 124 and receives predictions 2982/2984/2986 from the dynamic predictor 2932, the predictor selector 2934, and the static predictor 2936, whereupon the instruction translator 104 generates a trend prediction for the conditional ALU instruction 124. Step 3114 is then entered.

In step 3114, the instruction translator 104 determines whether the conditional ALU instruction 124 it predicted in step 3112 was executed. If so, the flow proceeds to step 3116; otherwise, go to step 3118 for determination.

At step 3116, the instruction translator 104 issues the unconditional ALU microinstruction 3045 with condition codes as shown in FIG. 30 based on the execution prediction. Step 3126 is entered next.

At step 3118, the instruction translator 104 determines whether the conditional ALU instruction 124 it predicted at step 3112 will not be executed. If so, the flow proceeds to step 3122; otherwise, go to step 3124.

At block 3122, the instruction translator 104 issues the no operation microinstruction 3047 with condition codes as shown in FIG. 30 based on the no-execution prediction. Step 3126 is entered next.

At block 3124, in the absence of prediction, the instruction translator 104 issues an ALU microinstruction 3044 and a conditional move microinstruction 3046, as depicted in FIG. 30. Step 3126 is entered next.

At block 3126, the execution unit 424 executes the microinstruction 126 issued by the instruction translator 104 at

block

3116,3122 or 3124. In the absence of prediction, the execution unit 424 executes the ALU microinstruction 3044 by performing the ALU function specified by the opcode field a212 on the source operand specified in field a216 to generate a result that is output on the result bus 128 and written into the reorder buffer allocated to the entry of the ALU microinstruction 3044, in anticipation of a subsequent write to the temporary register 106 specified by field a 218. Once the results of the ALU microinstruction 3044 are available, the conditional move microinstruction 3046 can be sent to the execution unit 424 to confirm that the flag 926 satisfies the condition specified by the condition code 244. If so, the result of the ALU micro instruction 3044 (either from the direction bus or the temporary register 106) is output on the result bus 128 and is written into the reorder buffer allocated to the entry of the conditional move micro instruction 3046, expecting it to be written into the architectural register 106 specified by field a 232. However, if the condition is not satisfied, the original value of the architectural register 106 specified by field a228 of the source register 2, i.e., the architectural register specified by the destination register field a208 of the conditional ALU instruction 124, is output to the result bus and written to the reorder buffer allocated to the entry of the conditional move microinstruction 3046, expecting that the architectural register 106 specified by field a232 can be written thereafter. The execution unit 242 also assigns a correct prediction to the reorder buffer (since the instruction translator 104 generates the ALU micro-instruction 3044 and the conditional move micro-instruction 3046 in response to no prediction). That is, in the case of no prediction, since there is no prediction, no misprediction is generated at all. In the case of speculative execution, the execution unit 424 executes the unconditional ALU microinstruction with condition codes 3045 by executing the ALU function specified by the opcode field a252 on the source operand specified by field a256 to generate a result that is output on the result bus 128 and allocated by the write reorder buffer to the entry of the unconditional ALU microinstruction with condition codes, in anticipation of being written to the architectural register 106 specified by field a 258. The execution unit 424 also determines whether the flag 926 satisfies the condition specified by the condition code a254, and accordingly provides a pointer to the reorder buffer 422. Further, the execution unit 424 indicates a misprediction to the reorder buffer 422 only if the flag 926 does not satisfy the condition specified by the condition code a254, since the instruction translator 104 generates the unconditional ALU microinstruction 3045 with the condition code if it performs a prediction, otherwise indicates a correct prediction. In the case of no execution, the execution unit 424 does not perform any operation, resulting in the execution of the no operation microinstruction 3047 having the condition code. In addition, the execution unit 424 determines whether the flag 926 satisfies the condition specified by the condition code a274 and provides a pointer to the reorder buffer 422 accordingly. Further, the execution unit 424 indicates a misprediction to the reorder buffer 422 only when the flag satisfies the condition specified by the condition code a254, because the instruction translator 104 generates the no-op microinstruction with a condition code 3047 if the prediction is not to be executed, otherwise indicates a correct prediction. Decision step 3128 is entered next.

At decision block 3128, reorder buffer 422 determines whether the trend of conditional ALU instruction 124 was mispredicted based on misprediction pointer 2976 received from execution unit 242. If so, the flow proceeds to step 3134; if not, go to step 3132.

At step 3132, reorder buffer 422 updates the architectural state of microprocessor 100, i.e., architectural registers 106 and flags 926, with the result of conditional ALU instruction 124. Further, because the reorder buffer 422 must retire instructions in program order, the reorder buffer 422 updates architectural state when a conditional move microinstruction 3046 (in the case of no prediction), an unconditional ALU microinstruction with condition codes 3045 (in the case of speculative execution), or a no-op microinstruction with condition codes 3047 (in the case of speculative execution) becomes the oldest microinstruction in the microprocessor 100. Step 3136 is entered next.

In step 3134, reorder buffer 422 generates a true value (true value) in misprediction indicator 2976, causing the microinstruction generated from the translation of conditional ALU instruction 124 and all microinstructions associated therewith to be cleared. Additionally, generating a true value at misprediction indicator 2976 causes conditional ALU instruction 124 to repeat. That is, the instruction translator 104 again translates the conditional ALU instruction 124, this time following the no prediction principle of step 3124. According to another embodiment, the instruction translator 104 inverts the correct prediction as the conditional ALU instruction 124 is repeatedly executed, and translates based on the inverted prediction. That is, if the speculative execution is mispredicted, instruction translator 104 performs the translation according to the speculative execution principle, and if the speculative execution is mispredicted, the instruction translator performs the translation according to the speculative execution principle. It is noted, however, that this embodiment may be susceptible to livelock (livelock) situations.

At block 3136, reorder buffer 422 provides historical data update pointer 2974 with the appropriate value to dynamic predictor 2932 and predictor selector 2934 and updates dynamic predictor 2932 and predictor selector 2934 based on the correct trend evaluated by execution unit 424 and prediction information 2982/2984/2986 flowing along the pipeline.

As can be seen from the foregoing, the present invention provides significant advantages in translating a conditional ALU instruction 124 into a single micro instruction rather than multiple micro instructions, i.e., predicting trends.

First, the present invention may reduce one or more microinstructions that may require additional instruction slots in the resources of the non-sequential execution microprocessor 100. These resources include register configuration table 402, reorder buffers, reservation stations (not shown), and execution units 424. Therefore, the invention can reduce and simplify the resources required to be used, and the energy consumed by the resources can be reduced.

Second, the average number of instruction set architecture (e.g., ARM instruction) program instructions that the instruction translator 104 can translate per clock cycle may be increased. Assume that the instruction translator 104 can translate up to three ARM instructions per clock cycle, but only issue up to three micro instructions per clock cycle, and includes the additional limitation that the instruction translator 104 must issue all micro instructions associated with the ARM instruction in the same clock cycle, i.e., the instruction translator 104 cannot issue a micro instruction associated with an ARM instruction in a first clock cycle and issue a second micro instruction associated with the ARM instruction in a next clock cycle. Assume the following ARM instruction sequence, where CAI is a conditional ALU instruction 124 and the "Rx" value is a general purpose register:

CAI EQ R1,R2,R3

CAI NE R4,R5,R6

CAI CS R7,R8,R9

in processors that do not have predictor 2932/2934/2936 (or that have but do not predict), instruction translator 104 must take three clock cycles to translate the three CAI instructions. However, in a processor with predictor 2932/2934/2936 for prediction, the instruction translator may translate all three CAI instructions in the same clock cycle. Furthermore, this advantage is still valid in the case of a mixture of non-CAI instructions, i.e., other ARM instructions. For example, assume that the CAI instruction is followed by an ARM instruction D which is translated into two microinstructions whose direction is predicted by predictor 2932/2934/2936, an ARM instruction E which is translated into two microinstructions followed by a CAI instruction, and an ARM instruction F which is translated into a single microinstruction followed by an ARM instruction E. In this case, the instruction translator may translate the ARM instruction D and the CAI instruction in the same clock cycle, and then translate the ARM instructions E and F in the next clock cycle. I.e., four ARM instructions are translated in two clock cycles. In contrast, without the functionality provided by the present embodiment, instruction translator 104 would require three clock cycles to translate the four instructions. Similar advantages may also be found in instruction issue unit 408 and reorder buffer 422.

Third, the latency of the conditional ALU instruction 124 is expected to be reduced in the case where the trend is predicted by the

predictor

2932,2934,2936 so that the instruction translator 104 need only issue a single micro-instruction.

Fourth, the reorder buffer and the reservation station do not have additional microinstructions, which may improve the look-ahead capability of the microprocessor, thereby improving the instruction level parallelism of the processor with respect to the programs being executed, thereby improving utilization of the execution units 424 to improve the processing power (throughput) of the microprocessor 100. Further, omitting the second microinstruction may leave more space in the reorder buffer for the microinstructions. This feature is advantageous in that it may create a larger pool of micro instructions for sending micro instructions to the execution unit 424 for execution. A micro instruction cannot be issued for execution until it is "ready to complete", i.e., all source operands from a previous micro instruction are available for issue. Thus, the larger the pool of microinstructions for which the microprocessor 100 is seeking ready microinstructions, the greater the chance of finding, and therefore, the greater the chance of the execution unit 424 being utilized. This is commonly referred to as microprocessor lookahead capability, which is the ability to fully utilize the instruction level parallel processing of the program to be executed by the microprocessor. The greater the look-ahead capability, the more utilization of the execution units 424 is typically promoted. Thus, the microprocessor 100 of the present invention has the potential to increase its look-ahead capability by translating the conditional ALU instruction 124 into a single micro-instruction, rather than multiple micro-instructions.

Although the micro-architecture of the above-described embodiment supports the x86 instruction set architecture in addition to the ARM instruction set architecture conditional ALU instructions, it should be noted that the present invention is also applicable to other embodiments, i.e., conditional ALU instructions supporting other instruction set architectures than the ARM instruction set architecture. Furthermore, it should be noted that the present invention is also applicable to cases where there is no pre-existing micro-architecture or where the pre-existing micro-architecture supports an instruction set architecture other than the x86 instruction set architecture. In addition, it is noted that the present invention is described herein in the general context of a processor that supports conditional ALU instructions of an instruction set architecture by predicting the direction of the conditional ALU instructions ahead of the pipeline prior to instruction execution. For one embodiment, similar to branch prediction techniques, the fetched instruction stream is identified based on the presence or absence of a walk prediction and a different micro instruction sequence is issued. In addition, although the embodiments described herein include both dynamic and static predictors, the present invention is also applicable to embodiments having only static predictors or only dynamic predictors. Furthermore, the present invention is also applicable to embodiments having multiple dynamic and/or static predictors, wherein the predictor selector selects from among the multiple dynamic and static predictors. Furthermore, the present invention is also applicable to embodiments in which the dynamic predictor is integrated into a branch prediction array, such as a branch target address cache. A disadvantage of this embodiment is that space is wasted in each entry for storing the target address of a branch instruction, since the target address need not be predicted for conditional ALU instructions. Although interference or containment may occur between branch instructions and conditional ALU instructions based on instruction mixing in the program, this embodiment may still have the following advantages: the storage space of the integrated cache is more efficiently utilized, and the integrated array may have more entries than the sum of the entries of the individual arrays.

Although the embodiments described above are directed to conditional non-branch instructions that are conditional ALU instructions, the present invention may also apply predictors to predict other types of conditional non-branch instructions. For example, the conditional load instruction may be predicted. If the instruction is predicted to execute, the instruction translator generates an unconditional load micro instruction having condition codes. This unconditional load microinstruction with condition codes includes the condition specified by the conditional load instruction, enabling the execution pipeline to detect whether it is mispredicted. If the execution pipeline detects a misprediction, it will avoid performing any architectural state update operations, such as a page table walk (page table walk) that updates memory when a load causes a Translation Lookaside Buffer (TLB) miss, or an architectural exception event when a load generates an exception condition. Additionally, if a load miss (load miss) occurs in the cache, the execution pipeline may avoid generating a transmission on the processor bus to fill the missed cache line. If the prediction is not predicted, the instruction translator generates a set of micro instructions to conditionally execute the load operation. In one embodiment, if the prediction is not predicted, the set of microinstructions may be executed in a manner similar to that described in U.S. patent provisional application 61/473,062.

Although the above embodiments are directed to ARM ISA conditional non-branch instructions, the present invention may also utilize predictors to predict conditional non-branch instructions applied to other ISAs. For example, conditional non-branch instructions of the x86ISA, such as CMOVcc and SETcc, may be predicted.

Application of modified immediate values to instruction translation

The ARM instruction set architecture defines a data processing instruction set that allows instructions to specify an immediate source operand, referred to herein as an "immediate operand instruction". The immediate source operand is a 32-bit value generated by rotating an 8-bit value to the right by twice a 4-bit value. The 8-bit value is specified in the field of the instruction labeled immed _8, and the 4-bit value is specified in the field of the instruction labeled rotate _ imm. Thus, it is possible to provide

Immediate operand value immed _8> > (2 × rotate _ imm)

A method for processing an immediate operand instruction in an existing micro-architecture allows an instruction translator to generate two micro-instructions. The first micro instruction performs a rotation operation on the immed _8 value by twice the value of the rotate _ imm to generate a result, and the second micro instruction receives the result of the first micro instruction as a source operand for performing an ALU function specified by the immediate operand instruction. This embodiment can be seen in reference to FIGS. 10 and 21. For example, in step 1034 of FIG. 10, the instruction translator generates the SHF micro instruction to perform a shift operation (i.e., a rotate operation in this embodiment) to generate a shifted result that is written into a temporary register, and the subsequent ALUOP micro instruction uses the shifted result from the SHF micro instruction in the temporary register. This shift operation may be performed on an immediate value specified in an immediate operand instruction (e.g., corresponding to

steps

1012 and 1024 of FIG. 10). However, the following disadvantages may occur when the method is applied to a non-sequential execution processor, as compared to a method that generates only a single micro instruction using an instruction translator.

First, the additional microinstructions occupy an additional instruction slot in the non-sequential execution processor resources, such as register allocation tables, reorder buffers, reservation stations, and additional instruction slots or entries in the execution units, requiring larger, more complex resources and higher power consumption.

Second, some functional units are limited by the maximum number of instructions that can be executed per clock cycle. For example, according to one embodiment, the instruction translator has its maximum limit on the number of instructions that can be issued per clock cycle (e.g., three microinstructions per clock cycle), the issue unit has its maximum limit on the number of instructions that can be issued per clock cycle to the execution unit (e.g., four microinstructions per clock cycle), and the retirement unit has its maximum limit on the number of instructions that can be retired per clock cycle (e.g., three microinstructions per clock cycle). Therefore, the generation of additional micro instructions within these functional units reduces the average number of instructions that can be issued, or retired per clock cycle, thereby limiting the performance of the processor.

Third, the immediate operand instruction will not be retired until it completes execution of its constituent microinstructions because the second microinstruction is associated with the result of the first microinstruction and therefore the second microinstruction cannot be dispatched to the execution unit until the first microinstruction produces a result. These cause additional delays to the overall execution time of the immediate operand instruction.

Fourth, the presence of additional microinstructions in the reorder buffer and/or the reservation station may reduce the processor's lookahead capability, thereby reducing the processor's ability to execute programs using instruction level parallelism, reducing the utilization of execution units, and reducing the overall processor performance.

The embodiments described herein have the potential to perform better when executing immediate operand instructions. The immed _8 field and the rotate _ imm field are herein combined as an "immediate field". In particular, the instruction translator knows a predetermined subset of immediate field values and associated 32-bit immediate operand values generated by each corresponding immediate field value. When the instruction translator encounters an immediate operand instruction, the instruction translator determines whether the immediate field value specified falls within the prediction subset. If so, the instruction translator issues the correct 32-bit immediate operand to the immediate operand bus, along with the immediate operand instruction, and along with it, is forwarded along the pipeline for execution. If the immediate field value does not fall within the predetermined subset, the instruction translator takes the lower performance approach, i.e., issues two micro instructions. The relative frequency of different immediate field values may be generated by executing the application software and observing, and selecting a few most frequently observed immediate field values as a predetermined set of immediate field values to maintain the size, power consumption, and instruction translator complexity within certain limits.

Referring now to FIG. 32, a block diagram illustrating a microprocessor 100 according to the present invention processing modified immediate constants during instruction translation is shown. The microprocessor 100 of FIG. 32 is similar to the microprocessor of FIG. 1, and includes elements similar to those shown in FIGS. 1-4, including an instruction cache 102, an instruction translator 104, a configuration register 122, a register allocation table 402, an instruction issue unit 408, and an execution unit 424. The execution unit 424 includes one or more units to execute the microinstructions 126 described below. Further, the execution units 424 include one or more units to execute a rotate right (ROR) microinstruction 3344 (also referred to herein as a shift microinstruction), an ALU microinstruction 3346, and an immediate ALU microinstruction 3348, as depicted in FIG. 33. Microprocessor 100 also includes architectural and temporary registers 106 and flags 926 as shown in FIG. 33. The instruction cache 102 fetches the immediate operand instruction 124 shown in FIG. 33.

In one embodiment, the micro-architecture of the microprocessor 100 is similar in many respects to the VIA Nano produced by Taiwan Weisheng electronics^TMThe microprocessor 100 of the present embodiment is modified to support the ARM instruction set architecture. This VIA Nano^TMThe micro-architecture of the processor is a high performance non-sequential execution superscalar micro-architecture to support the x86 instruction set architecture, and is modified as described herein to enable additional support for the ARM micro-architecture, and more particularly to the ARM immediate operand instruction 124 described in greater detail in connection with figure 33. Further, an immediate operand 3366 is issued on an immediate operand bus in response to the instruction translator 104 encountering an immediate operand instruction 124 whose specified immediate field b207 (FIG. 33) has a value that falls within a predetermined subset of values known to the instruction translator 104. The immediate operand 3366 is passed down stages (stages) of the microprocessor 100 pipeline until the execution unit 424 is reached.

The register allocation table 402 receives the micro instructions 164 from the instruction translator 104 and correspondingly generates information regarding each of the micro instructions 164. Further, the register allocation table 402 indicates that the ALU micro instruction 3346 (see FIG. 33) is associated with the result of the ROR micro instruction 3344 (see FIG. 33) and that both micro instructions are issued when the instruction translator 104 translates an immediate operand instruction whose immediate field value b207 is not within a predetermined subset of the immediate field b207 value. In addition, as shown in FIG. 34 (including FIGS. 34A and 34B), in the case where the instruction translator 104 additionally issues a conditional move microinstruction 126 (such as that described in FIG. 10), the register allocation table 402 indicates that the conditional move microinstruction 126 is associated with the result of the ALU microinstruction 3346.

The temporary registers 106 store the non-architectural state of the microprocessor 100 and are used by the micro-architecture to temporarily store intermediate values required to execute the instruction 124 of the instruction set architecture. Further, the micro instruction 126 issued by the instruction translator 104 specifies the temporary register 106 as a source and/or destination operand location. The ROR microinstruction 3344 of fig. 33 specifies a temporary register 106 as its destination register, while the ALU microinstruction 3346 specifies the same temporary register 106 as its source register. This will be explained in more detail in the following text.

The at least one execution unit 424 includes an arithmetic logic unit (not shown) for executing various micro instructions. These micro instructions include the ROR micro instruction 3344, the ALU micro instruction 3346, and the immediate ALU micro instruction 3348 shown in FIG. 33. In the case of the immediate ALU microinstruction 3348, the execution unit 424 receives as its input the value of the immediate operand 3366 from the instruction translator 104. The execution unit 424 performs the ALU function specified by the opcode field b212, which is the same as the ALU function specified by the immediate operand instruction 124, on top of the immediate operand 3366 and a second source operand. In the case of the ALU microinstruction 3346, the execution unit 424 performs the ALU function specified by the opcode field b212, which is the same as the ALU function specified by the immediate operand instruction 124, and executes on two source operands, one of which is from the temporary register 106, and the associated ROR microinstruction 3344 writes its result to this register. In the case of the ROR micro instruction 3344, the execution unit 424 rotates an 8-bit value to the right by twice a 4-bit value to generate a 32-bit immediate value and writes a temporary register 106 for use by subsequent dependent ALU micro instructions 3344. The 8-bit value is the same as the value specified by the immed _8 field b208 of the immediate operand instruction 124, and the 4-bit value is the same as the value specified by the rotate _ imm field b209 of the immediate operand instruction 124.

Referring now to FIG. 33, a block diagram is presented illustrating an embodiment of the present invention that selectively translates an immediate operand instruction 124 into a ROR micro-instruction 3344 and an ALU micro-instruction 3346, or into an immediate ALU micro-instruction 3348. As described herein, the instruction translator 104 translates the immediate operand instruction 124 into an immediate ALU micro-instruction 3348 for execution by the execution unit 424 when the value specified by the immediate field b207 falls within the predetermined subset known to the instruction translator 104, whereupon the instruction translator 104 issues a corresponding evaluated immediate operand value 3366. As shown in FIG. 32, when the value specified by the immediate field b207 does not fall within the predetermined subset, the instruction translator 104 translates the immediate operand instruction 124 into a ROR micro-instruction 3344 followed by an ALU micro-instruction 3044 for execution by the execution unit 424. In one embodiment, the immediate operand instruction 124 is an immediate operand instruction defined by the ARM instruction set architecture, which in the terminology of ARM is an instruction having data processing immediate encoding (data processing encoding) functionality.

The immediate operand instruction 124 includes an opcode field b202, a source register 1 field b204, a destination register field b206, an immed _8 field b208, and a rotate _ imm field b 209. As shown in fig. 33, the combination of the immed _8 field b208 and the rotate _ imm field b209 constitutes the immediate field b 209. The opcode field b202 contains a value that distinguishes the immediate operand instruction 124 from other instructions of the instruction set architecture and specifies an ALU function to be performed on the source operand. For an ARM immediate operand instruction 124, this ALU function may include, for example, ADD (ADD), ADD with carry (ADC), logical AND, logical bit clear (BIC), compare negative (CMN), Compare (CMP), logical exclusive-OR (EOR), Move (MOV), move backward (move not, MVN), logical OR (ORR), Reverse Subtract (RSB), ADD with Reverse Subtract (RSC), subtract with carry (SBC), subtract (TST, SUB), equal measure (TEQ), AND test (TST, T). The field b204 of source register 1 specifies an architectural register 106 or a temporary register 106 from which the source operand received by the execution unit 424 is derived. The destination register field b206 specifies either an architectural register 106 or a temporary register 106, and the result is written to the specified register. The immed _8 field b208 holds an 8-bit constant that is rotated to the right by twice the value of the 4-bit rotate _ imm field b209 to produce an immediate source operand. As described above with respect to the embodiments of FIGS. 9-28, the immediate operand instruction 124 may comprise a conditional ALU instruction. For example, the immediate operand instruction 124 may be an ARM NCUALUOP instruction 124 as described in step 1056, which specifies a modified immediate constant as its source operand, rather than as a register.

The ROR micro instruction 3344 includes an opcode field b222, a destination register field b226, and two source operand fields for specifying a source operand, shown in FIG. 33 as immed _8 field b228 and rotate _ imm field b229, for executing the immediate operand instruction 124. The opcode field b222 contains a value that distinguishes ROR micro instruction 3344 from other micro instructions of the micro instruction set architecture of the microprocessor 100. The destination register field b226 specifies an architectural register 106 or a destination register 106 into which the results of the ROR micro instruction 3344 are written. When the instruction translator 104 translates the immediate operand instruction 124, and the value specified by field b207 does not fall within the predetermined subset, the instruction translator 104 populates the immed 8 field b228 and the rotate imm field b229 with the corresponding values of the immed 8 field b208 and the rotate imm field b209 of the immediate operand instruction, and the instruction translator 104 populates the destination register field b226 to specify a temporary register 106 to receive the result of the ALU function, which is subsequently utilized by the ALU micro instruction 3344 as its source operand. In addition to the foregoing, the ROR micro instruction 3344 may also include a shift micro instruction 126 (labeled SHF from FIG. 10) to specify a modified immediate constant, as described in more detail with respect to FIGS. 10 and 11. For example, if the translated immediate operand instruction 124 is the ARM NCUALUOP instruction 124 specifying a modified immediate constant as described in step 1056, the ROR micro instruction 3344 may be the SHF micro instruction 126 of step 1056.

ALU micro instruction 3346 includes an opcode field b232, a source register 1 field b234, a source register 2 field b235, and a destination register field b 236. The opcode field b232 contains a value that distinguishes the ALU micro-instruction 3346 from other micro-instructions of the micro-instruction set architecture of the microprocessor 100 and specifies that the ALU function to be performed on the source operand is the same as that resulting from translation by the immediate operand instruction 124. The field b234 of the source register 1 specifies an architectural register 106 or a temporary register 106, the first source operand is provided from the specified register to the ALU micro instruction 3346, the field b235 of the source register 2 specifies an architectural register 106 or a temporary register 106, the second source operand is provided from the specified register to the ALU micro instruction 3346, the destination register field b236 specifies an architectural register 106 or a temporary register 106, and the result of the ALU micro instruction 3346 is written to the specified register. When the instruction translator 104 translates the immediate operand instruction 124 and the value specified by the immediate field b207 does not fall within the predetermined subset, the instruction translator 104 populates the field b234 of the source register 1 to specify a register as specified by the field b204 of the source operand instruction 124, the instruction translator 104 populates the destination register field b236 to specify a register as specified by the destination register field b206 of the immediate source operand 124, and the instruction translator 104 populates the field b235 of the source register 2 to specify a temporary register 106 as specified by the destination register field b226 of the ROR micro instruction 3344. As previously mentioned, the ALU microinstructions 3346 may include any of the ALU operation microinstructions 126, designated as ALUOP, ALUOPUC, CALUOP, and NCALUOP, as well as the conditional versions of the microinstructions described in detail in FIGS. 10 and 12. For example, if the translated immediate operand instruction 124 is the ARM NCUALUOP instruction 124 of step 1056, and the modified immediate constant specified by the specification does not fall within the predetermined subset, the ALU micro-instruction 3346 may be the NCUALUOP micro-instruction 126 of step 1056.

The immediate ALU micro instruction 3348 includes an opcode field b212, a source register 1 field b214, a destination register field b216, and an immediate-32 field b 218. For one embodiment, the immediate-32 field b218 is the immediate operand 3366 received by the execution unit 424 that executes the immediate ALU microinstruction 3348. That is, an operand multiplexer (not shown) operates to select immediate operands 3366 to be provided to the execution units 424 that receive the immediate ALU microinstruction 3348. The opcode field b212 contains a value that distinguishes the ALU microinstruction 3348 from other microinstructions within the microinstruction set architecture of the microprocessor 100, and the ALU function specified to be performed on the source operand is the same as that resulting from the translation of the immediate operand instruction 124. The source register 1 field b214 specifies either an architectural register 106 or a temporary register 106 from which a first source operand is provided to the ALU micro instruction 3346, and the destination register field b216 specifies either an architectural register 106 or a temporary register 106 into which the immediate ALU micro instruction 3348 results are written. When the instruction translator 1045 translates the immediate operand instruction 124 such that the value specified by the immediate field b207 falls within the predetermined subset, the instruction translator 104 populates the field b214 of source register 1 to specify a register that is identical to the value specified by the field b204 of the source operand 1 of the immediate operand instruction 124, and the instruction translator 104 populates the destination register field b216 to specify a register that is identical to the value specified by the destination register field b206 of the immediate operand instruction 124. As previously described, the immediate ALU microinstruction 3346 may include any of the ALU operation microinstructions 126, designated ALUOP, ALUOPUC, CALUOP, and NCALUOP, including the conditional versions of the microinstructions detailed in FIGS. 10 and 12, which specify an immediate source operand. For example, if the translated immediate operand instruction 124 is the ARM NCUALUOP instruction 124 of 1056, which specifies that the modified immediate constants fall within the predetermined subset, the immediate ALU micro-instruction 3348 may be the NCUALUOP micro-instruction 126 of 1056, without the instruction translator 104 issuing the SHF micro-instruction 126 of 1056, providing the advantages described above with respect to processing the modified immediate constants by the instruction translator 104.

Referring now to FIG. 34, a flowchart illustrating operation of the microprocessor 100 of FIG. 32 when executing an immediate operand instruction of FIG. 33 according to one embodiment of the present invention is shown. The flow begins at step 3402.

In step 3402, the instruction translator 104 encounters an immediate operand instruction 124 of FIG. 33 and examines the immediate field b207 (for an ARM immediate operand instruction 124, the lower 12 bits) with a predetermined subset of values. A decision step 3404 is next entered.

At decision step 3404, the instruction translator 104 determines whether the value of the immediate field b207 falls within the predetermined subset of values. If yes, go to step 3406; otherwise proceed to step 3414.

At step 3406, the instruction translator 104 issues a single immediate ALU micro-instruction 3348, as shown in FIG. 33, in response to the immediate operand instruction 124. In one embodiment, if the immediate operand instruction 124 is a conditional ALU instruction 124 specifying a source destination shared register, the immediate ALU micro-instruction 3348 will include one of the ALU micro-instructions 126 described in

steps

2134, 2136, 2154, and 2156 of FIG. 21, but not the SHF micro-instruction described above. If the conditional ALU instruction 124 does not specify a source-destination shared register, the instruction translator 104 issues the immediate ALU microinstruction 3348 and the conditional move microinstructions 126(XMOV and CMOV) described in

steps

1034, 1036, 1054, 1056 of fig. 10, but which do not include the SHF microinstructions. In this case, the dependency information for the conditional move microinstruction 126 generated by the register allocation table 402 indicates that the conditional move microinstruction 126 is dependent upon the result of the immediate ALU microinstruction 3348. Step 3408 is next entered.

At step 3408, the instruction issue unit 408 issues the immediate ALU microinstruction 3348 to the execution unit 424. Step 3412 is next entered.

At step 3412, the execution unit 424 receives the value of the 32-bit immediate operand 3366 transmitted through the pipeline from the immediate operand bus, along with the source operand specified by field b214 of source register 1. The execution unit 424 executes the immediate ALU micro instruction 3348 by executing the ALU function specified in the opcode field b212 on the 32-bit immediate operand 3366 and other source operands to generate a result on the result bus 128 for subsequent retirement of the architectural register 106 specified in the destination register field b216, which architectural register 106 is identical to the architectural register 106 specified in the destination register field b206 of the immediate operand instruction 124. If at step 3406 the instruction translator 104 issues a conditional move microinstruction 126, the result of the immediate ALU microinstruction 3348 is destined for a temporary register 106 other than the destination register 106 specified by the immediate operand instruction 124, and in response to the execution unit 424 completing the immediate ALU microinstruction operation at step 3412, the instruction issue unit 408 issues the conditional move microinstruction 126 to the execution unit 424 and the execution unit 424 executes the conditional move microinstruction 126 to generate the result of the immediate operand instruction 124, as described above, particularly with respect to fig. 10-20. Flow ends at step 3412.

At step 3414, the instruction translator 104 issues two micro instructions, a ROR micro instruction 3344 and an ALU micro instruction 3346 of FIG. 33, in response to the immediate operand instruction 124. In one embodiment, the ROR microinstruction 3344 includes the

steps

1034, 1054, and 1056 of FIG. 10 or the SHF microinstruction 126 described in

steps

2134, 2136, 2154, and 2154 of FIG. 21, if the immediate operand instruction 124 is a conditional ALU instruction 124 specifying a modified immediate constant. For example, if the translated immediate operand instruction 124 is the ARM NCUALUOP instruction 124 of step 1056, which specifies a modified immediate constant that does not fall within the predetermined subset, the ROR micro instruction 3344 may be the SHF micro instruction 126 of step 1056. In one embodiment, if the conditional operand instruction 124 is a conditional ALU instruction 124 specifying a source-destination shared register, the ALU microinstruction 3346 may include one of the ALU microinstructions 126 described in

steps

2134, 2136, 2154, and 2156 of fig. 21. If the immediate operand conditional ALU instruction 124 does not specify a source-destination common register, the instruction translator 104 issues ALU microinstruction 3346 and a conditional move microinstruction 126(XMOV and CMOV) as described in

steps

1034, 1036, 1054, and 1056 of fig. 10. Step 3416 is next entered.

At step 3416, the register allocation table 402 generates the association information for the ALU micro instruction 3346 indicating that the ALU micro instruction 3346 is associated with the result of the ROR micro instruction 3344. If at step 3414 the instruction translator 104 issues a conditional move microinstruction 126, the register allocation table 402 generates dependency information for the conditional move microinstruction 126 indicating that the conditional move microinstruction 126 is dependent upon the result of the ALU microinstruction 3346. Step 3418 is next entered.

At step 3418, the instruction issue unit 408 issues the ROR microinstruction 3344 to the execution unit 424. Therefore, the execution unit 424 receives the values of the immed _8 field b208 and the rotate _ imm field b209 specified by the immediate operand instruction 124. Proceed next to decision step 3412.

In step 3422, the execution unit 424 executes the ROR microinstruction 3344 to generate an immediate operand result, which is written to the temporary register 106 specified by the destination register field b 226. Step 3424 is entered next.

In block 3424, the instruction issue unit issues the ALU microinstruction 3346 to the execution units 424 in response to the execution units 424 completing the operation of the ROR microinstruction 3344 in block 3422. Therefore, the execution unit 424 (integer unit 124) receives the result of the ROR micro instruction 3344 generated in step 3422 and the operand value specified by field b234 of the source register 1 of the ALU micro instruction 3346, which is the same as the architectural register 106 specified by field b204 of the source register 1 of the immediate operand instruction 124. Proceed next to decision step 3426.

At 3426, the execution unit 424 executes the ALU micro instruction 3346, executing the ALU function specified in opcode field b232 on two source operands, to generate a result that is provided on the result bus 128 for use by the architectural register 106 specified in destination register field b236 in subsequent retirement steps, the architectural register 106 being the same as the architectural register 104 specified in destination register field b206 of the immediate operand instruction 124. If at step 3414 the instruction translator 104 issues a conditional move microinstruction 126, the result of the ALU microinstruction 3346 is destined for the temporary register 106 and not the destination register 106 specified by the immediate operand instruction, and the instruction issue unit 408 issues the conditional move microinstruction 126 to the execution unit 424 in response to the execution unit 424 completing the operation of the ALU microinstruction 3346 at step 3426. As previously described with respect to FIGS. 10-20, the execution unit 424 executes the conditional move microinstruction 126 to generate the result of the immediate operand instruction 124. The flow ends at step 3426.

As can be seen from the foregoing, the microprocessor 100 of the present invention translates the immediate operand instruction 124 into a single immediate ALU microinstruction 3346, rather than multiple microinstructions, under certain circumstances. In some cases, i.e., when the immediate field b207 falls within a predetermined subset of values, the instruction translator 104 may directly issue the corresponding evaluated immediate operand 3366 value, which may provide a significant contribution.

First, the present invention reduces the amount of resources occupied by a micro instruction in an additional instruction slot within the resources of the non-sequential execution processor, such as the additional instruction slots or entries within the register file 402, reorder buffer 422, reservation stations 406, and execution units 424, thereby reducing resources and reducing power consumption.

Second, the average number of instructions per clock cycle for the program of the instruction set architecture (e.g., ARM instructions) that the instruction translator 104 can translate can be increased. For example, assume that the instruction translator 104 can translate up to three ARM instructions per clock cycle, but only issue up to three microinstructions per clock cycle, and must also comply with the limitation that all microinstructions associated with the ARM instruction must issue in the same clock cycle, i.e., the instruction translator 104 cannot issue a microinstruction associated with an ARM instruction in a first clock cycle, while issuing a second microinstruction associated with the ARM instruction in a next clock cycle. Assume the ARM instruction sequence as follows, where IOI is an immediate operand instruction 124, such as a conditional ALU instruction, that specifies a destination register, which is also the source register, and the "Rx" value is a general purpose register:

IOI R1, R1, immediate field value A

IOI R3, R3, immediate field value B

IOI R5, R5, immediate field value C

In the case where immediate field values A, B and C do not fall within the predetermined subset, instruction translator 104 must take three clock cycles to translate the three IOI instructions. However, in the case where immediate field values A, B and C fall within the predetermined subset, instruction translator 104 may only need one clock cycle to translate the three IOI instructions. Furthermore, this advantage may also be obtained in other instances where non-IOI instructions are mixed, i.e., other ARM instructions. For example, assume that an ARM instruction D is translated into two microinstructions followed by an IOI instruction specifying an immediate field value that falls within the predetermined subset, followed by an ARM instruction E, which is translated into two microinstructions, followed by an ARM instruction F, which is translated into a single microinstruction. In this case, the instruction translator 104 may translate the ARM instructions D and IOI in a single clock cycle and then translate the ARM instructions E and F in the next clock cycle, i.e., four ARM instructions complete translation in two clock cycles. In contrast, without the functionality described in this embodiment, the instruction translator 104 would require three clock cycles to translate the four instructions. Similar advantages exist with respect to instruction issue unit 408 and retirement unit 422. Similar advantages arise in the case of four-instruction-wide instruction translators and conditional ALU instructions that do not specify a destination register as a source register, in which case both instructions may be translated in the same clock cycle, requiring two clock cycles without the functionality described in this embodiment.

Third, in the case where the value of the immediate field b207 falls within a predetermined subset and the instruction translator 104 may issue a single micro-instruction (or two micro-instructions instead of three), the latency of the immediate operand instruction 124 may be reduced because of the elimination of the second (or third) micro-instruction.

Fourth, the absence of additional microinstructions in the reorder buffer and/or the reservation station may increase or decrease the processor's lookahead capability, thereby increasing the ability of the microprocessor 100 to execute programs using instruction level parallel processing, increasing the utilization of the execution units 424, and improving the overall performance of the microprocessor 100. Further, reducing the second micro instructions may allow more room in the reorder buffer for micro instructions, thereby creating a larger pool of micro instructions that may be dispatched to execution units 424 for execution. A micro instruction cannot be issued for execution until it is "ready to complete", i.e., all source operands from a previous micro instruction are available for issue. Thus, the larger the pool of microinstructions for which the microprocessor 100 is seeking ready microinstructions, the greater the chance of finding, and therefore, the greater the chance of the execution unit 424 being utilized. This is commonly referred to as microprocessor lookahead capability, which is the ability to fully utilize the instruction level parallel processing of the program to be executed by the microprocessor. The greater the look-ahead capability, the more efficient the utilization of the execution units 424 will generally be. Thus, the microprocessor 100 of the present invention translates the immediate operand instruction 124 into a single immediate ALU microinstruction 3348, rather than multiple microinstructions, based on the value of the immediate field b207, thereby potentially increasing its look-ahead capability.

Although the immediate operand instruction of the embodiments described above is an ARM instruction with data processing immediate encoding, the techniques may be applied to translate immediate operand instructions of other instruction set architectures; it should be noted that the present invention is also applicable to cases where there is no pre-existing micro-architecture or where the pre-existing micro-architecture supports an instruction set architecture other than the x86 instruction set architecture. Moreover, it is noted that the present invention is described herein in the general context of a processor that translates operand instructions into a different micro instruction sequence of an out-of-order execution micro architecture to support an instruction set architecture immediate operand instruction based on whether the immediate field value specified by the immediate operand instruction falls within a predetermined subset.

In another embodiment, the instruction translator 104 generates the immediate operand 3266 of FIG. 32 to all values of the immediate field b207 of the immediate operand instruction 124 of FIG. 33. That is, all values within the predetermined subset of the values of the immediate field b207 are possible values of the immediate field b 207. The following is the Verilog hardware description language encoding of this embodiment.

However, the above description is only a preferred embodiment of the present invention, and should not be taken as limiting the scope of the invention, which is intended to cover all the modifications and equivalents of the claims and the description. For example, software may perform the functions, fabrication, modeling, simulation, description, and/or testing of the apparatus and methods described herein. This can be done in general programming languages (e.g., C, C + +), Hardware Description Languages (HDL) including Verilog HDL, VHDL, etc., or other existing programs. The software can be disposed on any known computer usable medium such as magnetic tape, semiconductor, magnetic disk, optical disk (e.g., CD-ROM, DVD-ROM, etc.), network, or other communication medium. Embodiments of the apparatus and methods described herein may be included in a semiconductor smart core, such as a microprocessor core (e.g., implemented in a hardware description language) and transformed into hardware by the fabrication of integrated circuits. Furthermore, the apparatus and methods described herein may also comprise a combination of hardware and software. Accordingly, any embodiments described herein are not intended to limit the scope of the invention. In addition, the present invention can be applied to a microprocessor device of a general-purpose computer. Finally, those skilled in the art should appreciate that they can readily use the disclosed conception and specific embodiments as a basis for designing or modifying other structures for carrying out the same purposes of the present invention without departing from the scope of the invention as defined by the appended claims.

However, the above description is only a preferred embodiment of the present invention, and should not be taken as limiting the scope of the invention, which is intended to cover all the modifications and equivalents of the claims and the description. Moreover, not all objects, advantages, or features of the invention need be achieved in any one embodiment or the scope of the claims. In addition, the abstract and the title are provided to assist the patent document searching and are not intended to limit the scope of the claims.

[ references to related applications ]

This application is a continuation-in-part of the official application of the U.S. patent in the present application, which is incorporated herein by reference in its entirety:

table number	Date of filling
		13/224,310(CNTR.2575)	09/01/2011
13/333,520(CNTR.2569)	12/21/2011
		13/333,572(CNTR.2572)	12/21/2011
13/333,631(CNTR.2618)	12/21/2011

This application is incorporated by reference as if set forth in priority in the following U.S. provisional patent applications, each of which is incorporated by reference in its entirety:

table number	Date of filling
		61/473,062(CNTR.2547)	04/07/2011
61/473,067(CNTR.2552)	04/07/2011
		61/473,069(CNTR.2556)	04/07/2011
61/537,473(CNTR.2569)	09/21/2011
		61/541,307(CNTR.2585)	09/30/2011
61/547,449(CNTR.2573)	10/14/2011
		61/555,023(CNTR.2564)	11/03/2011
61/604,561(CNTR.2552)	02/29/2012

U.S. official patent application

13/224,310(CNTR.2575)

09/01/2011

Priority is given to the following U.S. provisional applications:

61/473,062(CNTR.2547)	04/07/2011
		61/473,067(CNTR.2552)	04/07/2011
61/473,069(CNTR.2556)	04/07/2011

the following three U.S. official applications

13/333,520(CNTR.2569)	12/21/2011
		13/333,572(CNTR.2572)	12/21/2011
13/333,631(CNTR.2618)	12/21/2011

Are all continuations of the following U.S. formal application forms:

13/224,310(CNTR.2575)

09/01/2011

and cite priority from the following U.S. provisional applications:

61/473,062(CNTR.2547)	04/07/2011
		61/473,067(CNTR.2552)	04/07/2011
61/473,069(CNTR.2556)	04/07/2011
		61/537,473(CNTR.2569)	09/21/2011

this application is a related case of the following U.S. official patent applications:

13/413,258(CNTR.2552)	03/06/2012
		13/412,888(CNTR.2580)	03/06/2012
13/412,904(CNTR.2583)	03/06/2012
		13/412,914(CNTR.2585)	03/06/2012
13/413,346(CNTR.2573)	03/06/2012
		13/413,300(CNTR.2564)	03/06/2012
13/413,314(CNTR.2568)	03/06/2012

Claims

1. a microprocessor having an instruction set architecture defining an instruction including an immediate field having a first portion specifying a first value and a second portion specifying a second value, the instruction directing the microprocessor to perform an operation with a fixed value as one of source operands, the fixed value being obtained by rotating/shifting the first value by a number of bits based on the second value, the microprocessor comprising:

an instruction translator for translating the instruction into at least one immediate ALU micro-instruction, wherein the immediate ALU micro-instruction is encoded in a different instruction encoding than that defined by the instruction set architecture; and

an execution pipeline that executes micro instructions generated by the instruction translator to generate results defined by the instruction set architecture;

wherein the instruction translator, but not the execution pipeline, generates the fixed value as a source operand for the immediate ALU micro-instruction based on the first and second values for execution by the execution pipeline.

2. The microprocessor of claim 1, wherein the instruction translator translates the instruction into different microinstructions based on whether a value of the immediate field falls within a predetermined subset of values.

3. The microprocessor of claim 1, wherein the execution pipeline comprises:

a plurality of execution units that execute the microinstructions to generate the result; and

an issue unit issues the fixed value generated by the instruction translator to at least one of the execution units as the source operand of the immediate ALU micro instruction executed by the at least one of the execution units.

4. The microprocessor of claim 1, wherein the execution pipeline comprises:

a plurality of execution units that execute the microinstructions to generate the result;

wherein, this microprocessor still includes:

one or more first buses for transmitting execution results of the microinstructions from the execution unit back to the execution unit as source operands of other microinstructions; and

a second bus providing the fixed value generated by the instruction translator to the execution pipeline, wherein the second bus is different from the one or more first buses.

5. The microprocessor of claim 4, further comprising:

a plurality of registers to receive results of the execution of the microinstructions from the execution unit, wherein the fixed value generated by the instruction translator is not written into the registers by the microprocessor.

6. The microprocessor of claim 1, wherein the fixed value is obtained by rotating/shifting the first value by twice the number of bits of the second value.

7. The microprocessor of claim 1, wherein the instruction set architecture of the microprocessor defines a plurality of instructions each including an immediate field, including data processing instructions of the advanced reduced instruction set machine (ARM) Instruction Set Architecture (ISA) that specify a modified immediate constant.

8. The microprocessor of claim 7, wherein the data processing instruction of the ARM instruction set architecture specifying a modified immediate constant comprises a conditional ALU instruction specifying a modified immediate constant.

9. A method for processing a microprocessor, the method performed by a microprocessor having an instruction set architecture defining an instruction, the instruction including an immediate field having a first portion specifying a first value and a second portion specifying a second value, the instruction directing the microprocessor to perform an operation on a fixed value as a source operand, the fixed value being obtained by rotating/moving the first value by a number of bits based on the second value, the method comprising:

translating the instruction into at least one immediate ALU microinstruction encoded in an instruction encoding manner different from that defined by the instruction set architecture, wherein the translating is performed by an instruction translator of the microprocessor; and

executing the microinstructions generated by the instruction translator to generate a result defined by the instruction set architecture, wherein the executing step is performed by an execution pipeline of the microprocessor;

wherein the fixed value is generated by the instruction translator, but not the execution pipeline, as a source operand to the immediate ALU micro-instruction based on the first and second values for execution by the execution pipeline.

10. The method of claim 9, wherein the translating step includes translating the instruction into different micro instructions based on whether a value of the immediate field falls within a predetermined subset of values.

11. The method of claim 9, wherein the fixed value is obtained by rotating/shifting the first value by twice the number of bits of the second value.

12. The method of claim 9, wherein the instruction set architecture of the microprocessor defines a plurality of instructions each including an immediate field, including data processing instructions of the advanced reduced instruction set machine ARM instruction set architecture ISA, the data processing instructions specifying a modified immediate constant.

13. A microprocessor having an instruction set architecture that defines an instruction that includes an immediate field having a first portion specifying a first value and a second portion specifying a second value, the instruction directing the microprocessor to perform an operation with a fixed value as one of source operands, the fixed value being obtained by rotating/shifting the first value by a number of bits based on the second value, the microprocessor comprising:

an instruction translator for translating the instruction into micro instructions; and

an execution pipeline that executes the microinstructions generated by the instruction translator to generate a result defined by the instruction set architecture;

wherein when a value of the immediate field falls within a predetermined subset of values:

the instruction translator translates the instruction into at least one immediate ALU micro-instruction;

the instruction translator, but not the execution pipeline, generating the fixed value according to the first value and the second value; and

the execution pipeline executing the immediate ALU microinstruction using the fixed value generated by the instruction translator as one of the source operands; and

wherein when the value of the immediate field does not fall within the predetermined subset of values:

the instruction translator translates the instruction into at least a first micro instruction and a second micro instruction;

the execution pipeline, other than the instruction translator, generating the fixed value by executing the first micro instruction; and

the execution pipeline executes the second micro instruction by using the fixed value generated by the execution of the first micro instruction as one of the source operands.

14. The microprocessor of claim 13, wherein the execution pipeline comprises:

a register allocation table for generating the association between the second micro instruction and the fixed value generated by the first micro instruction.

15. The microprocessor of claim 13, wherein all of the microinstructions are defined by a microarchitecture of the microprocessor and are encoded with an instruction encoding different than the instruction set architecture definition.

16. The microprocessor of claim 13, wherein the first micro instruction is a shift/rotate micro instruction.

17. A method for processing a microprocessor, the method performed by a microprocessor having an instruction set architecture defining an instruction, the instruction including an immediate field having a first portion specifying a first value and a second portion specifying a second value, the instruction directing the microprocessor to perform an operation on a fixed value as a source operand, the fixed value being obtained by rotating/shifting the first value by a number of bits based on the second value, the microprocessor including an instruction translator and an execution pipeline, the method comprising:

determining, by the instruction translator, whether a value of the immediate field falls within a predetermined subset of values;

when the value of the immediate field falls within the predetermined subset of values:

translating the instruction into at least an immediate ALU micro-instruction using the instruction translator;

generating the fixed value according to the first value and the second value using the instruction translator instead of the execution pipeline; and

executing the immediate ALU microinstruction using the fixed value generated by the instruction translator as one of the source operands using the execution pipeline; and

translating the instruction into at least a first micro instruction and a second micro instruction by using the instruction translator;

generating the fixed value by executing the first micro instruction using the execution pipeline instead of the instruction translator; and

the second micro instruction is executed using the execution pipeline by using the fixed value generated by the first micro instruction execution as one of the source operands.

18. The method of claim 17, further comprising:

generating an association between the second micro instruction and the fixed value generated by the execution of the first micro instruction, wherein the generating the association is performed by a register allocation table of the microprocessor.

19. The method of claim 17, wherein all of the microinstructions are defined by a microarchitecture of the microprocessor and are encoded with an instruction encoding different than the instruction set architecture definition.