WO1997041509A1

WO1997041509A1 - Superscalar microprocessor including a high performance instruction alignment unit

Info

Publication number: WO1997041509A1
Application number: PCT/US1996/006164
Authority: WO
Inventors: David B. Witt; Thang M. Tran
Original assignee: Advanced Micro Devices, Inc.
Priority date: 1996-05-01
Filing date: 1996-05-01
Publication date: 1997-11-06
Also published as: EP0896700A1

Abstract

A high performance superscalar microprocessor including an instruction alignment unit is provided which is capable of routing variable byte-length instructions simultaneously to a plurality of decode units which form fixed issue positions within the superscalar microprocessor. The instruction alignment unit may be implemented with a relatively small number of cascaded levels of logic gates, thus accomodating very high frequencies of operation. In one embodiment, the superscalar microprocessor includes an instruction cache for storing a plurality of variable byte-length instructions and a predecode unit for generating predecode tags which identify the location of the start byte of each variable byte-length instruction. An instruction alignment unit is configured to channel a plurality of the variable byte-length instructions simultaneously to predetermined issue positions depending upon the locations of their corresponding start bytes in a cache line. The issue position or positions to which an instruction may be dispatched is limited depending upon the position of the instruction's start byte within a line. By limiting the number of issue positions to which a given instruction within a line may be dispatched, the number of cascaded levels of logic required to implement the instruction alignment unit may be advantageously reduced.

Description

SUPERSCALAR MICROPROCESSOR INCLUDING A HIGH PERFORMANCE INSTRUCTION ALIGNMENT UNIT

BACKGROUND OF THE INVENTION

1. Field of the Invention

This invention relates to superscalar microprocessors and more particularly to the alignment and dispatch of variable byte length computer instructions to a plurality of instruction decoders within a high performance and high frequency superscalar microprocessor.

2. Description ofthe Relevant Art

Superscalar microprocessors are capable of attaining performance characteristics which surpass those of conventional scalar processors by allowing the concurrent execution of multiple instructions. Due to the widespread acceptance of the x86 family of microprocessors, efforts have been undertaken by microprocessor manufacturers to develop superscalar microprocessors which execute x86 instructions. Such superscalar microprocessors achieve relatively high performance characteristics while advantageously maintaining backwards compatibility with the vast amount of existing software developed for previous microprocessor generations such as the 8086, 80286, 80386, and 80486.

The x86 instruction set is relatively complex and is characterized by a plurality of variable byte length instructions. A generic format illustrative of the x86 instruction set is shown in Figure 1. As illustrated in the figure, an x86 instruction consists of from one to five optional prefix bytes 102, followed by an operation code (opcode) field 104, an optional addressing mode (Mod R/M) byte 106, an optional scale-index-base (SIB) byte 108, an optional displacement field 1 10, and an optional immediate data field 1 12. 97 1

The opcode field 104 defines the basic operation for a particular instruction. The default operation of a particular opcode may be modified by one or more prefix bytes. For example, a prefix byte may be used to change the address or operand size for an instruction, to override the default segment used in memory addressing, or to instruct the processor to repeat a string operation a number of times. The opcode field 104 follows the prefix bytes 102, if any, and may be one or two bytes in length. The addressing mode (Mod R/M) byte 106 specifies the registers used as well as memory addressing modes. The scale-index-base (SIB) byte 108 is used only in 32-bit base- relative addressing using scale and index factors. A base field ofthe SIB byte specifies which register contains the base value for the address calculation, and an index field specifies which register contains the index value. A scale field specifies the power of two by which the index value will be multiplied before being added, along with any displacement, to the base value. The next instruction field is the optional displacement field 1 10, which may be from one to four bytes in length. The displacement field 1 10 contains a constant used in address calculations. The optional immediate field 112, which may also be from one to four bytes in length, contains a constant used as an instruction operand. The shortest x86 instructions are only one byte long, and comprise a single opcode byte. The 80286 sets a maximum length for an instruction at 10 bytes, while the 80386 and 80486 both allow instruction lengths of up to 15 bytes.

The complexity ofthe x86 instruction set poses difficulties in implementing high performance x86 compatible superscalar microprocessors. One difficulty arises from the fact that instructions must be aligned with respect to the parallel-coupled instruction decoders of such processors before proper decode can be effectuated. In contrast to most RISC instruction formats, since the x86 instruction set consists of variable byte length instructions, the start bytes of successive instructions within a line are not necessarily equally spaced, and the number of instructions per line is not fixed. As a result, employment of simple, fixed-length shifting logic cannot in itself solve the problem of instruction alignment. Although scanning logic has been proposed to dynamically find the boundaries of instructions during the decode pipeline stage (or stages) of the processor, such a solution typically requires that the decode pipeline stage ofthe processor be implemented with a relatively large number of cascaded levels of logic gates and/or the allocation of several clock cycles to perform the scanning operation.

5 A further solution to instruction alignment and decode within x86 compatible superscalar microprocessors is described within the copending, commonly assigned patent application entitled "Superscalar Instruction Decoder", Serial No. 08/146,383, filed October 29, 1993 by Witt et al., the disclosure of which is incoφorated herein by reference in its entirety. The solution proposed within the above-referenced patent i o application involves a translation of each variable length x86 instruction into one or more fixed-length RISC-like instructions. Upon translation, each fixed-length RISC- like instruction is aligned with respect to an allocated instruction decoder. While this solution has been quite successful, it too typically requires a relatively large number of cascaded levels of logic gates. This correspondingly limits the maximum overall

15 clock frequency of the superscalar microprocessor.

SUMMARY OF THE INVENTION

The problems outlined above are in large part solved by a high performance 20 superscalar microprocessor including an instruction alignment unit in accordance with the present invention. In one embodiment, an instruction alignment unit is provided which is capable of routing variable byte length instructions such as x86 instructions simultaneously to a plurality of decode units which form fixed issue positions within the superscalar microprocessor. The instruction alignment unit may be implemented 25 with a relatively small number of cascaded levels of logic gates, thus accommodating very high frequencies of operation.

In one specific implementation, a superscalar microprocessor includes an instruction cache for storing a plurality of variable byte-length instructions and a 30 predecode unit for generating predecode tags which identify the location of the start byte of each variable byte-length instruction. An instruction alignment unit is configured to channel a plurality ofthe variable byte-length instructions simultaneously to predetermined issue positions depending upon the locations of their corresponding start bytes in a cache line. The issue position or positions to which an instruction may be dispatched is limited depending upon the position ofthe instruction's start byte within a line. By limiting the number of issue positions to which a given instruction of a line may be dispatched, the number of cascaded levels of logic required to implement the instruction alignment unit may be advantageously reduced.

In another implementation, instructions that have start bytes located at certain positions within a cache line may be restricted for dispatch to only one issue position. while instructions having start bytes at other positions within the cache line may be dispatched to one of a plurality of possible issue positions. By restricting the dispatch of those instructions having start bytes residing at certain positions within a line to a single issue position, the number of cascaded levels of logic may be reduced even further.

Broadly speaking, the invention contemplates a superscalar microprocessor comprising an instruction cache for storing a plurality of variable byte-length instructions, a predecode unit coupled to the instruction cache for generating a predecode tag associated with each variable byte-length instruction, and a plurality of decode units capable of decoding the variable byte length instructions, wherein each ofthe plurality of decode units is associated with a fixed issue position. An instruction alignment unit is also coupled between the instruction cache and the plurality of decode units, wherein the instruction alignment unit is configured to channel the plurality of variable byte-length instructions to predetermined issue positions depending upon the predecode tag associated with each variable byte-length instruction.

The invention further contemplates a superscalar microprocessor comprising an instruction cache for storing a plurality of variable byte-length instructions, a predecode unit coupled to the instruction cache for generating a predecode tag associated with each variable byte-length instruction, and a plurality of decode units capable of decoding the variable byte length instructions, wherein each of the plurality of decode units is associated with a fixed issue position. An instruction alignment unit is further coupled between the instruction cache and the plurality of decode units, wherein the instruction alignment unit is configured to channel a first instruction starting within a first predetermined range of positions within a cache line to a first decode unit and to channel a second instruction starting within a second range of positions within the cache line to a second decode unit.

The invention additionally contemplates a method for aligning instructions within a superscalar microprocessor comprising the steps of storing a plurality of variable byte-length instructions within an instruction cache, predecoding the plurality of variable byte-length instructions to thereby provide a tag indicative of a boundary of each ofthe plurality ofthe variable byte-length instructions, and detecting predecode tags associated with a line of instructions within the instruction cache. The method comprises the further steps of routing a first instruction starting within a first range of positions within a cache line to a first decode unit, and routing a second instruction starting within a second range of positions within the cache line to a second decode unit.

BRIEF DESCRIPTION OF THE DRAWINGS

Other objects and advantages ofthe invention will become apparent upon reading the following detailed description and upon reference to the accompanying drawings in which:

Figure 1 is a diagram which illustrates the generic x86 instruction set format.

Figure 2 is a block diagram of a superscalar microprocessor which includes an instruction alignment unit to forward multiple instructions to six decode units.

Figure 3 is a block diagram ofthe instruction alignment unit and six decode units.

Figures 4A-4C are block diagrams which depict execution of an MROM instruction. While the invention is susceptible to various modifications and alternative forms, specific embodiments thereof are shown by way of example in the drawings and will herein be described in detail. It should be understood, however, that the drawings and detailed description thereto are not intended to limit the invention to the particular form disclosed, but on the contrary, the intention is to cover all modifications, equivalents and alternatives falling within the spirit and scope ofthe present invention as defined by the appended claims.

DETAILED DESCRIPTION OF THE INVENTION

Referring next to Figure 2, a block diagram of a superscalar microprocessor 200 including an instruction alignment unit 206 in accordance with the present invention is shown. As illustrated in the embodiment of Figure 2, superscalar microprocessor 200 includes a prefetch/predecode unit 202 and a branch prediction unit 220 coupled to an instruction cache 204. Instruction alignment unit 206 is coupled between instruction cache 204 and a plurality of decode units 208A-208F (referred to collectively as decode units 208). Each decode unit 208A-208F is coupled to a respective reservation station units 210A-210F (referred collectively as reservation stations 210), and each reservation station 210A-210F is coupled to a respective functional unit 212A-212F (referred to collectively as functional units 212). Decode units 208, reservation stations 210, and functional units 212 are further coupled to a reorder buffer 216, a register file 218 and a load/store unit 222. A data cache 224 is finally shown coupled to load/store unit 222, and an MROM unit 209 is shown coupled to instruction alignment unit 206.

Generally speaking, instruction cache 204 is a high speed cache memory provided to temporarily store instructions prior to their dispatch to decode units 208. In one embodiment, instruction cache 204 is configured to cache up to 32 kilobytes of instruction code organized in lines of 16 bytes each (where each byte consists of 8 bits). During operation, instruction code is provided to instruction cache 204 by prefetching code from a main memory (not shown) through prefetch/precode unit 202. It is noted that instruction cache 204 could be implemented in a set-associative, a fully-associative, or a direct-mapped configuration.

Prefetch/predecode unit 202 is provided to prefetch instruction code from the main memory for storage within instruction cache 204. In one embodiment, prefetch/predecode unit 202 is configured to burst 64-bit wide code from the main memory into instruction cache 204. It is understood that a variety of specific code prefetching techniques and algorithms may be employed by prefetch/predecode unit 202.

As prefetch/predecode unit 202 fetches instructions from the main memory, it generates three predecode bits associated with each byte of instruction code: a start bit, an end bit, and a "functional" bit. The predecode bits form tags indicative of the boundaries of each instruction. The predecode tags may also convey additional information such as whether a given instruction can be decoded directly by decode units 208 or whether the instruction must be executed by invoking a microcode procedure controlled by MROM unit 209, as will be described in greater detail below.

Table 1 indicates one encoding ofthe predecode tags. As indicated within the table, if a given byte is the first byte of an instruction, the start bit for that byte is set. If the byte is the last byte of an instruction, the end bit for that byte is set. If a particular instruction cannot be directly decoded by the decode units 208, the functional bit associated with the first byte ofthe instruction is set. On the other hand, if the instruction can be directly decoded by the decode units 208, the functional bit associated with the first byte ofthe instruction is cleared. The functional bit for the second byte of a particular instruction is cleared if the opcode is the first byte, and is set if the opcode is the second byte. It is noted that in situations where the opcode is the second byte, the first byte is a prefix byte. The functional bit values for instruction byte numbers 3-8 indicate whether the byte is a MODRM or an SIB byte, as well as whether the byte contains displacement or immediate data.

Table !__*. Encoding of. Start., End and Fπnr-t.-i onal R-it-c; Instr. Start End Functional

Byte Bit Bit Bit

Number Value. ValUP Value Meani ng

1 1 X 0 Fast decode

1 1 X 1 MROM instr.

2 0 X 0 Opcode is first byte

2 0 X 1 Opcode is this byte, first byte is prefix

3-8 0 X 0 Mod R/M or SIB byte

3-8 0 X 1 Displacement or immediate data; the second functional bit set in bytes

3-8 indicates immediate data

1-8 X 0 X Not last byte of instruction

1-8 X 1 X Last byte of instruction

As stated previously, in one embodiment certain instructions within the x86 instruction set may be directly decoded by decode unit 208. These instructions are referred to as "fast path" instructions. The remaining instructions ofthe x86 instruction set are referred to as "MROM instructions". MROM instructions are executed by invoking MROM unit 209. When an MROM instruction is encountered, MROM unit 209 parses and serializes the instruction into a subset of defined fast path instructions to effectuate a desired operation. A listing of exemplary x86 instructions categorized as fast path instructions as well as a description of the manner of handling both fast path and MROM instructions will be provided further below.

Instruction alignment unit 206 is provided to channel or "funnel" variable byte length instructions from instruction cache 204 to fixed issue positions formed by decode units 208A-208F. As will be described in conjunction with Figures 3 and 4A- 4C, instruction alignment unit 206 is configured to channel instruction code to designated decode units 208A-208F depending upon the locations of the start bytes of instructions within a line as delineated by instruction cache 204. In one embodiment, the particular decode unit 208A-208F to which a given instruction may be dispatched is dependent upon both the location of the start byte of that instruction as well as the location of the previous instruction's start byte, if any. Instructions starting at certain byte locations may further be restricted for issue to only one predetermined issue position. Specific details follow.

Before proceeding with a detailed description ofthe alignment of instructions from instruction cache 204 to decode units 208, general aspects regarding other subsystems employed within the exemplary superscalar microprocessor 200 of Figure 2 will be described. For the embodiment of Figure 2, each ofthe decode units 208 includes decoding circuitry for decoding the predetermined fast path instructions referred to above. In addition, each decode unit 208A-208F routes displacement and immediate data to a corresponding reservation station unit 21 OA-21 OF. Output signals from the decode units 208 include bit-encoded execution instructions for the functional units 212 as well as operand address information, immediate data and/or displacement data.

The superscalar microprocessor of Figure 2 supports out of order execution, and thus employs reorder buffer 216 to keep track of the original program sequence for register read and write operations, to implement register renaming, to allow for speculative instruction execution and branch misprediction recovery, and to facilitate precise exceptions. As will be appreciated by those of skill in the art, a temporary storage location within reorder buffer 216 is reserved upon decode of an instruction that involves the update of a register to thereby store speculative register states. Reorder buffer 216 may be implemented in a first-in-first-out configuration wherein speculative results move to the "bottom" ofthe buffer as they are validated and written to the register file, thus making room for new entries at the "top" ofthe buffer. Other specific configurations of reorder buffer 216 are also possible, as will be described further below. If a branch prediction is incorrect, the results of speculatively-executed instructions along the mispredicted path can be invalidated in the buffer before they are written to register file 218. The bit-encoded execution instructions and immediate data provided at the outputs of decode units 208A-208F are routed directly to respective reservation station units 21 OA-21 OF. In one embodiment, each reservation station unit 21 OA-21 OF is capable of holding instruction information (i.e., bit encoded execution bits as well as operand values, operand tags and/or immediate data) for up to three pending instructions awaiting issue to the corresponding functional unit. It is noted that for the embodiment of Figure 2, each decode unit 208A-208F is associated with a dedicated reservation station unit 21 OA-21 OF, and that each reservation station unit 21 OA-21 OF is similarly associated with a dedicated functional unit 212A-212F. Accordingly, six dedicated "issue positions" are formed by decode units 208, reservation station units 210 and functional units 212. Instructions aligned and dispatched to issue position 0 through decode unit 208A are passed to reservation station unit 210A and subsequently to functional unit 212A for execution. Similarly, instructions aligned and dispatched to decode unit 208B are passed to reservation station unit 21 OB and into functional unit 212B, and so on.

Upon decode of a particular instruction, if a required operand is a register location, register address information is routed to reorder buffer 216 and register file 218 simultaneously. Those of skill in the art will appreciate that the x86 register file includes eight 32 bit real registers (i.e., typically referred to as EAX, EBX. ECX, EDX, EBP, ESI. EDI and ESP), as will be described further below. Reorder buffer 216 contains temporary storage locations for results which change the contents of these registers to thereby allow out of order execution. A temporary storage location of reorder buffer 216 is reserved for each instruction which, upon decode, modifies the contents of one ofthe real registers. Therefore, at various points during execution of a particular program, reorder buffer 216 may have one or more locations which contain the speculatively executed contents of a given register. If following decode of a given instruction it is determined that reorder buffer 216 has previous location(s) assigned to a register used as an operand in the given instruction, the reorder buffer 216 forwards to the corresponding reservation station either: 1) the value in the most recently assigned location, or 2) a tag for the most recently assigned location if the value has not yet been produced by the functional unit that will eventually execute the previous instruction. If the reorder buffer has a location reserved for a given register, the operand value (or tag) is provided from reorder buffer 216 rather than from register file 218. If there is no location reserved for a required register in reorder buffer 216. the value is taken directly from register file 218. If the operand corresponds to a memory location, the operand value is provided to the reservation station unit through load/store unit 222.

Details regarding suitable reorder buffer implementations may be found within the publication "Superscalar Microprocessor Design" by Mike Johnson, Prentice-Hall, Englewood Cliffs, New Jersey, 1991, and within the co-pending, commonly assigned patent application entitled "High Performance Superscalar Microprocessor", Serial No. 08/146,382, filed October 29, 1993 by Witt, et al. These documents are incoφorated herein by reference in their entirety.

Reservation station units 21 OA-21 OF are provided to temporarily store instruction information to be speculatively executed by the corresponding functional units 212A-212F. As stated previously, each reservation station unit 21 OA-21 OF may store instruction information for up to three pending instructions. Each ofthe six reservation stations 210A-210F contain locations to store bit-encoded execution instructions to be speculatively executed by the corresponding functional unit and the values of operands. If a particular operand is not available, a tag for that operand is provided from reorder buffer 216 and is stored within the corresponding reservation station until the result has been generated (i.e., by completion ofthe execution of a previous instruction). It is noted that when an instruction is executed by one ofthe functional units 212A-212F, the result of that instruction is passed directly to any reservation station units 21 OA-21 OF that are waiting for that result at the same time the result is passed to update reorder buffer 216 (this technique is commonly referred to as "result forwarding"). Instructions arc issued to functional units for execution after the values of any required operand(s) are made available. That is, if an operand associated with a pending instruction within one ofthe reservation station units 210A- 21 OF has been tagged with a location of a previous result value within reorder buffer 216 which corresponds to an instruction which modifies the required operand, the instruction is not issued to the corresponding functional unit 212 until the operand result for the previous instruction has been obtained. Accordingly, the order in which instructions are executed may not be the same as the order ofthe original program instruction sequence. Reorder buffer 216 ensures that data coherency is maintained in situations where read-after-write dependencies occur.

In one embodiment, each ofthe functional units 212 is configured to perform integer arithmetic operations of addition and subtraction, as well as shifts, rotates, logical operations, and branch operations. It is noted that a floating point unit (not shown) may also be employed to accommodate floating point operations.

Each ofthe functional units 212 also provides information regarding the execution of conditional branch instructions to the branch prediction unit 220. If a branch prediction was incorrect, instruction cache 204 flushes instructions not needed, and causes prefetch/predecode unit 202 to fetch the required instructions from main memory. It is noted that in such situations, results of instructions in the original program sequence which occur after the mispredicted branch instruction are discarded, including those which were speculatively executed and temporarily stored in load/store unit 222 and reorder buffer 216. Exemplary configurations of suitable branch prediction mechanisms are well known.

Results produced by functional units 212 are sent to the reorder buffer 216 if a register value is being updated, and to the load/store unit 222 if the contents of a memory location is changed. If the result is to be stored in a register, the reorder buffer 216 stores the result in the location reserved for the value ofthe register when the instruction was decoded. As stated previously, results are also broadcast to reservation station units 21 OA-21 OF where pending instructions may be waiting for the results of previous instruction executions to obtain the required operand values.

Generally speaking, load/store unit 222 provides an interface between functional units 212A-212F and data cache 224. In one embodiment, load/store unit 222 is configured with a store buffer with eight storage locations for data and address information for pending loads or stores. Functional units 212 arbitrate for access to the load/store unit 222. When the buffer is full, a functional unit must wait until the load/store unit 222 has room for the pending load or store request information. The load/store unit 222 also performs dependency checking for load instructions against pending store instructions to ensure that data coherency is maintained.

Data cache 224 is a high speed cache memory provided to temporarily store data being transferred between load/store unit 222 and the main memory subsystem. In one embodiment, data cache 224 has a capacity of storing up to eight kilobytes of data. It is understood that data cache 224 may be implemented in a variety of specific memory configurations, including a set associative configuration.

Details regarding the dispatch of instructions from instruction cache 204 through instruction alignment unit 206 to decode units 208 will next be considered. Figure 3 is a block diagram which depicts internal portions of one embodiment of instruction alignment unit 206 as well as internal portions of decode units 208A-208F with respect to a line of instruction code to be provided from instruction cache 204. As stated previously, instruction alignment unit 206 is configured to channel variable byte length instructions (in this case certain x86 instructions referred to as fast path instructions) to decode units 208A-208F.

As shown in Figure 3, a latching unit 302 is incorporated as a portion of an output buffer section 301 of instruction cache 204. Latching unit 302 is capable of storing a line of instruction code provided from a storage array (not shown in Figure 3) of instruction cache 204 prior to being dispatched to decode units 208.

The instruction alignment unit 206 of Figure 3 includes a plurality of multiplexer circuits referred to as multiplexer channels 304A-304G coupled between latching unit 302 and decode units 208. A multiplexer control circuit 306 is further shown coupled to each multiplexer channel 304A-304G. In this embodiment, each decode unit 208A-208F includes an associated instruction decoder 318A-318F having an input port coupled to a respective multiplexer channel 304A-304F. Each decode unit 208A-208F further includes a respective displacement/immediate data buffer 330A-330F and a respective instruction issue unit 340A-340F. W

During operation, a line of instruction code to be executed is provided to latching unit 302 from the storage array of instruction cache 204. Each byte of instruction code within instruction cache 204 is associated with a corresponding 5 predecode tag including a start bit, an end bit, and a functional bit. When a line of instruction code is provided to latching unit 302, the predecode tag associated with each byte is provided to an input of multiplexer control circuit 306. As will be described in further detail below, depending upon the predecode tags corresponding to each line of instruction code within latching unit 302, multiplexer control circuit 306 l o controls multiplexer channels 304A-304G such that the instruction bytes are selectively routed to designated instruction decoders 318A-318F. Instruction paths formed by decode units 208A-208F are referred to as issue positions. The channeling of instruction code through multiplexer channels 304A-304G is dependent upon the location ofthe start byte associated with each instruction relative to each line as

15 delineated by latching unit 302. In the embodiment of Figure 3, each ofthe first five multiplexer channels 304A-304F routes four contiguous bytes of instruction code from latching unit 302 to a respective instruction decoder 318A-318F. Multiplexer channel 304G is capable of channeling up to three contiguous bytes of instruction code to instruction decoder 318.

20

Table 2 below illustrates the possible multiplexer channels 304A-304G through which start bytes may be channeled. As stated previously, the channeling of instruction code is dependent upon the location(s) of start bytes within a given line. It is noted that each multiplexer channel 304A-304F is configured to route the lowest- 25 order start byte among those allocated to it, provided the start byte has not been selected for routing by a lower order multiplexer channel. Tabl e 2-.. Tli gpat- r-he.g LO. I SSUft Pos it ions

Rased an start. Byte Locations

Start Byte Dispatch To In Location Tssne Position

0 0

1 0 or 1

2 0 or 1 3 1 or 2

4 1 or 2

5 2

6 2 or 3

7 2 or 3 8 2 or 3

9 3 or 4

10 3 or 4

11 4

12 4 or 5 13 5 or 6

14 5 or 6

15

Referring to Table 2, multiplexer channel 304A is capable of routing start bytes located at byte positions 0-2 to decode unit 318 A. Multiplexer channel 304B is capable of routing start bytes at byte positions 1 -4 to decode unit 318B. Multiplexer channel 304C is capable of transferring start bytes at byte positions 3-8 to decode unit 208C. Similarly, multiplexer channel 304D is capable of transferring start bytes at byte positions 6-10 to decode unit 208D, and multiplexer channel 304E is capable of transferring start bytes at byte positions 9-12 to decode unit 208E. Finally, multiplexer channel 304F is capable of transferring start bytes at byte positions 12-15 to decode unit 318F. Start bytes located at byte positions 13-15 may alternatively be routed through multiplexer channel 304G to a seventh issue position which is employed to wrap bytes of an incomplete instruction (i.e., an instruction which extends into the next line) to the next cache line for decode. As will be described further below, instruction bytes routed through multiplexer channel 304G are provided to instruction decoder 304A upon the next clock cycle when the remaining bytes of that instruction are available within latching unit 302.

If an instruction wraps around to a subsequent cache line, the dispatch of the instruction to a designated position is dependent upon the nature of the remaining bytes ofthe instruction that appear on the next line. For situations where solely displacement or immediate data wrap around to the next cache line, that immediate or displacement data is provided to displacement/immediate data buffer 330F through multiplexer channel 304A. It is noted that in this situation, the preceding bytes of that instruction (which appear on the preceding cache line) will have been dispatched to instruction decoder 318F during the preceding clock cycle. For situations in which prefix, opcode, MODRM, and/or SIB bytes wrap around to the next cache line, the instruction information from the previous line is routed through multiplexer channel 304G to instruction decoder 318 A, and is merged with the rest of the instruction code during the next clock cycle.

It will be appreciated that by limiting the possible number of issue positions to which a given instruction of a line may be dispatched, the number of cascaded levels of logic required to implement the instruction alignment unit 206 may be advantageously reduced. Furthermore, by restricting the dispatch of an instruction having a start byte which resides at one of a select subset of byte locations within a line to a single issue position (i.e., byte positions 5 and 1 1), the number of cascaded levels of logic for instruction alignment may be reduced even further. Accordingly, the instruction alignment unit 206 as described above allows the implementation of a superscalar microprocessor having a relatively small number of gates per pipeline stage to thereby accommodate very high frequencies of operation. For relatively long instructions, although issue positions may be skipped, relatively high performance may still be achieved since other issue positions are available for remaining instructions within a cache line.

The defined fast path instructions may be up to eight bytes in length, and may include a single prefix byte. It is noted that by limiting the defined fast path instructions to only a single prefix byte, 4 through 7 of any fast path instruction contain only displacement or immediate data. Therefore, for situations in which the instruction is greater than four bytes, the first four bytes ofthe instruction are routed through the multiplexer channel allocated to that instruction's start byte. The remaining bytes ofthe instruction are channeled through the next issue position's multiplexer channel. In such situations, the instruction decoder ofthe issue position (i.e., instruction decoder) receiving the remaining bytes of the instruction detects the absence of a start bit at its first-byte position, and accordingly passes the data to the displacement/immediate data buffer 330 of the preceding issue position and issues a NOOP instruction.

Thus, if a start byte of an instruction is located at byte position 0 of latching unit 302, that byte is provided to decode unit 208A along with the next three contiguous bytes residing at byte positions 1, 2, and 3. If the next start byte resides at position 2 (i.e., first instruction was two bytes in length), bytes 2-5 are routed through multiplexer channel 304B to decode unit 208B. For the embodiment of Figure 3, each instruction decoder 318A-318F is capable of decoding only one instruction at a time. Accordingly, although the start bytes of more than one instruction may be provided to, for example, instruction decoder 318A, only the first instruction is decoded. Bytes beyond the first end byte, corresponding to additional instructions within a given instruction decoder, are extraneous and are effectively ignored. It is noted that the multiplexer channels 304 of instruction alignment unit 206 could be alternatively configured such that only a single instruction (or portions thereof), in accordance with the instruction's start and end predecode bits, are channeled to a given instruction decoder 318.

In accordance with the above, if a first instruction starts at byte position 0, bytes 0-3 are provided to instruction decoder 318 A. If the instruction is longer than four bytes, bytes 4-7 of latching unit 302 are provided through multiplexer channel 304B to instruction decoder 318B, which subsequently passes the data to displacement/immediate data buffer 330A. For this situation, multiplexer channel

308C will route the next start byte appearing in the code to instruction decoder 318C. If, on the other hand, the first instruction starting at byte location 0 is four bytes or less, the next instruction is routed through multiplexer channel 304B beginning with the start byte ofthe second instruction. If that instruction is greater than four bytes long, the immediate or displacement data corresponding to that instruction is routed through multiplexer channel 304C to displacement/immediate data buffers 330B. The remaining multiplexer channels operate similarly. It is noted that if immediate or displacement data is wrapped around to a subsequent line from an instruction starting at a previous line, that data is provided to displacement/immediate data buffer 340F through multiplexer channel 304A when the immediate or displacement data is available in latching unit 302. It is further noted that instruction decoding is not affected since no decoding is required for displacement and immediate data. The first instruction ofthe subsequent line is therefore routed to instruction decoder 318B through multiplexer channel 304B.

It is similarly noted that if prefix, opcode, MODRM, and/or SIB information is wrapped around from an instruction beginning on a previous line, multiplexer channel 304G routes the preceding portions ofthe instruction to instruction decoder 318 A, in which case the next instruction (corresponding to the first start byte within latching unit 302 during the next clock cycle) will be routed through multiplexer channel 304B to instruction decoder 318B.

As will be understood better from the following example, situations may arise wherein none ofthe possible issue positions to which a given start byte may be provided are available due to occupation of those issue positions by previous instructions. When such a situation arises, that instruction and any instructions following it must be held until the next clock cycle for dispatch.

A sample sequence of x86 instructions is shown in Table 3 below. Instructions 1 through 7 in addition to the first byte of instruction 8 are shown within cache line 1. Cache line 2 begins with the second byte of instruction 8, and further includes instructions 9 through 16. Table..L-. Sampl e Sequence o£ Instructions.

Instr. Address Num. Cache Line

Number Offset Instruction Bytes Line Byte

1 0000 INC ESI 1 1 0

2 0001 CMP BYTE, [ESI] 3 1 1-3

3 0004 JZ DST1 2 1 4-5

4 0006 CMP BYTE, [ESI] 3 1 6-8

5 0009 JZ DST2 2 1 9-10

6 000B INC [EDX] 2 1 11-12

7 000D OR ECX,ECX 2 1 13-14

8 000F JZ DST3 2 1 15 2 0

9 0011 MOV AL, [ESI] 2 2 1-2

10 0013 MOV [ECX] ,AL 2 2 3-4

11 0015 INC ECX 1 2 5

12 0016 INC ESI 1 2 6

13 0017 CMP BYTE, [ESI] 3 2 7-9

14 001A JNZ DST4 2 2 10-11

15 001C INC [ECX] 2 2 12-13

16 001E OR ECX,ECX 2 2 14-15

Table 4 below illustrates the manner in which the above sequence of instructions in Table 3 are dispatched to the decode units 208A-208F by instruction alignment unit 206.

Table 4_ Instructi on Al ignment tor Sample Spφipnr-p

Of. Tnat- T-nct i ori Pi in Tabl ft 3.

Issue Issue Issue Issue Issue Issue Pos. 0 Pos. 1 Pos. 2 Pos. 3 Pos. 4 Pos.5

Clonk (0:2) (1 :4) (3:R) (9:12) (12:1.5)

1 Ins. 1 Ins. 2 Ins. 3 Ins. 4 Ins . 5 2 Ins . 6 Ins. 7 3 Ins. 8 Ins. 9 Ins. 10 4 Ins. 11 Ins. 12 5 Ins. 13 Ins. 14 Ins. 15 Ins.16

Instructions 1-5 are dispatched to issue positions 0-4 corresponding to decode units 318A-318E, respectively, during a first clock cycle. Instruction 6, which begins at byte position 11 of latching unit 302, can only be channeled to issue position 4 corresponding to decode unit 318E. However, since issue position 4 is already occupied by instruction 5, instruction 6 cannot be dispatched during this cycle. Accordingly, multiplexer control circuit 306 causes decode unit 318F to issue a NOOP (no operation) instruction during the decode stage when instructions 1-4 are decoded.

During clock cycle 2, instruction 6 is dispatched to issue position 4, and instruction 7 is dispatched to issue position 5. It is noted when these instructions are decoded, multiplexer control circuit 306 causes decode units 318A-318D to issue NOOP instructions. Since instruction 8 wraps around to the next cache line, the first byte ofthe instruction is wrapped around to instruction decoder 318 during the next clock cycle through multiplexer channel 304G.

During clock cycle 3, instruction 8 is dispatched to issue position 0. It is noted that the first byte of instruction 8 is wrapped around from byte position 15 ofthe previous line. Instructions 9 and 10 are further dispatched to issue positions 1 and 2 through multiplexer channels 304B and 304C, respectively. Upon decode of instructions 8-10, instruction issue units 340D-E cause NOOP instructions to be issued.

Instructions 1 1 and 12 are dispatched to issue positions 2 and 3 during clock cycle 4. Instruction 13 begins in byte 7, and cannot be routed to issue position 4. Therefore, the dispatch of instruction 13 must be held until the next clock cycle.

During clock cycle 5, instructions 13 through 16 are dispatched to issue positions 2 through 5, respectively. Similar to the above, during decode of instructions 13-16, instruction issue units 340A and 340B cause NOOP instructions to be issued for issue positions 0 and 1.

Referring back to Figure 2, instructions which are not included within the subset of x86 instructions designated as fast path instructions are executed under the control of MROM unit 209 using stored microcode. MROM unit 209 parses instructions into a series of fast path instructions which are dispatched during one or more clock cycles. When an MROM instruction within a line of code in latching unit 202 is detected by MROM unit 209, this instruction and any following it are not dispatched during the current cycle. Any instruction(s) preceding it are dispatched in accordance with the above description.

During the following clock cycle(s), MROM unit 209 provides series of fast path instructions to the decode units 208 through instruction alignment unit 206 in accordance with the microcode for that particular MROM instruction. Once all of the microcoded instructions have been dispatched to decode units 208 through alignment unit 206 to effectuate the desired MROM operation, the instructions which followed the MROM instruction are allowed to be dispatched.

Table 5 below illustrates a sample of x86 assembly language code segment containing an MROM instruction (REP MOVSB).

Table 5— X86 Assembly T.angnagp Code Segmp.nt- With MROM TnRtmπtinn

M MOOVV C CXX,, SS_LEN ;get string length

CLD ; increment indices

REP MOVSB ;move string

POP CX ,-restore registers P POOPP D DII

POP SI

Figures 4A-4C are block diagrams of portions of superscalar processor 200 depicting the dispatch and decode ofthe instructions of Table 5 during consecutive clock cycles. During the first clock cycle as depicted within Figure 4A, the first two instructions (MOVE CX. S_LEN and CLD) are routed through multiplexer channels 304A and 304B to issue positions 0 and 1 (i.e., decode units 318A and 318B). Upon decode MROM unit 209 further causes decode units 208C-208F to issue NOOP instructions.

Microcoded instructions that effectuate the REP MOVSB instruction are dispatched during cycles 2 through N, as depicted by Figure 4B. During these cycles. 4150

a set of fast path instructions in accordance with the microcode stored in MROM unit 209 are dispatched through the instruction alignment unit 206 to decode units 208A- 208F. It is noted that this MROM sequence may take several cycles to complete.

Following complete dispatch ofthe MROM instruction, the remaining instructions ofthe line following the MROM instruction are allowed to be dispatched to issue positions 3-5 through multiplexer channels 304D-304F. Upon decode of these instructions, MROM unit 209 causes decode units 208A-208C issue NOOP instructions.

In accordance with the instruction alignment unit as described above, variable byte-length computer instructions may be dispatched to a plurality of instruction decoders during the same pipeline stage. The instruction alignment unit may be implemented using a relatively small number of cascaded levels of logic gates to thereby accommodate high frequencies of operation.

It is understood that while the instruction alignment unit 206 as described above in conjunction with Figures 2-4 is configured to selectively route instructions to the specific issue positions indicated by Table 2, other configurations are also possible. That is, the specific issue position or positions to which a given instruction within a line of memory is dispatched may be varied from that described above. It is further specifically contemplated that the number of issue positions provided within a superscalar microprocessor employing an instruction alignment unit in accordance with the invention may also vary.

Numerous variations and modifications will become apparent to those skilled in the art once the above disclosure is fully appreciated. It is intended that the following claims be inteφreted to embrace all such variations and modifications.

Claims

WHAT IS CLAIMED IS:

1. A superscalar microprocessor comprising:

5 an instruction cache for storing a plurality of variable byte-length instructions; a predecode unit coupled to said instruction cache for generating a predecode tag associated with each variable byte-length instruction; a plurality of decode units capable of decoding said variable byte length instructions, wherein each of said plurality of decode units is associated l o with a fixed issue position; and an instruction alignment unit coupled between said instruction cache and said plurality of decode units, wherein said instruction alignment unit is configured to channel said plurality of variable byte-length instructions to predetermined issue positions depending upon said predecode tag 15 associated with each variable byte-length instruction.

2. The superscalar microprocessor as recited in claim 1 wherein each of said plurality of said decode units is capable of decoding a predetermined subset of an x86 instruction set.

20

3. The superscalar microprocessor as recited in claim 1 wherein said plurality of variable byte-length instructions are organized in lines within said instruction cache, wherein a line includes a predetermined number of bytes.

25 4. The superscalar microprocessor as recited in claim 1 wherein said instruction alignment unit includes a plurality of multiplexer channels coupled between said instruction cache and said plurality of decode units, wherein a first of said multiplexer channels if configured to route an instruction having a start byte which resides at a first location within a line to a first decode unit, and wherein a second multiplexer

30 channel is configured to route an instruction having a start byte residing a second byte location within said line of instruction code to a second decode unit. 5. The superscalar microprocessor as recited in claim 1 wherein said instruction alignment unit is configured to restrict the dispatch of a variable byte length instruction from said instruction cache to a selected subset of said plurality of decode units depending upon said predecode tag.

6. The superscalar microprocessor as recited in claim 5 wherein said predecode tag associated with each variable byte length instruction is indicative of a start byte of said variable byte length instruction.

7. The superscalar microprocessor as recited in claim 1 further comprising a plurality of functional units configured to receive output signals from said plurality of decode units.

8. The superscalar microprocessor as recited in claim 7 wherein said output signals from said plurality of said decode units include bit-encoded execution instructions.

9. The superscalar microprocessor as recited in claim 7 further comprising a plurality of reservation stations coupled to said plurality of decode units and to said plurality of functional units, wherein said plurality of reservation stations are configured to temporarily store said output signals from said plurality of decode units prior to issuance to said plurality of said functional units.

10. The superscalar microprocessor as recited in claim 7 wherein a dedicated functional unit is associated with each of said plurality of decode units.

1 1. The superscalar microprocessor as recited in claim 1 further comprising a reorder buffer coupled to said plurality of decode units for storing speculatively- executed instruction results.

12. A superscalar microprocessor comprising: an instruction cache for storing a plurality of variable byte-length instructions; a predecode unit coupled to said instruction cache for generating a predecode tag associated with each said variable byte-length instruction; a plurality of decode units capable of decoding said variable byte length 5 instructions, wherein each of said plurality of decode units is associated with a fixed issue position; and an instruction alignment unit coupled between said instruction cache and said plurality of decode units, wherein said instruction alignment unit is configured to channel a first instruction starting within a first l o predetermined range of positions within a cache line to a first decode unit and to channel a second instruction starting within a second range of positions within said cache line to a second decode unit.

13. The superscalar microprocessor as recited in claim 12 wherein each of said 15 plurality of said decode units is capable of decoding a determined subset of instructions defined by an x86 instruction set.

14. The superscalar microprocessor as recited in claim 12 wherein said first range is associated with a first subset of positions within said cache line which is different

20 from a second subset of instructions associated with said second range.

15. The superscalar microprocessor as recited in claim 12 wherein said instruction alignment unit includes a plurality of multiplexer channels coupled between said instruction cache and said plurality of decode units, wherein a first of said multiplexer

25 channels is configured to route said first instruction starting within said first range and wherein a second multiplexer channel is configured to route said second instruction starting within said second range.

16. The superscalar microprocessor as recited in claim 12 wherein said instruction 30 alignment unit is configured to restrict the dispatch of a variable byte length instruction from said instruction cache to a selected subset of said plurality of decode units depending upon said predecode tag. 17. The superscalar microprocessor as recited in claim 16 wherein said predecode tag associated with each variable byte length instruction is indicative of a start byte of said variable byte length instruction.

18. The superscalar microprocessor as recited in claim 12 further comprising a plurality of functional units configured to receive output signals from said plurality of decode units.

19. The superscalar microprocessor as recited in claim 18 wherein said output signals from said plurality of said decode units include bit-encoded execution instructions.

20. The superscalar microprocessor as recited in claim 18 further comprising a plurality of reservation stations coupled to said plurality of decode units and to said plurality of functional units, wherein said plurality of reservation stations are configured to temporarily store said output signals from said plurality of decode units prior to issuance to said plurality of said functional units.

21. The superscalar microprocessor as recited in claim 18 wherein a dedicated functional unit is associated with each of said plurality of decode units.

22. The superscalar microprocessor as recited in claim 12 further comprising a reorder buffer coupled to said plurality of decode units for storing speculatively- executed instruction results.

23. A method for aligning instructions within a superscalar microprocessor comprising the steps of:

storing a plurality of variable byte-length instructions within an instruction cache; predecoding said plurality of variable byte-length instructions to thereby provide a tag indicative of a boundary of each of said plurality of said variable byte-length instructions; detecting predecode tags associated with a line of instructions within said instruction cache; routing a first instruction starting within a first range of positions within a cache line to a first decode unit; and routing a second instruction starting within a second range of positions within said cache line to a second decode unit.

24. An instruction alignment unit for routing a plurality of variable byte-length instructions from a stored line to a plurality of decode positions, said instruction alignment unit comprising: a first multiplexer channel configured to route a first instruction having a corresponding start byte within a first range of byte locations of said line to a first decode position; and a second multiplexer channel configured to route a second instruction having an associated start byte within a second range of byte locations of said line to a second decode position.

25. The instruction alignment unit as recited in claim 24 wherein selected byte locations with said first range are exclusive of said second range.

26. The instruction alignment unit as recited in claim 24 further comprising a multiplexer control unit couled to said first and second multiplexer channels, wherein said multiplexer channel is configured to detect tags indicative of a boundary of each of said plurality of instructions.

27. The instruction alignment unit as recited in claim 26 wherein said multiplexer control unit is further configured to control said first multiplexer channel such that said corresponding start byte of said first instruction along with a plurality of subsequent bytes of said first instruction are routed to said first decode position. 28. The instruction alignment unit as recited in claim 27 wherein said multiplexer control unit is further configured to control said second multiplexer channel such that said associated start byte of said second instruction along with a plurality of subsequent bytes of said second instruction are routed to said second decode position.

29. The instruction alignment unit as recited in claim 24 further comprising a third multiplexer channel configured to route a third instruction having a corresponding start byte within a third range of locations within said line to a third decode position.

30. The instruction alignment unit as recited in claim 29 wherein said third range includes byte locations within said line that are exclusive of said first range of byte locations.

31. The instruction alignment unit as recited in claim 29 wherein said third range includes byte locations within said line that are exclusive of both said second range and said third range of byte locations.

32. The instruction alignment unit as recited in claim 29 further comprising a multiplexer control circuit coupled to said first, second, and third multiplexer channels, wherein said multiplexer channel selectively controls each of said first, second, and third multiplexer channel in response to tags indicative of a boundary of each of said plurality of instructions.

33. The instruction alignment unit as recited in claim 32 wherein said tags are predecode tags.

34. The instruction alignment unit as recited in claim 33 wherein a separate predecode tag is associated with each byte location of said line.

35. The instruction alignment unit as recited in claim 29 wherein said first range includes a plurality of byte locations within said line, wherein said second range is limited to a single byte location with said line, and wherein said third range includes a plurality of byte locations within said line. 36. The instruction alignment unit as recited in claim 29 comprising a fourth multiplexer channel configured to route portions of a fourth instruction having a corresponding start byte within a fourth range of byte locations within said line to said first decode position.

37. The instruction alignment unit as recited in claim 36 wherein said fourth range of byte locations resides at program locations that are subsequent to said first range, said second range, and said third range of byte locations.

38. The instruction alignment unit as recited in claim 24 wherein said first instruction consists of a fewer number of bytes than said second instruction.

39. An instruction alignment unit for routing a plurality of variable byte-length instructions from a stored line to a plurality of decode positions, said instruction alignment unit comprising: a first multiplexer channel configured to route a first instruction having a corresponding start byte within a first range of byte locations of said line to a first decode position; a second multiplexer channel configured to simultaneously route a second instruction having an associated start byte within a second range of byte locations of said line to a second decode position; a third multiplexer channel configured to route a third instruction having a corresponding start byte within a third range of locations within said line to a third decode position; and a multiplexer control circuit coupled to said first, second, and third multiplexer channels, wherein said multiplexer control unit selectively controls each of said first, second, and third multiplexer channels in response to a predecode tag associated with each byte location of said line, wherein said first range includes a plurality of byte locations within said line, and wherein said second range is limited to a single byte location within said line, and wherein said third range includes a plurality of byte locations within said line. 40. The instruction alignment unit as recited in claim 39 wherein said third range includes byte locations within said line that are exclusive of said first range of byte location.

41. The instruction alignment unit as recited in claim 39 wherein said third range includes byte locations within said line that are exclusive of both said second and said third range of byte locations.