CN101853151B

CN101853151B - Device and method adaptive to microprocessor

Info

Publication number: CN101853151B
Application number: CN 201010185625
Authority: CN
Inventors: 汤玛斯·C·麦当劳; 约翰·L·唐肯
Original assignee: Via Technologies Inc
Current assignee: Via Technologies Inc
Priority date: 2009-05-19
Filing date: 2010-05-19
Publication date: 2013-06-26
Anticipated expiration: 2030-05-19
Also published as: CN101853151A

Abstract

The invention provides a device and a method adaptive to microprocessor. The device is used for extracting instruction from an instruction byte stream of a microprocessor. The instruction set architecture of the microprocessor has a variable length instruction. The device includes a quene, each item of the quene stores a different line of a stream of instruction bytes and accumulated prefix information associated with each instruction byte. The quene has a bottom item and a control logic unit: (a) couples to the quene and detects a condition; (b) saves away the initial portion length and shifts a second line into the BE, in response to detecting the condition; (c) extracts instruction bytes of the unextracted instruction from the second line in the BE and extracts accumulated prefix information from the second line of the BE in place of the already shifted out initial portion prefix bytes; (d) calculates the unextracted instruction length; and (e) extracts an instruction other than the unextracted instruction from the second line in the BE.

Description

Be applicable to the device and method of microprocessor

Technical field

The present invention is relevant field of microprocessors, particularly about getting instruction from a kind of command byte crossfire of microprocessor of the instruction set architecture with variable length instruction.

Background technology

Microprocessor comprises one or more performance element, carries out in order to carry out actual instruction.SuperScale (superscalar) microprocessor can send a plurality of instructions to each performance element within each clock period, thereby is promoted throughput or promote interior averaging instruction of each clock period.Yet instruction fetch and the decoding function of microprocessor pipeline upper end must provide an instruction crossfire to performance element with efficient speed, use and effectively use performance element and promote throughput.Therefore the x86 framework makes this work more difficult due to its instruction length and on-fixed, and under this framework, the length of its each instruction changes, and this will be in following detailed description.Therefore, the x86 microprocessor must comprise a lot of logical circuits to process the command byte crossfire of coming in, to determine beginning and end position of instruction.Therefore, must promote the x86 microprocessor and resolve the command byte crossfire to obtain the processing speed of each instruction.

Summary of the invention

According to the features of this invention, the invention provides a kind of device that is applicable to microprocessor, in order to extract instruction in a command byte crossfire of microprocessor, the instruction set architecture tool variable length instruction of this microprocessor.This device comprises: a formation and a steering logic unit.Each project of this formation reaches the preposition message of accumulation of each command byte that is listed as corresponding to this command byte in order to each command byte row that store this command byte crossfire, wherein this formation has a bottom project.This steering logic unit, couple this formation, in order to: detect a situation, this situation comprises the start-up portion of an instruction of first row of these command byte row that are stored in this bottom project not yet from this formation extraction, and wherein the command byte of the start-up portion of this instruction is prefix byte; Detect situation according to this, store the length of the start-up portion of this instruction, this first row of this bottom project is shifted out this formation, and the secondary series that this command byte is listed as moves into this bottom project; Extract the command byte of still undrawn this instruction from this secondary series of this bottom project, and this secondary series of this bottom project extracts the preposition message of this accumulation with the prefix byte of the start-up portion of replacing this instruction of shifting out from this formation certainly; According to the length of the start-up portion of this instruction that stores, calculate the length of previous still undrawn this instruction; Reach the length according to this calculating, this secondary series of this bottom project extracts previous still undrawn this instruction instruction in addition certainly.

According to the features of this invention, the invention provides a kind of method that is applicable to microprocessor, in order to extract instruction in a command byte crossfire of microprocessor, the instruction set architecture tool variable length instruction of this microprocessor, this microprocessor comprises a formation, each project of this formation reaches the preposition message of accumulation of each command byte that is listed as corresponding to this command byte in order to each command byte row that store this command byte crossfire, and this formation has a bottom project.The method comprises: detect a situation, this situation comprises the start-up portion of an instruction of first row of these command byte row that are stored in this bottom project not yet from this formation extraction, and wherein the command byte of the start-up portion of this instruction is prefix byte; Detect situation according to this, store the length of the start-up portion of this instruction, this first row of this bottom project is shifted out this formation, and the secondary series that this command byte is listed as moves into this bottom project; Extract the command byte of still undrawn this instruction from this secondary series of this bottom project, and this secondary series of this bottom project extracts the preposition message of this accumulation with the prefix byte of the start-up portion of replacing this instruction of shifting out from this formation certainly; According to the length of the start-up portion of this instruction that stores, calculate the length of previous still undrawn this instruction; Reach the length according to this calculating, this secondary series of this bottom project extracts previous still undrawn this instruction instruction in addition certainly.

Description of drawings

Fig. 1 shows the calcspar of the microprocessor of the embodiment of the present invention.

Fig. 2 shows the calcspar of L level of the order format device of Fig. 1.

Fig. 3 shows the preposition message 238 of the accumulation of Fig. 2.

Fig. 4 shows the operation of the microprocessor of Fig. 1.

Fig. 5 shows part L level and the M level calcspar of the order format device of Fig. 1.

Fig. 6 shows the operational flowchart of microprocessor element shown in Figure 5, in order to taking out instruction (can take out three instructions at most in one embodiment) in the command byte crossfire, its can generation time postpone and and instruction in the prefix byte number irrelevant.

Fig. 7 shows the calcspar of a part of the order format device of Fig. 1.

Fig. 8 a and Fig. 8 b show the operational flowchart of the part order format device of Fig. 7.

Fig. 9 shows the detailed block diagram of multiplex's formation of Fig. 5.

Figure 10 shows the calcspar of part M level of the order format device of Fig. 1.

Figure 11 shows the calcspar of the M level steering logic unit of Fig. 5.

Figure 12 shows the operational flowchart of part M level of the order format device of Fig. 1.

Figure 13 shows that multiplex's formation of Fig. 5 is in the content of continuous two clock period, with the operation of illustration M level.

Figure 14 shows that multiplex's formation of Fig. 5 is in the content of continuous two clock period, with the operation of illustration M level.

Figure 15 shows in Figure 14 the instruction formatter in a clock in the cycle, and three instructions that will contain maximum 40 command byte obtain and send out.

Figure 16 shows that the BTAC of Fig. 1 has done bad prediction thereby caused branch's mistake of microprocessor, that is the branch of Fig. 1 is designated as logic true value but non-ly is the operational code of instruction.

Figure 17 shows the composition signal of ripple logical block output.

Figure 18 shows the operational flowchart of the microprocessor of Fig. 1.

Figure 19 shows the detailed block diagram of the length decoder of Fig. 2.

Figure 20 shows the configuration of 16 length decoders.

Figure 21 shows the operational flowchart of the length decoder of Figure 20.

[main element label declaration]

100 microprocessor 102 instruction caches

104 x86 command byte formation 106 order format devices

108 format instruction queue 112 instruction transfer interpreters

114 translate instruction queue 116 working storage alias tables

118 reservation station 122 performance elements

124 retirement unit 126 extraction units

128 branch target address caching 132 command byte

134 command byte 136 x86 instruction crossfires

142 extract address 144 totalizers at present

146 predicted target address 148 performance objective addresses

152 next address 154 branches that extract continuously indicate

202 length decoder 204 ripple logical blocks

The output of 208 steering logic unit 212 length decoders

Output 218 operands and the address size of 214 ripple logical blocks

The 222 instruction length 224 arbitrary preposition designators of decoding

226 decoding LMP designators 228 are subjected to LMP to affect designator

229 preposition message 232 start bits

234 stop bit 236 significance bits

The 238 preposition message 252 predetermined registration operation numbers of accumulation and address sizes

302 OS 304 AS

308 REX.W appear in 306 REX

312 REX.R 314 REX.X

316 REX.B 318 REP

322 REPNE 324 LOCK

326 fragments exceed appearance 328 coding sections and exceed [2:0]

332 arbitrary preposition 402-414 steps that occur

502 multiplex's formation 504 I1 multiplexers

506 I2 multiplexer 508 I3 multiplexers

512 M level steering logic unit 514 control signals

516 control signal 518 control signals

524 first instruction I1 526 second instruction I2

528 the 3rd instruction I3 534,536,538 significance indicators

602-608 step 702 XIBQ steering logic unit

The preposition array of 802-824 step 1002 accumulation

1004 command byte array 1102 subtracters

1104 part LEN 1106 residue LEN1

1108 byte location END1 1112 byte location END0

1114 multiplexer 1116 totalizers

1118 working storage 1122 instruction length LEN1

The bad BTAC of 1201-1222 step 1702 position

1802-1816 step 1902 programmable logic array (PLA)

1904 totalizer 1906 multiplexers

1912 eaLen value 1914 control signals

1916 immLen value 1918 eaLen values

The 2102-2116 step

Embodiment

Fig. 1 shows the calcspar of the microprocessor 100 of the embodiment of the present invention.microprocessor 100 comprises the pipeline (pipeline) that is comprised of multistage or a plurality of functional units, it comprises level Four instruction cache (four-stage instruction cache) 102, x86 command byte formation (x86 instructionbyte queue, XIBQ) 104, (it comprises three grades of L to order format device (instruction formatter) 106, M and F), format instruction queue (formatted instruction queue) 108, instruction transfer interpreter (instruction translator) 112, translate instruction queue (translatedinstruction queue) 114, working storage alias table (register alias table) 116, reservation station (reservation station) 118, performance element (execution units) 122 and retirement unit (retire unit) 124.Microprocessor 100 also comprises extraction unit (fetch unit) 126, and it provides present extraction address 142 to instruction cache 102, is listed as to XIBQ104 in order to select a command byte (byte) 132 caches.Microprocessor 100 also comprises totalizer 144, and it increases the present address 142 of extracting and extracts continuously address 152 to produce the next one, then feeds back to extraction unit 126.Extraction unit 126 also receives predicted target address 146 from branch target address caching (branch target address cache, BTAC) 128.At last, extraction unit 126 receives performance objective address (executedtarget address) 148 from performance element 122.

The formation of XIBQ104 contains a plurality of projects (entry), and each project comprises 16 byte datas from instruction cache 102.Moreover each project of XIBQ104 comprises relevant pre decoding (pre-decoded) message of data byte.Pre decoding message is to produce when instruction cache 102 flow to XIBQ104 when data byte.Caching data from XIBQ104 is command byte 134 crossfires, and its form is a plurality of 16 byte blocks, yet and does not know in crossfire or beginning or the end position of the x86 instruction in block.Order format device 106 is namely in order to determine beginning and end byte of each instruction in crossfire, thereby byte serial stream is separated into x86 instruction crossfire 136, it is fed to and is stored in format instruction queue 108 again, processes with the other parts for the treatment of microprocessor 100 pipelines.When occur resetting or carry out/predict flow control instruction (for example jump over (jump) instruction, subroutine call (subroutine call) instruction or from the subroutine link order), provide replacement address or branch target address to order format device 106 as instruction pointer (pointer), in order to activation order format device 106, make the first byte of the first effective instruction in its 16 present byte blocks that determine the instruction crossfire.Therefore, order format device 106 can add according to the starting position of first object instruction the length of first object instruction, to determine the starting position of next instruction.Order format device 106 repeats said procedure, until carry out or predict another flow control instruction.

BTAC128 also provides branch (taken) indication 154 to occur to XIBQ104.132 pairs of each command byte that instruction cache 102 offers XIBQ104 should have a branch that indication 154 occurs.Indication 154 occurs and predicts in order to represent BTAC128 whether command byte 132 row that offer XIBQ104 have branch instruction in branch; If be yes, extraction unit 126 will be chosen the predicted target address 146 that BTAC128 provides.In details of the words, BTAC128 for first byte (even this first byte is prefix byte) of branch instruction can corresponding output logic true value branch indication 154 occurs, but for the branch that other byte of instruction can the output logic falsity, indication 154 occurs.

Microprocessor 100 is the microprocessor 100 of x86 framework.Can correctly carry out when aiming at the performed major applications program of x86 microprocessor when microprocessor, this microprocessor namely can be considered the microprocessor of x86 framework.In the time can obtaining expected results, this application program namely can be considered and can correctly carry out.One of feature of X86-based is variable for the instruction length in its instruction set architecture, but not fixes as the instruction length in some instruction set architectures.Moreover, for a certain x86 operational code (opcode), may affect because of whether having preposition (prefix) before operational code the length of instruction.In addition, the length of some instructions may be the function of the predetermined registration operation number (operand) under microprocessor 100 operator schemes and/or address size (for example the D position of code segment descriptor (code segment descriptor), perhaps whether microprocessor 100 operates in IA-32e or 64 bit patterns).At last, outside default address/operand size, instruction also can comprise a length and revise preposition (length-modifying prefix), in order to select address/operand size.For example, can use the REX.W position (position 3) of operand size (operand size, OS) preposition (0x66), address size (AS) preposition (0x67) and REX preposition (0x4x) to change default address/operand size.Intel (Intel) company claims that these are that length changes preposition (length-changing prefix, LCP), revise preposition (length-modifying prefix, LMP) yet be called in this manual length.Form and the length of X86 instruction are well known, details can be with reference to IA-32 Intel Architecture software development notebook (IA-32Intel ArchitectureSof tware Developer ' s Manual), the chapter 2 of 2A collection: instruction set is with reference to (InstructionSet Reference), A-M, in June, 2006 in Christian era.

According to Intel 64 and IA-32 framework optimization reference manual ( 64andIA-32Architectures Optimization Reference Manual), in March, 2009 in Christian era, page 3-21 to 3-23 (can from following page download http://www.intel.com/Assets/PDF/manual/248966.pdf): " when pre decoder runs into LCP in extracting row, must use slower length decoder algorithm.When using slower length decoder algorithm, pre decoder was decoded within six cycles, but not general one-period.Formation in the machine pipeline (queuing) is generally the delay that can't avoid LCP to cause.」

Fig. 2 shows the calcspar of L level of the order format device 106 of Fig. 1.Order format device 106 comprises a plurality of length decoders 202, its output 212 is coupled to respectively a plurality of ripples (ripple) logical block 204, and the output 214 of ripple logical block 204 is coupled to steering logic unit 208 and offers the M level of order format device 106.In one embodiment, length decoder 202 produces output 212 during the first phase place of the two phase clock signal of microprocessor 100, and ripple logical block 204 produces output 214 during the second phase place of two phase clock signal.

Length decoder 202 receives command byte 134 from XIBQ104.In one embodiment, each project width of XIBQ104 is 16 bytes, thereby 16 length decoders 202 should be arranged mutually, as shown in Figure 20 to 15.Each length decoder 202 is from the bottom of the XIBQ104 project corresponding command byte that receives and decode.In addition, each length decoder 202 ensuing three adjacent instructions bytes that receive and decode.For last three length decoders 202, it receives one or more command byte (if the bottom penultimate project of XIBQ104 is invalid, last three length decoders 202 must be waited for and produce effectively output in the next clock period) from the bottom of XIBQ104 penultimate project.The details of length decoder 202 will illustrate in Figure 19.By this, make length decoder 202 can determine and export the instruction length 222 of the instruction in the bottom project of XIBQ104.In one embodiment, the byte number of instruction length 222 these instructions of expression except prefix byte.In other words, in the middle of instruction length 222 presentation directiveses, the byte number from operational code to last byte.Specifically, be instruction length 222 by the instruction length of exporting corresponding to the length decoder 202 of the first command byte of instruction.

In order to produce instruction length 222, length decoder 202 also uses operand and the address size 218 that is received from steering logic unit 208.Steering logic unit 208 can output function number and address size 218 for each command byte 134.The predetermined registration operation number of the steering logic unit 208 present microprocessors 100 of basis and the output 214 of address size 252 and ripple logical block 204 are to determine operand and address size 218.If in output 214 presentation directiveses of ripple logical block 204, without LMP, corresponding length decoder 202 is given for each instruction word festival-gathering output predetermined registration operation number and address size in steering logic unit 208.Yet, if in output 214 presentation directiveses of ripple logical block 204, one or more LMP is arranged, steering logic unit 208 revise predetermined registration operation number and address size 252 for each instruction word festival-gathering and output function number and address size 218 to corresponding length decoder 202, wherein predetermined registration operation number and address size 252 are revised according to the value of 308 of OS 302, AS 304 and REX.W in steering logic unit 208, these are contained in the preposition message 238 of accumulation of output 214 of ripple logical block 204, as shown in Figure 3.

As shown in Figure 2, the output 212 include instruction bytes 134 of each length decoder 202, instruction length 222, the arbitrary preposition designator of decoding (decoded any prefix indicator) 224, decoding LMP designator (decoded LMP indicator) 226, be subjected to LMP to affect designator (susceptible toLMP indicator) 228 and preposition message 229.

The byte of decoding when length decoder 202 corresponds to arbitrary x86 preposition (no matter whether it is LMP), and arbitrary preposition designator 224 of decoding is logic true value; Otherwise, be the logic falsity.

The byte of decoding when length decoder 202 corresponds to arbitrary x86LMP, that is OS preposition (0x66), AS preposition (0x67) or REX.W preposition (0x48-0x4F), and the LMP designator 226 of decoding is logic true value; Otherwise, be the logic falsity.

The byte of decoding when length decoder 202 is opcode byte, wherein the instruction length of operational code (is not for example affected by LMP, OS is preposition is compulsory for some SIMD instructions, therefore can not change its length), be subjected to LMP to affect designator 228 and be the logic falsity; Otherwise, be logic true value.

Preposition message 229 comprises a plurality of position (bit), in order to presentation directives's byte whether have various x86 preposition one of them.These are similar to the preposition message 238 of accumulation shown in Figure 3.Yet the preposition message 229 of length decoder 202 output only represents single preposition, that is, be subjected to the prefix value of command byte of the single correspondence of length decoder 202 decodings.Opposite, because ripple logical block 204 is accumulated the preposition message 229 that all length demoder 202 provides, all of therefore accumulating in preposition message 238 presentation directiveses are preposition.

As shown in Figure 2, output 214 include instruction bytes 134, start bit 232, stop bit 234, the significance bit 236 of each ripple logical block 204 and accumulate preposition message 238.The output 214 of each ripple logical block 204 also is fed to next adjacent ripple logical block 204.In one embodiment, 16 ripple logical blocks 204 are organized into four logical blocks, four command byte of each block processes and related news thereof.Each ripple logical block block 204 is also exported corresponding command byte.

When the handled byte of ripple logical block 204 is the opcode byte of instruction (for example the first byte of instruction is non-is prefix byte), start bit 232 is logic true value.Order format device 106 increases by indexs, and it points to all prefix bytes, makes when pointed one non-prefix byte the operand byte that this pointer will directional order.

When the handled byte of ripple logical block 204 was the last byte of instruction, stop bit 234 was logic true value; Otherwise, be the logic falsity.

From first of 16 significance bits 236 of ripple logical block 204 output, until first untreated LMP occurs, each significance bit 236 is logic true value.

Accumulating preposition message 238 is shown in Fig. 3 and discusses as above.Steering logic unit 208 uses the preposition message 238 of accumulation and coordinates significance bit 236, to determine whether use predetermined registration operation number and address size 252 or it is modified.

Output 212 that it should be noted that length decoder 202 belongs to a kind of test character.When in other words, it produces output and do not know the address of dependent instruction byte in instruction.Especially, be to be to produce under effective preposition prerequisite in this byte of hypothesis to preposition relevant designator 224/226/228/229, and this hypothesis may be the hypothesis of a mistake.Therefore, this byte may by chance have a preposition value, but this byte is displacement (displacement) byte with value identical with LMP in fact.For example, 0x67 is the preposition value of AS, and it is LMP.Yet the SIB byte of address displacement byte or immediate data value (immediatedata value) byte or Mod R/M byte or instruction is neither is prefix byte, but may have the 0x67 value.Only all LMP in the present block of command byte process, could determine that the

output

212 and 214 corresponding to all bytes in block is all correct.

If within the clock period at present, all command byte in the XIBQ104 project are not decoded goes out any LMP, and the L level can produce at single clock ripple logical block 204 output 214 (particularly start bit 232 and the stop bit 234) of whole project in the cycle.If decodedly in the present project of XIBQ104 go out one or more LMP, the ripple logical block 204 output 214 required clock periodicities that produce the correct start bit 232 of tool and stop bit 234 are N+1, and wherein N has the number of the instruction of at least one LMP in the present project of XIBQ104.No matter how many preposition numbers of the arbitrary instruction in project is, the L level all can be carried out above-mentioned work, and this is shown in the process flow diagram of Fig. 4.Steering logic unit 208 comprises a state, and processed in order to which byte in the present block of presentation directives's byte, which is not yet processed.This state makes steering logic unit 208 to produce significance bit 236 and operand and address size 218 for each command byte.Because the processing of the command byte block with the instruction that contains LMP has iteration (iterative) characteristic, even when the first clock period, contain the first instruction of LMP instruction length 222, start bit 232 and stop bit 234 may and incorrect; Yet when next clock period, the first instruction and arbitrary instruction length 222, start bit 232 and stop bit 234 that does not contain the adjacent instructions of LMP can become correctly; And in the clock period of continuing, next of the first instruction contains instruction and adjacent instruction length 222, start bit 232 and the stop bit 234 that does not contain the instruction of LMP thereof of LMP all can be correct.Whether in one embodiment, this state comprises the sixteen bit working storage, processed in order to represent the dependent instruction byte.

[indicate for the instruction that contains LMP and begin and end byte]

Fig. 4 shows the operation of the microprocessor 100 of Fig. 1, and this flow process starts from step 402.

In step 402, steering logic unit 208 output predetermined registration operation numbers and address size 218 are given length decoder 202.Then, flow process enters step 404.

In step 404, in the first phase place of clock period, operand and address size 218 that length decoder 202 provides according to steering logic unit 208 are with the command byte of the bottom project of decoding XIBQ104 and produce its output 212.As previously mentioned, for each command byte of the bottom project of XIBQ104, output 212 include instruction length 222 and and the preposition relevant designator 224/226/228/229 (Fig. 2) of length decoder 202.Then, flow process enters step 406.

In step 406, in the second phase place of clock period, ripple logical block 204 exports 214 according to the output 212 of length decoder 202 to produce.As previously mentioned, the output 214 of ripple logical block 204 comprises start bit 232, stop bit 234, significance bit 236 and accumulates preposition message 238 (Fig. 3).Then, flow process enters step 408.

In step 408, the output 214 of (examine) ripple logical block 204 is inspected in steering logic unit 208, comprises untreated LMP (length is revised preposition symbol) whether to also have any instruction in the bottom project that judges XIBQ104.If be yes, enter step 412: otherwise, step 414 entered.

In step 412, the preposition message 238 of accumulation that steering logic unit 208 provides according to ripple logical block 204 is to upgrade internal state and operand and address size.Then, flow process is returned to step 404, according to new operand size and address size, again processes the command byte of bottom project.

In step 414, the command byte of steering logic unit 208 judgement bottom projects is handled fully, thereby it is shifted out from XIBQ104, and the M level is delivered in its output 214 together with each command byte 134 corresponding ripple logical block 204.Specifically, as previously mentioned, because the output 214 of ripple logical block 204 comprises start bit 232 and stop bit 234, it expresses the border of each instruction in the middle of the instruction crossfire that instruction cache 102 provides, thereby make the M level of order format device 106 and F level be able to further processing instruction crossfire, and individual instructions is inserted FIQ (format instruction queue) 108, allow instruction transfer interpreter 112 process.Flow process ends at step 414.

According to aforementioned, if do not contain LMP (length revise preposition symbol) in command byte, the L level can in single clock in the cycle for the whole project of XIBQ (formation of x86 byte word) 104 to produce start bit 232 and stop bit 234; If there are one or more instructions to have LMP (length is revised preposition symbol) in the project of XIBQ104, produce start bit 232 and the required clock periodicity of stop bit 234 becomes N+1, wherein N is the number of instructions that contains at least one LMP (length is revised preposition symbol) in the XIBQ104 project, and the preposition number that no matter contains in instruction why, and the L level can be carried out.

[accumulating preposition effectively to process the instruction that contains a plurality of prefix bytes]

The x86 framework allows instruction to contain 0 to 14 prefix byte.This causes the difficulty of pipeline (pipeline) front end when processing instruction byte crossfire.In the past when processing contains the instruction of prefix byte of a great deal of, delay that can encounter time.According to Intel 64 and IA-32 framework optimization reference manual ( 64and IA-32Architectures Optimization Reference Manual), in March, 2009 in Christian era, page 12-5, Intel mentions for the ATOM micro-architecture: " contain instruction meeting preposition more than three and produce the MSROM transfer, cause two clock cycle delays of front end." according to the micro-architecture (The microarchitecture of Intel and AMD CPU ' s) of another Research Literature-Intel and AMD central processing unit; author Agner Fog; Copenhagen University College of Enginerring; May 5 2009 Christian era last the renewal; page 93 (can in following page download www.agner.org/optimize/microarchitecture.pdf), it is mentioned: " containing a plurality of preposition instructions needs extra time to decode.It is one preposition that the instruction decoder of P4 only can be processed in the cycle in a clock.On P4, contain its each preposition cost one clock cycle decoder that needs of a plurality of preposition instructions ", and " instruction decoder of P4E can be preposition in two of clock period treatment.Therefore, decodable code contains at the most two preposition instructions in the single clock cycle, and containing three or four preposition instructions needs decode within two clock period.So P4E increases this function, be because under 64 bit patterns, a lot of instructions all contain two preposition (for example the operand size is preposition and REX is preposition).」

Yet, the embodiment of the present invention need not increase under the condition of time delay, can process all (14 at the most) prefix bytes that in an instruction, framework allows, no matter the quantity of prefix byte why (as long as should be preposition non-be LMP (the preposition symbol of length modification), if this is preposition is LMP, contain the extra processing time that increases by a clock cycle of one or more each preposition instruction meeting, as previously mentioned).So the embodiment of the present invention can reach this purpose, be because length decoder 202 produces preposition message 229, ripple logical block 204 accumulate preposition message 229 to produce the preposition message 238 of accumulation to the opcode byte of instruction, this will be in following detailed description.

Fig. 5 shows part L level and M level (multiplex's level) calcspar of the order format device 106 of Fig. 1.The M level comprises multiplex's formation (mux queue) 502.In one embodiment, multiplex's formation 502 comprises four projects, each items storing 16 byte.The blank project of next of multiplex's formation 502 receives the output 214 (Fig. 2) of corresponding ripple logical block 204, its include instruction byte 134, start bit 232, stop bit 234 and accumulate preposition message 238.

The M level also comprises M level steering logic unit 512, its from the bottom of multiplex's formation 502 project receives/stop bit 232/234, and (in one embodiment) receive the front crossed joint of the bottom project second from the bottom (next-to-bottom entry, NTBE) of multiplex's formation 502.According to beginning/stop bit 232/234, three groups of multiplex's logical blocks of M level steering logic unit 512 controls are respectively I1 multiplexer 504, I2 multiplexer 506 and I3 multiplexer 508.I1 multiplexer 504 output the first instruction I1 524 are to the F level of order format device 106; I2 multiplexer 506 output the second instruction I2 526 to F levels; I3 multiplexer 508 output the 3rd instruction I3 528 to F levels.In addition, three significance indicators 534/536/538 of M level steering logic unit 512 outputs, whether effective in order to represent corresponding first, second, third instruction 524/526/528.By this, the M level is able to take out at most (extract) three format instructions from the instruction crossfire, and provides it to the F level in the cycle at single clock.In other embodiments, the M level can be taken out in the cycle and provide more than three format instructions to the F level at single clock.Each instruction in three instructions 524/526/528 comprises command adapted thereto byte 134, and its prefix byte is replaced into the preposition message 238 of corresponding accumulation.In other words, each instruction 524/526/528 comprises the other parts of opcode byte and command byte and accumulates preposition message 238.Each multiplexer 504/506/508 is from the respective base project of multiplex's formation 502 receipt message 214 (but start bit 232, stop bit 234 except) respectively, and (in one embodiment) crossed joint before the corresponding NTBE of multiplex's formation 502 receives is in order to choose individually and output order 524/526/528.

Fig. 6 shows the operational flowchart of microprocessor shown in Figure 5 100 elements, in order to taking out instruction (can take out three instructions at most in one embodiment) in the command byte crossfire, its can generation time postpone and and instruction in the prefix byte number irrelevant.As previously mentioned, ripple logical block 204 can the preposition message 229 of accumulation be accumulated preposition message 238 opcode byte to instruction to produce.Shown in flow process start from step 602.

In step 602, in the first phase place of clock period, length decoder 202 decoding instruction byte 134 crossfires to be producing output 212 (Fig. 2), particularly preposition message 229, and the class of operation of this and step 404 seemingly.Then, enter step 604.

In step 604, in the second phase place of clock period, ripple logical block 204 according to preposition message 229 in each instruction that determines crossfire which byte as opcode byte (that is first non-prefix byte).Moreover ripple logical block 204 is accumulated its preposition message 229 for all (mostly being 14 the most) prefix bytes in instruction, to produce preposition message 238 opcode byte to instruction of accumulation.Specifically, ripple logical block 204 begins to accumulate preposition message 229 from the first prefix byte of instruction, and accumulates one by one the preposition message 229 of each byte, until it detects opcode byte.When the time comes, ripple logical block 204 stops the accumulation of preposition message, makes the preposition message 238 of accumulation of present instruction can not continue to be accumulated to next instruction.Ripple logical block 204 begins to carry out the accumulation of preposition message 229 from the first prefix byte of next instruction, and stops at opcode byte.Each instruction in crossfire repeats this program.Ripple logical block 204 uses another output 212 of length decoder 202 to complete the accumulation of preposition message.For example, as previously mentioned, ripple logical block 204 uses instruction length 222 to determine the first byte of each instruction, and it may be prefix byte, in order to begin the accumulation program of preposition message.Ripple logical block 204 also uses other message 224/226/228 to determine the position of opcode byte, its first byte (by start bit 232 expressions) for not containing preposition instruction, and the position of the last byte of decision instruction (by stop bit 234 expressions).Then, flow process enters step 606.

In step 606, command byte 134 and beginning accordingly/stop bit 232/234, the preposition message 238 of accumulation are loaded in next available items of multiplex's formation 502.In one embodiment, the step shown in step 602,604,606 is carried out (presumptive instruction does not contain LMP (length is revised preposition symbol)) in the cycle in single clock.Then, enter step 608.

In step 608, in next clock period, multiplexers 504/506/508 are controlled in M level steering logic unit 512, make it can take out at the most three instructions.In other words, no matter the quantity of prefix byte why, the M level need not increase time delay and can get instruction.After multiplex (MUX) (muxed), but instruction 524/526/528 each be fed to the F level.Specifically, the M level can be taken out opcode byte and the subsequent byte of each instruction along with the preposition message 238 of accumulation.The F level according to instruction kenel, the exceptional situation that some are possible, pairing property (pairability) and other characteristic with decoding instruction 524/526/528, with translating of sign on 524/526/528.F level and instruction transfer interpreter 112 can utilize the preposition message 238 of accumulation.Flow process ends at step 608.

The present embodiment is different from traditional design.As previously mentioned, ripple logical block 204 is more traditional complicatedly next, its start bit that produces 232 is opcode byte of pointing to instruction, but not the first byte of directional order (it may be prefix byte) as traditional, and produce the preposition message 238 of accumulation, therefore, no matter why the quantity of prefix byte all can get instruction and can not cause time delay (LMP (length is revised preposition) only, as aforementioned).On the contrary, traditional practice is to point out that the first byte of instruction reality is the first byte, if instruction contains prefix byte, this prefix byte is represented as the first instruction.When instruction contained a plurality of prefix byte, in order to remove prefix byte, therefore traditional multiplex's logic can cause time delay.

[when operation part occurs, making caching data to discharge as early as possible with beginning/end sign]

Fig. 7 shows the calcspar of a part of the order format device 106 of Fig. 1.In Fig. 1, instruction cache 102 provides command byte 132 to XIBQ104.In one embodiment, order format device 106 comprises pre decoding (pre-decode) logical block (be not shown in graphic in), in order to the command byte 132 from instruction cache 102 is carried out pre decoding, be loaded onto in the lump XIBQ104 together with command byte 132 through pre decoding message.Order format device 106 comprises XIBQ steering logic unit 702, and its project of controlling XIBQ104 loads and shifts out.

Length decoder 202 and ripple logical block 204 (Fig. 2) receive command byte 134 and produce output 214 from XIBQ104, in order to the M level steering logic unit 512 of multiplex's formation 502 of offering Fig. 5 and order format device 106.The project that multiplex's formation 502 is controlled in M level steering logic unit 512 loads and shifts out.Multiplex's formation 502 gives information in its project and 214 gives multiplexer 504/506/508 and M level steering logic unit 512, and multiplexer 504/506/508 is controlled again in M level steering logic unit 512, as previously mentioned.

Can have problems when following situation: (1) but the bottom project of XIBQ104 comprises effective command byte NTBE not to be comprised; (2) only has the instruction (for example the first of instruction or second byte) of part in the bottom project; (3) instruction of part does not provide enough message to allow length decoder 202/ ripple logical block 204 determine instruction length 222 (and beginning/stop bit 232/234), that is instruction also has some bytes to be positioned at NTBE.For example, suppose that the start bit 232 of the byte 15 (that is last byte) of project bottom XIBQ104 is logic true value, and the value of this byte is 0x0F.In the instruction of x86, the value of the first non-prefix byte is that 0x0F represents the operational code that a tool extends, therefore need to be according to its subsequent byte to determine the instruction kenel.In other words, can't be only from the 0x0F byte to determine instruction length (in some cases, may need at the most to the 5th byte to determine instruction length).Yet, when instruction cache 102 provides the next column caching data to XIBQ104 by the time, a period of time will be needed, for example, the error (miss) of instruction cache 102 may occur, or the error of seeking impact damper (translation lookaside buffer, TLB) is translated in instruction, therefore, need to a kind ofly not wait for other command byte and the scheme of footpath row processing.Moreover in some cases, microprocessor 100 must obtain the instruction before the unknown lengths instruction, if therefore these instructions are processed, microprocessor 100 will be waited for always.Therefore, the mode that needs a kind of footpath row to process.

Fig. 8 shows the operational flowchart of the part order format device 106 of Fig. 7.This flow process starts from step 802.

In step 802, XIBQ steering logic unit 702 detects the instruction of bottom project terminal of XIBQ104 across another row to instruction caching data crossfire, and bottom XIBQ104, the instruction in project is not enough to allow length decoder 202/ ripple logical block 204 determine instruction lengths (and beginning/stop bit 232/234), and determine that the required subsequent instructions byte of instruction length not yet is placed in XIBQ104NTBE, that is XIBQ104NTBE is invalid or blank.Then, flow process enters step 804.

In step 804, the output 214 of the ripple logical block 204 that M level steering logic unit 512 will produce corresponding to XIBQ104 bottom project is loaded onto multiplex's formation 502.Yet M level steering logic unit 512 does not shift out the bottom project of XIBQ104, because still need to determine the stop bit 234 of unknown lengths instruction.In other words, for the instruction of unknown lengths, its byte that is positioned at XIBQ104 bottom project must keep, and when other byte of instruction is come XIBQ104, is determined instruction length and stop bit.Then, flow process enters step 806.

In step 806, the loaded output 214 of previous step 804 arrives the bottom project of multiplex's formation 502.At this moment, unit 512 all instructions of taking-up of M level steering logic also reach the F level with it, but do not transmit the instruction of unknown lengths.Yet M level steering logic unit 512 does not shift out the bottom project of multiplex's formation 502, because the stop bit 234 of the instruction of unknown lengths also do not learn, and all the other bytes of instruction not yet can get.The existence of unknown lengths instruction is known in M level steering logic unit 512, because this instruction does not have effect stop bit 234.In other words, had the first byte of effect start bit 232 directional orders, but the byte and the NTBE that do not have the bottom project of effect stop bit 234 sensing multiplex (MUX) formations 502 are invalid.Then, flow process enters 808.

In step 808, M level steering logic unit 512 stops (stall) multiplex (MUX) formation 502, until NTBE inserts effective output 214.Then, flow process enters step 812.

In step 812, XIBQ104 receives the command byte 132 of row finally from instruction cache 102, and it is loaded onto in NTBE.The command byte 132 of these row comprises all the other bytes of unknown lengths instruction.Then, flow process enters step 814.

In step 814, length decoder 202/ ripple logical block 204 produces instruction length 222 and beginning/stop bit 232/234 for the unknown lengths instruction.In one embodiment, XIBQ steering logic unit 702 is according to the remaining word joint number amount (it be arranged in NTBE that step 812 be loaded onto XIBQ104) of instruction length 222 with the instruction of calculating unknown lengths.This remaining word joint number amount is the position that determines stop bit 234 in following step 818.Then, flow process enters step 816.

In step 816, XIBQ steering logic unit 702 shifts out the bottom project.Yet M level steering logic unit 512 does not load the output 214 of the ripple logical block 204 of respective base project, because it has been placed in multiplex's formation 502 according to step 804.Then, flow process enters step 818.

In step 818, length decoder 202/ ripple logical block 204 process new XIBQ104 bottom project (that is, the caching data that receives in step 812), and M level steering logic unit 512 output 214 (it comprises the stop bit 234 of unknown lengths instruction) of ripple logical block 204 is loaded onto in the NTBE of multiplex's formation 502.Then, flow process enters step 822.

In step 822, M level steering logic unit 512 takes out unknown lengths instruction (and other instruction that can take out) from the bottom of multiplex's formation 502 project and NTBE, and is sent to the F level.Then, flow process enters step 824.

In step 824, M level steering logic unit 512 shifts out the bottom project of multiplex's formation 502.Flow process ends at step 824.

According to above-mentioned, even the order format device 106 of the present embodiment is in the situation that the related news of XIBQ (formation of x86 command byte) 104 bottom projects are not yet available, for the instruction with available message, by allowing message (command byte, beginning/stop bit and accumulate preposition message) disengage from the L level as early as possible, thereby solved foregoing problems.

[by preposition accumulation obtaining with the enhancement instruction]

Fig. 9 shows the detailed block diagram of multiplex's formation 502 of Fig. 5.In the embodiment of Fig. 9, multiplex's formation 502 comprises four projects, be respectively bottom project (bottom entry, BE), NTBE, bottom project third from the bottom (second-from-bottom entry, SFBE) and bottom fourth from the last project (third-from-bottom entry, TFBE).Each project of multiplex's formation 502 contains 16 bytes, and each byte is deposited a command byte and start bit 232, stop bit 234 and accumulated preposition message 238.As shown in the figure, BE is denoted as respectively 0 to 15.NTBE is denoted as respectively 16 to 31.These labels also are shown in Figure 10.SFBE is denoted as respectively 32 to 47.

Figure 10 shows the calcspar of part M level of the order format device 106 of Fig. 1.Figure 10 shows the preposition array of the accumulation of multiplex's formation 502 (accumulated prefix array) 1002 and command byte array (instruction byte array) 1004.The message of accumulating preposition array 1002 and command byte array 1004 is actually BE and the NTBE that is stored in multiplex's formation 502.Yet, multiplex's formation 502 message provide be by wire to selecting circuit (in one embodiment, it is the dynamic logic unit), it comprises the multiplexer 504/506/508 of Fig. 5.Figure 10 only demonstrates I1 multiplexer 504, yet the input that I2 multiplexer 506 and I3 multiplexer 508 receive is also as I1 multiplexer 504.Instruction multiplexer 504/506/508 is the 16:1 multiplexer.As shown in figure 10, the input of I1 multiplexer 504 is denoted as respectively 0 to 15.The input of each I1 multiplexer 504 receives 11 command byte and accumulates preposition message 238, wherein accumulates preposition message 238 lowest orders corresponding to 11 command byte of receive (lowest order) byte.The byte number that this lowest order byte is command byte array 1004, it corresponds to Entering Number of I1 multiplexer 504.For example, the input 8 of I1 multiplexer 504 receives the byte 8 to 18 (that is byte 16-18 of the byte 8-15 of BE and NTBE) of multiplex's formation 502 and the preposition message 238 of accumulation of respective byte 8.The reason that I1 multiplexer 504 receives 11 command byte is: although the x86 instruction allows maximum 15 bytes, right non-prefix byte mostly is 11 bytes most, previous embodiment only obtain and transmit non-prefix byte to the remainder of pipeline (that is, remove prefix byte and replace prefix bytes to accumulate preposition message 238), thereby can reduce in a large number the decoding workload of pipeline following stages and allow microprocessor 100 realize various benefits.

Figure 11 shows the calcspar of the M level steering logic unit 512 of Fig. 5.M level steering logic unit 512 comprises 2:1 multiplexer 1114, and in order to produce instruction length LEN1 1122, it is the instruction length by an instruction (the first instruction I1 524 of Fig. 5) of the instruction crossfire of order format device 106.Instruction length LEN11122 continues to transmit by pipeline also processed together with the first instruction I1 524.Multiplexer 1114 exists according to the situation whether partial-length was arranged in the last clock period, with the output of selection subtracter 1102 or the output of totalizer 1116.Multiplexer 1114 is controlled by working storage 1118, and it stores one in order to represent whether the last clock period have the situation of partial-length, and this will describe in detail in Figure 12 to Figure 14.If there is the partial-length situation to occur, multiplexer 1114 is selected the output of totalizer 1116; Otherwise multiplexer 1114 is selected the output of subtracter 1102.First of totalizer 1116 is input as the instruction residue length, is denoted as residue LEN11106, and it will describe in detail in Figure 12 to Figure 14.M level steering logic unit 512 also comprises other logical block (be not shown in graphic in), and its stop bit 234 according to the first instruction I 1524 (it is to offer M level steering logic unit 512 by multiplex's formation 502) is to calculate residue LEN1 1106.Second of totalizer 1116 is input as the partial-length of present instruction, is denoted as part LEN 1104, and it is provided by the working storage that the last clock period loads, and will describe in detail in Figure 12.Subtracter 1102 deducts the byte location (END1 1108) of stop bit 234 in multiplex's formation 502 of the first instruction I1524 with the byte location (END0 1112) of stop bit 234 in multiplex's formation 502 of last instruction.Although it should be noted that the mathematical operations that M level steering logic unit 512 is carried out as shown in figure 11, yet M level steering logic unit 512 can not use conventional adders/subtracter, but implement with combinatorial logic unit.For example, in one embodiment, carry out with decoded form the position; For example, subtraction can use boolean (Boolean) AND-OR computing.The subtracter that the length computation of the second instruction I2 526 and the 3rd instruction I3 528 is used (be not shown in graphic in) is similar to the subtracter of the first instruction I1 524, and END1 deducts END2 and END2 deducts END3 but be respectively.At last, the decision of the present skew (offset) of multiplex's formation 502 projects is rear bytes of selecting from the last byte of final injunction of multiplexer 504/506/508.

Figure 12 shows the operational flowchart of part M level of the order format device 106 of Fig. 1.This flow process starts from step 1201.

In step 1201, BE and the NTBE (Fig. 9) of multiplex's formation 502 are inspected in newly clock period, and M level steering logic unit 512.Then, flow process enters step 1202.

In step 1202, multiplexers 504/506/508 are controlled in M level steering logic unit 512, the instruction of the BE of multiplex's formation 502 and NTBE (if possible) are sent to the F level of order format device 106.As previously mentioned, in one embodiment, the M level can obtain three instructions in a clock in the cycle.Because the length of x86 instruction can be zero to 15 bytes, so the bottom project of multiplex's formation 502 may have one to 16 x86 instruction.Therefore, need a plurality of clock period with all instructions of the BE that obtains multiplex's formation 502.Moreover, be prefix byte, end byte or other type byte according to the last byte of BE actually, instruction may be crossed over BE and NTBE, therefore, M level steering logic unit 512 is when getting instruction and shift out the BE of multiplex's formation 502, and its mode of operation has difference, and this will be in following detailed description.Moreover M level steering logic unit 512 calculates each and obtains/and the length of move instruction, particularly use the logic of Figure 11 to calculate the first instruction I1 524 (the instruction length LEN1 1122 of Figure 11).If be the partial-length (this will describe in detail in step 1212) of last clock period, M level steering logic unit 512 uses the part LEN1104 that stores with computations length LEN 1 1122; Otherwise M level steering logic unit 512 uses subtracters 1102 (Figure 11) with computations length LEN 1 1122.Then, flow process enters step 1204.

In step 1204, M level steering logic unit 512 determines whether that all instructions that end at BE all have been sent to the F level.In one embodiment, in the cycle, the M level can obtain and transmit at most three instructions to the F level in a clock.Therefore, if the M level obtains three instructions from the bottom project, and the start bit 232 that another instruction is at least still arranged is in the project of bottom, and another instruction must obtain in next clock period.All be sent to the F level if end at all instructions of BE, flow process enters step 1206; Otherwise flow process enters step 1205.

In step 1205, M level steering logic unit 512 does not shift out BE, makes when next clock period, and more instruction be obtained and be transmitted in M level steering logic unit 512 can from BE.Flow process is back to step 1201, to carry out the program of next clock period.

In step 1206, the last byte that BE is judged in M level steering logic unit 512 is actually as preposition or be non-prefix byte.If the last byte of BE is non-prefix byte, flow process enters step 1216; If the last byte of BE is prefix byte, flow process enters step 1212.

In step 1212, M level steering logic unit 512 calculates and is positioned at the partial-length that BE comprises the instruction of prefix byte at last, that is, from the end byte of last instruction until the prefix byte number between the last byte 15 of BE, this calculating is not carried out by the mathematical logic unit of M level steering logic unit 512 (be shown in graphic in).For example, in the example of Figure 13, the partial-length of instruction b is 14.Prefix byte between byte is to be in " gore " (no-man ' s land) at end byte and beginning, and prefix byte is in fact unnecessary in multiplex's formation 502, because its content is Already in accumulated preposition message 238, the opcode byte of itself and instruction is stored in multiplex's formation 502.By this, if BE has obtained in this clock period all for prefix byte and all other instructions in BE at last, M level steering logic unit 512 can shift out (step 1214) with BE (1214), because these prefix bytes are that (it will accumulate on opcode byte in the middle of ensuing 16 byte streams) that exist and M level steering logic unit 512 store the prefix byte number (the partial-length working storage 1104 that is stored to Figure 11) and shift out from multiplex's formation 502.On the other hand, if BE's is last for non-prefix byte and its not yet are obtained or transmit, M level steering logic unit 512 can not shift out (consulting step 1222) from multiplex's formation 502 with it.Then, flow process enters step 1214.

In step 1214, the unit 512 control multiplex (MUX) formations 502 of M level steering logic are to shift out BE.Flow process is back to step 1201, to carry out the program of next clock period.

In step 1216, whether the last byte that BE is judged in M level steering logic unit 512 is the end byte of instruction, that is whether stop bit 234 is logic true value.If be yes, flow process enters step 1214; Otherwise flow process enters step 1218.

In step 1218, M level steering logic unit 512 judges that NTBE is whether as effectively.Be positioned at the last byte (that is byte 15) of BE when the end byte of the final injunction of obtaining, perhaps last byte is across being effective to NTBE and its, and M level steering logic unit 512 shifts out BE; Otherwise BE is kept until next clock period in M level steering logic unit 512.If NTBE is that effectively flow process enters step 1214; Otherwise flow process enters step 1222.

In step 1222, M level steering logic unit 512 does not shift out BE.This is that (that is, non-prefix byte) crosses over BE and NTBE because the real bytes of instruction, and NTBE is invalid.In this situation, M level steering logic unit 512 can't determine instruction length, because the stop bit of instruction 234 can't be learnt from invalid NTBE.Flow process is back to step 1201, carries out the program of next clock period, to wait for that NTBE fills up valid data.

Figure 13 shows that multiplex's formation 502 of Fig. 5 is in the content of continuous two clock period, with the operation of illustration M level.First multiplex's formation 502 contents were in for the first clock period 0, and second multiplex's formation 502 content is in the second clock cycle 1.Graphic three projects that only demonstrate the bottom.In Figure 13, " S " expression beginning byte (that is start bit 232 be logic true value), " E " represents end byte (that is stop bit 234 is logic true value), " P " represents prefix byte (that is, accumulate preposition message 238 represented).4 instructions represent with a, b, c, d respectively, and show that it begins, end and prefix byte.Shown in byte number correspond to Fig. 9, for example byte 0 to 47, it is positioned at BE, NTBE and the SFBE of multiplex's formation 502.

The cycle 0 at the beginning, the byte 1 of BE includes the end byte Ea of instruction a, and the byte 2 to 15 of BE includes the prefix byte Pb of 14 instruction b.Because instruction b starts from BE, but its to begin byte be to be positioned at NTBE rather than BE, its partial-length is calculated as ten nybbles.The content of NTBE and SFBE is invalid, that is the formation 104 of X86 command byte and length decoder 202/ ripple logical block 204 not yet provide the caching data of instruction crossfire and related news thereof (for example start bit 232, stop bit 234 and accumulate preposition message 238) to other project except BE.

In 0 o'clock cycle, content (step 1201 of Figure 12) and move instruction a to F level (step 1202) that BE and NTBE are inspected in M level steering logic unit 512.Moreover, the length of M level steering logic unit 512 computations a, it equals the difference between the end byte position of the end byte position of instruction a and last instruction.At last, (last byte (byte 15) that instruction a) has transmitted (step 1204) and BE is prefix byte (step 1206) due to all instructions that end at BE, the partial-length of M level steering logic unit 512 computations b is ten nybbles, and it is stored in part LEN 1104 working storages (step 1212).At last, M level steering logic unit 512 shifts out (step 1214) from multiplex's formation 502 with BE.

Ripple logical block 204 outputs 214 of other 16 byte streams have been carried out shifting out and moving in the cycle 0 due to step 1214, thereby the beginning cycle 1, this moment, BE comprised: the beginning byte (Sb) and the end byte (Eb) (that is the non-prefix byte of instruction b only has single byte) that are positioned at the instruction b of byte 0; Be positioned at five prefix bytes (Pc) of the instruction c of byte 1 to 5; Be positioned at the beginning byte (Sc) of the instruction c of byte 6; Be positioned at the end byte (Ec) of the instruction c of byte 8; Be positioned at the beginning byte (Sd) of the instruction d of byte 9; And be positioned at the end byte (Ed) of the instruction d of byte 15.

In 1 o'clock cycle, content (step 1201) and move instruction b, c and d to F level (step 1202) that BE and NTBE are inspected in M level steering logic unit 512.Moreover, M level steering logic unit 512 calculates the following: the length of instruction b (LEN1 1122) (step 1202) (being 15 bytes in this example), and it equals the residue length (being a byte) that part LEN 1104 (being ten nybbles) adds instruction b in this example in this example; The length of instruction c (in this example be the Eight characters joint), it equals the difference of the end byte position of the end byte position of instruction c and instruction b; And the length (being seven bytes in this example) of instruction d, it equals the difference of the end byte position of the end byte position of instruction d and instruction c.Moreover, because all end at instruction (the instruction b of BE, c, d) the last byte (byte 15) that has all transmitted (step 1204) and BE is end byte (step 1216) for the last byte of non-prefix byte (step 1206) and BE, so M level steering logic unit 512 shifts out (step 1214) from multiplex's formation 502 with BE.

According to embodiment shown in Figure 13, the preposition message 238 of accumulation by accumulation instruction b is to the part LEN 1104 of its operational code and save command b, make order format device 106 BE that contains the prefix byte of instruction b can be shifted out, and obtain and transmit maximum three instructions in next clock period from multiplex's formation 502.If do not accumulate preposition message 238 and storage compartment LEN 1104, this will be impossible (that is instruction c and d can't obtain in same period and transmit by and instruction b, but must carry out in next clock period).Can process by the enough instructions of functional unit tool that make microprocessor, can reduce the use of microprocessor 100 resources.

Figure 14 shows that multiplex's formation 502 of Fig. 5 is in the content of continuous two clock period, with the operation of illustration M level.The example of Figure 14 is similar to the example of Figure 13; Yet, the position of instruction and multiplex's formation 502 enter and to leave sequential variant.

In the cycle 0 at the beginning, BE is positioned at the end byte (Ea) that byte 1 includes instruction a, and is positioned at 14 prefix bytes (Pb) that byte 2 to 15 includes instruction b.In addition, b starts from BE due to instruction, but to begin byte be but to be positioned at NTBE for it, so part LEN 1104 is calculated as 14.NTBE comprises: be positioned at the beginning byte (Sb) of instruction b of byte 16 and the end byte (Eb) (that is instruction b except prefix byte, is only single byte) of instruction b; Be positioned at five prefix bytes (Pc) of the instruction c of byte 17-21; Be positioned at the beginning byte (Sc) of the instruction c of byte 22; Be positioned at the end byte (Ec) of the instruction c of byte 27; Be positioned at three prefix bytes (Pd) of the instruction d of byte 28-30; And be positioned at the beginning byte (Sd) of the instruction d of byte 31.SFBE comprises: be positioned at the end byte (Ed) of the instruction d of byte 41, and be positioned at the beginning byte (Se) of the instruction e of byte 42.

In 0 o'clock cycle, content (step 1201 of Figure 12) and move instruction a to F level (step 1202) that BE and NTBE are inspected in M level steering logic unit 512.Moreover, the length of M level steering logic unit 512 computations a, it equals the difference between the end byte position of the end byte position of instruction a and last instruction.At last, (last byte (byte 15) that instruction a) has transmitted (step 1204) and BE is prefix byte (step 1206) due to all instructions that end at BE, the partial-length of M level steering logic unit 512 computations b is ten nybbles, and it is stored in part LEN1104 working storage (step 1212).At last, M level steering logic unit 512 shifts out (step 1214) from multiplex's formation 502 with BE.

Because step 1214 shifted out in the cycle 0, thus the beginning cycle 1, this moment, BE comprised the content of the NTBE in 0 o'clock cycle, and NTBE comprises the content of the SFBE in 0 o'clock cycle.

In 1 o'clock cycle, content (step 1201) and move instruction b, c and d to F level (step 1202) that BE and NTBE are inspected in M level steering logic unit 512.Moreover, M level steering logic unit 512 calculates the following: the length of instruction b (LEN1 1122) (step 1202) (being 15 bytes in this example), and it equals the residue length (being a byte) that part LEN 1104 (being ten nybbles) adds instruction b in this example in this example; The length of instruction c (being 11 bytes in this example), it equals the difference of the end byte position of the end byte position of instruction c and instruction b; And the length (being ten nybbles in this example) of instruction d, it equals the difference of the end byte position of the end byte position of instruction d and instruction c.Moreover, the last byte (byte 15) that has all transmitted (step 1204) and BE due to all instructions (instruction b, c, d) that end at BE is non-for end byte (step 1216) and NTBE are effectively (step 1218) for the last byte of non-prefix byte (step 1206) and BE, so M level steering logic unit 512 shifts out (step 1214) from multiplex's formation 502 with BE.

According to embodiment shown in Figure 14, order format device 106 can be in a clock in the cycle, and three instructions that will contain maximum 40 command byte obtain and send out, as shown in figure 15.

[detection of bad branch prediction, sign and accumulation are in order to fast processing instruction crossfire]

Consult Fig. 1 again, extract at present address 142 in order to when instruction cache 102 extraction one command byte are listed as and offer XIBQ104 when extraction unit 126 outputs, BTAC128 also obtains simultaneously this and extracts at present address 142.Hit (hit) BTAC128 if extract at present address 142, the address is previously herein extracted in expression has a branch instruction once to be performed; Therefore, whether BTAC128 is measurable has branch instruction that (taken) occurs, if be yes, BTAC128 has also predicted predicted target address 146.Specifically, BTAC128 be obtain from the command byte crossfire at microprocessor 100 or the branch instruction of decoding before namely predict.Therefore, the branch instruction that BTAC128 predicts may not be present in the cache column of command byte of taking-up, that is BTAC128 has done bad prediction, causes microprocessor 100 branch's mistakes.It should be noted that this bad prediction is not equal to incorrect prediction.Because program is carried out the tool dynamic property, for example change of the value of the status code of branch instruction or status data, so all branch predictors are in essence all might prediction error.Yet bad prediction herein represents that the cache column that BTAC128 predicts is different, and perhaps identical the but content in cache column of cache column changes.Why the reason of these situations occurs, as United States Patent (USP) 7,134,005 description, reason has following several: due to the BTAC128 only address tag of storage compartment (tag) but not full address label, thereby cause label to obscure (aliasing); Because BTAC128 only stores virtual (virtual) address tag but not physical address, thereby cause and virtually obscure; And the spontaneous code (self-modifying code) of revising.When this situation occured, microprocessor 100 must be determined not bad predict command and follow-up because of bad predict command and the false command that mistake obtains sends out.

If indication 154 (Fig. 1) occur for logic true value for a command byte its branch but be not in fact the first byte for instruction, as shown in figure 16, represent that namely BTAC128 has done bad prediction thereby caused branch's mistake of microprocessor 100.As previously mentioned, the true value generation indication 154 expression BTAC128 of branch that BTAC128 provides think that this command byte is first byte (that is operational code) of branch instruction, and extraction unit 126 carries out branch according to the predicted target address 146 that BTAC128 predicts.

The determining method system of bad BTAC prediction waits for, until individual other instruction obtains from the command byte crossfire and length is known, and non-the first byte that scans each instruction indication 154 occurs whether as true to inspect its branch.Yet this kind inspection method is too slow, because it needs a lot of shieldings (masking) and shifts out, and needs result with each byte via logical OR (OR) computing, therefore can cause sequence problem.

For fear of sequence problem, the embodiment of the present invention is accumulated branch indication 154 message that provide is occured, and it is the part of ripple logical block 204 executive routines, and after the M level gets instruction, uses these accumulation message.Specifically, ripple logical block 204 detected states also hand on designator until the last byte of instruction, and it inspects single byte, that is the last byte of instruction.When getting instruction from the M level, determine whether an instruction is bad instruction, that is whether this instruction will be included in the instruction crossfire and continue is transmitted down along pipeline.

Figure 17 shows the composition signal of ripple logical block 204 outputs 214.Ripple logical block 204 output signals shown in Figure 17 are similar to shown in Figure 2, but have additionally increased bad BTAC position 1702 for each command byte, and it will be in following detailed description.In addition, ripple logical block 204 output comprises: a signal when it is logic true value, represent that corresponding command byte is the first byte of the branch instruction predicted of BTAC128, yet the branch instruction that BTAC128 predicts will can not occur; And another signal, the last byte of its expression is the end byte of instruction.

Figure 18 shows the operational flowchart of the microprocessor 100 of Fig. 1.This flow process starts from step 1802.

In step 1802, BTAC (branch target address caching) 128 predicts in the cache column of extraction unit 126 provides present extraction address 142 indications, have a branch instruction, and this branch instruction will occur.BTAC (branch target address caching) 128 goes back the predicted target address 146 of predicting branch instructions.Therefore, the first row of XIBQ104 in the middle of the instruction cache 102 at 142 indication places, present extraction address receives 16 command byte, and the then secondary series in the middle of the instruction cache 102 at predicted target address 146 indication places receives 16 command byte.Then, flow process enters step 1804.

In step 1804, XIBQ104 stores each branch and indication 154 (Fig. 1) occurs together with the two corresponding command byte of row that receive in step 1802.Then, flow process enters step 1806.

In step 1806, the first row of length decoder 202 and ripple logical block 204 processing instruction bytes, and detect that indication 154 occurs for branch that command byte contains logic true value but this byte is not the situation of the first byte of instruction, error situations as shown in figure 16.In other words, ripple logical block 204 knows in 16 groups of command byte row, which byte is the first byte, in order to set stop bit 234.Accordingly, the ripple logical block 204 of the first non-prefix byte of corresponding each instruction is inspected the logic true value of branch's generation indication 154 and is detected this situation.Then, flow process enters step 1808.

In step 1808, be logic true value when indication 154 occurs for the true value branch of non-the first byte that instruction detected, the bad BTAC position 1702 that ripple logical block 204 is set these command byte is logic true value.In addition, ripple logical block 204 is passed to all the other bytes of 16 byte streams with the bad BTAC position 1702 of true value from its byte location.Moreover, if the end byte of instruction does not come across the first row of command byte, ripple logical block 204 update modes (for example flip-flop) (be not shown in graphic in) are in order to represent in row, bad BTAC (branch target address caching) 128 predictions being occured in an instruction at present.Then, when the secondary series of ripple logical block 204 processing instruction bytes, because state is true, ripple logical block 204 is set its bad BTAC position 1702 for all bytes of command byte secondary series.Then, flow process enters step 1812.

In step 1812, for first and second row of command byte, multiplex's formation 502 stores the output 214 of ripple logical blocks 204, comprises bad BTAC position 1702, and stores together with each command byte.Then, flow process enters step 1814.

In step 1814, M level steering logic unit 512 finds that the bad BTAC position 1702 corresponding to command byte be that the stop bit 234 of logic true value and this command byte also is logic true value (that is, the situation that bad BTAC (branch target address caching) 128 predicts detected).Therefore, instruction that bad situation occurs and subsequent instructions thereof are abandoned transmitting to the F level by removing corresponding significance bit 534/536/538 in M level steering logic unit 512.Yet if before the instruction that bad situation occurs, an instruction is arranged, this instruction is effectively and is transferred into the F level.As previously mentioned, the bad BTAC position 1702 of true value is passed to the end byte of the instruction that bad situation occurs, and will make M level steering logic unit 512 only be inspected single byte, that is, the byte of stop bit 234 indications, thereby obviously reduce the restriction of sequential.Then, flow process enters 1816.

In step 1816, it is invalid that microprocessor 100 allows the wrong project of BTAC (branch target address caching) 128 become.In addition, microprocessor 100 is removed all the elements of XIBQ104 and multiplex's formation 502 and is allowed extraction unit 126 upgrade and extracts at present address 142, in order to the byte that again gets instruction from BTAC (branch target address caching) 128 bad predictions place of generation.When again obtaining, BTAC (branch target address caching) 128 can not produce bad prediction, because bad item is eliminated, that is when again obtaining, BTAC (branch target address caching) 128 predicted branches can not occur.In one embodiment, step 1816 is executed in the F level of order format device 106, and/or instruction transfer interpreter 112.Flow process ends at step 1816.

[effective decision of x86 instruction length]

Determine that the x86 instruction length is very complicated, it is described in the IA-32 of Intel Framework Software exploitation handbook (Intel IA-32 Architecture Software Developer ' s Manual), the chapter 2 of 2A collection: instruction set is with reference to (Instruction Set Reference), A-M.The instruction total length is following sum: whether number (1,2 or 3), the ModR/M byte of the number of prefix byte (if any), operation byte occurs, whether the SIB byte occurs, address displacement (displacement) length (if any) reaches the length (if any) of (immediate) data immediately.Be below characteristic or the requirement of x86 instruction, it is enough to affect the decision of length (except preposition):

The number of opcode byte is:

3, if the first two byte is 0F 38/3A

2, if the first word byte is 0F, and the second word byte is not 38/3A

1, other situation

Whether the ModR/M byte operational code occurs being decided by, as follows:

If be three byte oriented operands, ModR/M is compulsory

If be a byte or two byte oriented operands, inspect opcode byte

Whether S IB byte the ModR/M byte occurs being decided by.

Whether displacement the ModR/M byte occurs being decided by.

Displacement scale is decided by ModR/M byte and present address size (AS).

Whether immediate data opcode byte occurs being decided by.

The size of immediate data is decided by opcode byte, operational code size (OS), present AS and REX.W are preposition at present; Specifically, the ModR/M byte can not affect the immediate data size.

If there is no the ModR/M byte, there is no SIB, displacement or immediate data.

When determining instruction length, instruction operation code and ModR/M byte only have five kinds of forms:

Operational code

The 0F+ operational code

Operational code+ModR/M

0F+ operational code+ModR/M

0F+38/3A+ operational code+ModR/M

Figure 19 shows the detailed block diagram of the length decoder 202 of Fig. 2.Fig. 2 has shown 16 length decoders 202.Figure 19 shows a representative length decoder 202, is denoted as n.As shown in Figure 2, each length decoder 202 corresponds to a byte of command byte crossfire 134.In other words, length decoder 0 corresponds to command byte 0, and length decoder 1 corresponds to command byte 1, until length decoder 15 corresponds to command byte 15.Length decoder 202 comprises programmable logic array (Programmable Logic Array, PLA) 1902,4:1 multiplexer 1906 and totalizer 1904.

PLA 1902 receives address size (AS), operand size (OS) and REX.W value 218 shown in Figure 2.AS represents that address size, OS represent the operand size, and the preposition appearance of REX.W value representation REX.W.PLA 1902 also receives the command byte 134 (it indicates with n+1) of corresponding command byte 134 (it indicates with n) and higher order.For example, PLA 31902 receives

command byte

3 and 4.

PLA 1902 produces immLen value 1916, and it offers the first input of totalizer 1904.ImmLen value 1916 is between 1 and 9 (containing), and its value is following sum: the size of opcode byte number and immediate data (0,1,2,4,8).PLA 1902 is when determining immLen value 1916, be that this two command byte 134 of hypothesis is the first two opcode byte of instruction, and foundation two opcode byte (being an opcode byte if not 0F), address size (AS), operand size (OS) and REX.W value 218 are to produce immLen value 1916.

PLA 1902 produces eaLen value 1912, and it offers the multiplexer 1906 of three low order length decoders 202.EaLen value 1912 is between 1 and 6 (containing), and its value is following sum: ModR/M byte number (existence of PLA hypothesis ModR/M byte), SIB byte number (0 or 1) and displacement scale (0,1,2,4).PLA 1902 when determining eaLen value 1912, is that hypothesis the first command byte 134 is the ModR/M byte, and according to ModR/M byte and address size (AS) 218 with generation eaLen value 1912.

One of them input of multiplexer 1906 receives null value.Three inputs of other of multiplexer 1906 receive the eaLen value 1912 from three high-order PLA 1902.Multiplexer 1906 selects one of them input in order to provide eaLen value 1918 as output, and it offers the second input of totalizer 1904 again.In one embodiment, in order to reduce transmission delay, can not use aforesaid multiplexer 1906, each eaLen value 1912 is input to totalizer 1904, and wherein eaLen value 1912 is tri-state line or (tri-statewired-OR) signal.

Totalizer 1904 adds up to produce final instruction length 222 shown in Figure 2 with immLen value 1916 and by the eaLen value 1918 chosen.

PLA 1902 produces control signal 1914 to control multiplexer 1906, and it detects as follows according to aforementioned five kinds of forms:

1. for instruction type of not having a ModR/M byte shown below, select null value:

Operational code only, or

The 0F+ operational code

2. for following instruction type, select PLA n+1:

Operational code+ModR/M

3. for following instruction type, select PLA n+2:

0F+ operational code+ModR/M

4. for following instruction type, select PLA n+3:

0F+38/3A+ operational code+ModR/M

Figure 20 shows the configuration of 16 length decoders 202.PLA 15 (programmable logic array) 1902 receives the command byte 0 of command byte 15 and previous column, and multiplexer 151906 receives the eaLen value 1912 of three PLA1902 (not shown), and wherein these three PLA 1902 inspect respectively the command byte 0/1,1/2 and 2/3 of previous column.

The benefit that aforementioned each PLA 1902 inspects two bytes each time is to reduce in a large number required complete and (minterm) number, thereby is reduced the size of the logical circuit on wafer.This design provides total full delay Balancing selection between the two that the item number purpose reduces and the sequential requirement allows that reaches.

Figure 21 shows the operational flowchart of the length decoder 202 of Figure 20.This flow process starts from step 2102.

In step 2102, for each command byte 134 from XIBQ104, corresponding PLA 1902 inspects two command byte 134, that is corresponding command byte 134 and next instruction byte 134.For example, PLA 3 (programmable logic array) 1902 inspects command byte 3 and 4.Then, flow process enters

step

2104 and 2106 simultaneously.

In step 2104, each PLA 1902 hypothesis two command byte 134 be the first two opcode byte of instruction, and according to this two command byte 134, operand size (OS), address size (AS), reach the REX.W value to produce immLen value 1916.In details of the words, immLen value 1916 is following sum: the size (0,1,2,4 or 8) of the number of opcode byte (1,2 or 3) and immediate data.Then, flow process enters step 2114.

In step 2106, each PLA 1902 hypothesis first command byte 134 be the ModR/M byte, and according to ModR/M byte and address size (AS) with generation eaLen value 1918, and provide eaLen value 1918 to inferior three low order multiplexers 1906.In details of the words, eaLen value 1918 is following sum: the size (0,1,2,4) of ModR/M byte number (1), SIB byte (0 or 1) and displacement.Then, flow process enters step 2108.

In step 2108, the eaLen value 1912 that each multiplexer 1906 receives zero input and receives from three high-order PLA 1902.For example, PLA 3 (programmable logic array) 1902 receives eaLen value 1912 from

PLA

4,5,6 (programmable logic array) 1902.Then, flow process enters step 2112.

In step 2112, each PLA 1902 produces control signal 1914 to corresponding multiplexer 1906, and wherein one inputs with selection according to aforementioned five kinds of forms.Then, flow process enters step 2114.

In step 2114, each totalizer 1904 adds to the selected eaLen value 1918 of multiplexer 1906 with immLen value 1916, to produce instruction length 222.Then, flow process enters step 2116.

In step 2116, if there is LMP is arranged, the L level is for extra clock period of each instruction cost of containing LMP, as aforementioned graphic as shown in, Fig. 1 to Fig. 4 particularly.

The above is only embodiments of the invention, is not to limit claim scope of the present invention.The equivalence that the personage in familiar with computers field completes under the spirit that the disengaging invention does not disclose changes or modifies, and all should be included in above-mentioned claim scope.For example, can use software with start-up performance, make, set up model, emulation, description and/or test disclosed device and method.But it reaches service routine language (for example C, C++), hardware description language (HDL), and it comprises Verilog HDL, VHDL and other program.This software can be placed in computing machine can use media, for example semiconductor, disk or CD (for example CD-ROM, DVD-ROM).Disclosed device and method embodiment can be contained in intellecture property core (IPcore), for example microcontroller core (for example being placed in HDL) and be converted to hardware to make integrated circuit.Moreover disclosed device and method embodiment can implement with the combination of hardware and software.Therefore, the scope of the invention is not limited to any illustrative embodiments, and should define with claim scope and equivalent scope thereof.In details of the words, invention can be implemented in micro processor, apparatus, this microprocessor can be used in general computing machine.At last, those skilled in the art can use disclosed concept and specific embodiment as the basis with design or be modified as other framework, in order to realize identical purpose, it does not break away from claim scope of the present invention yet.

Claims

1. device that is applicable to microprocessor, in order to extract instruction in a command byte crossfire of microprocessor, the instruction set architecture tool variable length instruction of this microprocessor, this microprocessor comprises a formation, each project of this formation reaches the preposition message of accumulation of each command byte that is listed as corresponding to this command byte in order to each command byte row that store this command byte crossfire, wherein this formation has a bottom project, and this device comprises:

(a) for detection of the parts of a situation, the start-up portion of an instruction that this situation comprises the first row of these command byte row that are stored in this bottom project not yet extracts from this formation, and wherein the command byte of the start-up portion of this instruction is prefix byte;

(b) be used for detecting situation according to this, store the length of the start-up portion of this instruction, this first row of this bottom project is shifted out this formation, and the secondary series that this command byte is listed as is moved into the parts of this bottom project;

(c) be used for extracting from this secondary series of this bottom project the command byte of still undrawn this instruction, and this secondary series of this bottom project extracts the preposition message of this accumulation with the parts of the prefix byte of the start-up portion of replacing this instruction of shifting out from this formation certainly;

(d) be used for length according to the start-up portion of this instruction that stores, calculate the parts of the length of previous still undrawn this instruction; And

(e) for the length according to this calculating, this secondary series of this bottom project extracts the parts of previous still undrawn this instruction instruction in addition certainly.

2. device according to claim 1, wherein carry out (a) parts and (b) parts within the first clock period of this microprocessor, and the second clock after this first clock period is in the cycle, carries out (c) parts, (d) parts and (e) parts.

3. device according to claim 1, wherein above-mentioned formation also comprises a bottom project second from the bottom, and this device also comprises:

(f) parts are used for when carrying out (b) parts, with the 3rd row this bottom of immigration project second from the bottom of these command byte row; And

(g) parts are used for when carrying out (e) parts, according to the length of this calculating, certainly the 3rd of this secondary series of this bottom project and this bottom project second from the bottom the are listed as the instruction beyond previous still undrawn this instruction of extraction.

4. device according to claim 1, wherein above-mentioned situation also comprises: all end at the instruction of this bottom project and all extract in this first row of this bottom project.

5. device according to claim 1, wherein said extracted is opcode byte corresponding to still undrawn this instruction from the preposition message of this accumulation of this secondary series of this bottom project.

6. device according to claim 1, wherein this device also comprises:

(f) parts were used for before carrying out (a) parts, and this first row of this bottom project extracts one or more instruction certainly, and this first row that will this bottom project shifts out this formation, all extracted until all end at the instruction of this bottom project.

7. device according to claim 6, wherein this device also comprises:

(g) parts are used for after carrying out (f) parts, if this situation not yet detects, and the last byte of this bottom project is the end byte of instruction, and: this first row that will this bottom project shifts out this formation; With

(h) parts are used for after carrying out (f) parts, if this situation not yet detects, and the last byte of this bottom project is non-is the end byte of instruction, and: whether a bottom project second from the bottom that judges this formation is effective; And be used for after carrying out (f) parts, if this bottom project second from the bottom is effective, this first row that will this bottom project shifts out this formation; Otherwise this first row that will this bottom project shifts out this formation.

8. device according to claim 1, wherein this device also comprises:

Be used for before carrying out (a) parts, determine an opcode byte for each instructions of a plurality of instructions of this first row, the preposition message of this accumulation is accumulate to this opcode byte of each instruction, and will accumulates the parts that preposition message is loaded onto this formation.

9. method that is applicable to microprocessor, in order to extract instruction in a command byte crossfire of microprocessor, the instruction set architecture tool variable length instruction of this microprocessor, this microprocessor comprises a formation, each project of this formation reaches the preposition message of accumulation of each command byte that is listed as corresponding to this command byte in order to each command byte row that store this command byte crossfire, and this formation has a bottom project, and the method comprises:

(a) detect a situation, this situation comprises the start-up portion of an instruction of first row of these command byte row that are stored in this bottom project not yet from this formation extraction, and wherein the command byte of the start-up portion of this instruction is prefix byte;

(b) detect situation according to this, store the length of the start-up portion of this instruction, this first row of this bottom project is shifted out this formation, and the secondary series that this command byte is listed as moves into this bottom project;

(c) this secondary series of this bottom project extracts the command byte of still undrawn this instruction certainly, and this secondary series of this bottom project extracts the preposition message of this accumulation with the prefix byte of the start-up portion of replacing this instruction of shifting out from this formation certainly;

(d) according to the length of the start-up portion of this instruction that stores, calculate the length of previous still undrawn this instruction; And

(e) according to the length of this calculating, this secondary series of this bottom project extracts previous still undrawn this instruction instruction in addition certainly.

10. method according to claim 9, wherein above-mentioned step (a) and (b) being executed in the first clock period of this microprocessor, and this step (c), (d) and (e) be executed in second clock after this first clock period in the cycle.

11. method according to claim 9, wherein above-mentioned formation also comprises a bottom project second from the bottom, and this step (b) also comprises:

(f) the 3rd row with this command byte move into this bottom project second from the bottom;

Wherein this step (e) also comprises:

(g) according to the length of this calculating, the 3rd of this secondary series of this bottom project and this bottom project second from the bottom the row, extract previous still undrawn this instruction instruction in addition certainly.

12. method according to claim 9, wherein above-mentioned situation also comprises: all end at the instruction of this bottom project and all extract in this first row of this bottom project.

13. method according to claim 9, wherein said extracted is opcode byte corresponding to still undrawn this instruction from the preposition message of this accumulation of this secondary series of this bottom project.

14. method according to claim 9 also comprises:

(f) in execution in step (a) before, this first row of this bottom project extracts one or more instruction certainly, and this first row that will this bottom project shifts out this formation, all extracts until all end at the instruction of this bottom project.

15. method according to claim 14 also comprises:

In execution in step (f) afterwards, if this situation not yet detects, and the last byte of this bottom project is the end byte of instruction:

(g) this first row that will this bottom project shifts out this formation;

In execution in step (f) afterwards, if this situation not yet detects, and the last byte of this bottom project is non-is the end byte of instruction:

(h) judge whether a bottom project second from the bottom of this formation is effective; And

In execution in step (f) afterwards, if this bottom project second from the bottom is effectively: this first row that will this bottom project shifts out this formation; Otherwise this first row that will this bottom project shifts out this formation.

16. method according to claim 9 also comprises:

In execution in step (a) before, determine an opcode byte for each instructions of a plurality of instructions of this first row, the preposition message of this accumulation is accumulate to this opcode byte of each instruction, and should accumulate preposition message and be loaded onto this formation.