EP0924603A2 - Compilergesteuerte dynamische Ablauffolgeplanung von Programmbefehlen - Google Patents

Compilergesteuerte dynamische Ablauffolgeplanung von Programmbefehlen Download PDF

Info

Publication number
EP0924603A2
EP0924603A2 EP98310063A EP98310063A EP0924603A2 EP 0924603 A2 EP0924603 A2 EP 0924603A2 EP 98310063 A EP98310063 A EP 98310063A EP 98310063 A EP98310063 A EP 98310063A EP 0924603 A2 EP0924603 A2 EP 0924603A2
Authority
EP
European Patent Office
Prior art keywords
program instructions
instruction
computer
instructions
dependency
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP98310063A
Other languages
English (en)
French (fr)
Other versions
EP0924603A3 (de
Inventor
Gerard Paul D'arcy
Sanjay Jinturkar
C. John Glossner
Stamatis Vassiliadis
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nokia of America Corp
Original Assignee
Lucent Technologies Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Lucent Technologies Inc filed Critical Lucent Technologies Inc
Publication of EP0924603A2 publication Critical patent/EP0924603A2/de
Publication of EP0924603A3 publication Critical patent/EP0924603A3/de
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/43Checking; Contextual analysis
    • G06F8/433Dependency analysis; Data or control flow analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/45Exploiting coarse grain parallelism in compilation, i.e. parallelism between groups of instructions
    • G06F8/456Parallelism detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30072Arrangements for executing specific machine instructions to perform conditional operations, e.g. using predicates or guards
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30076Arrangements for executing specific machine instructions to perform miscellaneous control operations, e.g. NOP
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3802Instruction prefetching
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3802Instruction prefetching
    • G06F9/3808Instruction prefetching for instruction reuse, e.g. trace cache, branch target cache
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3818Decoding for concurrent execution
    • G06F9/3822Parallel decoding, e.g. parallel decode units
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
    • G06F9/3838Dependency mechanisms, e.g. register scoreboarding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
    • G06F9/3842Speculative instruction execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
    • G06F9/3853Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution of compound instructions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3854Instruction completion, e.g. retiring, committing or graduating
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3854Instruction completion, e.g. retiring, committing or graduating
    • G06F9/3858Result writeback, i.e. updating the architectural state or memory

Definitions

  • This invention relates broadly to increasing a computer's speed in executing instructions in a program.
  • a compiler In a modern computer system, a compiler translates a program written in a high level programming language (generally referred to as source code) into a lower level language program that physically executes on the computer system's processor (generally referred to as machine or object code).
  • machine or object code In some computers or machines, in addition to the compiler, an assembler is provided to allow human programmers to produce object code. Other compilers also perform the functions of the assembler to produce the object code that can be passed directly to the processor.
  • One measure of the performance of the processor e.g, central processing unit (CPU)
  • CPU central processing unit
  • the processor is termed one machine cycle or cycle.
  • Improvements in instruction execution rates of computers are achieved by circuit-level or technology improvements and organizational techniques, such as exploitation of instruction level parallelism (ILP), cache memories, out-of-order execution of instructions and multiple, concurrent execution units in the processor.
  • ILP instruction level parallelism
  • cache memories out-of-order execution of instructions and multiple, concurrent execution units in the processor.
  • ILP instruction level parallelism
  • Increasingly popular is exploitation of ILP, or allowing multiple object code instructions which result from the compiler (and/or the assembler) to be issued and executed by the processor simultaneously in a single machine cycle.
  • Optimizing object code based on optimization techniques is a means by which to exploit ILP.
  • An important optimization technique is determining dependencies (also referred to as hardware interlocks) between instructions contained in the code. Instructions are independent when, within a given number of instructions (or instruction sequence), the order of execution of the instructions does not affect the results.
  • Various methods for optimizing code are known in the art, as shown in A. Aho et al, Compilers, Principles, Techniques and Tools, Addison Wesley (1988), Chapter 10. Reference is made to this publication for further description of such optimization techniques.
  • the processor includes complex hardware which decodes the object code instructions to determine the dependencies between two or more instructions (up to m instructions, where m is an integer) at the time that such code is executed by the processor.
  • the hardware is referred to as dynamic checking logic.
  • Such processing is very complex because it is necessary to implement a large number of rules to determine whether instruction dependencies exist.
  • VLIW Very Long Instruction Word
  • the execution units in VLIW machines require that the compiler permanently (or statically) preallocate instructions to the execution units of the processor.
  • the execution units are hardware in the processor which receive issued instructions and execute such instructions according to the operation specified by the instruction. For example, an instruction to multiply two operands can be sent to a multiply execution unit but cannot be sent to an arithmetic logic unit (also referred to as an ALU execution unit). However, an add instruction can be sent to either an ALU or a multiply execution unit, which can execute both operations.
  • the compiler preallocates the multiply instruction to a single execution unit, in this example, the multiply execution unit.
  • Our invention is directed to compiler controlled dynamic scheduling, or a system and method of dynamically storing multiple instruction dependencies which a compiler has prespecified. Thin is accomplished through the use of a single dep instruction, which instructs the processor hardware that the next m instructions (where m is an integer) associated with such dep instruction can be executed in parallel with one another.
  • the dep instruction, and instructions delimited by the dep instruction can be stored in a Multiple Issue Buffer (MIB) implemented in the processor.
  • MIB is a special storage buffer separate from, smaller and faster than the main memory of the computer. It can store the instructions to be executed in parallel separately from the main memory so that when such instructions are to be executed, the processor can retrieve them from the MIB rather than the main memory.
  • main memory need not include hardware (e.g., transmission lines) to accommodate sending multiple instructions at the same time because the MIB is accessed for that purpose.
  • our invention can be implemented in any organization which includes the inventive features described herein, including superscalar and VLIW architectures modified to include such inventive features.
  • An advantage of our invention is that the dep instruction encodes inter-instruction dependencies at the compiler level, or before processing by the processor, in order to alleviate hardware dependency checking.
  • the object code encoded with dep instructions is then extant in main memory of the computer. This eliminates the need for complex dynamic checking logic in the processor to determine hardware interlocks at execution time. Accordingly, the processor operation is simplified, therefore affords high performance, and can be operated with a reduction in power.
  • this advantage is magnified because the optimized code is permanent. Therefore, the dep instruction and subsequent delimited instructions need not be reencoded each time the instruction sequence is executed or the instruction sequence is used in another processor with a different number of execution units.
  • a further advantage of our invention is that the dep instruction does not preallocate the instructions delimited by it to a predetermined execution unit. Accordingly, in addition to supporting a high performance low power processor, the dep instruction does not limit the processor from allocating instructions to particular execution units at execution time. In this way, the gain in exploiting ILP at the compiler level can be maintained or further optimized by allowing the processor to allocate such instructions freely to the processor's execution units. Accordingly, prior to processing by the processor, the code is optimized to achieve the highest degree of ILP that can exist and to avoid the disadvantages of preallocating instructions to execution units.
  • an additional advantage of our invention is that the processor hardware can cache or store the object code instructions in the MIB, which is smaller and faster than the main storage of the computer. In this way, when the processor executes instructions associated with the dep instruction (referred to as the dep instruction packet), it can retrieve such packet from the MIB rather than from the main memory of the computer. This organization produces permanent, optimized preprocessing which can be available quickly to the processor.
  • the architecture of the MIB implemented according to our invention allows instructions of the instruction packet within the delimited buffer boundary to issue simultaneously if enough processor execution units are available. For example, for a packet including five instructions associated with the dep instruction, the processor may include five execution units in order to process such instruction packet.
  • the dep instruction can also include additional information or tags about the dep instruction packet to provide to the processor through the MIB. Such additional information can be used in the processor for additional optimization logic implemented at the processor level.
  • a further advantage of our invention is object code compatibility within multiple implementations of the same organization, or organizations which differ only in the number of execution units of the processor and have one or more execution units in common. For example, two organizations each provide multiply, ALU, load and store execution units and the second organization additionally provides another multiply execution unit.
  • the dep instruction of the present invention is designed for the first organization, it can also be executed by the second organization, and visa versa.
  • the first organization can issue four instructions in parallel because it includes four execution units.
  • the second organization can issue five instructions in parallel.
  • Such execution on alternative first and second organizations is achieved because a dep instruction packet containing four instructions can be executed on both organizations. As to a dep instruction packet containing five instructions, it can execute on the second organization in parallel and on the first organization with four instructions in parallel followed by a single instruction in series.
  • our invention provides a high degree of flexibility and versatility in implementing the dep instruction by the processor for object code execution.
  • Such features provide maximum exploitation of ILP in such a manner that instruction execution rate increases are maintained at execution time because additional dynamic checking logic processing by the processor is unnecessary.
  • a program 2 provides source code as input to a compiler/preprocessor 3.
  • the complier/preprocessor 3 performs both a compiler function and a preprocessing function in the illustrative embodiment of Fig. 1.
  • the compiler and preprocessor functions can be implemented by separate devices.
  • assembler operations could be performed by the compiler or separately by an assembler (not shown).
  • the compiler/preprocessor 3 examines the source code (code is also referred to as instructions within an instruction set architecture (ISA)) and identifies instruction dependencies which can be delimited by a dep instruction (shown in Fig. 2) in order to implement instruction level parallelism (ILP).
  • the compiler/processor 3 uses a set of optimization rules 4 for this purpose.
  • the compiler/preprocessor 3 produces object code optimized by the inclusion of dep instructions in order to exploit ILP.
  • dep instructions are added as the first instruction of a packet of instructions delimited by it.
  • the instruction sequence containing the dep instruction and the instructions delimited by it are hereinafter referred to as the dep instruction packet (an example of which is shown in Fig. 2 as a dep instruction packet 11).
  • a device other than a compiler 3 and/or preprocessor 3 can implement the dep instructions.
  • the facility which implements the dep instructions can be a software facility implemented separately from the compiler, e.g., a post-compiler, or it can be a hardware facility in the form of a hardware preprocessor located between an architected storage area, for example, a cache (or a special storage buffer smaller and faster than the main storage of the machine 1; an example of a cache is a MIB) in the machine 1, and another subsystem of such architected storage area.
  • the output of the compiler/preprocessor 3 is object code which is compiled and optimized to include the dep instructions.
  • the object code is then applied to an a processor 5 (e.g., a central processing unit (CPU)), constructed according to the present invention, as described further below.
  • the processor 5 hardware then fetches and issues the instructions for execution based in part on the dep instructions.
  • An important advantage of our invention is the static or permanent implementation of the dep instruction during processing by the compiler/preprocessor 3. Accordingly, complex dynamic checking logic in the processor to determine hardware interlocks at execution time is unnecessary. This results in simplified, and therefore high performance, processing and a reduction in power required by the processor. Moreover, this advantage is magnified because the optimized code is permanent such that every time instructions associated with loop programming techniques are executed by the processor, the instructions already contain the information encoded within the dep instructions. Therefore, dep instruction encoding need not be refetched in the event that the instructions are reexecuted by the processor.
  • Assembly code is shown because object code, as a lower level language, appears as a stream of 0s and Is whereas assembly code, as a higher level language, provides understandable notation and terms for ease of discussing the functionality of the code and, accordingly, the functionality of the dep instruction.
  • the dep instruction of the assembly code shown would be translated to object code for execution by the processor 5.
  • the Fig. 2 assembly code includes an exemplary dep instruction on line I for the instruction address location symbolically denoted by label and instructions delimited bv such dep instruction on lines 2 to 6 (hereinafter referred to as the dep instruction packet 11).
  • Line 7 represents that any number of instructions may be included between lines 6 and line S.
  • Line 8 contains a branch instruction, which indicates to the processor 5 to execute the instruction at the instruction address location symbolically denoted by label, or line 1 of Fig. 2.
  • the branch instruction on line 8 is also referred to as a looping instruction because it returns to a previously defined instruction address in the code.
  • the branch would create a recursive loop.
  • the assembly code shown does not have a particular overall function; rather, it is a representation of several operations, e.g., add, and multiply, performed on values loaded from a series of register file 22 to 25 (shown in Fig. 3) and the results are also stored in the register files 22 to 25.
  • the dep instruction shown on line 1 contains information pertinent to how a sequence of instructions is to interact.
  • One form of interaction is for the instructions to be executed concurrently, as is shown in the dep instruction packet 11.
  • the label is the symbol name that refers to the instruction address at which the dep instruction is found.
  • the term "dep” indicates that it is a dep instruction.
  • the information between the parenthesis, namely "indep” specifies the type of the dep instruction (additional types of dep instructions are described below). In this example, the type is independent.
  • the independent dep instruction is the primary type of dep instruction of our invention. It indicates to the hardware of the processor 5 that the next m instructions can be executed concurrently.
  • the instructions following the dep instruction on lines 2 to 8 include instruction types, namely, load, add, multiply (shown as mpy) and store instructions.
  • the names of the instruction types also indicate their functions.
  • the add instruction type performs an arithmetic addition operation.
  • the references to the right of the instructions, (e.g., the load instruction on line 2 is followed by "r0, base0, offset0") are pointers to addresses in the processor 5 main memory contained in register files 22 to 25 shown in Fig. 3. Such addresses contain data which the instruction operates on.
  • the "r0" indicates where data in main memory will be loaded to from an address calculated by base0 plus offset0.
  • Implementation of the dep instruction according to the present invention requires an optimizing compiler/preprocessor 3 (shown in Fig. 1) or a programmer to identify ILP opportunities within an instruction sequence of the program 2 (shown in Fig. 1).
  • An optimizing compiler/preprocessor 3 shown in Fig. 1
  • a programmer to identify ILP opportunities within an instruction sequence of the program 2 shown in Fig. 1.
  • a number of techniques which uncover ILP are known in the art. These include, for example, trace scheduling, percolation scheduling and software pipelining, as descnbed in the following articles: C. Foster et al., Percolation of code to enhance parallel dispatching and execution, IEEE Transactions on Computers, C-21:1411-1415 (Dec. 1972); M. Lam, Software Pipelining: An effective scheduling technique for VLIW machines, In Proceeding of the SIGPLAN'88 Conference on Programming Language Design and Implementation, pp.
  • each version provides the same operations, namely multiplying the values of a series of 0 to N vectors (denoted by the variable i) and adding the results of such series of multiplication.
  • the functionality is shown in the first version written in the C programming language, which is a higher level language.
  • This higher level language code is then translated into two versions assembly code, i.e., lower level languages.
  • the first assembly code version translates the C programming code without implementing the dep instruction and the second assembly code version translates the C programming code including implementing the dep instruction.
  • the C language code is shown as follows (and will hereinafter be referred to as the C code example; the lines shown can be contained within a larger loop in the program 2 and include a simplified inner loop):
  • the first assembly language version is the translation of the C code example to its assembly code equivalent where the compiler/preprocessor 3 does not optimize the code based on the dep instruction. It contains the same functionality as the C code example.
  • the following is hereinafter referred to as the non- dep code example:
  • This non- dep code example has the same functionality as the C code example above.
  • the operations are directed to the lower level processing implemented in assembly code necessary to execute the C code example, such as multiply and subtract operations on data from the register files 22 to 25.
  • move, load and store operations are shown.
  • a term "bne" is shown.
  • This term means a condition referred to as branch if not equal (i.e., the value contained in r0 is not equal to zero).
  • the instruction instructs the processor to go to the instruction address location symbolically denoted by the term "loop" shown next to bne and to execute the instruction at that address.
  • the second assembly language version is the translation of the C code example to its assembly code equivalent where the compiler/preprocessor 3 optimizes the code based on the dep instruction. It contains the same functionality as the C code example and the non- dep code example.
  • the following is hereinafter referred to as the dep code example:
  • the bind_branch indicates to the processor 5 hardware that all instructions within the dep instruction must execute prior to the branch taking effect, as further discussed below.
  • the bne means branch if not equal and is used to implement the loop.
  • the difference between the dep code example and the non- dep code example is that the dep code example implements greater optimization techniques than the non- dep code example. As a result, for multiple loop iterations, the dep code example can be executed in fewer machine cycles than the non- dep code example.
  • dep code example where one instruction is executed per cycle, the first time the code sequence is executed, three additional cycles are required versus the non- dep code example. This is because three dep instructions have been added to the assembly code. While the dep instructions can initiate the execution of instructions in parallel, such dep instructions themselves are not executed in parallel. However, on the first execution of lines 8 to 12, they can be stored in a storage device, for example, a Multiple Issue Buffer (MIB) 26 (shown in Fig. 3) for fast parallel retrieval.
  • MIB Multiple Issue Buffer
  • the processor includes architected storage areas 21 to 26, execution units 27 to 36, a fetch 37, a decoder 38, an issue control 39 and a parallel decoder 40.
  • the architected storage areas 21 to 26 are a main memory 21 and a set of register files 22 to 25 and a MIB 26.
  • the register files are separate register devices which are grouped together in order to use common transmission lines for the input and the output of data to and from such files.
  • the register files 22 to 25 are a register file offset 22, a register file base 23, a register file r24 and a register file f25.
  • the execution units 27 to 36 are a branch unit 27, a branch unit 28, a load ALU 29, a store ALU 30, a data service unit (DSU) 31, a multiply (MPY) 32, an ALU 33 and an ALU 34.
  • DSU data service unit
  • MPY multiply
  • the main memory 21 stores the instructions of the object code, which includes the dep instruction packet 11.
  • the object code can be stored in a separate memory storage area, such as, for example, a cache or a disk.
  • the register files 22 to 25 are storage devices with data intended for particular execution units 27 to 36.
  • the register file 22 corresponds with the branch units 27 and 28, the load ALU 29, the store ALU 30 and the DSU 31.
  • the register file base 23 corresponds to the load ALU 29, the store ALU 30 and the DSU 31.
  • the register file r24 corresponds to the DSU 31, the MPY 32, the ALU 33 and the ALU 34.
  • the register file f25 corresponds to the Fp unit 35 and the Fp Unit 36.
  • the execution units 27 to 36 are logic devices which implement specific types of mathematical operations and are dedicated to these specific operations.
  • the processor 5 determines the operations indicated by the instruction and, based on the operation, which of the execution units 27 to 36 can implement the instruction.
  • the branch units 27 and 28 execute assembly program instructions that may branch to another instruction address.
  • the load ALU 29 loads a value into the r or f register files 24 or 25 to be used in arithmetic operations and the store ALU 30 stores register file contents to main memory.
  • the DSU 31 performs shift, bit manipulation and data permutation.
  • the MPY unit 32 performs multiplication and possibly arithmetic and logical functions.
  • the ALU units 33 and 34 perform arithmetic operations.
  • the Fp units 35 and 36 perform floating point operations.
  • the execution units 27 to 36 additionally communicate with the register files 22 to 25 which are used in the operation of such units 27 to 36.
  • the operations and the signals for such operations of the execution units 27 to 36 and register files 22 to 25 are standard, as shown in J.L. Hennessy, D Goldberg and D. Patterson, Computer Architecture : A Quantitative Approach, Morgan Kan (2d. Ed. Aug. 1995). Reference is made to this publication for a further description of execution units and register files.
  • the general operation of the processor 5 according to the present invention using the dep instruction packet 11 of Fig. 2 is as follows:
  • the fetch 37 fetches instructions from the main memory 21 based on the instruction address pointed to by an Instruction Address Register (not shown) contained in an Instruction Fetch Unit 37A (the IAR and Instruction Fetch Unit 37A are contained within the fetch 37; the Instruction Fetch Unit 37A is shown in Fig. 4).
  • the instruction is thereafter sent to the decoder 38, which in the illustrative embodiment is a serial decode unit used to determine the type of operation to be performed based on the instruction.
  • the instruction is then sent to the issue control 39.
  • the issue control 39 is responsible for comparing the IAR in the MIB 26 with the IAR in the fetch 37.
  • the instruction is a dep instruction and the IAR is cached in the MIB 26 along with the inter-instruction dependencies for transmittal to the parallel decode 40 which issues the instructions for parallel execution by the execution units 27 to 36.
  • the issue control 39 can also map program instructions, including instructions in the dep instruction packet 11, to execution units 27 to 36. The issue control 39 implements additional optimization of program instructions by performing this mapping function.
  • another logic device for example the parallel decode 40 alone or in combination with the issue control 39 (both of which are included in the processor 5) can perform this mapping function.
  • one or more logic located outside the processor 5 can perform the mapping function.
  • Fig. 3 also shows control and data signals between the main memory 21 and the fetch 37.
  • the execution controls (shown in Fig. 3 as exec ctls) are outputs from the parallel decode 40 and are used to control the execution units 27 to 36 (for ease of reference, the inputs of the execution controls to the units 27 to 36 are not shown).
  • the MIB 26 is further described with reference to Fig. 4, in which is shown the MIB 26, the instruction fetch unit 37A, the main memory 21, the issue control 39, the parallel decode 40 and a set of decode units 41 to 50 (for ease of reference, only units 41, 42 and 50 are shown).
  • Each of the decode units 41 to 50 are associated with one of the execution units 27 to 36.
  • Such units 41 to 50 further process the instructions before transmitting them for execution by the execution units 27 to 36. This additional processing is known in the art and therefore will not be further described herein.
  • the MIB 26 of the illustrative embodiment includes a series of storage areas, hereinafter referred to as records (for ease of reference, 3 records are shown in Fig. 4).
  • Each record within the MIB 26 can contain a IAR field 26A, a DEP field 26B, a Num field 26C, and instruction fields 26D (shown as InstrO to Instr n , where n is the number of instructions within a given record of the MIB 26).
  • n is equal to the number of execution units 27 to 36 (or the value ten) in the processor 5 constructed according to the present invention.
  • the number of instruction fields 26D equals the number of execution units 27 to 36 because during parallel execution of the dep instruction packet 11, where the number of delimited instructions is less than n such that each of the execution units 27 to 36 are not needed for a parallel execution, the unused execution units 27 to 36 must receive a noop.
  • Noops or no operation instructions are instructions indicating to the execution units 27 to 36 corresponding to such instruction fields 26D that no operations are to be performed. For those instruction fields 26D which do not contain an instruction from the dep instruction packet 11, the decode 38 writes noops to such fields 26D.
  • Another important advantage of the present invention is that the noops are implemented in the instruction fields 26D of the MIB 26 rather than in the main memory 21. This allows for compressed instructions to be stored in the main memory 21. In addition, the number of transmission lines of the main memory 21 required to transmit an instruction sequence in the main memory 21 to the processor 5 is reduced where noops are not stored in the main memory 21. Finally, this feature of our invention allows the processor 5 to allocate instructions associated with the dep instruction packet 11 in the MIB 26 according to the optimal use of execution units 26 to 37, rather than based on a predetermined binding of instructions and noops. Once the processor 5 has allocated the dep instruction packet 11 to the MIB 26, the instruction fields 26D for which no instruction is written can contain noops or have noops written to them. Therefore, main memory 21 storage space and hardware are reduced and the processor 5 is able to maintain or further optimize the gain achieved by exploiting ILP by freely allocating the execution units 26 to 37.
  • the general operation of the MIB 26 according to the present invention using the dep instruction packet 11 of Fig. 2 is as follows:
  • the execution of the dep instruction causes the contents of the IAR (at the instruction address label) to be written into the MIB 26.
  • the processor 5 may sequentially execute the (indep) type dep instruction while writing the instruction concurrently into the MIB 26.
  • Another alternative is for the processor to fetch all the instructions delimited by the dep instruction and then issue them in parallel.
  • the branch label instruction is executed, the IAR is found in the IAR field 26A of the MIB 26 and all four instructions are issued in parallel.
  • the instructions are written into the MIB 26 in much the same manner as a trace cache, as described in Rotenberg et al., Trace Cache: a Low Latency Approach to High Bandwidth Instruction Fetching, Proceedings of the 29th Annual International Symposium on Microarchitecture, pp. 24-24 (Dec. 1996). Later, when the object code branches to an instruction address that is within the MIB 26, the machine can issue all of the instructions in parallel subject to the constraints of the stored dependency information.
  • the processor 5 can issue two of the instructions of the dep instruction packet 11 in parallel followed by the remaining two instructions in series.
  • the Num field 26C specifies how to increment the IAR in the fetch 37 so that the original instruction sequence of the dep instruction packet 11 is preserved.
  • Fig. 5 depicts an implementation of a MIB 26' according to an altemative embodiment in which there are no noops stored in the instruction fields 26D'.
  • the devices of Fig. 5 are identical to those of Fig. 4 and have been assigned identical reference characters.
  • the instruction fields 26D' i.e., instr0 to instm
  • the MIB 26' the difference is that the instruction fields 26D' (i.e., instr0 to instm) are located in a storage device separate from the MIB 26'.
  • the MIB 26' is generally referred to as a two-level buffer and requires the storage of additional information.
  • an address field 26E' (shown as Addr in this figure) stores the memory address of the first delimited instruction within the dep instruction packet 11. This has the advantage of rendering unnecessary additional noop storage space in the MIB 26.
  • the Num field 26C specifies how many instructions to read from the instruction fields 26D' and additionally how to increment the IAR field 26A.
  • a possible requirement of the MIB 26' embodiment of the present invention is the defragmentation of memory. Because the number of instructions stored at each address field 26E' is variable, it is possible that the memory may become fragmented. This may cause unnecessary evictions from the cache or require occasional compaction. Defragmentation is related to a constraint of architected storage spaces and is known in the art. Therefore, it will not be described further herein.
  • the MIB 26' instruction fields 26D' can be physically executed by the hardware of the processor 5 (shown in Fig. 3).
  • the issue control 39' logic is responsible for assembling the object code instructions into executable packets, allocating execution units 27 to 36 (as shown in Fig. 3) and routing the instructions over a number of cycles.
  • the MIB 26 or the MIB 26' shown in Figs. 4 and 5, respectively since they operate as a cache, there are a limited number of instructions which can be stored at a given time. Accordingly, there are numerous known methods for systematically clearing instructions from the MIB in order to allow for the storage of other instructions. However, it is possible to introduce some additional operations which help to ensure real-time behavior. (For purposes of clearing instructions, the MIB 26 and the MIB 26' can be used interchangeably.) In particular, specific instructions delimited by the dep instruction can be "locked" within the MIB 26 or 26'. This can avoid a potential thrashing condition (as is known in the art and will not be further described herein) on implementations that have restricted numbers of entries.
  • a lock-MIB operation may be specified in a number of ways.
  • a separate instruction can lock the contents of the entire MIB in place until a separate unlock_mib instruction is executed. Under program control, it can also specify that the current contents can not be evicted but additional locations can be free to cache the dep instructions in the MIB.
  • a bit (or a binary 0 or 1) in the dep instruction can lock an individual record of the MIB. If no MIB is present, this bit is ignored. Additionally, if the MIB is full and all the records are currently locked, the bit is ignored. This can have a negative impact on performance.
  • This instruction sequence at line 4 loads the instruction address label into the register file r24 at location 30. Because the MIB 26 stores the dep instruction packet 11 by the IAR, an instruction can be unlocked by referencing the IAR at which it is stored in the IAR field 26A of the MIB 26.
  • a flush_mib Another important operation for use with a single-level MIB, or the MIB 26, is a flush_mib. This instruction clears the contents of the MIB 26 and sets all instructions in the instruction fields 26D locations to noops. In this way, noops are extant in each instruction field 26D such that they do not need to be written again when the dep instruction packet contains fewer than n instructions.
  • the dep instruction according to the present invention can be used even if there is no MIB 26 (or any type of cache) present.
  • all that is required is an instruction bandwidth, which is a storage area in a device of the processor 5 capable of storing the instructions to be issued in parallel. If the fetch 37 can hold the instructions delimited by the dep instruction, then the instruction bandwidth would still be reduced. In this case, all instructions which operate on the MIB 26 or any type of cache (e.g., lock_mib and flush_mib) are ignored.
  • additional types of dep instructions can be used alone or in addition to the independent type of dep instructions.
  • One such additional type is the concurrent type. This type indicates to the processor 5 (shown in Fig. 3) that the delimited instructions should appear to be issued concurrently. This affects the values read from the register files 22 to 25 which are used for implementing such instructions. Rather than receiving the updated values (as viewed from the serial instruction sequence), the pertinent register files 22 to 25 receive the value contained in the pertinent register files 22 to 25 prior the dep instruction. Accordingly, where each of multiple instructions effect overlapping addresses in the register files 22 to 25 and require the values in such addresses prior to either instruction being executed, the concurrent type of dep instruction is used. An example of this is a swap.
  • a temporary register must be used to perform a swap operation.
  • a temporary register r3 is established to store one of the values in either r0 or r1 in order to ensure so that the values prior to instruction execution are swapped rather than overwritten values.
  • a swap can be accomplished as:
  • the temporary register r3 is not needed. Rather, the swap operation is executing using two separate execution units which have separately loaded the values of r0 and r1 prior to any instruction execution. In this way, the values written as a result of the operations are based on the original values in the r0 and r1 registers rather than any overwritten values.
  • bind_branch type dep instruction Another type of dep instructions for use in altemative embodiments is the bind_branch type dep instruction. This informs the processor 5 hardware that all of the instructions can issue in parallel but the branch instruction may not execute until all other instructions within the dep instruction packet have completed execution. For a processor 5 with enough resources to execute the entire delimited instructions in a single cycle, this is equivalent to an (indep) type dep instruction. However, for a processor 5 which requires multiple cycles to execute an entire dep instruction packet, it is necessary to delay the effects of the branch until all the instructions within the packet have executed.
  • branch prediction dep instruction specifies the equivalent of the bind_branch type except that the processor 5 hardware also statically predicts that the branch will be taken, for example:
  • the bne or branch function (as described above) is used in this example.
  • the instruction instructs the processor to go to the instruction address shown next to bne, namely the symbolic address label, and to execute such the instruction at that address, namely the prod_taken dep instruction.
  • the prod_taken dep instruction type allows the processor 5 hardware to begin fetching instructions from the address label at the earliest possible stage of the processor and the pipeline (not shown) which can accommodate such instruction.
  • a standard processor 5 pipeline includes four stages, namely fetch, decode, execute and write back while a high performance processor 5 can include more than four stages such that the number of stages is greatly increased.
  • the processing of instructions occurs during a particular stage of the processor 5.
  • the operations of the processor 5 as to stages and processing instructions is known in the art and will not be discussed further herein.
  • Using the branch prediction type instructions are processed in a shorter cycle time than for normal processing of the processor 5.
  • a speculative operation type involves executing a series of instructions in order to use the processor 5 at the earliest possible time it is available to execute an instruction but waiting until the outcome of a condition is known (e.g., based on a branch instruction) to store the results of such execution (also referred to as committing results). For example, in an exemplary speculative operation, while the series of instructions are executed, they will be stored only if the branch condition is met. In this way, the processor 5 is used at maximum efficiency with the expectation that the outcome of the condition will enable storage of instruction executions (depending on how the speculative operation type is set up, whether meeting the condition enables storing the results or not) .
  • the outcome of the condition can also result in rendering the results of such execution moot such that the results are discarded.
  • the operation is speculative because there is a chance that the results of execution of a series of instructions may be discarded. However, efficiencies can be gained when the results are usable. More particularly, with the speculative operation type, the results are not committed until the outcome of some condition is known. In some cases, speculative operations allow the processor 5 hardware to optimize the utilization of the execution units 27 to 36. However, it can require that some results be discarded. For example: This dep instruction packet specifies that the entire delimited packet is a speculative operation.
  • condition is a branch instruction and, if the branch is not taken, the instruction sequence is stored in the register files 22 to 25 for execution (also referred to as committing an instruction. Otherwise, if the branch is taken, the results are discarded.
  • This type of operation is particularly important for store instructions because it addresses the difficult problem of moving store instructions above branches when attempting to issue a large number of parallel instructions.
  • the issue control 39 may have to handle complex issue strategies and may need to contain enough architecturally invisible registers 22 to 27 to hold all intermediate computations. This is particularly true of the concurrent type dep instruction.
  • Such issue strategies and organization for the processor 5 are known in the art and therefore will not be further discussed herein.
  • the dep instruction types can be used alone or in combination with one or more of any other such type. The use and combination of the dep instruction types is a matter of design preference and does not limit the present invention.
  • An advantage of our invention is object code compatibility within multiple implementations of the same organization, or organizations which differ in the number of execution units of the processor and have one or more execution units in common. For example, two organizations each provide multiply, ALU, load and store execution units and the second organization additionally provides another multiply execution unit.
  • the dep instruction of the present invention is compiled for the first organization, it can also be executed by the second organization without recompilation, and vice versa
  • the first organization can issue four instructions in parallel because it includes four execution units.
  • the second organization can issue five instructions in parallel.
  • Such execution on alternative first and second organizations is achieved because a dep instruction packet containing four instructions can be executed on both organizations.
  • a dep instruction packet containing five instructions it can execute on the second organization in parallel and on the first organization with four instructions in parallel followed by a single instruction in series without having to recompile the object code.
  • the performance time for each the organizations to execute the dep instruction packet can differ based on different performance times for parallel processing of all instructions in the packet compared to parallel processing of some instructions followed by serial processing of the remainder in the packet.
  • dep instruction packets and the dep instructions shown herein are merely exemplary of types of dep instructions and are not required alone or in combination.
  • packets and instructions as well as the processor organizations shown herein are merely exemplary of dep instructions and processor organizations which exploit ILP.
  • processors which can be constructed according to the present invention.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Advance Control (AREA)
  • Devices For Executing Special Programs (AREA)
EP98310063A 1997-12-16 1998-12-08 Compilergesteuerte dynamische Ablauffolgeplanung von Programmbefehlen Withdrawn EP0924603A3 (de)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US99711797A 1997-12-16 1997-12-16
US997117 1997-12-16

Publications (2)

Publication Number Publication Date
EP0924603A2 true EP0924603A2 (de) 1999-06-23
EP0924603A3 EP0924603A3 (de) 2001-02-07

Family

ID=25543667

Family Applications (1)

Application Number Title Priority Date Filing Date
EP98310063A Withdrawn EP0924603A3 (de) 1997-12-16 1998-12-08 Compilergesteuerte dynamische Ablauffolgeplanung von Programmbefehlen

Country Status (2)

Country Link
EP (1) EP0924603A3 (de)
JP (1) JPH11242599A (de)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2002027479A1 (en) * 2000-09-27 2002-04-04 University Of Bristol Computer instructions
WO2003046712A2 (en) 2001-11-26 2003-06-05 Koninklijke Philips Electronics N.V. Wlim architecture with power down instruction
EP1378825A1 (de) * 2002-07-02 2004-01-07 STMicroelectronics S.r.l. Verfahren zur Ausführung von Programmen in einem Prozessor mit auswählbaren Befehlslängen, und entsprechendes Prozessorsystem
GB2411025A (en) * 2003-09-10 2005-08-17 Hewlett Packard Development Co Compiler which inserts diagnostic instructions to be executed by idle functional units
US7395532B2 (en) 2002-07-02 2008-07-01 Stmicroelectronics S.R.L. Process for running programs on processors and corresponding processor system
US8141068B1 (en) * 2002-06-18 2012-03-20 Hewlett-Packard Development Company, L.P. Compiler with flexible scheduling
CN109313554A (zh) * 2016-05-27 2019-02-05 Arm有限公司 用于在非均匀计算装置中进行调度的方法和设备

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0455966A2 (de) * 1990-05-10 1991-11-13 International Business Machines Corporation Vorverarbeitungsprozessor zur Verbindung von Befehlen für einen Cache-Speicher
EP0652509A1 (de) * 1993-11-05 1995-05-10 Intergraph Corporation Befehlscachespeicher mit Kreuzschienenschalter
US5504932A (en) * 1990-05-04 1996-04-02 International Business Machines Corporation System for executing scalar instructions in parallel based on control bits appended by compounding decoder
WO2000038059A2 (en) * 1998-12-23 2000-06-29 Cray Inc. Method and system for calculating instruction lookahead

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5504932A (en) * 1990-05-04 1996-04-02 International Business Machines Corporation System for executing scalar instructions in parallel based on control bits appended by compounding decoder
EP0455966A2 (de) * 1990-05-10 1991-11-13 International Business Machines Corporation Vorverarbeitungsprozessor zur Verbindung von Befehlen für einen Cache-Speicher
EP0652509A1 (de) * 1993-11-05 1995-05-10 Intergraph Corporation Befehlscachespeicher mit Kreuzschienenschalter
WO2000038059A2 (en) * 1998-12-23 2000-06-29 Cray Inc. Method and system for calculating instruction lookahead

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2002027479A1 (en) * 2000-09-27 2002-04-04 University Of Bristol Computer instructions
WO2003046712A2 (en) 2001-11-26 2003-06-05 Koninklijke Philips Electronics N.V. Wlim architecture with power down instruction
WO2003046712A3 (en) * 2001-11-26 2003-10-02 Koninkl Philips Electronics Nv Wlim architecture with power down instruction
US8141068B1 (en) * 2002-06-18 2012-03-20 Hewlett-Packard Development Company, L.P. Compiler with flexible scheduling
EP1378825A1 (de) * 2002-07-02 2004-01-07 STMicroelectronics S.r.l. Verfahren zur Ausführung von Programmen in einem Prozessor mit auswählbaren Befehlslängen, und entsprechendes Prozessorsystem
US7395532B2 (en) 2002-07-02 2008-07-01 Stmicroelectronics S.R.L. Process for running programs on processors and corresponding processor system
US7617494B2 (en) 2002-07-02 2009-11-10 Stmicroelectronics S.R.L. Process for running programs with selectable instruction length processors and corresponding processor system
US8176478B2 (en) 2002-07-02 2012-05-08 Stmicroelectronics S.R.L Process for running programs on processors and corresponding processor system
GB2411025A (en) * 2003-09-10 2005-08-17 Hewlett Packard Development Co Compiler which inserts diagnostic instructions to be executed by idle functional units
US7206969B2 (en) 2003-09-10 2007-04-17 Hewlett-Packard Development Company, L.P. Opportunistic pattern-based CPU functional testing
CN109313554A (zh) * 2016-05-27 2019-02-05 Arm有限公司 用于在非均匀计算装置中进行调度的方法和设备
CN109313554B (zh) * 2016-05-27 2023-03-07 Arm有限公司 用于在非均匀计算装置中进行调度的方法和设备

Also Published As

Publication number Publication date
JPH11242599A (ja) 1999-09-07
EP0924603A3 (de) 2001-02-07

Similar Documents

Publication Publication Date Title
US11422837B2 (en) Virtual machine coprocessor for accelerating software execution
Eichenberger et al. Using advanced compiler technology to exploit the performance of the Cell Broadband Engine™ architecture
US5442760A (en) Decoded instruction cache architecture with each instruction field in multiple-instruction cache line directly connected to specific functional unit
Colwell et al. A VLIW architecture for a trace scheduling compiler
US6631514B1 (en) Emulation system that uses dynamic binary translation and permits the safe speculation of trapping operations
CN108376097B (zh) 用于通过使用由可分割引擎实例化的虚拟核来支持代码块执行的寄存器文件段
US5958048A (en) Architectural support for software pipelining of nested loops
US7594102B2 (en) Method and apparatus for vector execution on a scalar machine
US5890008A (en) Method for dynamically reconfiguring a processor
US5941983A (en) Out-of-order execution using encoded dependencies between instructions in queues to determine stall values that control issurance of instructions from the queues
US6240502B1 (en) Apparatus for dynamically reconfiguring a processor
US5838988A (en) Computer product for precise architectural update in an out-of-order processor
CN108108188B (zh) 用于通过使用由可分区引擎实例化的虚拟核来支持代码块执行的存储器片段
US8161266B2 (en) Replicating opcode to other lanes and modifying argument register to others in vector portion for parallel operation
US7502910B2 (en) Sideband scout thread processor for reducing latency associated with a main processor
JP3120152B2 (ja) コンピューターシステム
US7350055B2 (en) Tightly coupled accelerator
US20060095720A1 (en) Reuseable configuration data
WO2017048662A1 (en) Predicated read instructions
KR20180021812A (ko) 연속하는 블록을 병렬 실행하는 블록 기반의 아키텍쳐
US6219778B1 (en) Apparatus for generating out-of-order results and out-of-order condition codes in a processor
WO2009076324A2 (en) Strand-based computing hardware and dynamically optimizing strandware for a high performance microprocessor system
US20050060711A1 (en) Hidden job start preparation in an instruction-parallel processor system
Stark et al. Reducing the performance impact of instruction cache misses by writing instructions into the reservation stations out-of-order
EP0924603A2 (de) Compilergesteuerte dynamische Ablauffolgeplanung von Programmbefehlen

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

AK Designated contracting states

Kind code of ref document: A2

Designated state(s): DE FR GB

AX Request for extension of the european patent

Free format text: AL;LT;LV;MK;RO;SI

PUAL Search report despatched

Free format text: ORIGINAL CODE: 0009013

AK Designated contracting states

Kind code of ref document: A3

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LI LU MC NL PT SE

AX Request for extension of the european patent

Free format text: AL;LT;LV;MK;RO;SI

17P Request for examination filed

Effective date: 20010720

AKX Designation fees paid

Free format text: DE FR GB

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN

18D Application deemed to be withdrawn

Effective date: 20020702