WO1998006039A1 - Circuit de memoire de desambiguisation et procede de fonctionnement - Google Patents

Circuit de memoire de desambiguisation et procede de fonctionnement Download PDF

Info

Publication number
WO1998006039A1
WO1998006039A1 PCT/RU1996/000215 RU9600215W WO9806039A1 WO 1998006039 A1 WO1998006039 A1 WO 1998006039A1 RU 9600215 W RU9600215 W RU 9600215W WO 9806039 A1 WO9806039 A1 WO 9806039A1
Authority
WO
WIPO (PCT)
Prior art keywords
memory
dam
loop
register
registers
Prior art date
Application number
PCT/RU1996/000215
Other languages
English (en)
Inventor
Boris Artashesovich Babayan
Feodor Anatolievich Gruzdov
Yuli Khanaanovich Sakhin
Vladimir Jurievich Volkonski
Zinaida Nikolaevna Zaitzeva
Mikhail Viktorovich Laptev
Original Assignee
Sun Microsystems, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Microsystems, Inc. filed Critical Sun Microsystems, Inc.
Priority to PCT/RU1996/000215 priority Critical patent/WO1998006039A1/fr
Publication of WO1998006039A1 publication Critical patent/WO1998006039A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
    • G06F9/3842Speculative instruction execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30072Arrangements for executing specific machine instructions to perform conditional operations, e.g. using predicates or guards
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30076Arrangements for executing specific machine instructions to perform miscellaneous control operations, e.g. NOP
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30094Condition code generation, e.g. Carry, Zero flag
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30098Register arrangements
    • G06F9/3012Organisation of register space, e.g. banked or distributed register file
    • G06F9/3013Organisation of register space, e.g. banked or distributed register file according to data content, e.g. floating-point registers, address registers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3824Operand accessing
    • G06F9/3834Maintaining memory consistency
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
    • G06F9/3838Dependency mechanisms, e.g. register scoreboarding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
    • G06F9/3893Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units controlled in tandem, e.g. multiplier-accumulator

Definitions

  • the present invention relates to processors and computing devices. More specifically the present invention relates to a method and apparatus for handling memory aliasing arising from potential dependencies between memory operations.
  • VLIW very long instruction word
  • superscalar architecture processors achieve efficiency by performing multiple operations in parallel.
  • Parallelism is limited, however, by memory aliasing arising from potential dependencies between memory operations.
  • a load operation is potentially dependent upon all preceding stores so that the load is not to be rescheduled to execute before the store operations.
  • the latency of a load operation must be considered to prevent aliasing.
  • compile-time memory disambiguation One technique for solving the problem of memory operation dependencies is compile-time memory disambiguation.
  • compile-time disambiguation is a difficult problem since operational information is lacking during compilation.
  • run-time disambiguation Another technique is run-time disambiguation, which allows scheduling of a load before potentially aliasing stores even if conclusive aliasing information is not available at compile time.
  • One run-time disambiguation method is disclosed in "HPL PlayDoh Architecture Specification: Version 1.0" by V. Kathail et al. PlayDoh run- time disambiguation is attained at an architectural level by support of three related operation families including data speculative load (LDS), data verify load (LDV) and data verify branch (BRDV) families. Run-time disambiguation is attained through usage of an LDS operation in conjunction with either and LDV operation or a BRDV operation.
  • a load is scheduled before potentially aliasing stores using an LDS-LDV operation pair in which both operations specify the same memory address, the same destination register and the same width modifier.
  • a LDS-BRDV pair not only schedules a load but also schedules operations that depend upon the loaded value to be scheduled before potentially aliasing stores.
  • the BRDV operation is used to branch to a compiler-generated compensation code.
  • PlayDoh run-time disambiguation employs an abstract concept of an LDS log, which may be implemented in various structures.
  • the LDS log records information relating to a previously-issued subset of LDS operations.
  • the LDS log includes a plurality of entries, each entry including at least a target register field, an address field and a construct for marking an entry as valid or invalid.
  • the target register field contains the register loaded by the operation.
  • the address field contains either the memory address referenced by the operation or a "syndrome" derived from the address.
  • Operations that access the LDS log include LDS, LDV, and BRDV operations and also store operations.
  • the PlayDoh run time disambiguation employs a content addressable memory for disambiguation memory addressing.
  • the addresses are placed by speculative load operations and associatively accessed by check operations.
  • load operations typically cause a large number of aliases among loop store operations.
  • the checking of multiple addresses in the disambiguation memory is prohibitive in a content addressable memory due to the large amount of circuitry necessary for acquiring multiple destination addresses and concurrently performing a multiple associative search of the addresses.
  • a major failing of convention disambiguation memory handling is that load- store dependency handling is typically achieved only for linear program code. Dependency handling for loop operations is very difficult or impossible for conventional implementations.
  • a disambiguation memory incorporates checking of load/store aliasing for a plurality of destination addresses into a single wide instruction operating in a very long instruction word (VLIW) processor.
  • the disambiguation memory utilizes direct addressing and a compiler- generated check mask to check multiple destination addresses concurrently in a compact circuit size.
  • software-pipelined loops are used by a compiler to accelerate calculations.
  • Several logical loop iterations are typically executed simultaneously, potentially causing a large number of address dependencies that cannot be resolved at compile-time.
  • a particular loop operation has the same static destination address, which is known by the compiler, in all loop iterations.
  • dynamic destination addresses change as successive iterations are initiated.
  • the dynamic addresses are not available at compile time.
  • the disambiguation memory uses based addressing to access dynamic destination addresses, thereby accessing rotating regions of the disambiguation memory. Accordingly, the disambiguation memory accesses dynamic destination addresses for software pipelined loop operation in a compact disambiguation memory size.
  • a disambiguation memory that temporarily stores a source memory address of a speculative load instruction in a disambiguation memory register having a position determined by the location of a destination register of the speculative load instruction.
  • the disambiguation memory is selectably based addressed and direct addressed.
  • store operations are issued s to a source register of the speculative load instruction are indicative of a load-store scheduling fault and are logged by setting a "violation" bit in the disambiguation memory register designated by both the source operand of the speculative load instruction and the destination operand of the store instruction.
  • a check disambiguation memory instruction clears the disambiguation memory but, if a violation bit is set, diverts program control to a compiler-generated recovery code.
  • the recovery code executes as directed by the violation bits in the disambiguation memory registers to variously recover from multiple simultaneous load-store faults that may occur during loop processing.
  • the described system and method achieve several advantages.
  • One advantage is that potential compile-time load-store dependencies are efficiently addressed not only for linear execution, but also for operating in a loop body.
  • Another advantage is that direct addressing in the disambiguation memory results in a highly compact circuit size.
  • a further advantage is that based addressing allows the disambiguation memory to be implemented in a structure having many fewer registers while still efficiently handling pipelined loop program code.
  • FIGURE 1 is a schematic block diagram showing a central processing unit (CPU) which includes a disambiguation memory in accordance with an embodiment of the present invention.
  • CPU central processing unit
  • FIGURE 2 is a pictorial illustration of loop scheduling of a simple inner loop in accordance with the teachings of the present invention.
  • FIGURE 3 is a block diagram of loop control logic constructed in accordance with the teachings of the present invention.
  • FIGURE 4 is a pictorial illustration of control transfer logic constructed in accordance with the teachings of the present invention.
  • FIGURE 5 is a schematic block diagram showing a disambiguation memory including a plurality of associative disambiguation memory (DAM) registers.
  • DAM associative disambiguation memory
  • FIGURE 6 is a flow chart showing the operation of a disambiguation memory in accordance with an embodiment of the present invention.
  • a Central Processor Unit (CPU) 100 has a wide instruction word architecture and uses Instruction Level Parallelism (ILP) to ensure high performance.
  • the CPU compiler is used to plan the operations to be executed by the CPU 100 in each cycle.
  • the processor structure allows concurrent execution of a few simple independent instructions (operations) that constitute a wide instruction (load, store, add, multiply, divide, shift, logical, branch, etc.).
  • Wide instructions are stored in a memory 111 connected to CPU 100 in packed form as sets of 16 and 32 bits syllables. Particular operations can occupy a part of syllable, a whole syllable or several syllables.
  • the EU 150 operation execution time is one cycle for integer and logic operations, two cycles for floating point addition, three or four cycles for floating point multiplication, seven cycles for word format division and ten to eleven cycles for two-word format, normalized operands. All operations except division may be run in every cycle. Division may be run every other cycle.
  • the Central Processor Unit 100 contains an Instruction Buffer (IB) 110, a Control Unit (CU) 120, a multiport Predicate File (PF) 131, a multiport Register File (RF) 130, a Calculate Condition Unit (CCU) 133, a Data Cache (DCACHE) 180, four Arithmetic Logic Units (ALU0 - ALU3) generally identified as 140, each of which includes a plurality of execution units (EUs) which are shown generally as EU 150, an Array Prefetch Buffer (APB) 135, four Array Access Channels (AAC0 - AAC3) generally identified as 160, a Memory Management Unit (MMU) 190, and Memory Access Unit (MAU) 170.
  • the combination of wide instruction operation and a large number of execution units 150 allows several alternative program branches to execute concurrently in a speculative mode.
  • the Instruction Buffer (IB) 110 fetches wide instructions from memory 111 and includes an instruction buffer memory, an instruction alignment logic, a program counter register (PC) 116, and control transfer preparation registers (CTPRl 113 and CTPR2 114), a control transfer execution register (CTER 115), and the instruction cache (ICACHE) 182.
  • the instruction buffer memory is filled in response to both linear program path prefetches and control transfer preparation instructions.
  • the Instruction Buffer (IB) 110 contains 2048 64-bit words and is divided into sixteen sectors. Program code is stored in virtual memory (not shown) which is common with data code storage.
  • IB 110 has a separate Instruction Translate Lookaside Buffer (ITLB) 117 with 32 entries.
  • IB 110 filling is initiated by hardware for direct way when direct way code is exhausted in IB 110 and by a program when prepare control transfer operation is executed. IB 110 performs program code filling for three branches. In the case of IB 110 miss, the program code is loaded from memory 111 by four memory access channels in parallel (four 64-bit words simultaneously). IB 110 and Control Unit (CU) 120 perform reading from IB 110 and dispatching of the maximum size wide instruction (eight 64-bit words) every cycle.
  • the control unit (CU) 120 generates wide instructions in an unpacked form, transforms indirect based operands addresses of wide instruction to absolute addresses in a register file 130, checks the conditions of the wide instruction issue. The wide instruction issue conditions which are checked include checking for no exceptions, no interlock conditions from other units of CPU 100, and availability of operands in the register file (RF) 130.
  • the Control Unit (CU) 120 issues wide instruction operations for execution and performs several tasks including reading of up to ten operands from the register file (RF) 130 to ALUO - ALU3 140, reading up to three predicate values from the Predicate File (PF) 131 to Control Unit (CU) 120 as condition code for control transfer operations, reading up to eight predicate values from the Predicate File (PF) 131 to the Calculate Condition Unit (CCU) 133 for calculation of new predicate values and generation of a mask of condition execution of operations in ALUO - ALU3 140 and AACO - AAC3 160, issuing literal values to ALUO - ALU3 140 and AACO - AAC3 160, issuing up to four operations to ALUO - ALU3 140, issuing up to four operations to AACO - AAC3 160, and issuing up to four operations to the Calculate Condition Unit (CCU) 133.
  • the Control Unit (CU) 120 also issues a prepare control transfer operation to Control Unit (CU) 120 and checks for the possibility of the execution of three control transfer operations in Control Unit (CU) 120.
  • the control unit 120 receives an "H-syllable" of an instruction word, transforms operand addresses from the instruction that are base-relative into effective register file addresses, and checks conditions of the next instruction delivery from an unpacked instruction register (not shown) to an execution unit 150.
  • the control unit 120 also executes control transfer operations (CTOPs) and includes loop parameter and status registers 124 such as a loop parameters register (LPR), and loop state registers (LSRl and LSR2).
  • the Predicate File (PF) 131 is a storage of predicate values generated by integer and floating point compare operations. Predicate values are used to control the conditional execution of operations.
  • the Predicate File (PF) 131 contains 32 two-bit registers.
  • the Calculate Condition Unit (CCU) 133 generates a mask for the conditional execution of ALUi 140 and AACi 160 operations and calculates values of the secondary predicate as the primary predicates function.
  • the Register File (RF) 130 contains 256 66-bit registers and has ten read ports and eight write ports. All ten read ports are used to read ALU 140 operands and two read ports are used to read stored values to the Data Cache (DCACHE) 180 and the Memory Management Unit (MMU) 190. Four write ports are used to write ALUs results and die other four write ports are used to write values loaded from memory.
  • the register file 130 accesses the 256 66-bit registers using four address bases (CWP, CWPAR, BR1 and BR2). Each base addresses up to 64 registers.
  • ALUO - ALU3 140 are four parallel executive channels and have nearly the same sets of arithmetic and logic operations. ALUI and ALU3 are used to calculate addresses of scalar memory accesses. All ALUs receive operands from register file (RF) 130 and bypass buses 142. The bypass abates the time of delivery of ALUs operations results to the following operations. ALUO and ALU2 receive two operands and ALUI and ALU3 receive three operands for execution of combined three- argument operations. ALU 140 operation results are written to the register file (RF) 130 through four RF write channels.
  • the Array Access Channels AACO - AAC3 160 are four parallel channels for generation of array elements addresses for loops. Each AACi contains eight pair of address registers which are current address register and increment register. All AACi 160 have the same operations set including a current array element address generation operation with or without the next element address calculation. For memory accesses, one pair of address registers in each channel is used in every cycle. AACO and AAC2 are used only for load memory accesses. AACI and AAC3 are used for load and store memory accesses.
  • the Memory Management Unit (MMU) 190 contains a four-port Data Translate Lookaside Buffer (DTLB) 137 with 64 entries and performs hardware search in Page Table in DTLB 137 miss case.
  • DTLB Data Translate Lookaside Buffer
  • MMU Disambiguation Memory
  • Disambiguation Memory 194 for checking rearrangement correctness of load and store operations, performed by an optimizing compiler.
  • the disambiguation memory 194 includes sixteen associative registers for detecting such memory conflicts.
  • the virtual address is used as an associative tag. Invoking a load instruction causes the load address to be driven to the disambiguation memory 194 which checks for subsequent stores to the same address.
  • the disambiguation memory 194 responds to a store at the address by generating a trap. The address is deleted from the disambiguation memory 194 using a CHECKDAM instruction.
  • the MAU 170 is an interface for communicating between the CPU 100 and external memory at an exchange rate of up to four information words transferred during a cycle.
  • the Memory Access Unit contains an entry buffer for memory requests and a crossbar of four data and one group instruction buffer (IB) 110 memory access channels to four physical memory channels. Two least significant bits of physical addresses are the physical memory channel number.
  • the Data Cache (DCACHE) 180 caches data for scalar memory access.
  • Data Cache (DCACHE) 180 is write-through, 32 Kbytes, four-way set associative with 64- byte blocks, virtually addressed and virtually tagged, dual-ported, with 64-bit data paths.
  • Data Cache (DCACHE) 180 output is united with ALUs output that permits to use bypass buses 142 to abate data transfer to ALUs. In the case of DCACHE miss data from memory are transferred to Data Cache (DCACHE) 180 through four channels simultaneously.
  • the Array Prefetch Buffer (APB) 135 is used to prefetch array elements for loops from memory.
  • the Array Prefetch Buffer (APB) 135 is a four-channel FIFO buffer.
  • the Array Prefetch Buffer (APB) 135 contains 4x48 66-bit registers. Data are transferred from the Array Prefetch Buffer (APB) 135 to the register file (RF) 130 when the data are ready.
  • the CPU 100 has four memory access channels. Each channel has 64 bits data path.
  • S iteration interval
  • FIGURE 2 showing a loop execution diagram, during the first I cycles a first stage of a first iteration executes. During the next I cycles, the first stage of a second iteration and the second stage of the first iteration execute. The loop progresses in this manner until S different iterations are executing in different stages.
  • the first S-l iterations of a loop when less than all stages are executing, is called a prologue interval 230.
  • the final S iterations are executing while early iterations of early cycles have terminated so that not all stages are executing.
  • the final S-l iterations of a loop when all stages are not executing, is called an epilogue interval 240.
  • the intermediate iterations when all stages are executing concurrently, are called a kernel interval 290.
  • a compiler for generating instruction code for a VLIW processor acts upon loop code for overlapped execution by overlapping portions of the instruction code corresponding to several sequential iterations of a loop. Operations from several iterations are combined, or overlapped, into a single wide instruction word.
  • VLIW compilers which are well known in the computing arts, implement variations of a software pipelining technique.
  • Logical iterations are iterations within an original loop code before the code is compiled.
  • Physical iterations are run-time iterations of a software pipelined loop. Multiple logical iterations are overlapped in a physical iteration.
  • overlapped loop code the number of overlapped logical iterations in a physical iteration is N 0Vf N 0V physical iterations are executed to complete a logical iteration so that one logical iteration is executed in N 0VL stages.
  • a timing diagram shows iterations of a simple inner loop compiled for execution on CPU 100.
  • Logical iterations including a first logical iteration 270 and a second logical iteration 280 and physical iterations including a first physical iteration 250 and a second physical iteration 260 are illustrated.
  • Five logical iterations are overlapped in each physical iteration and each logical iteration is executed in five stages.
  • stages of logical iterations 3, 4, 5,6, and 7 are executed.
  • a single physical iteration can require the evaluation of more than one instruction word, i.e., "n" very long instruction words evaluated in "n" cycles such as 217, 218, and 219.
  • n very long instruction words evaluated in "n" cycles
  • not every very long instruction required for a physical iteration will contribute an operation to the set of operations evaluated for a stage of a logical iteration, i.e., some cycles will not contribute an operation to some stages.
  • physical iterations of prologue 230 and epilogue 240 portions of the body of a simple inner loop do not include a full set of stages.
  • certain stages include garbage operations 210 which are associated with non-existent logical iterations.
  • garbage operations 220 are associated with other non-existent logical iterations.
  • garbage operations 210 and 220 occur because each physical iteration of loop body 200 includes the same set of operations, encoded by the one or more VLIW instruction cycles which make up a physical iteration.
  • the full set of operations encoded for a physical iteration of loop body code only one valid stage exists in the first physical iteration 250, only two valid stages exist in the second physical iteration 260, etc., until all five stages are valid in the initial physical iteration of kernel portion 290, for example physical iteration NOVL.
  • Garbage operations 210 are the invalid operations.
  • Garbage operations 220 are similar, but result from increasing numbers of stages containing invalid operations during the epilogue portion 240 of loop body 200.
  • the prologue/epilogue control technique implemented by control logic of CPU 100 selectively enables and disables the execution of categories of operations, rather than providing prologue/epilogue control by exploiting predicated execution codings to successively enable additional stages during successive physical iterations of the prologue and to successively disable stages during successive physical iterations of the epilogue.
  • the prologue/epilogue control technique is not a general solution for all inner loop body code, the technique can be applied to a large class of loop programs.
  • the loop body code for performing prologue/epilogue conforms to two reasonable constraints on the structure of the pipelined logical iterations.
  • memory read operations such as loads are confined to the first stage of a logical iteration.
  • operations with side-effects such as memory write operations or stores, loop breaks and the like are confined to the last stage of a logical iteration.
  • These constraints are imposed by the compiler.
  • the restriction of memory read operations to memory read stages 212 and of operations having side-effects to side- effects stages 214 is illustrative of the loop body code constraints.
  • memory read operations associated with logical iteration 270 are constrained to the first stage 271 of the logical iteration.
  • side-effects operations associated with logical iteration 270 are constrained to the last stage 272 of the logical iteration.
  • Loop control logic 300 is connected to receive values for loop control variables from VLIW instruction decoder 323. These values are used to initialize fields of various loop parameters and loop control registers which are collectively shown as loop parameter and status registers 340. In particular, these values initialize an epilogue counter field (ecnt) 341, a shift register (sh) 347, a side-effects enabled flag (seen) 348, a current loop counter field (clc) 345, a loop mode flag (lm) 344, and side-effects manual control (seme) and loads manual control (ldmc) flags (342 and 346).
  • ecnt epilogue counter field
  • sh shift register
  • clc current loop counter field
  • lm loop mode flag
  • ldmc load manual control
  • Side- effects enabling logic 310 and load enabling logic 320 respectively issue the side- effects enabled predicate (ls_se_enbl) and the loads enabled predicate (ls_ld_enbl) to respective subsets of execution units illustratively grouped as 330.
  • STU 0 333 through STU m 334 are illustrative of executive units which implement operations with side-effects and which are distributed among ALC 1 442 and ALC3 444 as described above with reference to FIGURE 1.
  • STU 0 333 through STU m 334 are also illustrative of the AACI and AAC3 channels of AAU 450.
  • STU 0 333 through STU m 334 are each responsive to the ls_se_enbl predicate, enabling side-effects operations when ls_se_enbl is asserted and disabling side-effects operations when ls_se_enbl is de-asserted.
  • LDU 0 335 through LDU n 336 are similarly illustrative of executive units which implement load operations and which are distributed among ALC 1 242 and ALC3 244 as described above with reference to FIGURE 1.
  • LDU 0 335 through LDU n 336 are also illustrative of array access channels (AACO, AACI, AAC2, and AAC3) 250.
  • LDU 0 335 through LDU n 336 are each responsive to the ls_ld_enbl predicate, enabling load operations when ls_ld_enbl is asserted and disabling load operations when ls_ld_enbl is de-asserted.
  • ALU 0 331 through ALU k 332 are illustrative of executive units which implement arithmetic and logic operations, including non-load and non-side-effects operations, and which are distributed among ALCO 241, ALCl 242, ALC2 243, and ALC3 244 as described above with reference to FIGURE 1.
  • the operation of ALU 0 331 through ALU k 332 is unaffected by the state of either the 1 s_se_enbl predicate or the ls_ld_enbl predicate.
  • load enabling logic 320 implements:
  • ls_ld_enbl 1 lm
  • Side-effects enabling logic 310 and load enabling logic 320 may be implemented using various other known circuits.
  • comparison logic including a less than zero comparison logic 321, and OR gates such as OR gates 312 and 322)
  • side-effects enabling logic 310 and load enabling logic 320 may be implemented in positive or negative logic, using AND, OR, NAND, or NOR gates. Suitable transformations of the respective logic equations are well known. Additionally, the initialization and transition sequencing of register fields may be alternately defined with suitable modifications to the logic equations. Similarly, many suitable designs for comparing register values to trigger values are known. Side-effects enabling logic 310 and load enabling logic 320 are of any such suitable designs.
  • the operation of loop control logic 300 includes three types of operations and with reference to FIGUREs 2 and 3.
  • the operation types include operations that cause side-effects including store and loop-break operations, load operations including load address modifications and arithmetic logic type operations.
  • operations with side-effects are restricted to the last stage of a logical iteration and load operations are restricted to the first stage of a logical iteration.
  • side-effects operations of the first logical iteration 270 are scheduled for stage 5 272 for the fifth physical iteration of loop body 200.
  • Load operations of the first logical iteration 270 are scheduled for stage 1 271 for the first physical iteration 250 of a loop body 200.
  • Arithmetic and logic operations of the first logical iteration 270 are scheduled for any of the stages from stage 1 271 to stage 5 272 for any of the first five physical iteration of a loop body 200.
  • the first four (N 0VL -1) physical iterations in the prologue portion 230 of loop body 200 include stages having operations collectively shown as garbage operations 210.
  • Loop control logic 300 disables garbage operations of the prologue portion 230 of loop body 200 by de-asserting the side-effects enabled predicate supplied to side-effect execution units 333 through 334.
  • Arithmetic and logic operations are included in the set of garbage operations 210 and evaluations of arithmetic and logic operations of the ALU channels 331 and 332 are unaffected by the side-effects enabled predicate. Since these garbage arithmetic and logic operations are not part of any valid logical iteration, they operate on uninitialized data and produce unpredictable garbage-type result values. However, since these garbage result values are used only inside a logical iteration boundary and since operations with side-effects are disabled by the side-effects enabled predicate, the garbage result values do not propagate.
  • side-effects enabling logic 310 supplies the side- effects enabled predicate, disables side-effects operations during the prologue portion of a loop, and otherwise enables side-effects operations.
  • the side-effects enable flag (seen) 348 enables and disables the side-effects enabling logic.
  • the side- effects enabling logic 310 disables operations with side-effects during the first four physical iterations while side-effects enable flag (seen) 348 is reset. On the fifth physical iteration and thereafter, operations with side-effects are enabled and remain enabled for the remainder of the inner loop.
  • the last four (N 0V L -I) physical iterations of a loop body 200, which make up the epilogue portion 240, include stages having operations collectively shown as garbage operations 220.
  • Loop control logic 300 disables these garbage operations of the epilogue portion 240 of loop body 200 by de-asserting the loads enabled predicate supplied to load execution units 335 through 336.
  • the arithmetic and logic operations are also included in the set of garbage operations 210 and the evaluation of the arithmetic and logic operations at ALU channels 331 and 332 are unaffected by the loads enabled predicate. Since the arithmetic and logic operations are not part of a valid logical iteration, the operations operate on uninitialized data and produce unpredictable garbage result values.
  • the garbage result values are used only inside a logical iteration boundary.
  • Loop body code restricts operations having side-effects to the last stage of a logical iteration. Since garbage operations 220 include no operations with side-effects, garbage result values do not propagate
  • Load enabling logic 320 supplies the loads enabled predicate, disables load operations during the epilogue portion of a loop, and otherwise enables load operations.
  • the loop counter register 345 and the epilogue counter register 341, are used by the load enabling logic 320 to distinguish the epilogue portion of a loop.
  • a loop initialization operation loads loop counter register 345 with a value equal to the number of logical iterations, N LI , and loads epilogue counter register 341 with a value equal to the N OVL -1.
  • Loop counter register 345 is decremented at the end of each physical iteration until the loop counter reaches zero. When the loop counter is one, the first stage of the last logical iteration begins, illustratively shown as logical iteration 8 in FIGURE 2.
  • loop counter register 345 is initialized with the value 8, the loop mode flag 344 is set, and the manual control flag 346 is cleared.
  • Load enabling logic 320 enables load operations during the first eight physical iterations, while loop counter register 345 contains a non-zero value.
  • load operations are disabled and remain disabled for the remainder of the inner loop.
  • Epilogue counter register 341 is decremented at the end of each physical iteration of the epilogue until the value in the epilogue counter register 345 reaches zero, signaling termination of a simple inner loop.
  • garbage arithmetic and logic operations which are included in either the garbage operations 210 of prologue period 230 or the garbage operations 220 of epilogue period 240 occasionally produce garbage exception conditions.
  • garbage arithmetic and logical operations in the prologue portion 230 of loop body 200 occasionally operate on uninitialized operand values and trigger an exception condition.
  • Garbage arithmetic and logical operations are an artifact of the software pipelining model rather than valid operations so that exception conditions or traps which result are superfluous. The problem of garbage exceptions is addressed by deferring the handling of an exception until the last stage of a logical iteration, when an iteration, and thus an exception occurring during the iteration, is known to be superfluous.
  • Speculative execution in the VLIW processor 100 proceeds as each operand is tagged with a diagnostic bit (db).
  • Data paths, register files and functional units in processor 100 support the diagnostic bit.
  • the operation marks the result as a diagnostic value.
  • the marking is set, for example, in the register file 130 as a diagnostic value.
  • the actual exception handling event or trap is deferred.
  • the diagnostic value typically contains information about the operations and the triggering exception. If a subsequent speculatively-executed operation uses a marked value as an input operand, the diagnostic bit tagging is passed through to the result, propagating the exception along the speculatively executed execution path and deferring the exception or trap.
  • the tagged diagnostic operand causes an exception and trap when the input operand is used in a non-speculatively executed operation.
  • all operations of a logical iteration except iterations having side-effects, are executed speculatively. Operations with side-effects are executed non-speculatively so that all kinds of side-effects, including exceptions and traps, are deferred until the last stage of a logical iteration. Side-effects associated with exceptions and traps are therefore controlled by the loop status registers.
  • Multi-way control transfer logic 400 includes multiplexer 410 which supplies a next address selected from an incremented next instruction address, a start patch address, and a loop body address.
  • the next instruction address is supplied to multiplexer 410 by adder 430 based on a program counter value supplied from Program Counter register (PC) register 432 and an instruction length supplied from instruction decoder 323.
  • PC Program Counter register
  • the start patch and loop body addresses are respectively supplied from Control Transfer Preparation Registers (CTPR2a 440 and CTPR3a 450).
  • Branch target coder 420 provides multiplexer 410 with an address selection signal based on an outer loop exit predicate (ls_exit) represented in predicate file 231, based on a last iteration begin predicate (ls_lst_itr_bgn), and based on a last iteration end predicate (ls_lst_itr_end). Branch target coder 420 also receives Control Transfer Operations (CTOPs) from instruction decoder 323.
  • COPs Control Transfer Operations
  • Last iteration begin and end loop predicates are respectively supplied by last iteration begin logic 470 and last iteration end logic 460 based on values stored in fields of various of loop parameter and loop status registers, which are collectively shown as loop parameter and status registers 340.
  • last iteration begin logic 470 compares current loop counter field (clc) 345 to the value one (1), supplying a true predicate if current loop counter field (clc) 345 indicates that the current physical iteration begins the last logical iteration.
  • the ls_lst_itr_bgn predicate is used by branch target coder 420 to identify points in a nested loop schedule for transferring control to StartPatch 112 shown in FIGURE 1.
  • branch target coder 420 Upon receiving an appropriate Control Transfer OP (CTOP) from instruction decoder 323 and a true ls_ls _itr_bgn predicate from last iteration begin logic 470, branch target coder 420 supplies a address selection signal selective for the start patch address stored in Control Transfer Preparation Register (CTPR2a) 440.
  • CTOP Control Transfer OP
  • Values of current loop counter field (clc) 345 from LPR . lc to one (1) indicate valid logical iterations.
  • the zero (0) value indicates the epilogue period.
  • Alternate encodings for identifying the beginning of the last logical iteration of an inner loop body will be appreciated by those of ordinary skill in the art. For example, a shift register configuration, a count up (rather than count down) configuration, alternate counter base points, etc. are all suitable alternatives.
  • Last iteration end logic 460 supplies the last iteration end predicate
  • (ls_lst_itr_end) if both the side-effects enabled flag (seen) 347 and bit one (sh [1] ) of shift register, sh 347 are set.
  • Side-effects enabled flag (seen) 347 is set at the end of prologue period 230 and marks the non-prologue portion of loop body schedule 200 shown in FIGURE 2.
  • the last inner loop body code physical iterations associated with each of multiple overlapped outer loop passes are encoded by set bits of shift register, sh 347.
  • the ls_lst_itr_end predicate is used by branch target coder 420 to identify points in an overlapped loop schedule for transferring control to FinishPatch 120 in the context of FIGURE 1.
  • CTOP Control Transfer OP
  • instruction decoder 323 Upon receiving an appropriate Control Transfer OP (CTOP) from instruction decoder 323 and a true ls_lst_itr_end predicate from last iteration end logic 460, branch target coder 420 supplies a address selection signal selective for the finish patch address stored in Control Transfer Preparation Register (CTPR3a) 450.
  • CPR3a Control Transfer Preparation Register
  • branch target coder 420 supplies an address selection signal based on the particular Control Transfer OPeration (CTOP) received from instruction decoder 323 and based on the states of loop predicates such as ls_exit, ls_lst_itr_bgn, and ls_lst_itr_end.
  • COP Control Transfer OPeration
  • CTOP Control Transfer OP
  • branch target coder 420 selects one of four program paths (see column 3) as indicated by the address from the corresponding Control Transfer Preparation Register (CTPR/a).
  • Loop control transfer semantics encode a "fall through,” i.e., continuation with the next long instruction (according to the program counter value from Program Counter register (PC) register 432) as path 0, i.e., the address associated with "CTPROa.”
  • Loop control condition expressions are prioritized from 0 to 3, with path 0 (fall through) having the highest priority.
  • branch target coder 420 provides multiplexer 410 with an address selection signal selective for the next instruction address input from adder 430 if condition expression 0 evaluates to true.
  • branch target coder 420 provides multiplexer 410 with a address selection signal selective for the start patch address from Control Transfer Preparation Register (CTPR2a) 440 if condition expressions 0 and 1 evaluate to false and condition expression 2 evaluates to true.
  • branch target coder 420 provides multiplexer 410 with a address selection signal selective for the loop body address from Control Transfer Preparation Register (CTPR3a) 450 if condition expressions 0, 1, and 2 evaluate to false and condition expression 3 evaluates to true.
  • CPR2a Control Transfer Preparation Register
  • CPR3a Control Transfer Preparation Register
  • Condition expression 1 and Control Transfer Preparation Register CTPRl can be configured to control transfers to a middle patch for supporting nested loops with a prefetch buffer.
  • condition expression 0 for Control Transfer OPs (CTOPs) 1, 2, and 3 includes the ls_lst_i t r_end predicate and two additional terms, which are not closely related to the implementation of loop schedule 200.
  • the 1 s_break term provides for inner loop exit on a condition from predicate file and the (ls_prlg && LPR .
  • ext term provides for handling of inner loop body code with an extension fragment for handling nested loops with vector invariants (where ls j prlg indicates the prologue period of each inner loop and ext is a flag in Loop Parameters Register (LPR) 410.
  • the ls_ldovl_limit predicate provides support for branches in response to a maximum load overlap condition for array prefetch operations.
  • FIGURE 5 a schematic block diagram shows a disambiguation memory 500 including a plurality of associative disambiguation memory (DAM) registers 510.
  • the disambiguation memory 500 includes 32 DAM registers 510.
  • Each DAM register 510 includes a DAM address register 512, a violation bit V 514, and a size register 516.
  • a DAM register 510 has a format as follows:
  • v is the violation bit V 514
  • size is the size register 516
  • memory address is the DAM address register 512.
  • the memory address of a store operation like the memory address for a load instruction (LDS), is tranferred to the disambiguation memory 500.
  • LDS load instruction
  • a compare is performed using the size field to account for different formats of loaded and stored values.
  • the violation bit is set for the particular DAM register 510.
  • the disambiguation memory 510 is accessed through usage of various specialized disambiguation memory operations which permit the compiler to schedule a load before potentially dependent store operations even when full information about address dependence is not available.
  • Operations which access and control the disambiguation memory 510 include a speculative load operation (LDS), a check disambiguation memory operation (CHECKDAM), and shift right logical cyclic (SRLC) operation, a decrement disambiguation memory base (DBDAM) operation, and a set disambiguation memory base (SBDAM) operation.
  • LDS speculative load operation
  • CHECKDAM check disambiguation memory operation
  • SRLC shift right logical cyclic
  • DDAM decrement disambiguation memory base
  • SBDAM set disambiguation memory base
  • the LDS operation performs a check of memory disambiguation during the speculative load operation.
  • Modifiers of the LDS operation include an address field and a size field.
  • the LDS operation is called to place a memory address in the disambiguation memory 510 for comparison with memory addresses of subsequent store operations.
  • All load operations include a version incorporating a check of memory disambiguation.
  • the LDS operation loads a designated value from memory to a designated destination register, performing the essential function of all load operations. However, in addition the LDS operation transfers the address of the memory access to the disambiguation memory 510 to check for coincidence of the transferred memory access address to the memory access addresses of subsequent store operations.
  • the modifiers associated with the LDS operation including two operands for address calculation and a modifier identifying the destination register, have the same meaning and function as the modifiers for standard load operations.
  • the modifiers for both the LDS and conventional load operations define the same data widths including bytes, half-word, words, doublewords and the like.
  • the byte, half- word and word LDS operations are unsigned (zero extended) or signed (sign- extended) load operations.
  • the LDS instruction only addresses the disambiguation memory 500 using based addressing and direct addressing.
  • the rd format is, as follows:
  • BASE is the number of the base for the physical register address computation and BINCR is a field that determines the bias with respect to the base.
  • the CHECKDAM operation checks and resets violation bits of the disambiguation memory 510 according to a mask argument.
  • the result of the check is written to the address specified in a the rd code in the Predicate File (PF) 431.
  • the result predicate is 1 if the violation bit is equal to 1 in at least one of the DAM cells specified by the mask.
  • the shift right logical cyclic (SRLC) operation is used to generate a plurality of CHECKDAM operation mask arguments in a loop.
  • the SRLC operation performs a one-bit right logical cyclic shift of the data designated by operand 1 by a shift amount identified by operand 2.
  • the SRLC is performed in an arithmetic logic unit (ALU) 140.
  • ALU arithmetic logic unit
  • the decrement base DAM (DBDAM) operation decrements bdcur, a register holding a current value of the disambiguation memory base pointer.
  • the BDDAM operation decrements bdcur, modulo bdsz+1 where bdsz is the size of the based area in the disambiguation memory 510.
  • the base register contains a base that supplies rotatability of the disambiguation memory 500.
  • Other architectural files, such as r- register files, predicates files use similar based addressing and include base registers.
  • a standard rd field in the LDS operation is used as an index when the DAM is accessed so that the DAM line distribution correlate with the data distribution in the register file 130.
  • the DAM address corresponds to the address of a register in a defined window.
  • rd is considered and index in the DAM that is based-addressed.
  • the DAM line is calculated as follows:
  • bdsz is the size of the based area of DAM minus one, which describes a adjacent area that includes lines from 0 to bdsz.
  • Bdcur is the current value of the base pointer.
  • the DAM address register 512 and the size register 516 of a DAM register 510 are filled by corresponding fields of the speculative load operation LDS which are communicated by a respective memory address bus 520 and memory operation size bus 522.
  • the LDS operation causes the violation bit V 514 to be reset.
  • Each DAM register 510 is connected to a DAM register associative comparison circuit 530 including an address comparator 532 and an overlap circuit 534. For each DAM register 510, when a memory store operation occurs subsequent to an LDS instruction that fills the DAM register 510, the address comparator 532 compares the address held in the DAM address register 512 to the address of the memory store operation on the memory address bus 520.
  • the overlap circuit 534 receives size information from the size register 516 and size information of the memory store operation via the operation size bus 534, in conjunction with the result from the address comparator 532 to determine whether the addresses conflict. If the addresses do conflict, the violation bit V 514 is set.
  • the disambiguation memory 500 also includes a destination address input register 524 including an increment field 528 and a based field 526.
  • the destination address input register 524 is activated by an operation such as an LDS operation.
  • the increment field 528 includes a plurality of low-order destination bits for addressing the disambiguation memory 500 and the based field 526 bit specifies whether the disambiguation memory 500 is accessed using direct addressing or based addressing.
  • the DAM registers 510 serve as a temporary storage for the destination registers in the register file 130. To conserve size of the disambiguation memory 500, the DAM registers 510 account for only a fraction of the total register space in the register file 130. Note that a content-addressable DAM, rather than the present direct-addressed disambiguation memory 500, adequately covers the register file 130 by virtue of storing the target destination register in the register file in a tag field. The direct-addressed disambiguation memory 500 operates with a reduced register count by usage of based addressing.
  • DAM registers 510 Based addressing of the DAM registers 510 is highly advantageous for handling of software-pipelined loops, which are used by the compiler to accelerate calculations. Several logical loop iterations are typically executed simultaneously, potentially causing a large number of address dependencies that cannot be resolved at compile-time. Thus, multiple speculative loop loads are placed into the DAM registers 510 having a high risk of aliasing.
  • the knowledge of the program code at compile time has limited utility for handling software-pipelined loop scheduling due to dynamic changes in the destination address within a loop.
  • a particular loop operation has the same static destination address, which is known by the compiler, in all loop iterations. However, dynamic destination addresses change as successive iterations are initiated. The dynamic addresses are not available at compile time.
  • the disambiguation memory uses based addressing to access dynamic destination addresses, thereby accessing rotating regions of the disambiguation memory.
  • the disambiguation memory accesses dynamic destination addresses for software pipelined loop operation in a compact disambiguation memory size.
  • loop invariant loads loads which are inherent in all loop iterations, may have aliases among loop store operations.
  • the compiler typically extracts invariant loads from the loop body so that the invariant loads are repeated in successive loops. Nonetheless, the loads should be stored in the disambiguation memory 500 and checked every iteration of the loops. Based addressing cannot be used for invariant loads so that direct addressing is used. Accordingly, the disambiguation memory supports selection between direct addressing and based addressing operation.
  • a based addressing circuit 540 including a base pointer 542, a based area size register 544, a cyclic adder 546, a cyclic subtractor 548 and a multiplexer 550.
  • the base pointer 542 and based area size register 544 are established using the set disambiguation memory base (SBDAM) operation.
  • the cyclic adder 546 increments the base pointer 542 modulo the based area size from the based area size register 544.
  • the multiplexer 550 determines whether an access address is supplied from the increment field 528 of the destination address input register 524 or from the cyclic adder 546 under control of the based field 526 bit of the destination address input register 524.
  • the access address from the multiplexer 550 is applied to a DAM decoder 560, which determines the particular DAM register 510 of the plurality of DAM registers to receive the memory address and the memory operation size modifiers of a LDS operation.
  • the base pointer 542 is modified within a loop operation using the decrement disambiguation memory base (DBDAM) operation, which activates the cyclic subtractor 548 modulo the based area size from the based area size register 544 and stores the decremented value to the base pointer 542.
  • DDAM decrement disambiguation memory base
  • a checkdam mask circuit 562 masks bits supplied by the CHECKDAM operation to select particular DAM registers for checking and resetting of violation bits according to a mask argument established by a CHECKMASK signal on a plurality of check mask lines 564.
  • the CHECKMASK signal determines which DAM registers 510 have violation bits to be checked.
  • the checkdam mask circuit 562 in response to the CHECKDAM operation, generates a plurality of write strobe signals.
  • the write strobe signals from the checkdam mask circuit 562 are transmitted to the individual DAM registers 510 of the plurality of DAM registers.
  • the CHECKDAM operation resets violation bits and checks to determine whether a particular violation bit is set.
  • CHECKMASK is used to check scheduling validity for multiple potential load-store dependencies simultaneously by one operation.
  • the CHECKMASK signal is generated by the compiler so that a plurality of destination addresses are compacted into a single VLIW instruction, thereby saving on program code, compensation code and execution time.
  • the disambiguation memory 510 incorporates checking of load/store aliasing for a plurality of destination addresses into a single wide instruction operating in a very long instruction word (VLIW) processor using a direct-addressed structure.
  • VLIW very long instruction word
  • An alternative content- addressable memory would include circuits for obtaining several destination addresses and for concurrently performing a multiple associative search of the disambiguation memory. These circuits typically consume a large amount of circuit space.
  • the described disambiguation memory 500 utilizes direct addressing and a compiler-generated CHECKMASK to check multiple destination addresses concurrently in a compact circuit size.
  • the compiler constructs a program code using knowledge of the contents of the disambiguation memory 500 and generates the
  • CHECKMASK based on this knowledge which points to the DAM registers 510 to be checked.
  • the described disambiguation memory hardware simply checks to determine whether any violation bit is set. All other tasks are avoided.
  • a single DAM violation output signal is supplied to a violation line 566 through a logical OR circuit 568 which is connected to each violation bit V 514 through a plurality of AND gates 570.
  • a compiler may optimize VLIW instructions so that a read instruction is moved to an execution position which precedes a write instruction. Such an optimization may lead to a conflict or collision in which the read and the write instruction operate upon an address that designates the same memory location.
  • the disambiguation memory 500 is used to detect these conflicts or collisions.
  • the disambiguation memory 500 scan is performed simultaneous with the TLB scan.
  • the virtual address of the read instruction is stored in a DAM register 510, thereby serving as a tag.
  • the virtual address of a write instruction is used for comparing with the tags stored in the DAM registers 510.
  • the collision is logged by setting the violation bit 514 in the DAM register 510.
  • a CHECKD AM instruction executed by the disambiguation memory 500 causes invalidation of the conflicting address.
  • Invalidation of a DAM register 510 with an violation V bit 514 present causes a trap.
  • the disambiguation memory 500 is first activated using a speculative load (LDS) operation in step 610, which operates upon a designated source operand and a designated destination operand in the manner of a conventional load operation, including usage of the same modifiers.
  • LDS speculative load
  • the LDS operation like a load (LD) operation, loads a datum from a specified memory address into a specified destination register. Unlike the load (LD) operation, the LDS operation performs two functions. First, any entry in the registers 510 of the disambiguation memory 500 having a target register field which corresponds to the destination register of the LDS operation are invalidated in step 612. Second, information concerning the LDS operation are entered by storing the designated source memory address referenced by the LDS operation into a register in the disambiguation memory 510 in step 614.
  • LD load
  • the LDS operation performs two functions. First, any entry in the registers 510 of the disambiguation memory 500 having a target register field which corresponds to the destination register of the LDS operation are invalidated in step 612. Second, information concerning the LDS operation are entered by storing the designated source memory address referenced by the LDS operation into a register in the disambiguation memory 510 in step 614.
  • the specific register in the disambiguation memory 510 is determined by the destination address of the LDS operation using the destination address input register 524 multiplexer 550 and the decoder 560. Accordingly, for a speculative load operation (LDS) the disambiguation memory 500 stores the addresses, in memory, of the storage locations from which data is transferred to a destination register via the LDS operation.
  • the location of the register 510 in the disambiguation memory 500 is software-defined as a function of the destination register.
  • the LDS instruction potentially conflicts with subsequent store operations. To determine whether actual conflicts arise, the address of the source memory is stored in step 616 in the disambiguation memory 500 in a position determined by the position of the destination register designated by the LDS instruction.
  • the disambiguation memory 500 differs in operation from the operation of a conventional LDS instruction such as the LDS instruction defined within the PlayDoh architecture.
  • the Playdoh disambiguation memory stores a designation of both the destination register and the source memory location in an LDS log and thereby operates as a content addressable memory (CAM).
  • CAM content addressable memory
  • the DAM registers 510 are addressed using both direct addressing and base addressing.
  • base addressing of registers in the disambiguation memory 510 is highly advantageous for handling a plurality of speculative load operations in a loop body, as directed by an optimizing compiler.
  • based addressing is advantageous for allowing the number of DAM registers 510 to be substantially fewer than the number of registers in the register file 130.
  • loop processing is advantageously achieved using based addressing since the loops generally step through portions of the registers in order.
  • the option for direct addressing is advantageous for handling invariant loads that are not accessible using based addressing.
  • the DAM base pointer register 542 is specified as a base address of the disambiguation memory 500 which is set by writing to the DAM base pointer register 542 using a set DAM base (SBDAM) operation 604 during an initialization step 602 so that subsequent base address references of instructions appropriately address the DAM registers 510.
  • the DAM base pointer register 542 is also used to calculate an address in the disambiguation memory 500 as a function of the destination register.
  • the base area size register 544 specifies the size of the disambiguation memory 500.
  • the processor 100 When a store operation executes in step 618, the processor 100 performs multiple actions upon the disambiguation memory 500. For a valid entry in a register 510 within the disambiguation memory 500, the processor 100 checks the memory address stored in the disambiguation memory 500 entry and the memory address referenced by the store operation to determine whether the store operation potentially writes into a physical memory location that is accessed by the LDS operation corresponding to the disambiguation memory 500 entry in step 620. If the store operation references an address corresponding to the disambiguation memory 500 entry, then the processor 100 invalidates the disambiguation memory 500 entry by writing a "violation bit" within the corresponding register of the disambiguation memory 500 in step 622. This violation bit is indicative of incorrect load-store scheduling. The address comparison detects all cases in which the store operation and the LDS operation access a common physical memory location.
  • the CHECKDAM operation is issued in conjunction with the previously issued data speculative load (LDS) operation.
  • the CHECKDAM operation checks registers 510 within the disambiguation memory 500 which have a violation bit set and causes predicate bit setting in response to a set violation bit.
  • the CHECKDAM instruction is issued with a CHECKDAM operand which serves as a mask for application in a perpendicular fashion across the violation bits of the entire disambiguation memory 500. Specifically, the violation bits of all registers 510 of the disambiguation memory 500 are located in the same bit-position.
  • the CHECKDAM operand mask is applied across the violation bits of all registers 510 to designate an incorrect load-store scheduling condition for a plurality of LDS operations in parallel.
  • Each bit of the CHECKDAM operand mask corresponds to a separate and independent register of the disambiguation memory 500.
  • SRLC SRLC
  • LDS data speculative load
  • CHECKDAM CHECKDAM
  • the SRLC operation utilizes an operand mask in a range corresponding to the disambiguation memory-based area size.
  • the predicate bits analysis causes a conditional branch to a recovery code which is generated by the compiler to ensure the correct execution of the program.
  • the recovery code or compensation code, is directed by predicate bits that act as logical parameters for logical branch instructions in the recovery code.
  • the LDS, CHECKDAM and SRLC operations are used to perform conditional and unconditional branches.
  • the compiler prepares for LDS, CHECKDAM and SRLC operation by compiling and organizing an instruction code which includes routines for handling incorrect load- store scheduling. These routines include an initial branch address that is made accessible according to a particular branch target address which is stored by the compiler.
  • routines include an initial branch address that is made accessible according to a particular branch target address which is stored by the compiler.
  • other operating codes may be supplied that generate additional suitable target addresses at run-time using a prepare-to-branch operation, thereby specifying a branch target address in advance of the branch point to allow a prefetch of instructions from the target address. Information for controlling instruction prefetch is also specified.
  • an appropriate type of compare-to-predicate operation computes a branch condition in the Predicate File (PF) 431 which designates a target address for a conditional branch routine.
  • PF Predicate File
  • the SRLC operation is used to direct control to one or more branch operations to perform the actual transfer of control if the branch is taken.
  • a shift SRLC operation performs in the manner of a conditional branch that is used to branch to a location in compiler-generated compensation code to ensure correct execution of a program.
  • the SRLC operation does not provide for handling only a single load-store scheduling fault as occurs for a conventional conditional branch instruction but rather allows for branching to an address of a recovery program code which executes based on the setting of predicate bits to perform various different functions.
  • an optimizing compiler is used for handling speculative loads with potentially unresolved compile-time address dependencies not only for a linear program flow but also for execution in a loop body.
  • the processor 100 is adapted to resolve multiple load-store scheduling faults using only a single branch to a compiler-generated recovery code.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Executing Machine-Instructions (AREA)

Abstract

cette invention concerne une mémoire de désambiguïsation, laquelle va assurer la vérification de charge/repliement de stockage pour plusieurs adresses de destination sous forme d'une instruction large et unique utilisée dans un processeur à très long mot d'instruction (TLMI). La mémoire de désambiguïsation utilise l'adressage direct ainsi qu'un masque de vérification généré par un compilateur afin de vérifier de manière concurrente les multiples adresses de destination dans un circuit de taille compacte. Un compilateur va utiliser des boucles à chevauchement par logiciel afin d'accélérer les calculs. Plusieurs répétitions de boucles logiques sont exécutées simultanément, ce qui peut entraîner un grand nombre de dépendances d'adresses qui ne peuvent être résolues au moment de la compilation. Une opération de boucle particulière possède la même adresse de destination statique dans toutes les répétitions de boucle, cette adresse étant connue du compilateur. Toutefois, les adresses de destination dynamiques changent au fur et à mesure que sont lancées des répétitions successives. Les adresses dynamiques ne sont pas disponibles au moment de la compilation. La mémoire de désambiguïsation utilise l'adressage pointé afin d'accéder aux adresses de destination dynamiques, ce qui permet ainsi d'accéder aux zones rotatives de la mémoire de désambiguïsation. La mémoire de désambiguïsation va ainsi accéder aux adresses de destination dynamiques en vue d'une opération de boucle à chevauchement par logiciel dans une mémoire de désambiguïsation de taille compacte. Toutefois, il est impossible d'accéder par adressage pointé aux charges invariantes de boucles, à savoir aux charges qui sont inhérentes à toutes les répétitions de boucle. La mémoire de désambiguïsation va ainsi assurer la sélection entre une opération d'adressage direct et d'adressage pointé.
PCT/RU1996/000215 1996-08-07 1996-08-07 Circuit de memoire de desambiguisation et procede de fonctionnement WO1998006039A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/RU1996/000215 WO1998006039A1 (fr) 1996-08-07 1996-08-07 Circuit de memoire de desambiguisation et procede de fonctionnement

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/RU1996/000215 WO1998006039A1 (fr) 1996-08-07 1996-08-07 Circuit de memoire de desambiguisation et procede de fonctionnement

Publications (1)

Publication Number Publication Date
WO1998006039A1 true WO1998006039A1 (fr) 1998-02-12

Family

ID=20130022

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/RU1996/000215 WO1998006039A1 (fr) 1996-08-07 1996-08-07 Circuit de memoire de desambiguisation et procede de fonctionnement

Country Status (1)

Country Link
WO (1) WO1998006039A1 (fr)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2015520905A (ja) * 2013-05-30 2015-07-23 インテル・コーポレーション パイプライン化されたスケジュールにおけるエイリアスレジスタ割り当て
US20170123658A1 (en) * 2015-11-04 2017-05-04 Samsung Electronics Co., Ltd. Method and apparatus for parallel processing data

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3618027A (en) * 1970-03-27 1971-11-02 Research Corp Associative memory system with reduced redundancy of stored information
GB1447297A (en) * 1972-12-06 1976-08-25 Amdahl Corp Data processing system
SU741269A1 (ru) * 1978-01-04 1980-06-15 Специальное Конструкторское Бюро Вычислительных Машин Микропрограммный процессор
SU1161950A1 (ru) * 1982-12-30 1985-06-15 Предприятие П/Я Г-6429 8-Битный микропроцессор
SU1246108A1 (ru) * 1984-04-20 1986-07-23 Предприятие П/Я М-5339 Процессор
EP0293851A2 (fr) * 1987-06-05 1988-12-07 Mitsubishi Denki Kabushiki Kaisha Processeur de traitement numérique de signaux
EP0299537A2 (fr) * 1987-07-17 1989-01-18 Sanyo Electric Co., Ltd. Dispositif et méthode pour le traitement des signaux numériques

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3618027A (en) * 1970-03-27 1971-11-02 Research Corp Associative memory system with reduced redundancy of stored information
GB1447297A (en) * 1972-12-06 1976-08-25 Amdahl Corp Data processing system
SU741269A1 (ru) * 1978-01-04 1980-06-15 Специальное Конструкторское Бюро Вычислительных Машин Микропрограммный процессор
SU1161950A1 (ru) * 1982-12-30 1985-06-15 Предприятие П/Я Г-6429 8-Битный микропроцессор
SU1246108A1 (ru) * 1984-04-20 1986-07-23 Предприятие П/Я М-5339 Процессор
EP0293851A2 (fr) * 1987-06-05 1988-12-07 Mitsubishi Denki Kabushiki Kaisha Processeur de traitement numérique de signaux
EP0299537A2 (fr) * 1987-07-17 1989-01-18 Sanyo Electric Co., Ltd. Dispositif et méthode pour le traitement des signaux numériques

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2015520905A (ja) * 2013-05-30 2015-07-23 インテル・コーポレーション パイプライン化されたスケジュールにおけるエイリアスレジスタ割り当て
US9495168B2 (en) 2013-05-30 2016-11-15 Intel Corporation Allocation of alias registers in a pipelined schedule
KR101752042B1 (ko) * 2013-05-30 2017-06-28 인텔 코포레이션 파이프라이닝된 스케줄에서의 에일리어스 레지스터들의 할당
US20170123658A1 (en) * 2015-11-04 2017-05-04 Samsung Electronics Co., Ltd. Method and apparatus for parallel processing data
US10013176B2 (en) * 2015-11-04 2018-07-03 Samsung Electronics Co., Ltd. Method and apparatus for parallel processing data including bypassing memory address alias checking

Similar Documents

Publication Publication Date Title
US5958048A (en) Architectural support for software pipelining of nested loops
US5794029A (en) Architectural support for execution control of prologue and eplogue periods of loops in a VLIW processor
US5983336A (en) Method and apparatus for packing and unpacking wide instruction word using pointers and masks to shift word syllables to designated execution units groups
Ditzel et al. Branch folding in the CRISP microprocessor: Reducing branch delay to zero
US5649145A (en) Data processor processing a jump instruction
KR0149658B1 (ko) 데이터 처리장치 및 데이터 처리방법
Fox Formal specification and verification of ARM6
JP2931890B2 (ja) データ処理装置
US5692169A (en) Method and system for deferring exceptions generated during speculative execution
US5889985A (en) Array prefetch apparatus and method
US6076158A (en) Branch prediction in high-performance processor
EP0463977B1 (fr) Branchement dans un processeur en pipeline
US8583905B2 (en) Runtime extraction of data parallelism
US6301705B1 (en) System and method for deferring exceptions generated during speculative execution
US6157994A (en) Microprocessor employing and method of using a control bit vector storage for instruction execution
US20050010743A1 (en) Multiple-thread processor for threaded software applications
US5398321A (en) Microcode generation for a scalable compound instruction set machine
Benitez et al. Code generation for streaming: An access/execute mechanism
KR20010109354A (ko) 프로세서내의 기록 트래픽을 감소시키는 시스템 및 방법
US7272704B1 (en) Hardware looping mechanism and method for efficient execution of discontinuity instructions
US6292845B1 (en) Processing unit having independent execution units for parallel execution of instructions of different category with instructions having specific bits indicating instruction size and category respectively
EP1261914B1 (fr) Architecture de traitement a fonction de controle des limites de matrice
US7051193B2 (en) Register rotation prediction and precomputation
Case ‘Intel Reveals Pentium Implementation Details
US6799266B1 (en) Methods and apparatus for reducing the size of code with an exposed pipeline by encoding NOP operations as instruction operands

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 08733833

Country of ref document: US

AK Designated states

Kind code of ref document: A1

Designated state(s): RU US