US20050010743A1 - Multiple-thread processor for threaded software applications - Google Patents
Multiple-thread processor for threaded software applications Download PDFInfo
- Publication number
- US20050010743A1 US20050010743A1 US10/818,785 US81878504A US2005010743A1 US 20050010743 A1 US20050010743 A1 US 20050010743A1 US 81878504 A US81878504 A US 81878504A US 2005010743 A1 US2005010743 A1 US 2005010743A1
- Authority
- US
- United States
- Prior art keywords
- instruction
- bit
- sfu
- address
- ufu
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
- 238000012545 processing Methods 0.000 claims description 45
- 230000037361 pathway Effects 0.000 abstract description 3
- 230000015654 memory Effects 0.000 description 99
- 238000007667 floating Methods 0.000 description 29
- 230000009471 action Effects 0.000 description 23
- 238000010586 diagram Methods 0.000 description 22
- 238000012546 transfer Methods 0.000 description 15
- 238000000034 method Methods 0.000 description 14
- 230000001343 mnemonic effect Effects 0.000 description 14
- ZKLPARSLTMPFCP-UHFFFAOYSA-N Cetirizine Chemical compound C1CN(CCOCC(=O)O)CCN1C(C=1C=CC(Cl)=CC=1)C1=CC=CC=C1 ZKLPARSLTMPFCP-UHFFFAOYSA-N 0.000 description 8
- 239000000872 buffer Substances 0.000 description 8
- 230000008569 process Effects 0.000 description 7
- 101150071111 FADD gene Proteins 0.000 description 6
- 230000008859 change Effects 0.000 description 6
- 230000006870 function Effects 0.000 description 6
- 230000000977 initiatory effect Effects 0.000 description 6
- 239000002184 metal Substances 0.000 description 6
- 101150022676 CSTB gene Proteins 0.000 description 5
- 230000006399 behavior Effects 0.000 description 5
- 230000001427 coherent effect Effects 0.000 description 5
- 230000000694 effects Effects 0.000 description 5
- 238000012986 modification Methods 0.000 description 5
- 230000004048 modification Effects 0.000 description 5
- 230000003936 working memory Effects 0.000 description 5
- 230000000295 complement effect Effects 0.000 description 4
- 238000006073 displacement reaction Methods 0.000 description 4
- 230000006872 improvement Effects 0.000 description 4
- 239000000463 material Substances 0.000 description 4
- 230000002093 peripheral effect Effects 0.000 description 4
- 238000009738 saturating Methods 0.000 description 4
- 230000008901 benefit Effects 0.000 description 3
- 230000006837 decompression Effects 0.000 description 3
- 239000000284 extract Substances 0.000 description 3
- 238000005457 optimization Methods 0.000 description 3
- 229920006395 saturated elastomer Polymers 0.000 description 3
- 238000003860 storage Methods 0.000 description 3
- 238000007792 addition Methods 0.000 description 2
- 230000009977 dual effect Effects 0.000 description 2
- 230000008030 elimination Effects 0.000 description 2
- 238000003379 elimination reaction Methods 0.000 description 2
- 230000001788 irregular Effects 0.000 description 2
- 229910021420 polycrystalline silicon Inorganic materials 0.000 description 2
- 229920005591 polysilicon Polymers 0.000 description 2
- 238000002360 preparation method Methods 0.000 description 2
- 230000009467 reduction Effects 0.000 description 2
- 238000009877 rendering Methods 0.000 description 2
- 241000761456 Nops Species 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 230000004888 barrier function Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000000903 blocking effect Effects 0.000 description 1
- 210000001072 colon Anatomy 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 239000004020 conductor Substances 0.000 description 1
- 238000012937 correction Methods 0.000 description 1
- 238000011257 definitive treatment Methods 0.000 description 1
- 230000001934 delay Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000012840 feeding operation Methods 0.000 description 1
- 238000011010 flushing procedure Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 108010020615 nociceptin receptor Proteins 0.000 description 1
- 230000001902 propagating effect Effects 0.000 description 1
- 238000012797 qualification Methods 0.000 description 1
- 230000003362 replicative effect Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
- 238000010977 unit operation Methods 0.000 description 1
- 239000002699 waste material Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3824—Operand accessing
- G06F9/3826—Bypassing or forwarding of data results, e.g. locally between pipeline stages or within a pipeline stage
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/30007—Arrangements for executing specific machine instructions to perform operations on data operands
- G06F9/3001—Arithmetic instructions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30098—Register arrangements
- G06F9/3012—Organisation of register space, e.g. banked or distributed register file
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30098—Register arrangements
- G06F9/3012—Organisation of register space, e.g. banked or distributed register file
- G06F9/30123—Organisation of register space, e.g. banked or distributed register file according to context, e.g. thread buffers
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3824—Operand accessing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3836—Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
- G06F9/3851—Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution from multiple instruction streams, e.g. multistreaming
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3877—Concurrent instruction execution, e.g. pipeline or look ahead using a slave processor, e.g. coprocessor
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3885—Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
- G06F9/3888—Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units controlled by a single instruction for multiple threads [SIMT] in parallel
Definitions
- the present invention relates to a processor architecture. More specifically, the present invention relates to a single-chip processor architecture including structures for multiple-thread operation.
- an automated system may handle multiple events or processes concurrently.
- a single process is termed a thread of control, or “thread”, and is the basic unit of operation of independent dynamic action within the system.
- a program has at least one thread.
- a system performing concurrent operations typically has many threads, some of which are transitory and others enduring.
- Systems that execute among multiple processors allow for true concurrent threads.
- Single-processor systems can only have illusory concurrent threads, typically attained by time-slicing of processor execution, shared among a plurality of threads.
- Some programming languages are particularly designed to support multiple-threading.
- One such language is the JavaTM programming language that is advantageously executed using an abstract computing machine, the Java Virtual MachineTM.
- a Java Virtual MachineTM is capable of supporting multiple threads of execution at one time. The multiple threads independently execute Java code that operates on Java values and objects residing in a shared main memory.
- the multiple threads may be supported using multiple hardware processors, by time-slicing a single hardware processor, or by time-slicing many hardware processors in 1990 programmers at Sun Microsystems developed a universal programming language, eventually known as “the JavaTM programming language”.
- JavaTM, Sun, Sun Microsystems and the Sun Logo are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States and other countries. All SPARC trademarks, including UltraSPARC I and UltraSPARC II, are used under license and are trademarks of SPARC International, Inc. in the United States and other countries. Products bearing SPARC trademarks are based upon an architecture developed by Sun Microsystems, Inc.
- JavaTM supports the coding of programs that, though concurrent, exhibit deterministic behavior, by including techniques and structures for synchronizing the concurrent activity of threads.
- JavaTM uses monitors, high-level constructs that allow only a single thread at one time to execute a region of code protected by the monitor. Monitors use locks associated with executable objects to control thread execution.
- a thread executes code by performing a sequence of actions.
- a thread may use the value of a variable or assign the variable a new value. If two or more concurrent threads act on a shared variable, the actions on the variable may produce a timing-dependent result, an inherent consequence of concurrent programming.
- Each thread has a working memory that may store copies of the values of master copies of variables from main memory that are shared among all threads.
- a thread usually accesses a shared variable by obtaining a lock and flushing the working memory of the thread, guaranteeing that shared values are thereafter loaded from the shared memory to the working memory of the thread. By unlocking a lock, a thread guarantees that the values held by the thread in the working memory are written back to the main memory.
- actions performed by one thread are totally ordered so that for any two actions performed by a thread, one action precedes the other.
- Actions performed by the main memory for any one variable are totally ordered so that for any two actions performed by the main memory on the same variable, one action precedes the other.
- Actions performed by the main memory for any one lock are totally ordered so that for any two actions performed by the main memory on the same lock, one action precedes the other.
- an action is not permitted to follow itself Threads do not interact directly but rather only communicate through the shared main memory.
- each lock or unlock is performed jointly by some thread and the main memory.
- Each load action by a thread is uniquely paired with a read action by the main memory such that the load action follows the read action.
- Each store action by a thread is uniquely paired with a write action by the main memory such that the write action follows the store action.
- An implementation of threading incurs some overhead. For example, a single processor system incurs overhead in time-slicing between threads. Additional overhead is incurred in allocating and handling accessing of main memory and local thread working memory.
- a processor has an improved architecture for multiple-thread operation on the basis of a highly parallel structure including multiple independent parallel execution paths for executing in parallel across threads and a multiple-instruction parallel pathway within a thread.
- the multiple independent parallel execution paths include functional units that execute an instruction set including special data-handling instructions that are advantageous in a multiple-thread environment.
- a general-purpose processor includes two independent processor elements in a single integrated circuit die.
- the dual independent processor elements advantageously execute two independent threads concurrently during multiple-threading operation.
- the second processor element is advantageously used to perform garbage collection, Just-In-Time (JIT) compilation, and the like.
- the independent processor elements are Very Long Instruction Word (VLIW) processors.
- VLIW Very Long Instruction Word
- one illustrative processor includes two independent Very Long Instruction Word (VLIW) processor elements, each of which executes an instruction group or instruction packet that includes up to four instructions, otherwise termed subinstructions. Each of the instructions in an instruction group executes on a separate functional unit.
- the two threads execute independently on the respective VLIW processor elements, each of which includes a plurality of powerful functional units that execute in parallel.
- the VLIW processor elements have four functional units including three media functional units and one general functional unit. All of the illustrative media functional units include an instruction that executes both a multiply and an add in a single cycle, either floating point or fixed point.
- an individual independent parallel execution path has operational units including instruction supply blocks and instruction preparation blocks, functional units, and a register file that are separate and independent from the operational units of other paths of the multiple independent parallel execution paths.
- the instruction supply blocks include a separate instruction cache for the individual independent parallel execution paths, however the multiple independent parallel execution paths share a single data cache since multiple threads sometimes share data.
- the data cache is dual-ported, allowing data access in both execution paths in a single cycle.
- the instruction supply blocks in an execution path include an instruction aligner, and an instruction buffer that precisely format and align the full instruction group to prepare to access the register file.
- An individual execution path has a single register file that is physically split into multiple register file segments, each of which is associated with a particular functional unit of the multiple functional units. At any point in time, the register file segments as allocated to each functional unit each contain the same content.
- a multi-ported register file is typically metal limited to the area consumed by the circuit proportional with the square of the number of ports. It has been discovered that a processor having a register file structure divided into a plurality of separate and independent register files forms a layout structure with an improved layout efficiency. The read ports of the total register file structure are allocated among the separate and individual register files. Each of the separate and individual register files has write ports that correspond to the total number of write ports in the total register file structure. Writes are fully broadcast so that all of the separate and individual register files are coherent.
- FIG. 1 is a schematic block diagram illustrating a single integrated circuit chip implementation of a processor in accordance with an embodiment of the present invention.
- FIG. 2 is a schematic block diagram showing the core of the processor.
- FIG. 3 is a schematic block diagram that illustrates an embodiment of the split register file that is suitable for usage in the processor.
- FIG. 4 is a schematic block diagram that shows a logical view of the register file and functional units in the processor.
- FIG. 5 is a pictorial schematic diagram depicting an example of instruction execution among a plurality of media functional units.
- FIG. 6 illustrates a schematic block diagram of an SRAM array used for the multi-port split register file.
- FIGS. 7A and 7B are, respectively, a schematic block diagram and a pictorial diagram that illustrate the register file and a memory array insert of the register file.
- FIG. 8 is a schematic block diagram showing an arrangement of the register file into the four register file segments.
- FIG. 9 is a schematic timing diagram that illustrates timing of the processor pipeline.
- FIGS. 10A, 10B , 10 C, 10 D, 10 E and 10 F illustrate instruction formats.
- FIG. 11 illustrates operation of a bitext instruction.
- FIG. 1 a schematic block diagram illustrates a processor 100 having an improved architecture for multiple-thread operation on the basis of a highly parallel structure including multiple independent parallel execution paths, shown herein as two media processing units 110 and 112 .
- the execution paths execute in parallel across threads and include a multiple-instruction parallel pathway within a thread.
- the multiple independent parallel execution paths include functional units executing an instruction set having special data-handling instructions that are advantageous in a multiple-thread environment.
- the multiple-threading architecture of the processor 100 is advantageous for usage in executing multiple-threaded applications using a language such as the JavaTM language running under a multiple-threaded operating system on a multiple-threaded Java Virtual MachineTM.
- the illustrative processor 100 includes two independent processor elements, the media processing units 110 and 112 , forming two independent parallel execution paths.
- a language that supports multiple threads, such as the JavaTM programming language generates two threads that respectively execute in the two parallel execution paths with very little overhead incurred.
- the special instructions executed by the multiple-threaded processor include instructions for accessing arrays, and instructions that support garbage collection.
- a single integrated circuit chip implementation of a processor 100 includes a memory interface 102 , a geometry decompressor 104 , the two media processing units 110 and 112 , a shared data cache 106 , and several interface controllers.
- the interface controllers support an interactive graphics environment with real-time constraints by integrating fundamental components of memory, graphics, and input/output bridge functionality on a single die.
- the components are mutually linked and closely linked to the processor core with high bandwidth, low-latency communication channels to manage multiple high-bandwidth data streams efficiently and with a low response time.
- the interface controllers include a an UltraPort Architecture Interconnect (UPA) controller 116 and a peripheral component interconnect (PCI) controller 120 .
- UPA UltraPort Architecture Interconnect
- PCI peripheral component interconnect
- the illustrative memory interface 102 is a direct Rambus dynamic RAM (DRDRAM) controller.
- the shared data cache 106 is a dual-ported storage that is shared among the media processing units 110 and 112 with one port allocated to each media processing unit.
- the data cache 106 is four-way set associative, follows a write-back protocol, and supports hits in the fill buffer (not shown).
- the data cache 106 allows fast data sharing and eliminates the need for a complex, error-prone cache coherency protocol between the media processing units 110 and 112 .
- the UPA controller 116 is a custom interface that attains a suitable balance between high-performance computational and graphic subsystems.
- the UPA is a cache-coherent, processor-memory interconnect.
- the UPA attains several advantageous characteristics including a scaleable bandwidth through support of multiple bused interconnects for data and addresses, packets that are switched for improved bus utilization, higher bandwidth, and precise interrupt processing.
- the UPA performs low latency memory accesses with high throughput paths to memory.
- the UPA includes a buffered cross-bar memory interface for increased bandwidth and improved scaleability.
- the UPA supports high-performance graphics with two-cycle single-word writes on the 64-bit UPA interconnect.
- the UPA interconnect architecture utilizes point-to-point packet switched messages from a centralized system controller to maintain cache coherence. Packet switching improves bus bandwidth utilization by removing the latencies commonly associated with transaction-based designs.
- the PCI controller 120 is used as the primary system I/O interface for connecting standard, high-volume, low-cost peripheral devices, although other standard interfaces may also be used.
- the PCI bus effectively transfers data among high bandwidth peripherals and low bandwidth peripherals, such as CD-ROM players, DVD players, and digital cameras.
- Two media processing units 110 and 112 are included in a single integrated circuit chip to support an execution environment exploiting thread level parallelism in which two independent threads can execute simultaneously.
- the threads may arise from any sources such as the same application, different applications, the operating system, or the runtime environment.
- Parallelism is exploited at the thread level since parallelism is rare beyond four, or even two, instructions per cycle in general purpose code.
- the illustrative processor 100 is an eight-wide machine with eight execution units for executing instructions.
- a typical “general-purpose” processing code has an instruction level parallelism of about two so that, on average, most (about six) of the eight execution units would be idle at any time.
- the illustrative processor 100 employs thread level parallelism and operates on two independent threads, possibly attaining twice the performance of a processor having the same resources and clock rate but utilizing traditional non-thread parallelism.
- Thread level parallelism is particularly useful for JavaTM applications, which are bound to have multiple threads of execution.
- JavaTM methods including “suspend”, “resume”, “sleep”, and the like include effective support for threaded program code.
- JavaTM class libraries are thread-safe to promote parallelism.
- the thread model of the processor 100 supports a dynamic compiler which runs as a separate thread using one media processing unit 110 while the second media processing unit 112 is used by the current application.
- the compiler applies optimizations based on “on-the-fly” profile feedback information while dynamically modifying the executing code to improve execution on each subsequent run. For example, a “garbage collector” may be executed on a first media processing unit 110 , copying objects or gathering pointer information, while the application is executing on the other media processing unit 112 .
- the processor 100 shown in FIG. 1 includes two processing units on an integrated circuit chip, the architecture is highly scaleable so that one to several closely-coupled processors may be formed in a message-based coherent architecture and resident on the same die to process multiple threads of execution.
- a limitation on the number of processors formed on a single die thus arises from capacity constraints of integrated circuit technology rather than from architectural constraints relating to the interactions and interconnections between processors.
- the processor 100 is a general-purpose processor that includes the media processing units 110 and 112 , two independent processor elements in a single integrated circuit die.
- the dual independent processor elements 110 and 112 advantageously execute two independent threads concurrently during multiple-threading operation.
- the second processor element is advantageously used to perform garbage collection, Just-In-Time (JIT) compilation, and the like.
- the independent processor elements 110 and 112 are Very Long Instruction Word (VLIW) processors.
- VLIW Very Long Instruction Word
- one illustrative processor 100 includes two independent Very Long Instruction Word (VLIW) processor elements, each of which executes an instruction group or instruction packet that includes up to four instructions. Each of the instructions in an instruction group executes on a separate functional unit.
- VLIW processor advantageously reduces complexity by avoiding usage of various structures such as schedulers or reorder buffers that are used in superscalar machines to handle data dependencies.
- a VLIW processor typically uses software scheduling and software checking to avoid data conflicts and dependencies, greatly simplifying hardware control circuits.
- the two threads execute independently on the respective VLIW processor elements 110 and 112 , each of, which includes a plurality of powerful functional units that execute in parallel.
- the VLIW processor elements 110 and 112 have four functional units including three media functional units 220 and one general functional unit 222 . All of the illustrative media functional units 220 include an instruction that executes both a multiply and an add in a single cycle, either floating point or fixed point.
- All of the illustrative media functional units 220 include an instruction that executes both a multiply and an add in a single cycle, either floating point or fixed point.
- a processor with two VLIW processor elements can execute twelve floating point operations each cycle.
- the processor runs at an 6 gigaflop rate, even without accounting for general functional unit operation.
- the media processing units 110 and 112 each include an instruction cache 210 , an instruction aligner 212 , an instruction buffer 214 , a pipeline control unit 226 , a split register file 216 , a plurality of execution units, and a load/store unit 218 .
- the media processing units 110 and 112 use a plurality of execution units for executing instructions.
- the execution units for a media processing unit 110 include three media functional units (MFU) 220 and one general functional unit (GFU) 222 .
- An individual independent parallel execution path 110 or 112 has operational units including instruction supply blocks and instruction preparation blocks, functional units 220 and 222 , and a register file 216 that are separate and independent from the operational units of other paths of the multiple independent parallel execution paths.
- the instruction supply blocks include a separate instruction cache 210 for the individual independent parallel execution paths, however the multiple independent parallel execution paths share a single data cache 106 since multiple threads sometimes share data.
- the data cache 106 is dual-ported, allowing data access in both execution paths 110 and 112 in a single cycle. Sharing of the data cache 106 among independent processor elements 110 and 112 advantageously simplifies data handling, avoiding a need for a cache coordination protocol and the overhead incurred in controlling the protocol.
- the instruction supply blocks in an execution path include the instruction aligner 212 , and the instruction buffer 214 that precisely format and align a full instruction group of four instructions to prepare to access the register file 216 .
- An individual execution path has a single register file 216 that is physically split into multiple register file segments, each of which is associated with a particular functional unit of the multiple functional units. At any point in time, the register file segments as allocated to each functional unit each contain the same content.
- a multi-ported register file is typically metal limited to the area consumed by the circuit proportional with the square of the number of ports.
- the processor 100 has a register file structure divided into a plurality of separate and independent register files to form a layout structure with an improved layout efficiency.
- the read ports of the total register file structure 216 are allocated among the separate and individual register files.
- Each of the separate and individual register files has write ports that correspond to the total number of write ports in the total register file structure. Writes are fully broadcast so that all of the separate and individual register files are coherent.
- the media functional units 220 are multiple single-instruction-multiple-datapath (MSIMD) media functional units. Each of the media functional units 220 is capable of processing parallel 16-bit components. Various parallel 16-bit operations supply the single-instruction-multiple-datapath capability for the processor 100 including add, multiply-add, shift, compare, and the like.
- the media functional units 220 operate in combination as tightly coupled digital signal processors (DSPs). Each media functional unit 220 has an separate and individual sub-instruction stream, but all tree media functional units 220 execute synchronously so that the subinstructions progress lock-step through pipeline stages.
- the general functional unit 222 is a RISC processor capable of executing arithmetic logic unit (ALU) operations, loads and stores, branches, and various specialized and esoteric functions such as parallel power operations, reciprocal square root operations, and many others.
- ALU arithmetic logic unit
- the general functional unit 222 supports less common parallel operations such as the parallel reciprocal square root instruction:
- the illustrative instruction cache 210 is two-way set-associative, has a 16 Kbyte capacity, and includes hardware support to maintain coherence, allowing dynamic optimizations through self-modifying code.
- Software is used to indicate that the instruction storage is being modified when modifications occur.
- the 16 K capacity is suitable for performing graphic loops, other multimedia tasks or processes, and general-purpose JavaTM code.
- Coherency is maintained by hardware that supports write-through, non-allocating caching.
- Self-modifying code is supported through explicit use of “store-to-instruction-space” instruction store2i.
- Software uses the store2i instruction to maintain coherency with the instruction cache 210 so that the instruction caches 210 do not have to be snooped on every single store operation issued by the media processing unit 110 .
- the pipeline control unit 226 is connected between the instruction buffer 214 and the functional units and schedules the transfer of instructions to the functional units.
- the pipeline control unit 226 also receives status signals from the functional units and the load/store unit 218 and uses the status signals to perform several control functions.
- the pipeline control unit 226 maintains a scoreboard, generates stalls and bypass controls.
- the pipeline control unit 226 also generates traps and maintains special registers.
- Each media processing unit 110 and 112 includes a split register file 216 , a single logical register file including 128 thirty-two bit registers.
- the split register file 216 is split into a plurality of register file segments 224 to form a multi-ported structure that is replicated to reduce the integrated circuit die area and to reduce access time.
- a separate register file segment 224 is allocated to each of the media functional units 220 and the general functional unit 222 .
- each register file segment 224 has 128 32-bit registers.
- the first 96 registers (0-95) in the register file segment 224 are global registers. All functional units can write to the 96 global registers.
- the global registers are coherent across all functional units (MFU and GFU) so that any write operation to a global register by any functional unit is broadcast to all register file segments 224 .
- Registers 96-127 in the register file segments 224 are local registers. Local registers allocated to a functional unit are not accessible or “visible” to other functional units.
- the media processing units 110 and 112 are highly structured computation blocks that execute software-scheduled data computation operations with fixed, deterministic and relatively short instruction latencies, operational characteristics yielding simplification in both function and cycle time.
- the operational characteristics support multiple instruction issue through a pragmatic very large instruction word (VLIW) approach that avoids hardware interlocks to account for software that does not schedule operations properly.
- VLIW very large instruction word
- a VLIW instruction word always includes one instruction that executes in the general functional unit (GFU) 222 and from zero to three instructions that execute in the media functional units (MFU) 220 .
- a MFU instruction field within the VLIW instruction word includes an operation code (opcode) field, three source register (or immediate) fields, and one destination register field.
- Instructions are executed in-order in the processor 100 but loads can finish out-of-order with respect to other instructions and with respect to other loads, allowing loads to be moved up in the instruction stream so that data can be streamed from main memory.
- the execution model eliminates the usage and overhead resources of an instruction window, reservation stations, a re-order buffer, or other blocks for handling instruction ordering. Elimination of the instruction ordering structures and overhead resources is highly advantageous since the eliminated blocks typically consume a large portion of an integrated circuit die. For example, the eliminated blocks consume about 30% of the die area of a Pentium II processor.
- the media processing units 110 and 112 are high-performance but simplified with respect to both compilation and execution.
- the media processing units 110 and 112 are most generally classified as a simple 2-scalar execution engine with full bypassing and hardware interlocks on load operations.
- the instructions include loads, stores, arithmetic and logic (ALU) instructions, and branch instructions so that scheduling for the processor 100 is essentially equivalent to scheduling for a simple 2-scalar execution engine for each of the two media processing units 110 and 112 .
- the processor 100 supports full bypasses between the first two execution units within the media processing unit 110 and 112 and has a scoreboard in the general functional unit 222 for load operations so that the compiler does not need to handle nondeterministic latencies due to cache misses.
- the processor 100 scoreboards long latency operations that are executed in the general functional unit 222 , for example a reciprocal square-root operation, to simplify scheduling across execution units.
- the scoreboard (not shown) operates by tracking a record of an instruction packet or group from the time the instruction enters a functional unit until the instruction is finished and the result becomes available.
- a VLIW instruction packet contains one GFU instruction and from zero to three MFU instructions. The source and destination registers of all instructions in an incoming VLIW instruction packet are checked against the scoreboard.
- any true dependencies or output dependencies stall the entire packet until the result is ready.
- Use of a scoreboarded result as an operand causes instruction issue to stall for a sufficient number of cycles to allow the result to become available. If the referencing instruction that provokes the stall executes on the general functional unit 222 or the first media functional unit 220 , then the stall only endures until the result is available for intra-unit bypass. For the case of a load instruction that hits in the data cache 106 , the stall may last only one cycle. If the referencing instruction is on the second or third media functional units 220 , then the stall endures until the result reaches the writeback stage in the pipeline where the result is bypassed in transmission to the split register file 216 .
- the scoreboard automatically manages load delays that occur during a load hit.
- all loads enter the scoreboard to simplify software scheduling and eliminate NOPs in the instruction stream.
- the scoreboard is used to manage most interlock conditions between the general functional unit 222 and the media functional units 220 . All loads and non-pipelined long-latency operations of the general functional unit 222 are scoreboarded. The long-latency operations include division idiv, fdiv instructions, reciprocal square root frecsqrt, precsqrt instructions, and power ppower instructions. None of the results of the media functional units 220 is scoreboarded. Non-scoreboarded results are available to subsequent operations on the functional unit that produces the results following the latency of the instruction.
- the illustrative processor 100 has a rendering rate of over fifty million triangles per second without accounting for operating system overhead. Therefore, data feeding specifications of the processor 100 are far beyond the capabilities of cost-effective memory systems.
- Sufficient data bandwidth is achieved by rendering of compressed geometry using the geometry decompressor 104 , an on-chip real-time geometry decompression engine.
- Data geometry is stored in main memory in a compressed format. At render time, the data geometry is fetched and decompressed in real-time on the integrated circuit of the processor 100 .
- the geometry decompressor 104 advantageously saves memory space and memory transfer bandwidth.
- the compressed geometry uses an optimized generalized mesh structure that explicitly calls out most shared vertices between triangles, allowing the processor 100 to transform and light most vertices only once.
- the triangle throughput of the transform-and-light stage is increased by a factor of four or more over the throughput for isolated triangles.
- multiple vertices are operated upon in parallel so that the utilization rate of resources is high, achieving effective spatial software pipelining.
- operations are overlapped in time by operating on several vertices simultaneously, rather than overlapping several loop iterations in time.
- high trip count loops are software-pipelined so that most media functional units 220 are fully utilized.
- a schematic block diagram illustrates an embodiment of the split register file 216 that is suitable for usage in the processor 100 .
- the split register file 216 supplies all operands of processor instructions that execute in the media functional units 220 and the general functional units 222 and receives results of the instruction execution from the execution units.
- the split register file 216 operates as an interface to the geometry decompressor 104 .
- the split register file 216 is the source and destination of store and load operations, respectively.
- the split register file 216 in each of the media processing units 110 and 112 has 128 registers. Graphics processing places a heavy burden on register usage. Therefore, a large number of registers is supplied by the split register file 216 so that performance is not limited by loads and stores or handling of intermediate results including graphics “fills” and “spills”.
- the illustrative split register file 216 includes twelve read ports and five write ports, supplying total data read and write capacity between the central registers of the split register file 216 and all media functional units 220 and the general functional unit 222 .
- the five write ports include one 64-bit write port that is dedicated to load operations. The remaining four write ports are 32 bits wide and are used to write operations of the general functional unit 222 and the media functional units 220 .
- a large total read and write capacity promotes flexibility and facility in programming both of hand-coded routines and compiler-generated code.
- the illustrative split register file 216 is divided into four register file segments 310 , 312 , 314 , and 316 , each having three read ports and four write ports so that each register file segment has a size and speed proportional to 49 for a total area for the four segments that is proportional to 196. The total area is therefore potentially smaller and faster than a single central register file. Write operations are fully broadcast so that all files are maintained coherent. Logically, the split register file 216 is no different from a single central register file. However, from the perspective of layout efficiency, the split register file 216 is highly advantageous, allowing for reduced size and improved performance.
- the new media data that is operated upon by the processor 100 is typically heavily compressed. Data transfers are communicated in a compressed format from main memory and input/output devices to pins of the processor 100 , subsequently decompressed on the integrated circuit holding the processor 100 , and passed to the split register file 216 .
- the register file 216 is a focal point for attaining the very large bandwidth of the processor 100 .
- the processor 100 transfers data using a plurality of data transfer techniques.
- cacheable data is loaded into the split register file 216 through normal load operations at a low rate of up to eight bytes per cycle.
- streaming data is transferred to the split register file 216 through group load operations, which transfer thirty-two bytes from memory directly into eight consecutive 32-bit registers.
- the processor 100 utilizes the streaming data operation to receive compressed video data for decompression.
- Compressed graphics data is received via a direct memory access (DMA) unit in the geometry decompressor 104 .
- the compressed graphics data is decompressed by the geometry decompressor 104 and loaded at a high bandwidth rate into the split register file 216 via group load operations that are mapped to the geometry decompressor 104 .
- DMA direct memory access
- Load operations are non-blocking and scoreboarded so that early scheduling can hide a long latency inherent to loads.
- dedicating fields for globals, trap registers, and the like leverages the split register file 216 .
- a schematic block diagram shows a logical view of the register file 216 and functional units in the processor 100 .
- the physical implementation of the core processor 100 is simplified by replicating a single functional unit to form the three media functional units 220 .
- the media functional units 220 include circuits that execute various arithmetic and logical operations including general-purpose code, graphics code, and video-image-speech (VIS) processing.
- VIS processing includes video processing, image processing, digital signal processing (DSP) loops, speech processing, and voice recognition algorithms, for example.
- a simplified pictorial schematic diagram depicts an example of instruction execution among a plurality of media functional units 220 .
- Results generated by various internal function blocks within a first individual media functional unit are immediately accessible internally to the first media functional unit 510 but are only accessible globally by other media functional units 512 and 514 and by the general functional unit five cycles after the instruction enters the first media functional unit 510 , regardless of the actual latency of the instruction. Therefore, instructions executing within a functional unit can be scheduled by software to execute immediately, taking into consideration the actual latency of the instruction. In contrast, software that schedules instructions executing in different functional units is expected to account for the five cycle latency.
- the shaded areas represent the stage at which the pipeline completes execution of an instruction and generates final result values.
- a result is not available internal to a functional unit a final shaded stage completes.
- media processing unit instructions have three different latencies—four cycles for instructions such as fmuladd and fadd, two cycles for instructions such as pmuladd, and one cycle for instructions like padd and xor.
- Software that schedules instructions for which a dependency occurs between a particular media functional unit, for example 512 , and other media functional units 510 and 514 , or between the particular media functional unit 512 and the general functional unit 222 , is to account for the five cycle latency between entry of an instruction to the media functional unit 512 and the five cycle pipeline duration.
- FIG. 6 a schematic block diagram depicts an embodiment of the multiport register file 216 .
- a plurality of read address buses RA 1 through RAN carry read addresses that are applied to decoder ports 616 - 1 through 616 -N, respectively.
- Decoder circuits are well known to those of ordinary skill in the art, and any of several implementations could be used as the decoder ports 616 - 1 through 616 -N.
- the address is decoded and a read address signal is transmitted by a decoder port 616 to a register in a memory cell array 618 .
- Data from the memory cell array 618 is output using output data drivers 622 .
- Data is transferred to and from the memory cell array 618 under control of control signals carried on some of the lines of the buses of the plurality of read address buses RA 1 through RAN.
- FIGS. 7A and 7B a schematic block diagram and a pictorial diagram, respectively, illustrate the register file 216 and a memory array insert 710 .
- the register file 216 is connected to a four functional units 720 , 722 , 724 , and 726 that supply information for performing operations such as arithmetic, logical, graphics, data handling operations and the like.
- the illustrative register file 216 has twelve read ports 730 and four write ports 732 .
- the twelve read ports 730 are illustratively allocated with three ports connected to each of the four functional units.
- the four write ports 732 are connected to receive data from all of the four functional units.
- the register file 216 includes a decoder, as is shown in FIG. 6 , for each of the sixteen read and write ports.
- the register file 216 includes a memory array 740 that is partially shown in the insert 710 illustrated in FIG. 7B and includes a plurality of word lines 744 and bit lines 746 .
- the word lines 744 and bit lines 746 are simply a set of wires that connect transistors (not shown) within the memory array 740 .
- the word lines 744 select registers so that a particular word line selects a register of the register file 216 .
- the bit lines 746 are a second set of wires that connect the transistors in the memory array 740 .
- the word lines 744 and bit lines 746 are laid out at right angles.
- the word lines 744 and the bit lines 746 are constructed of metal laid out in different planes such as a metal 2 layer for the word lines 744 and a metal 3 layer for the bit lines 746 .
- bit lines and word lines may be constructed of other materials, such as polysilicon, or can reside at different levels than are described in the illustrative embodiment, that are known in the art of semiconductor manufacture.
- a distance of about lm separates the word lines 744 and a distance of approximately 1 ⁇ m separates the bit lines 746 .
- Other circuit dimensions may be constructed for various processes.
- the illustrative example shows one bit line per port, other embodiments may use multiple bit lines per port.
- each cell When a particular functional unit reads a particular register in the register file 216 , the functional unit sends an address signal via the read ports 730 that activates the appropriate word lines to access the register.
- each cell In a register file having a conventional structure and twelve read ports, each cell, each storing a single bit of information, is connected to twelve word lines to select an address and twelve bit lines to carry data read from the address.
- the four write ports 732 address registers in the register file using four word lines 744 and four bit lines 746 connected to each cell.
- the four word lines 744 address a cell and the four bit lines 746 carry data to the cell.
- the illustrative register file 216 were laid out in a conventional manner with twelve read ports 730 and four write ports 732 for a total of sixteen ports and the ports were 1 ⁇ m apart, one memory cell would have an integrated circuit area of 256 ⁇ m 2 (16 ⁇ 16). The area is proportional to the square of the number of ports.
- the register file 216 is alternatively implemented to perform single-ended reads and/or single-ended writes utilizing a single bit line per port per cell, or implemented to perform differential reads and/or differential writes using two bit lines per port per cell.
- the register file 216 is not laid out in the conventional manner and instead is split into a plurality of separate and individual register file segments 224 .
- FIG. 8 a schematic block diagram shows an arrangement of the register file 216 into the four register file segments 224 .
- the register file 216 remains operational as a single logical register file in the sense that the four of the register file segments 224 contain the same number of registers and the same register values as a conventional register file of the same capacity that is not split.
- the separated register file segments 224 differ from a register file that is not split through elimination of lines that would otherwise connect ports to the memory cells.
- each register file segment 224 has connections to only three of the twelve read ports 730 , lines connecting a register file segment to the other nine read ports are eliminated. All writes are broadcast so that each of the four register file segments 224 has connections to all four write ports 732 . Thus each of the four register file segments 224 has three read ports and four write ports for a total of seven ports.
- the individual cells are connected to seven word lines and seven bit lines so that a memory array with a spacing of 1 ⁇ m between lines has an area of approximately 49 ⁇ m 2 .
- the four register file segments 224 have an area proportion to seven squared. The total area of the four register file segments 224 is therefore proportional to 49 times 4, a total of 196.
- the split register file thus advantageously reduces the area of the memory array by a ratio of approximately 256/196 (1.3 ⁇ or 30%).
- the reduction in area further advantageously corresponds to an improvement in speed performance due to a reduction in the length of the word lines 744 and the bit lines 746 connecting the array cells that reduces the time for a signal to pass on the lines.
- the improvement in speed performance is highly advantageous due to strict time budgets that are imposed by the specification of high-performance processors and also to attain a large capacity register file that is operational at high speed.
- the operation of reading the register file 216 typically takes place in a single clock cycle.
- a cycle time of two nanoseconds is imposed for accessing the register file 216 .
- register files typically only have up to about 32 registers in comparison to the 128 registers in the illustrative register file 216 of the processor 100 .
- a register file 216 substantially larger than the register file in conventional processors is highly advantageous in high-performance operations such as video and graphic processing.
- the reduced size of the register file 216 is highly useful for complying with time budgets in a large capacity register file.
- a simplified schematic timing diagram illustrates timing of the processor pipeline 900 .
- the pipeline 900 includes nine stages including three initiating stages, a plurality of execution phases, and two terminating stages.
- the three initiating stages are optimized to include only those operations necessary for decoding instructions so that jump and call instructions, which are pervasive in the JavaTM language, execute quickly. Optimization of the initiating stages advantageously facilitates branch prediction since branches, jumps, and calls execute quickly and do not introduce many bubbles.
- the first of the initiating stages is a fetch stage 910 during which the processor 100 fetches instructions from the 16 Kbyte two-way set-associative instruction cache 210 .
- the fetched instructions are aligned in the instruction aligner 212 and forwarded to the instruction buffer 214 in an align stage 912 , a second stage of the initiating stages.
- the aligning operation properly positions the instructions for storage in a particular segment of the four register file segments 310 , 312 , 314 , and 316 and for execution in an associated functional unit of the three media functional units 220 and one general functional unit 222 .
- a decoding stage 914 of the initiating stages the fetched and aligned VLIW instruction packet is decoded and the scoreboard (not shown) is read and updated in parallel.
- the four register file segments 310 , 312 , 314 , and 316 each holds either floating-point data or integer data.
- the register files are read in the decoding (D) stage.
- the two terminating stages include a trap-handling stage 960 and a write-back stage 962 during which result data is written-back to the split register file 216 .
- any suitable programming language is also supported.
- Other programming languages that support multiple-threading are generally more advantageously used in the described system.
- any suitable processing engine is also supported.
- Other processing engines that support multiple-threading are generally more advantageously used in the described system.
- the illustrative register file has one bit line per port, in other embodiments more bit lines may be allocated for a port.
- the described word lines and bit lines are formed of a metal. In other examples, other conductive materials such as doped polysilicon may be employed for interconnects.
- the described register file uses single-ended reads and writes so that a single bit line is employed per bit and per port. In other processors, differential reads and writes with dual-ended sense amplifiers may be used so that two bit lines are allocated per bit and per port, resulting in a bigger pitch. Dual-ended sense amplifiers improve memory fidelity but greatly increase the size of a memory array, imposing a heavy burden on speed performance.
- the spacing between bit lines and word lines is described to be approximately 1 ⁇ m. In some processors, the spacing may be greater than lam. In other processors the spacing between lines is less than 1 ⁇ m.
- the Café assembler uses the .proc pseudo-op similar to SPARC's, but defines no operand for it. This pseudo-op should be used to mark the beginning of a function so the assembler can know to require the beginning of an instruction word. This only makes a practical difference if an immediately preceding function ends with an instruction that does not appear to consummate an instruction word, but using the .proc pseudo-op is a good habit in any case.
- register alias symbols may be defined as register use conventions evolve. None of these symbols is case-sensitive.
- a general purpose register can be denoted by a register expression of the form: % % [ Rr ] ? ⁇ constant-expression>
- Documentation refers to a program counter, % pc, and its sidekick % npc, but it's not apparent that either is used by the assembler.
- % psr has a program status register, for which this assembler uses the symbol % psr.
- the layout of % psr follows. % psr can be read and modified by the getir and set ir instructions using its internal register ordinal, 1.
- Bit 24 of % psr specifies The processor ID. Clear denotes cpu0, and set denotes cpu1.
- Bits 23 and 22 of % psr specify the current Trap Level.
- Bit 21 of % psr determines the endianness of loads and stores. The initial state of this bit is clear, which means big-endian. When set it means little-endian.
- Bit 20 of % psr is the Instruction Address Check Enable flag. Its description is yet to be supplied.
- Bit 19 of % psr is the Data Address Check Enable flag. Its description is yet to be supplied.
- Bit 18 of % psr is the Garbage Check Enable flag. Its description is yet to be supplied.
- Bit 17 of % psr is the Data Cache Enable flag. When set, the data cache is enabled; when clear, data cache is disabled.
- Bit 16 of % psr is the Instruction Cache Enable flag. When set, the instruction cache is enabled; when clear, instruction cache is disabled.
- Bit 15 of % psr is the Supervisor Mode flag. When set it indicates the processor is in supervisor mode, which allows certain privileged activities. Among these privileged activities is the ability to change all but the two right-most fields of % psr.
- the Supervisor Mode flag is most often set during trap handling, which is explained in the Traps section of the microarchitecture manual.
- Bit 14 of % psr is the Interrupt Enable flag. When set, interrupts are enabled. Its use is explained in the Traps section of the microarchitecture manual.
- Bits 13 through 10 of % psr is the Processor Interrupt Level, which is explained in the Traps section of the microarchitecture manual.
- the % psr fields that can be set when the Supervisor Mode flag is clear are grouped together at the low-order end. They follow.
- a 2-bit field of % psr specifies the mode in effect for the saturated arithmetic performed by some of the parallel integer operations.
- the bounds for saturation are given in the adjacent table.
- Modes 00 and 01 are expressed as two's-complement 16-bit integers.
- Mode 10 is expressed in S.15 fixed-point.
- Mode 11 is S2.13 fixed-point.
- the simulator is using bits 8 and 9 of % psr for this specification. bounds mode low high 00 000000000000 . . . 0 011111111 . . . 1 01 100000000000000 . . . 0 011111111 . . . 1 10 100000000000 . . . 0 011111111 . . . 1 11 111000000 . . . 0 001000000 . . . 0
- the sign-bit S is one bit of the integer part of the number.
- the integer part is a two's complement 3-bit number. There is no “dedicated” sign-bit as with the floating-point representation, and thus no negative zero to worry about.
- the low-order eight bits of % psr are used as dirty bits for octants of the general purpose register file.
- a new process begins with all the dirty bits clear, and a octant's dirty bit is set when a register in that octant is written.
- Café's large register file is a daunting lot of state to manage during a context-switch.
- An isolated region of a long-running program that causes dirty bits to be set should clear them when it's safe to do so.
- a given register number N corresponds to the bit (1 ⁇ (N>>5)).
- a vector of trap handler addresses is pointed-to by a trap base register, for which the assembler uses the symbol % tbr. Only the high-order 19 bits of % tbr are used to address the vector, so the vector must be positioned at an 8192-byte boundary. Details for use of the vector are described in the “Traps” chapter of the Café Architecture Manual.
- the assembler's only concern is the ability to read or set % tbr using the getir and setir instructions. Reads of the low-order 13 bits of % tbr always return zero, and writes to the low-order 13 bits of % tbr are always ignored.
- An SFU instruction begins with a 2-bit header field that is a count of the UFU instructions that follow in the instruction word. All of the instructions in an instruction word are issued in the same cycle.
- UFU instructions need not be present.
- the UFU on which an instruction executes is determined by the position of the instruction in the instruction word.
- the assembler infers the beginning of an instruction word from the presence of an SFU instruction. UFU instructions that follow form the rest of the instruction word. More than three consecutive UFU instructions are reported as a fatal error, since the assembler cannot create a well-formed Café instruction word from that.
- mnemonics denote instructions implemented both as SFU and UFU operations. These mnemonics indicate an SFU instruction only when used at the beginning of an instruction word. An instruction word boundary is established when the immediately preceding instruction word uses all three of its UFU slots or by the presence of two adjacent semicolons (;;), the instruction word delimiter.
- the double semicolon is a full colon, meaning it's time to flush the instruction word.
- An SFU instruction begins with a 2-bit header (labeled hdr in the instruction format diagrams appearing later in this section) that gives the number of UFU instructions that follow the SFU instruction in the instruction word. That is, the header vaules and instruction word contents they indicate are: header value instructions in instruction word 00 SFU only 01 SFU + UFU1 10 SFU + UFU1 + UFU2 11 SFU + UFU1 + UFU2 + UFU3
- the first two bits of an SFU opcode determine the class of the operation.
- the values and classes are: 00 Call and branch 01 Compute 10 Memory (uncacheable) 11 Memory (cacheable)
- the third bit of an SFU opcode is set when an operation uses an immediate for its second source operand and clear when it does not.
- SFU opcodes beginning with 00 (call and branch) are 6 bits, and all others are 8 bits.
- Opcodes for the memory operations can be shown in a matrix where the bits usually indicate cacheability, signedness, size, and direction: Memory (cacheable, leading 11) Opcodes opcode[2.0] 0xx-(unsigned) 1xx-(signed) opcode[7:3] byte short word long byte short word long 11ixx 000 001 010 011 100 101 110 111 11i00 ldub ldus lduw ldpair ldb lds ldw (ldg) 11i01 — lduso lduwo ld_diag — ldso ldwo prefetch 11i10 stb sts stw stpair cstb csts cstw — 11i11 s2ib stso stwo st_diag — — cas —
- Opcodes in the compute (leading 01) quadrant of the SFU opcode space generally are not assigned in ways where the bit patterns reveal much other than where the i-bit is used.
- Mnemonics in both the upper and lower halves of this table are those for opcodes that are the same except for a clear or set i-bit. Note that there are only three free spaces in the upper half and sixteen free in the upper half.
- Opcodes in the call and branch (leading 00) quadrant of the SFU opcode space have some irregularities compared to other SFU opcodes. Since call and nop opcodes must be unique in their higher-order six bits, they have effective footprints of four opcodes each. Similarly, bz and bnz, with their prediction qualifiers, each use up four opcode slots. This quadrant does not use the i-bit as the other three do.
- the immediate field of the UFU two-source instruction is 14 bit to be consistent with very similar operations using the SFU compute format.
- Logical Instructions mnemonic argument list opcode L operation add rs1,reg_or_imm14,rd S-01i00110 1 Add U-010110 cccb rs1,reg_or_imm14,rd U-111110 1 Count consecutive clear bits not rs1,rd S-01i00100 1 Not U-010100 or rs1,reg_or_imm14,rd S-01i00101 1 Or U-010101 pshll rs1,reg_or_imm14,rd U-011100 1 Parallel shift left logical pshra rs1,reg_or_imm14,rd U-011110 1 Parallel shift right arithmetic pshrl rs1,reg_or_imm14,rd U-011101 1 Parallel shift right logical shll rs1,reg_or_imm14,rd S-01i10000 1 Shift left U-011000 logical shra rs1,reg
- the fixed-point operands of these instructions are in S2.13 format; that is, these instructions are unaffected by the fixed-point mode bits of the psr. Precision may be lost as a result is rendered in that format.
- Overflows saturate.
- mnemonic argument list opcode L operation ppower rs1,rs2,rd S-01i01010 6 Parallel exponentiation precsqrt rs1,rd S-01011101 6 Parallel reciprocal square root
- Instructions are stored big-endian.
- the assembler and linker assume that relocations and other initializations in sections other than code sections should also be treated as big-endian.
- [address] may be [rs1+rs2], [rs1+simm14], or [rs1].
- [rs1] the assembler infers an immediate zero for the second address component of the instruction.
- mnemonic argument list opcode L operation cas rs1, ⁇ rs2],rd S-11011110 ? Compare and swap(atomic) cstb rd,rs1,[rs2] S-11010100 ?
- Pixel Instructions mnemonic argument list opcode L operation bitext rs1,rs2,rs3,rd U-111111 2 Bit extract byteshuffle rs1,rs2,rs3,rd U-100001 2 Byte extract pack rs1,rs2,rs3,rd U-100000 1 Pack pdist rs1,rs2,rd U-100101 4 Pixel distance pmean rs1,rs2,rd U-100100 1 Parallel mean more as I learn more . . . Scheduling
- Non-scoreboarded results are available to subsequent operations on the unit that produces them after their latencies; earlier use is erroneous. Latencies are shown in the columns labeled “L” in the tables in the preceding section.
- a result produced on one unit is available as an operand on another unit when it is being written to the register file (that is, when it reaches pipe stage W 1 ).
- Earlier use is erroneous. This takes 5 cycles on a UFU. On the SFU this takes one cycle more than the latency of the producing operation. The difference is because UFU results have to pass through all 4 E-stages of a pipe before reaching W 1 , and SFU results do not.
- a scoreboarded result register as an operand causes instruction issue to stall for as many cycles as it takes for that result to become available. If the referencing instruction that provokes the stall is also on the SFU, the stall is only until the result is available for intra-unit bypass. In the case of a load that hits in the cache, the stall could be as short as a single cycle. If the referencing instruction is on a UFU, stall lasts until the result reaches stage W 1 , where it can be bypassed on its way to the register file.
- a completed (that is, finished but not yet in W 1 ) instruction's result can be bypassed from the first UFU to the SFU if its destination register is r4.
- a completed instruction's result can be bypassed from the SFU to the the first UFU if its destination register is r5.
- add is an integer instruction that computers “r[rs1]+r[rs2] ” or “r[rs1]+sign_ext(imm14) ”.
- the use of an immediate for the second source operand sets the i-bit of the opcode of the SFU version or the first header bit of the UFU version. The resulting sum is left in r[rd].
- the suggested assembler syntax is:
- the SFU version of this instruction uses the SFU compute format, and the UFU version uses the UFU 2-source format.
- addc and subc are integer instructions that compute “r[rs1]+r[rs2]+r[rs3]” or “r[rs1] ⁇ r[rs2] ⁇ r[rs3]”, respectively. Only the least significant bit of r[rs3], which is expected to be the result of a gencarry or genborrow, is used. The result is left in r[rd].
- the suggested assembler syntax is:
- addc and subc are UFU instructions that use the UFU 3-source format.
- the suggested assembler syntax is:
- bitext is a pixel instruction that extracts bits from the pair of registers r[rs1] and r[rs2].
- the extracted field is described by a 6-bit length in bits 21 . . . 16 of r[rs3]and a 5-bit skip count in bits 4 . . . 0 of r[rs3].
- the skip count is applied at the left-most (high-order) end of r[rs1].
- the extracted field is right justified with r[rd] without sign-extension.
- the suggested assembler syntax is:
- bitext is an UFU operation that uses the UFU 3-source format.
- the second source operand can be either r[rs2] or sign_ext(imm8)
- the third source operand can be either r[rs3] or sign_ext(imm8).
- the suggested assembler syntax is:
- bnz is a control flow instruction for a branch to the offset implied by the difference between the program counter and the specified label if the value in “r[rd]” is not equal to the integer zero.
- label is a label at the branch target; the assembler will either determine the displacement or generate relocation information so a linker can determine it. Whenever the displacement is determined, it must be expressible in a signed 22-bit field.
- the mnemonic may be followed optionally by the qualifier,pt, which means this conditional branch is staticly predicted to be taken.
- the use of this qualifier sets the T-bit in the instruction.
- the suggested assembler syntax is:
- byteshuffle is a pixel instruction that copies the bytes from its sources r[rs1] and r[rs2] to byte positions of r[rd] according to the pattern described by the bits of the least significant two bytes of r[rs3].
- Each group of four contiguous bits the lower-order two bytes of r[rs3] is the ordinal of the byte position of the eight bytes of the register pair r[rs1] ⁇ r[rs2] from which a byte is copied to the corresponding byte of r[rd].
- An out-of-range byte ordinal (that is, a value greater than 7) means the corresponding byte of r[rd] will be zeroed.
- the suggested assembler syntax is:
- byteshuffle is a UFU operation that uses the UFU 3-source format.
- bz is a control flow instruction for a branch to the offset implied by the difference between the program counter and the specified label if the value in “r[rd]” is equal to the integer zero.
- label is a label at the branch target; the assembler will either determine the displacement or generate relocation information so a linker can determine it. Whenever the displacement is determined, it must be expressible in a signed 22-bit field.
- the mnemonic may be followed optionally by the qualifier, pt, which means this conditional branch is staticly predicted to be taken.
- the use of this qualifier sets the T-bit in the instruction.
- Unconditional branches are commonly coded using bz with r0 for the register operand. When the assembler sees this “unconditional conditional” branch without prediction, it will infer the “,pt” qualification, which can improve instruction prefetching.
- the suggested assembler syntax is:
- bz is an SFU operation that uses the branch instruction format.
- call is a control flow instruction causes a control transfer to the address specified by its label operand.
- a call to address zero is an illegal instruction.
- the return address (% npc at the time of the call) is left in r2, an implicit operand of this instruction. For that reason the assembler has the alias lp (“link pointer”) for r2.
- the suggested assembler syntax is:
- cas is a memory access instruction that compare the content of register r[rs1] with the content of the 32-bit word in memory addressed by r[rs2]. If those values are equal, the content of register r[rd] is swapped with the word addressed by r[rs2]. Otherwise, the content of the addressed memory word is unchanged, but the value at that memory address replaces the content of register r[rd].
- cas uses dcache, which makes it unsuitable for thread synchronization in a multi-Café configuration.
- the suggested assembler syntax is:
- cas is an SFU operation that use the 2-source register variant of the SFU memory format. Since it always uses two source registers, the i-bit of its opcodes is always clear.
- cccb is a logical instruction that counts consecutive clear bits is its first source operand, r[rs1], beginning from the high-order bit, first skipping the number of bits specified by the second source operand.
- the second source operand my be either a register or an immediate. In either case, only the the low-order 5 bits of the skip-count are used. The count of clear bits is left in r[rd].
- the suggested assembler syntax is:
- cccb is an UFU operation that uses the UFU 2-source format.
- the suggested assembler syntax is:
- cmovenz is an integer instruction that copies the value of the second source operand, specified by “r[rs2] ” or “sign_ext(imm)”, to the result register r[rd] only if the first source operand, r[rs1], is non-zero. Note that cmovenz allows a 14-bit immediate on the SFU but only an 8-bit immediate on a UFU.
- the suggested assembler syntax is: cmovenz rs1,reg_or_imm14,rd ! SFU cmovenz rs1,reg_or_imm8,rd ! UFU
- the SFU version of the cmovenz uses the SFU compute format.
- the UFU version of cmovenz is a pseudo-op for the cpicknz instruction with r[rd] replicated in the r[rs3] field.
- cmove z is an integer instruction that copies the value of the second source operand, specified by “r[rs2]” or “sign_ext(imm)”, to the result register r[rd] only if the first source operand, r[rs1], is zero. Note that cmovez allows a 14-bit immediate on the SFU but only an 8-bit immediate on a UFU.
- the suggested assembler syntax is: cmovez rs1,reg_or_imm14,rd ! SFU cmovez rs1,reg_or_imm8,rd ! UFU
- the SFU version of cmovez uses the SFU compute format.
- the UFU version of cmovez is a pseudo-op for the cpickz instruction with r[rd] replicated in the r[rs3] field.
- the use of an immediate for the second source operand sets the i-bit of the opcode. The resulting zero or one is left in r[rd].
- the suggested assembler syntax is: cmpeq rs1,reg_or_imm14,rd
- the SFU version of this instruction uses the SFU compute format, and the UFU version uses the UFU 2-source format.
- cmplt is an integer instruction that computes “r[rs1] ⁇ r[rs2]” or “r[rs1] ⁇ sign_ext(imm14)”.
- the use of an immediate for the second source operand sets the i-bit of the opcode. The resulting zero or one is left in r[rd].
- the suggested assembler syntax is: cmplt rs1,reg_or_imm14,rd
- the SFU version of this instruction uses the SFU compute format, and the UFU version uses the UFU 2-source format.
- cmpult is an integer instruction that computes “( unsigned ) r[rs1] ⁇ (unsigned) r[rs2] ” or “(unsigned) r[rs1] ⁇ (unsigned) imml 4”.
- the use of an immediate for the second source operand sets the i-bit of the opcode. The resulting zero or one is left in r[rd].
- the suggested assembler syntax is: cmpult rs1,reg_or_imml4,rd
- the SFU version of this instruction uses the SFU compute format, and the UFU version uses the UFU 2-source format.
- cpicknz is an integer instruction that assigns its second source operand to r[rd] if the first source operand, r[rs1], is non-zero; otherwise, the third source operand is assigned to r[rd].
- Each of the second and third source operands can be either a register or a sign-extended 8-bit immediate.
- cpicknz is a UFU operation that uses the UFU 3-source format.
- cpickz is an integer instruction that assigns its second source operand to r[rd] if the first source operand, r[rs1], is zero; otherwise, the third source operand is assigned to r[rd].
- Each of the second and third source operands can be either a register or a sign-extended 8-bit immediate.
- the suggested assembler syntax is: cpickz rs1,reg_or_imm8,reg_or_imm8,rd
- cpickz is a UFU operation that uses the UFU 3-source format.
- cstb, csts, and cstw are memory access instructions that, if the value in the register r[rs1] is non-zero, store the value the register r[rd] at the address in register r[rs2].
- the suggested assembler syntax is: cstb rd rs1,[rs2] csts rd,rs1,[rs2] cstw rd,rs1,[rs2]
- cst[b, s, w] are SFU operations that use the 2-source register variant of the SFU memory format. Since they always use two source registers, the i-bit of their opcodes is always clear.
- control flow instruction causes a control transfer from a trap handler to the next instruction word after the instruction that caused the trap. Please refer to the Traps chapter of the Café Microarchitecture specification for the complete description.
- the suggested assembler syntax is:
- SFU operation that uses the SFU compute format, but is has no use for any operand.
- fadd is a floating point instruction that computes “r[rs1]+r[rs2]”, where the values of the source operands are IEEE single-precision floating point numbers. The result is delivered in r[rd].
- the suggested assembler syntax is: fadd rs1,rs2,rd
- fadd is a UFU operation that uses the UFU 2-source format.
- fcmpeq is a floating point instruction that compares the single-precision floating point operands in its source registers r[rs1] and r[rs2] for equality.
- the destination register r[rd] set set to the integer value 1 if the source operands are equal and zero otherwise. If either source operand is a NaN, they are not equal.
- the suggested assembler syntax is: fcmpeq rs1,rs2,rd
- fcmpeq is a SFU operation that uses the SFU compute format.
- fcmple is a floating point instruction that set its destination register r[rd] to the integer value 1 if the single-precision floating point value in r[rs1] is less than or equal to the single-precision floating point value in r[rs2] and to zero otherwise. If the value of either source operand is a NaN, the result is zero.
- the suggested assembler syntax is: fcmple rs1,rs2,rd
- fcmple is a SFU operation that uses the SFU compute format.
- fcmpit is a floating point instruction that set its destination register r[rd] to the integer value 1 if the single-precision floating point value in r[rs1] is less than the single-precision floating point value in r[rs2] and to zero otherwise. If the value of either source operand is a NaN, the result is zero.
- the suggested assembler syntax is: fcmplt rs1,rs2,rd
- fcmpit is a SFU operation that uses the SFU compute format.
- fdiv is a floating point instruction that computes “r[rs1]
- the suggested assembler syntax is: fdiv rs1,rs2,rd
- fdiv is a SFU operation that uses the SFU compute format.
- fix2 flt is a convert instruction that converts a fixed point value in r[rs1], with its binary point specified by the low-order 5 bits of r[rs2] or imm14, to a single precision floating point result in r[rd].
- the suggested assembler syntax is: fix2flt rs1,reg_or_imm14,rd
- fix2flt is an UFU operation that uses the UFU 2-source format.
- flt2fix is a convert instruction that converts a single precision floating point value in r[rs1] to a fixed point result in r[rd] with the binary point as specified by the low-order 5 bits of r[rs2] or imm14.
- the suggested assembler syntax is: flt2fix rs1,reg_or_imm14,rd
- flt2fix is an UFU operation that uses the UFU 2-source format.
- fmul is a floating point instruction that computes “r[rs1]*r[rs2]”, where the values of the source operands are IEEE single-precision floating point numbers. The result is delivered in r[rd].
- the suggested assembler syntax is: fmul rs1,rs2,rd
- fmul is a UFU operation that uses the UFU 2-source format.
- fmuladd is a floating point instruction that computes “(r[rs1]*r[rs2])+r[rs3] ”, where the values of the source operands are IEEE single-precision floating point numbers. The result is delivered in r[rd].
- the suggested assembler syntax is: fmuladd rs1,rs2,rs3,rd
- fmuladd is a UFU operation that uses the UFU 3-source format.
- fmul sub is a floating point instruction that computes “(r[rs1]*r[rs2]) ⁇ r[rs3]”, where the values of the source operands are IEEE single-precision floating point numbers. The result is delivered in r[rd].
- the suggested assembler syntax is: fmulsub rs1,rs2,rs3,rd
- fmulsub is a UFU operation that uses the UFU 3-source format.
- frecsqrt is a floating point instruction that computes the reciprocal square root of the single-precision floating-point number in r[rs1] and puts that result in r[rd]. What will this do with an argument less than or equal to zero?
- frecsqrt rs1,rd frecsqrt is an SFU operation uses the SFU compute format, but has no use for the second source operand of that format.
- fsub is a floating point instruction that computes “r[rs1] ⁇ r[rs2] ”, where the values of the source operands are IEEE single-precision floating point numbers. The result is delivered in r[rd].
- the suggested assembler syntax is: fsub rs1,rs2,rd
- fsub is a UFU operation that uses the UFU 2-source format.
- genborrow and gencarry are integer instructions that generate a one or zero in r[rd] if subtracting or adding, respectively, the source operands generates a borrow or carry, respectively. The result would be useful as the third source operand of sesequent addc and subc operations.
- genborrow rs1,reg_or_imm14,rd gencarry rs1,reg_or_imm14,rd genborrow and gencarry are are UFU operations that use the UFU 2-source format. getir
- getir is an integer instruction that gets the value of the internal register the ordinal of which is its r[rs1] operand and puts that value in the register specified by r[rd].
- the suggested assembler syntax is: getir rs1,rd
- getir is an SFU operation that uses the SFU compute format, though in an irregular way. It has no use for the second source field, and the first source operand is an internal register number, NOT one of the general purpose registers.
- idiv is an integer instruction that computes “r[rs1]
- the use of an immediate for the second source operand sets the i-bit of the opcode. The result is left in r[rd].
- the suggested assembler syntax is: idiv rs1,reg_or_imm14,rd
- idiv is an SFU operation that uses the SFU compute format.
- iflush is a memory access instruction that is used to make sure that modifications to code space are visible by the processor executing the iflush iflush invalidates all younger instructions that have already entered the pipe.
- the suggested assembler syntax is:
- jmpl is a control flow instruction that causes a register-indirect control transfer to the address in r[rs1].
- the current value of % npc is left in r[rd].
- the suggested assembler syntax is: jmpl rs1,rd
- jmpl is an SFU operation that uses the compute instruction format.
- the second source operand of that format is not used by jmpl.
- ldb, lds, and ldw are memory access instructions that load an 8-bit byte, a 16-bit short, or a 32-bit word from address into the destination register r[rd].
- the value loaded is a sign-extended in the destination register.
- the use of an immediate for the second component of the address sets the i-bit of the opcode.
- the suggested assembler syntax is: ldb [address],rd lds [address],rd ldw [address],rd
- ldso and ldwo are memory access instructions that load a 16-bit short or a 32-bit word from address into the destination register r[rd] using the opposite endianness from that indicated by the endian-bit of % psr. The value loaded is sign-extended in the destination register. The use of an immediate for the second component of the address sets the i-bit of the opcode.
- the suggested assembler syntax is: ldso [address],rd ldwo [address],rd
- ldso and ldwo are SFU operations that use the SFU memory format.
- ldpair is a memory access instruction that performs a load into a pair of adjacent registers beginning at the register specified by r[rd] from address.
- the use of an immediate for the second component of the address sets the i-bit of the opcode, r[rd] must be an even-numbered register.
- the suggested assembler syntax is: ldpair rd,[address]
- ldpair is an SFU operation that uses the SFU memory format.
- ldub, ldus, and lduw are memory access instructions that load an unsigned 8-bit byte, an unsigned 16-bit short, or an unsigned 32-bit word from address into the destination register r[rd].
- the use of an immediate for the second component of the address sets the i-bit of the opcode.
- the suggested assembler syntax is: ldub [address],rd ldus [address],rd lduw [address],rd
- ldu [b, s, w] instructions are SFU operations that use the SFU memory format.
- lduso, and lduwo are memory access instructions that load an unsigned 16-bit short or an unsigned 32-bit word from address into the destination register r[rd] using the opposite endianness from that indicated by the endian-bit of % psr.
- the use of an immediate for the second component of the address sets the i-bit of the opcode.
- the suggested assembler syntax is: lduso [address],rd luwo [address],rd
- lduso and lsuwo are SFU operations that use the SFU memory format.
- membar is memory access instruction that specifies that all memory reference instructions already issued must be performed before any subsequent memory reference instruction may be initiated.
- the suggested assembler syntax is:
- membar is an SFU operation that uses the compute instruction format, but uses none of that format's operands.
- moveind is an integer instruction that copies the content of its first source operand, r[rs1], to the register indicated by the least significant eight bits of its second source register, r[rs2].
- the suggested assembler syntax is:
- moveind is a UFU operation that uses the UFU 2-source format, but it has no use for the destination register field and does not accept an immediate for the second source operand.
- mul is an integer instruction that computes “r[rs1]*r[rs2]” or “r[rs1]* sign_ext(imm14)”.
- the use of an immediate for the second source operand sets the first bit of the instruction header. The result is left in r[rd].
- the suggested assembler syntax is:
- mul is a UFU operation that uses the UFU 2-source format.
- muladd is an integer instruction that computes “(r[s1]*r[s2])+r[s3]”, “(r[s1 *r[s2])+sign ext(imm8)”, “(r[s1]*sign_ext(imm8))+r[s3] ”, or “(r[s1]* sign_ext(imm8))+sign_ext(imm8)” and puts the result in r[rd].
- the use of an immediate for the second or third source operand sets the first or second bit, respectively, of the instruction header.
- the suggested assembler syntax is:
- mulsub is an integer instruction that computes “(r[s1]*r[s2]) ⁇ r[s3]” “(r[s]*r[s2]) ⁇ sign_ext(imm8)”, “(r[s1]*sign_ext(imm8)) ⁇ r[s3]”, or “(r[s1]* sign_ext(imm8)) ⁇ sign_ext(imm8)” and puts the result in r[rd].
- the use of an immediate for the second or third source operand sets the first or second bit, respectively, of the instruction header.
- the suggested assembler syntax is:
- mul sub is a UFU operation that uses the UFU 3-source format.
- ncldb, nclds, and ncldw are memory access instructions that perform a non-cacheable load of an 8-bit byte, a 16-bit short, or a 32-bit word from address into the destination register r[rd]. The value loaded is sign-extended in the destination register. The use of an immediate for the second component of the address sets the i-bit of the opcode.
- ncldb [address]
- rd nclds [address]
- rd ncldw [address]
- ncld[b, s, w] instructions are SFU operations that use the SFU memory format.
- ncldg is a memory access instruction that does an uncached load of a group of eight consecutive 32-bit words from address into eight consecutive registers beginning with the one specified by r[rd].
- the use of an immediate for the second component of the address sets the i-bit of the opcode. r[rd] must be 8-register aligned.
- the suggested assembler syntax is:
- ncldg was formerly known as ldg.
- the assembler temporarily knows the former name as an alias for the new name to ease the transition.
- ncldg is an SFU operation that use the SFU memory format.
- ncldso and ncldwo are memory access instructions that perform a non-cacheable load of a 16-bit short or a 32-bit word from address into the destination register r[rd] using the opposite endianness from that indicated by the endian-bit of % psr.
- the value loaded is sign-extended in the destination register.
- the use of an immediate for the second component of the address sets the i-bit of the opcode.
- the suggested assembler syntax is: ncldso [address],rd ncldwo [address],rd
- ncldso and ncldwo are SFU operations that use the SFU memory format.
- ncldub, ncldus, and nclduw are memory access instructions that perform a non-cacheable load an unsigned 8-bit byte, an unsigned 16-bit short, or an unsigned 32-bit word from address into the destination register r[rd].
- the use of an immediate for the second component of the address sets the i-bit of the opcode.
- the suggested assembler syntax is: ncldub [address],rd ncldus [address],rd nclduw [address],rd
- ncldu [b, s, w] instructions are SFU operations that use the SFU memory format.
- nclduso and nclduwo are memory access instructions that perform a non-cacheable load an unsigned 16-bit short or an unsigned 32-bit word from address into the designation register r[rd] using the opposite endianness from that indicated by the endian-bit of % psr.
- the use of an immediate for the second component of the address sets the i-bit of the opcode.
- the suggested assembler syntax is: nclduso [address],rd nclduwo [address],rd
- nclduso and nclduwo are SFU operations that use the SFU memory format.
- ncstb, ncsts, and ncstw are memory access instruction that perform a non-cacheable store an 8-bit byte, a 16-bit short, or a 32-bit word from the register specified by r[rd] to address.
- the use of an immediate for the second component of the address sets the i-bit of the opcode.
- ncsts and ncstw the effective address to which to store must be 2-byte aligned or 4-byte aligned, respectively, but the consequences of failing to do that are not defined.
- the suggested assembler syntax is: ncstb rd,[address] ncsts rd,[address] ncstw rd,[address]
- ncst [sb, s, w] are SFU operations that use the SFU memory format.
- ncsts and ncstw are memory access instruction that perform a non-cacheable store of a 16-bit short or a 32-bit word from the register specified by r[rd] to address using the opposite endianness from that indicated by the endian-bit of % psr.
- the use of an immediate for the second component of the address sets the i-bit of the opcode.
- ncstso and ncstwo the effective address to which to store must be 2-byte aligned or 4-byte aligned, respectively, but the consequences of failing to do that are not defined.
- the suggested assembler syntax is: ncstso rd,[address] ncstwo rd,[address]
- ncstso and ncstwo are SFU operations that use the SFU memory format.
- ncstpair is a memory access instruction that performs an uncached store of a pair of adjacent registers beginning at the register specified by r[rd] to address.
- the use of an immediate for the second component of the address sets the i-bit of the opcode.
- r[rd] must be an even-numbered register.
- the suggested assembler syntax is: ncstpair rd,[address]
- ncstpair is an SFU operation that uses the SFU memory format.
- nop is a control flow instruction that does nothing. It has the special property of being unique in its leading byte with its remaining bytes ignored.
- the suggested assembler syntax is:
- nop is an SFU operation that uses the branch instruction format, but only the leading 6 bits of its opcode are significant and none of the other fields is used.
- This instruction has no use for a second source operand; the assembler infers r0 in its place for purely neurotic reasons.
- the suggested assembler syntax is:
- pack is a pixel instruction that treats its first two source operands, r[rs1] and r[rs2], as two pair of unsigned 16-bit operands. Each 16-bit operand is shifted right by the value of the the low-order 4 bits of the third source operand, r[rs3].
- the low-order 8 bits of the resulting values are packed into the result register r[rd], with the value derived from 31:16 of r[rs1] in bits 31:24, the value derived from 15:0 of r[rs1] in bits 23:16, the value derived from 31:16 of r[rs2] in bits 15:8, and the value derived from 15:0 of r[rs2] in bits 7:0.
- the suggested assembler syntax is: pack rs1,rs2,rs3,rd
- padd and padds are integer instructions that compute “r[rs1]+r[rs2]”, where each of the sources is treated as a pair of independent 16-bit quantities yeilding a pair of independent 16-bit sums in r[rd].
- padd also can have a 14-bit immediate as its second source operand, in which case the sign-extended immediate is added to each of 16-bit numbers in r[rs1] and the two 16-bit sums are left in r[rd].
- padd produces ordinary two's complement integer results, and padds produces saturated results.
- the suggested assembler syntax is: padd rs1, reg_or_imm14, rd padds rs1, rs2, rd
- padd and padds are UFU operations that use the UFU 2-source format.
- pcmovenz is an integer instruction that uses two 16-bit flags in r[rs1] to control whether the corresponding 16-bit fields of r[rs2] are copied to the same positions of r[rd].
- a field of r[rs2] is copied to r[rd] is the corresponding flag field of r[rs1] is non-zero.
- the flags in r[rs1] will most likely be the result of a preceding pcmpeq or pcmplt.
- the suggested assembler syntax is: pcmovenz rs1,rs2,rd
- pcmovenz is a UFU operation that uses the UFU 2-source format.
- pcmove z is an integer instruction that uses two 16-bit flags in r[rs1] to control whether the corresponding 16-bit fields of r[rs2] are copied to the same positions of r[rd].
- a field of r[rs2] is copied to r[rd] is the corresponding flag field of r[rs1] is zero.
- the flags in r[rs1] will most likely be the result of a preceding pcmpeq or pcmplt.
- the suggested assembler syntax is: pcmovez rs1,rs2,rd
- pcmovez is a UFU operation that uses the UFU 2-source format.
- pcmpeq is an integer instruction that compares for equality the pair of shorts in its first source register with either a pair of shorts in its second source register or with a signed 14-bit immediate.
- the suggested assembler syntax is: pcmpeq rs1,reg_or_imm14,rd
- pcmpeq is a UFU operation that uses the UFU 2-source format.
- pcmplt is an integer instruction that does a “compare less than” of pair of shorts in its first source source register with either a pair of shorts in its second source register or with a signed 14-bit immediate.
- the short in bits 31:16 of r[rd] is set to one if “r[rs1] ⁇ 31:16> ⁇ r[rs2] ⁇ 31:16>” and zero otherwise, and the short in bits 15:0 of r[rd] is set to one if “r[rs1] ⁇ 15:0> ⁇ r[rs2] ⁇ 15:0>” and zero otherwise.
- the short in bits 31:16 of r[rd] is set to one if “r[rs1] ⁇ 31:16> ⁇ sign_ext(imm14)” and zero otherwise, and the short in bits 15:0 of r[rd] is set to one if “r[rs1] ⁇ 15:0> ⁇ sign_ext(imm14)” and zero otherwise.
- the suggested assembler syntax is: pcmplt rs1,reg_or_imm14,rd
- pcmplt is a UFU operation that uses the UFU 2-source format.
- pdist is a pixel instruction that treats each of its two source registers, r[rs1] and r[rs2], as four unsigned 8-bit values, subtracts the corresponding pairs, and adds the sum of the absolute values of those differences to the value in the register specified by r[rd].
- the suggested assembler syntax is: pdist rs1,rs2,rd
- pdist is a UFU operation that uses the UFU 2-source format.
- pmean is a UFU operation that uses the UFU 2-source format.
- pmul is an integer instruction that multiplies the pair of 16-bit operands in r[rs1] with either a a pair of 16-bit operands in r[rs2] or with a sign-extended 14-bit immediate, placing a pair of independent 16-bit products in r[rd].
- bits 31:16 of r[rd] are set to the product bits 31:16 of r[rs1] and bits 31:16 of r[rs2] and bits 15:0 of r[rd] are set to the product bits 15:0 of r[rs1] and bits 15:0 of r[rs2].
- bits 31:16 of r[rd] are set to the product bits 31:16 of r[rs1] and sign_ext(imm14) and bits 15:0 of r[rd] are set to the product bits 15:0 of r[rs1] and sign_ext(imm14).
- the suggested assembler syntax is:
- pmul is a UFU operation that uses the UFU 2-source format.
- pmuladd and pmuladd are integer instructions that compute “(r[rs1]*r[rs2])+r[s3]”, where each of the sources is treated as a pair of independent 16-bit quantities yeilding a pair of independent 16-bit results in r[rd].
- the format of the operands and results is indicated by the saturation field of the Processor Status Register, but pmuladd does not saturate and pmuladds does.
- the suggested assembler syntax is: pmuladd rs1,rs2,rs3,rd pmuladds rs1,rs2,rs3,rd
- pmuladd and pmuladd are UFU operations that use the UFU 3-source format.
- pmulsub and pmulsub are integer instructions that compute “(r[rs1]*r[rs2]) ⁇ r[s3] ”, where each of the sources is treated as a pair of independent 16-bit quantities yeilding a pair of independent 16-bit results in r[rd].
- the format of the operands and results is indicated by the saturation field of the Processor Status Register, but pmulsub does not saturate and pmulsubs does.
- the suggested assembler syntax is: pmulsub rs1,rs2,rs3,rd pmulsubs rs1,rs2,rs3,rd
- pmul sub and pmul subs are UFU operations that use the UFU 3-source format.
- ppower is an fixed-point instruction that computes “r[rs1]**r[rs2]”, where each of the sources is treated as a pair of independent 16-bit quantities yeilding a pair of independent 16-bit powers in r[rd].
- ppower is a SFU operation that uses the SFU compute format.
- precsqrt is a fixed-point instruction that computes a pair of fixed-point reciprocal square roots of the 16-bit values of r[rs1]. The results are delivered in r[rd].
- the suggested assembler syntax is: precsqrt rs1,rd
- precsqrt is a SFU operation that uses the SFU compute format.
- the second source operand for the format is unused by this instruction.
- pshll is a logical instruction that shifts each of the the pair of 16-bit operands in r[rs1] left by either the lower-order 4 bits of the corresponding half of r[rs2] or the low-order 4 bits of its immediate. The shifted results are left in r[rd].
- the suggested assembler syntax is: pshll rs1,reg_or_imm14,rd
- pshll is a UFU operation that uses the UFU 2-source format.
- pshra is a logical instruction that performs a right arithmetic shift of the pair of 16-bit operands in r[rs1].
- the shift count of each is either the low-order 4 bits of the corresponding half of r[rs2] or the low-order 4 bits of its immediate.
- the shifted results are left in r[rd].
- the suggested assembler syntax is: pshra rs1,reg_or_imm14,rd
- pshra is a UFU operation that uses the UFU 2-source format.
- pshrl is a logical instruction that performs a right logical shift of the pair of 16-bit operands in r[rs1].
- the shift count of each is either the low-order 4 bits of the corresponding half of r[rs2] or the low-order 4 bits of its immediate.
- the shifted results are left in r[rd].
- the suggested assembler syntax is: pshrl rs1,reg_or_imm14,rd
- pshrl is a UFU operation that uses the UFU 2-source format.
- psub and psubs are integer instructions that compute “r[rs1] ⁇ r[rs2]”, where each of the sources is treated as a pair of independent 16-bit quantities yielding a pair of independent 16-bit differences in r[rd].
- psub also can have a 14-bit immediate as its second source operand, in which case the sign-extended immediate is subtracted from each of 16-bit numbers in r[rs1] and the two 16-bit differences are left in r[rd].
- the suggested assembler syntax is: psub rs1,reg_or_imm14,rd psubs rs1,rs2,rd
- psub and psubs are UFU operations that use the UFU 2-source format.
- rem is an integer instruction that computes “r[rs1] % r[rs2]” or “r[rs1] % sign_ext(imm14)”.
- the use of an immediate for the second source operand sets the i-bit of the opcode. The result is left in r[rd].
- the suggested assembler syntax is: rem rs1,reg_or_imm14,rd
- rem is an SFU operation that uses the SFU compute format.
- retry is a control flow instruction causes a control transfer from a trap handler to the instruction word that caused the trap. Please refer to the Traps chapter of the Café Microarchitecture specification for the complete description.
- the suggested assembler syntax is:
- retry is a SFU operation that uses the SFU compute format, but it has no use for any operand.
- the suggested assembler syntax is: return rs1
- s2ib, s2is, and s2iw are memory access instruction that store an 8-bit byte, a 16-bit short, or a 32-bit word from the register specified by r[rd] to address.
- the use of an immediate for the second component of the address sets the i-bit of the opcode. The intent is that these are the instructions to be used to modify code on the fly, so these stores guarantee instruction cache consistency.
- the suggested assembler syntax is: s2ib rd,[address] s2is rd,[address] s2iw rd,[address]
- s2i [sb, s, w] are SFU operations that use the SFU memory format.
- sethi is an integer instruction that places its immediate operand in the high-order 22 bits of r[rd] and clears the low-order 10 bits. It is frequently used with the % hi operator to form base addresses for subsequent memory references.
- the suggested assembler syntax is: sethi imm22,rd sethi %hi(label),rd
- sethi is an SFU operation that uses a format of its own.
- setir is an integer instruction that sets the internal register the ordinal of which is its r[rd] operand to the value in its r[rs1] operand.
- the suggested assembler syntax is: setir rs1,rd
- setir is an SFU operation that uses the SFU compute format in an irregular way. It has no use for the second source operand, and the destination register is an internal register number, NOT one of the general purpose registers.
- shll is a logical instruction that computes “r[rs1] ⁇ r[rs2]” or “r[rs1] ⁇ imm”.
- the use of an immediate for the second source operand sets the i-bit of the opcode of the SFU version or the first header bit of the UFU version. The result is left in r[rd]. Only the low-order 5 bits of the second source operand are used.
- the suggested assembler syntax is: shll rs1,reg_or_imm,rd
- the SFU version of sh 1 uses the SFU compute format, and the UFU version uses the UFU 2-source format.
- shra is a logical instruction that computes “r[rs1]>>r[rs2]” or “r[rs1]>>imm”.
- the use of an immediate for the second source operand sets the i-bit of the opcode of the SFU version or the first header bit of the UFU version. The result is left in r[rd].
- the first source operand is treated as a signed integer, so the result is sign-extended. Only the low-order 5 bits of the second source operand are used.
- the suggested assembler syntax is: shra rs1,reg_or_imm,rd
- the SFU version of shra uses the SFU compute format
- the UFU version uses the UFU 2-source format.
- shrl is a logical instruction that computes “r[rs1]>>r[rs2]” or “r[rs1]>>imm”.
- the use of an immediate for the second source operand sets the i-bit of the opcode of the SFU version or the first header bit of the UFU version. The result is left in r[rd].
- the first source operand is treated as an unsigned integer, so the result is not sign-extended. Only the low-order 5 bits of the second source operand are used.
- the suggested assembler syntax is:
- the SFU version of shri uses the SFU compute format
- the UFU version uses the UFU 2-source format.
- sir is a control flow instruction that resets the machine.
- sir is a privileged instruction; executing it when the Supervisor Mode flag of % psr is clear causes a privileged instruction trap.
- the suggested assembler syntax is:
- sir is an SFU operation that uses the SFU compute format, but has no use for any of that format's operands.
- softtrap is a control flow instruction that generates a trap.
- the ordinal of the trap is specified by r[rs1] ⁇ r[rs2] or r[rs1] ⁇ sign_ext(imm14). For details on how traps work, see the Traps chapter of the Café Microarchitecture specification.
- the suggested assembler syntax is: softtrap rs1,reg_or_imm14
- softtrap is an SFU operation that uses the SFU compute format but has no use for the destination register field of that format.
- stb, sts, and stw are memory access instruction that store an 8-bit byte, a 16-bit short, or a 32-bit word from the register specified by r[rd] to address.
- the use of an immediate for the second component of the address sets the i-bit of the opcode.
- the suggested assembler syntax is: stb rd,[address] sts rd,[address] stw rd,[address]
- St [sv, s, w] are SFU operations that use the SFU memory format.
- sts and stw are memory access instruction that store a 16-bit short or a 32-bit word from the register specified by r[rd] to address using the opposite endianness from that indicated by the endian-bit of % psr.
- the use of an immediate for the second component of the address sets the i-bit of the opcode.
- the suggested assembler syntax is: stso rd,[address] stwo rd,[address]
- stso and stwo are SFU operations that use the SFU memory format.
- stpair is a memory access instruction that performs a store of a pair of adjacent registers beginning at the register specified by r[rd] to address.
- the use of an immediate for the second component of the address sets the i-bit of the opcode.
- r[rd] must be an even-numbered register.
- the suggested assembler syntax is: stpair rd,[address]
- stpair is an SFU operation that uses the SFU memory format.
- sub is an integer instruction that computes r[rs1] ⁇ r[rs2] or r[rs1] ⁇ sign_ext(imml4).
- the use of an immediate for the second source operand sets the i-bit of the opcode of the SFU version or the first header bit of the UFU version. The result is left in r[rd].
- the suggested assembler syntax is: sub rs1,reg_or_imm14,rd
- the SFU version of sub uses the SFU compute format
- the UFU version uses the UFU 2-source format.
- xor is a logical instruction that computes “r[rs1] ⁇ circumflex over ( ) ⁇ r[rs2]” or “r[rs1] ⁇ circumflex over ( ) ⁇ imm]4”.
- the use of an immediate for the second source operand sets the i-bit of the opcode of the SFU version or the first header bit of the UFU version. The bit-wise logical result is left in r[rd].
- the suggested assembler syntax is: xor rs1,reg_or_imm14,rd
- the SFU version of xor uses the SFU compute format
- the UFU version uses the UFU 2-source format.
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Computational Mathematics (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Pure & Applied Mathematics (AREA)
- Advance Control (AREA)
- Executing Machine-Instructions (AREA)
Abstract
A processor has an improved architecture for multiple-thread operation on the basis of a highly parallel structure including multiple independent parallel execution paths for executing in parallel across threads and a multiple-instruction parallel pathway within a thread. The multiple independent parallel execution paths include functional units that execute an instruction set including special data-handling instructions that are advantageous in a multiple-thread environment.
Description
- 1. Field of the Invention
- The present invention relates to a processor architecture. More specifically, the present invention relates to a single-chip processor architecture including structures for multiple-thread operation.
- 2. Description of the Related Art
- For various processing applications, an automated system may handle multiple events or processes concurrently. A single process is termed a thread of control, or “thread”, and is the basic unit of operation of independent dynamic action within the system. A program has at least one thread. A system performing concurrent operations typically has many threads, some of which are transitory and others enduring. Systems that execute among multiple processors allow for true concurrent threads. Single-processor systems can only have illusory concurrent threads, typically attained by time-slicing of processor execution, shared among a plurality of threads.
- Some programming languages are particularly designed to support multiple-threading. One such language is the Java™ programming language that is advantageously executed using an abstract computing machine, the Java Virtual Machine™. A Java Virtual Machine™ is capable of supporting multiple threads of execution at one time. The multiple threads independently execute Java code that operates on Java values and objects residing in a shared main memory. The multiple threads may be supported using multiple hardware processors, by time-slicing a single hardware processor, or by time-slicing many hardware processors in 1990 programmers at Sun Microsystems developed a universal programming language, eventually known as “the Java™ programming language”. Java™, Sun, Sun Microsystems and the Sun Logo are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States and other countries. All SPARC trademarks, including UltraSPARC I and UltraSPARC II, are used under license and are trademarks of SPARC International, Inc. in the United States and other countries. Products bearing SPARC trademarks are based upon an architecture developed by Sun Microsystems, Inc.
- Java™ supports the coding of programs that, though concurrent, exhibit deterministic behavior, by including techniques and structures for synchronizing the concurrent activity of threads. To synchronize threads, Java™ uses monitors, high-level constructs that allow only a single thread at one time to execute a region of code protected by the monitor. Monitors use locks associated with executable objects to control thread execution.
- A thread executes code by performing a sequence of actions. A thread may use the value of a variable or assign the variable a new value. If two or more concurrent threads act on a shared variable, the actions on the variable may produce a timing-dependent result, an inherent consequence of concurrent programming.
- Each thread has a working memory that may store copies of the values of master copies of variables from main memory that are shared among all threads. A thread usually accesses a shared variable by obtaining a lock and flushing the working memory of the thread, guaranteeing that shared values are thereafter loaded from the shared memory to the working memory of the thread. By unlocking a lock, a thread guarantees that the values held by the thread in the working memory are written back to the main memory.
- Several rules of execution order constrain the order in which certain events may occur. For example, actions performed by one thread are totally ordered so that for any two actions performed by a thread, one action precedes the other. Actions performed by the main memory for any one variable are totally ordered so that for any two actions performed by the main memory on the same variable, one action precedes the other. Actions performed by the main memory for any one lock are totally ordered so that for any two actions performed by the main memory on the same lock, one action precedes the other. Also, an action is not permitted to follow itself Threads do not interact directly but rather only communicate through the shared main memory.
- The relationships among the actions of a thread and the actions of main memory are also constrained by rules. For example, each lock or unlock is performed jointly by some thread and the main memory. Each load action by a thread is uniquely paired with a read action by the main memory such that the load action follows the read action. Each store action by a thread is uniquely paired with a write action by the main memory such that the write action follows the store action.
- An implementation of threading incurs some overhead. For example, a single processor system incurs overhead in time-slicing between threads. Additional overhead is incurred in allocating and handling accessing of main memory and local thread working memory.
- What is needed is a processor architecture that supports multiple-thread operation and reduces the overhead associated with multiple-thread operation.
- A processor has an improved architecture for multiple-thread operation on the basis of a highly parallel structure including multiple independent parallel execution paths for executing in parallel across threads and a multiple-instruction parallel pathway within a thread. The multiple independent parallel execution paths include functional units that execute an instruction set including special data-handling instructions that are advantageous in a multiple-thread environment.
- In accordance with one embodiment of the present invention, a general-purpose processor includes two independent processor elements in a single integrated circuit die. The dual independent processor elements advantageously execute two independent threads concurrently during multiple-threading operation. When only a single thread is executed on a first of the two processor elements, the second processor element is advantageously used to perform garbage collection, Just-In-Time (JIT) compilation, and the like. Illustratively, the independent processor elements are Very Long Instruction Word (VLIW) processors. For example, one illustrative processor includes two independent Very Long Instruction Word (VLIW) processor elements, each of which executes an instruction group or instruction packet that includes up to four instructions, otherwise termed subinstructions. Each of the instructions in an instruction group executes on a separate functional unit.
- The two threads execute independently on the respective VLIW processor elements, each of which includes a plurality of powerful functional units that execute in parallel. In the illustrative embodiment, the VLIW processor elements have four functional units including three media functional units and one general functional unit. All of the illustrative media functional units include an instruction that executes both a multiply and an add in a single cycle, either floating point or fixed point.
- In accordance with an aspect of the present invention, an individual independent parallel execution path has operational units including instruction supply blocks and instruction preparation blocks, functional units, and a register file that are separate and independent from the operational units of other paths of the multiple independent parallel execution paths. The instruction supply blocks include a separate instruction cache for the individual independent parallel execution paths, however the multiple independent parallel execution paths share a single data cache since multiple threads sometimes share data. The data cache is dual-ported, allowing data access in both execution paths in a single cycle.
- In addition to the instruction cache, the instruction supply blocks in an execution path include an instruction aligner, and an instruction buffer that precisely format and align the full instruction group to prepare to access the register file. An individual execution path has a single register file that is physically split into multiple register file segments, each of which is associated with a particular functional unit of the multiple functional units. At any point in time, the register file segments as allocated to each functional unit each contain the same content. A multi-ported register file is typically metal limited to the area consumed by the circuit proportional with the square of the number of ports. It has been discovered that a processor having a register file structure divided into a plurality of separate and independent register files forms a layout structure with an improved layout efficiency. The read ports of the total register file structure are allocated among the separate and individual register files. Each of the separate and individual register files has write ports that correspond to the total number of write ports in the total register file structure. Writes are fully broadcast so that all of the separate and individual register files are coherent.
- The features of the described embodiments are specifically set forth in the appended claims. However, embodiments of the invention relating to both structure and method of operation, may best be understood by referring to the following description and accompanying drawings.
-
FIG. 1 is a schematic block diagram illustrating a single integrated circuit chip implementation of a processor in accordance with an embodiment of the present invention. -
FIG. 2 is a schematic block diagram showing the core of the processor. -
FIG. 3 is a schematic block diagram that illustrates an embodiment of the split register file that is suitable for usage in the processor. -
FIG. 4 is a schematic block diagram that shows a logical view of the register file and functional units in the processor. -
FIG. 5 is a pictorial schematic diagram depicting an example of instruction execution among a plurality of media functional units. -
FIG. 6 illustrates a schematic block diagram of an SRAM array used for the multi-port split register file. -
FIGS. 7A and 7B are, respectively, a schematic block diagram and a pictorial diagram that illustrate the register file and a memory array insert of the register file. -
FIG. 8 is a schematic block diagram showing an arrangement of the register file into the four register file segments. -
FIG. 9 is a schematic timing diagram that illustrates timing of the processor pipeline. -
FIGS. 10A, 10B , 10C, 10D, 10E and 10F illustrate instruction formats. -
FIG. 11 illustrates operation of a bitext instruction. - The use of the same reference symbols in different drawings indicates similar or identical items.
- Referring to
FIG. 1 , a schematic block diagram illustrates aprocessor 100 having an improved architecture for multiple-thread operation on the basis of a highly parallel structure including multiple independent parallel execution paths, shown herein as twomedia processing units - The multiple-threading architecture of the
processor 100 is advantageous for usage in executing multiple-threaded applications using a language such as the Java™ language running under a multiple-threaded operating system on a multiple-threaded Java Virtual Machine™. Theillustrative processor 100 includes two independent processor elements, themedia processing units - A single integrated circuit chip implementation of a
processor 100 includes amemory interface 102, ageometry decompressor 104, the twomedia processing units data cache 106, and several interface controllers. The interface controllers support an interactive graphics environment with real-time constraints by integrating fundamental components of memory, graphics, and input/output bridge functionality on a single die. The components are mutually linked and closely linked to the processor core with high bandwidth, low-latency communication channels to manage multiple high-bandwidth data streams efficiently and with a low response time. The interface controllers include a an UltraPort Architecture Interconnect (UPA)controller 116 and a peripheral component interconnect (PCI)controller 120. Theillustrative memory interface 102 is a direct Rambus dynamic RAM (DRDRAM) controller. The shareddata cache 106 is a dual-ported storage that is shared among themedia processing units data cache 106 is four-way set associative, follows a write-back protocol, and supports hits in the fill buffer (not shown). Thedata cache 106 allows fast data sharing and eliminates the need for a complex, error-prone cache coherency protocol between themedia processing units - The
UPA controller 116 is a custom interface that attains a suitable balance between high-performance computational and graphic subsystems. The UPA is a cache-coherent, processor-memory interconnect. The UPA attains several advantageous characteristics including a scaleable bandwidth through support of multiple bused interconnects for data and addresses, packets that are switched for improved bus utilization, higher bandwidth, and precise interrupt processing. The UPA performs low latency memory accesses with high throughput paths to memory. The UPA includes a buffered cross-bar memory interface for increased bandwidth and improved scaleability. The UPA supports high-performance graphics with two-cycle single-word writes on the 64-bit UPA interconnect. The UPA interconnect architecture utilizes point-to-point packet switched messages from a centralized system controller to maintain cache coherence. Packet switching improves bus bandwidth utilization by removing the latencies commonly associated with transaction-based designs. - The
PCI controller 120 is used as the primary system I/O interface for connecting standard, high-volume, low-cost peripheral devices, although other standard interfaces may also be used. The PCI bus effectively transfers data among high bandwidth peripherals and low bandwidth peripherals, such as CD-ROM players, DVD players, and digital cameras. - Two
media processing units illustrative processor 100 is an eight-wide machine with eight execution units for executing instructions. A typical “general-purpose” processing code has an instruction level parallelism of about two so that, on average, most (about six) of the eight execution units would be idle at any time. Theillustrative processor 100 employs thread level parallelism and operates on two independent threads, possibly attaining twice the performance of a processor having the same resources and clock rate but utilizing traditional non-thread parallelism. - Thread level parallelism is particularly useful for Java™ applications, which are bound to have multiple threads of execution. Java™ methods including “suspend”, “resume”, “sleep”, and the like include effective support for threaded program code. In addition, Java™ class libraries are thread-safe to promote parallelism. Furthermore, the thread model of the
processor 100 supports a dynamic compiler which runs as a separate thread using onemedia processing unit 110 while the secondmedia processing unit 112 is used by the current application. In the illustrative system, the compiler applies optimizations based on “on-the-fly” profile feedback information while dynamically modifying the executing code to improve execution on each subsequent run. For example, a “garbage collector” may be executed on a firstmedia processing unit 110, copying objects or gathering pointer information, while the application is executing on the othermedia processing unit 112. - Although the
processor 100 shown inFIG. 1 includes two processing units on an integrated circuit chip, the architecture is highly scaleable so that one to several closely-coupled processors may be formed in a message-based coherent architecture and resident on the same die to process multiple threads of execution. Thus, in theprocessor 100, a limitation on the number of processors formed on a single die thus arises from capacity constraints of integrated circuit technology rather than from architectural constraints relating to the interactions and interconnections between processors. - The
processor 100 is a general-purpose processor that includes themedia processing units independent processor elements processor 100, one of the two processor elements executes the thread, the second processor element is advantageously used to perform garbage collection, Just-In-Time (JIT) compilation, and the like. In theillustrative processor 100, theindependent processor elements illustrative processor 100 includes two independent Very Long Instruction Word (VLIW) processor elements, each of which executes an instruction group or instruction packet that includes up to four instructions. Each of the instructions in an instruction group executes on a separate functional unit. - The usage of a VLIW processor advantageously reduces complexity by avoiding usage of various structures such as schedulers or reorder buffers that are used in superscalar machines to handle data dependencies. A VLIW processor typically uses software scheduling and software checking to avoid data conflicts and dependencies, greatly simplifying hardware control circuits.
- The two threads execute independently on the respective
VLIW processor elements FIG. 2 , theVLIW processor elements functional units 220 and one generalfunctional unit 222. All of the illustrative mediafunctional units 220 include an instruction that executes both a multiply and an add in a single cycle, either floating point or fixed point. Thus, a processor with two VLIW processor elements can execute twelve floating point operations each cycle. At a 500 MHz execution rate, for example, the processor runs at an 6 gigaflop rate, even without accounting for general functional unit operation. - Referring to
FIG. 2 , a schematic block diagram shows the core of theprocessor 100. Themedia processing units instruction cache 210, aninstruction aligner 212, aninstruction buffer 214, apipeline control unit 226, asplit register file 216, a plurality of execution units, and a load/store unit 218. In theillustrative processor 100, themedia processing units media processing unit 110 include three media functional units (MFU) 220 and one general functional unit (GFU) 222. - An individual independent
parallel execution path functional units register file 216 that are separate and independent from the operational units of other paths of the multiple independent parallel execution paths. The instruction supply blocks include aseparate instruction cache 210 for the individual independent parallel execution paths, however the multiple independent parallel execution paths share asingle data cache 106 since multiple threads sometimes share data. Thedata cache 106 is dual-ported, allowing data access in bothexecution paths data cache 106 amongindependent processor elements - In addition to the
instruction cache 210, the instruction supply blocks in an execution path include theinstruction aligner 212, and theinstruction buffer 214 that precisely format and align a full instruction group of four instructions to prepare to access theregister file 216. An individual execution path has asingle register file 216 that is physically split into multiple register file segments, each of which is associated with a particular functional unit of the multiple functional units. At any point in time, the register file segments as allocated to each functional unit each contain the same content. A multi-ported register file is typically metal limited to the area consumed by the circuit proportional with the square of the number of ports. Theprocessor 100 has a register file structure divided into a plurality of separate and independent register files to form a layout structure with an improved layout efficiency. The read ports of the totalregister file structure 216 are allocated among the separate and individual register files. Each of the separate and individual register files has write ports that correspond to the total number of write ports in the total register file structure. Writes are fully broadcast so that all of the separate and individual register files are coherent. - The media
functional units 220 are multiple single-instruction-multiple-datapath (MSIMD) media functional units. Each of the mediafunctional units 220 is capable of processing parallel 16-bit components. Various parallel 16-bit operations supply the single-instruction-multiple-datapath capability for theprocessor 100 including add, multiply-add, shift, compare, and the like. The mediafunctional units 220 operate in combination as tightly coupled digital signal processors (DSPs). Each mediafunctional unit 220 has an separate and individual sub-instruction stream, but all tree mediafunctional units 220 execute synchronously so that the subinstructions progress lock-step through pipeline stages. - The general
functional unit 222 is a RISC processor capable of executing arithmetic logic unit (ALU) operations, loads and stores, branches, and various specialized and esoteric functions such as parallel power operations, reciprocal square root operations, and many others. The generalfunctional unit 222 supports less common parallel operations such as the parallel reciprocal square root instruction: - The
illustrative instruction cache 210 is two-way set-associative, has a 16 Kbyte capacity, and includes hardware support to maintain coherence, allowing dynamic optimizations through self-modifying code. Software is used to indicate that the instruction storage is being modified when modifications occur. The 16K capacity is suitable for performing graphic loops, other multimedia tasks or processes, and general-purpose Java™ code. Coherency is maintained by hardware that supports write-through, non-allocating caching. Self-modifying code is supported through explicit use of “store-to-instruction-space” instruction store2i. Software uses the store2i instruction to maintain coherency with theinstruction cache 210 so that theinstruction caches 210 do not have to be snooped on every single store operation issued by themedia processing unit 110. - The
pipeline control unit 226 is connected between theinstruction buffer 214 and the functional units and schedules the transfer of instructions to the functional units. Thepipeline control unit 226 also receives status signals from the functional units and the load/store unit 218 and uses the status signals to perform several control functions. Thepipeline control unit 226 maintains a scoreboard, generates stalls and bypass controls. Thepipeline control unit 226 also generates traps and maintains special registers. - Each
media processing unit split register file 216, a single logical register file including 128 thirty-two bit registers. Thesplit register file 216 is split into a plurality ofregister file segments 224 to form a multi-ported structure that is replicated to reduce the integrated circuit die area and to reduce access time. A separateregister file segment 224 is allocated to each of the mediafunctional units 220 and the generalfunctional unit 222. In the illustrative embodiment, eachregister file segment 224 has 128 32-bit registers. The first 96 registers (0-95) in theregister file segment 224 are global registers. All functional units can write to the 96 global registers. The global registers are coherent across all functional units (MFU and GFU) so that any write operation to a global register by any functional unit is broadcast to all registerfile segments 224. Registers 96-127 in theregister file segments 224 are local registers. Local registers allocated to a functional unit are not accessible or “visible” to other functional units. - The
media processing units - Instructions are executed in-order in the
processor 100 but loads can finish out-of-order with respect to other instructions and with respect to other loads, allowing loads to be moved up in the instruction stream so that data can be streamed from main memory. The execution model eliminates the usage and overhead resources of an instruction window, reservation stations, a re-order buffer, or other blocks for handling instruction ordering. Elimination of the instruction ordering structures and overhead resources is highly advantageous since the eliminated blocks typically consume a large portion of an integrated circuit die. For example, the eliminated blocks consume about 30% of the die area of a Pentium II processor. - To avoid software scheduling errors, the
media processing units media processing units processor 100 is essentially equivalent to scheduling for a simple 2-scalar execution engine for each of the twomedia processing units - The
processor 100 supports full bypasses between the first two execution units within themedia processing unit functional unit 222 for load operations so that the compiler does not need to handle nondeterministic latencies due to cache misses. Theprocessor 100 scoreboards long latency operations that are executed in the generalfunctional unit 222, for example a reciprocal square-root operation, to simplify scheduling across execution units. The scoreboard (not shown) operates by tracking a record of an instruction packet or group from the time the instruction enters a functional unit until the instruction is finished and the result becomes available. A VLIW instruction packet contains one GFU instruction and from zero to three MFU instructions. The source and destination registers of all instructions in an incoming VLIW instruction packet are checked against the scoreboard. Any true dependencies or output dependencies stall the entire packet until the result is ready. Use of a scoreboarded result as an operand causes instruction issue to stall for a sufficient number of cycles to allow the result to become available. If the referencing instruction that provokes the stall executes on the generalfunctional unit 222 or the first mediafunctional unit 220, then the stall only endures until the result is available for intra-unit bypass. For the case of a load instruction that hits in thedata cache 106, the stall may last only one cycle. If the referencing instruction is on the second or third mediafunctional units 220, then the stall endures until the result reaches the writeback stage in the pipeline where the result is bypassed in transmission to thesplit register file 216. - The scoreboard automatically manages load delays that occur during a load hit. In an illustrative embodiment, all loads enter the scoreboard to simplify software scheduling and eliminate NOPs in the instruction stream.
- The scoreboard is used to manage most interlock conditions between the general
functional unit 222 and the mediafunctional units 220. All loads and non-pipelined long-latency operations of the generalfunctional unit 222 are scoreboarded. The long-latency operations include division idiv, fdiv instructions, reciprocal square root frecsqrt, precsqrt instructions, and power ppower instructions. None of the results of the mediafunctional units 220 is scoreboarded. Non-scoreboarded results are available to subsequent operations on the functional unit that produces the results following the latency of the instruction. - The
illustrative processor 100 has a rendering rate of over fifty million triangles per second without accounting for operating system overhead. Therefore, data feeding specifications of theprocessor 100 are far beyond the capabilities of cost-effective memory systems. Sufficient data bandwidth is achieved by rendering of compressed geometry using thegeometry decompressor 104, an on-chip real-time geometry decompression engine. Data geometry is stored in main memory in a compressed format. At render time, the data geometry is fetched and decompressed in real-time on the integrated circuit of theprocessor 100. Thegeometry decompressor 104 advantageously saves memory space and memory transfer bandwidth. The compressed geometry uses an optimized generalized mesh structure that explicitly calls out most shared vertices between triangles, allowing theprocessor 100 to transform and light most vertices only once. In a typical compressed mesh, the triangle throughput of the transform-and-light stage is increased by a factor of four or more over the throughput for isolated triangles. For example, during processing of triangles, multiple vertices are operated upon in parallel so that the utilization rate of resources is high, achieving effective spatial software pipelining. Thus operations are overlapped in time by operating on several vertices simultaneously, rather than overlapping several loop iterations in time. For other types of applications with high instruction level parallelism, high trip count loops are software-pipelined so that most mediafunctional units 220 are fully utilized. - Referring to
FIG. 3 , a schematic block diagram illustrates an embodiment of thesplit register file 216 that is suitable for usage in theprocessor 100. Thesplit register file 216 supplies all operands of processor instructions that execute in the mediafunctional units 220 and the generalfunctional units 222 and receives results of the instruction execution from the execution units. Thesplit register file 216 operates as an interface to thegeometry decompressor 104. Thesplit register file 216 is the source and destination of store and load operations, respectively. - In the
illustrative processor 100, thesplit register file 216 in each of themedia processing units split register file 216 so that performance is not limited by loads and stores or handling of intermediate results including graphics “fills” and “spills”. The illustrativesplit register file 216 includes twelve read ports and five write ports, supplying total data read and write capacity between the central registers of thesplit register file 216 and all mediafunctional units 220 and the generalfunctional unit 222. The five write ports include one 64-bit write port that is dedicated to load operations. The remaining four write ports are 32 bits wide and are used to write operations of the generalfunctional unit 222 and the mediafunctional units 220. - A large total read and write capacity promotes flexibility and facility in programming both of hand-coded routines and compiler-generated code.
- Large, multiple-ported register files are typically metal-limited so that the register area is proportional with the square of the number of ports. A sixteen port file is roughly proportional in size and speed to a value of 256. The illustrative
split register file 216 is divided into fourregister file segments split register file 216 is no different from a single central register file. However, from the perspective of layout efficiency, thesplit register file 216 is highly advantageous, allowing for reduced size and improved performance. - The new media data that is operated upon by the
processor 100 is typically heavily compressed. Data transfers are communicated in a compressed format from main memory and input/output devices to pins of theprocessor 100, subsequently decompressed on the integrated circuit holding theprocessor 100, and passed to thesplit register file 216. - Splitting the register file into multiple segments in the
split register file 216 in combination with the character of data accesses in which multiple bytes are transferred to the plurality of execution units concurrently, results in a high utilization rate of the data supplied to the integrated circuit chip and effectively leads to a much higher data bandwidth than is supported on general-purpose processors. The highest data bandwidth requirement is therefore not between the input/output pins and the central processing units, but is rather between the decompressed data source and the remainder of the processor. For graphics processing, the highest data bandwidth requirement is between thegeometry decompressor 104 and thesplit register file 216. For video decompression, the highest data bandwidth requirement is internal to thesplit register file 216. Data transfers between thegeometry decompressor 104 and thesplit register file 216 and data transfers between various registers of thesplit register file 216 can be wide and run at processor speed, advantageously delivering a large bandwidth. - The
register file 216 is a focal point for attaining the very large bandwidth of theprocessor 100. Theprocessor 100 transfers data using a plurality of data transfer techniques. In one example of a data transfer technique, cacheable data is loaded into thesplit register file 216 through normal load operations at a low rate of up to eight bytes per cycle. In another example, streaming data is transferred to thesplit register file 216 through group load operations, which transfer thirty-two bytes from memory directly into eight consecutive 32-bit registers. Theprocessor 100 utilizes the streaming data operation to receive compressed video data for decompression. - Compressed graphics data is received via a direct memory access (DMA) unit in the
geometry decompressor 104. The compressed graphics data is decompressed by thegeometry decompressor 104 and loaded at a high bandwidth rate into thesplit register file 216 via group load operations that are mapped to thegeometry decompressor 104. - Load operations are non-blocking and scoreboarded so that early scheduling can hide a long latency inherent to loads.
- General purpose applications often fail to exploit the
large register file 216. Statistical analysis shows that compilers do not effectively use the large number of registers in thesplit register file 216. However, aggressive in-lining techniques that have traditionally been restricted due to the limited number of registers in conventional systems may be advantageously used in theprocessor 100 to exploit the large number of registers in thesplit register file 216. In a software system that exploits the large number of registers in theprocessor 100, the complete set of registers is saved upon the event of a thread (context) switch. When only a few registers of the entire set of registers is used, saving all registers in the full thread switch is wasteful. Waste is avoided in theprocessor 100 by supporting individual marking of registers. Octants of the thirty-two registers can be marked as “dirty” if used, and are consequently saved conditionally. - In various embodiments, dedicating fields for globals, trap registers, and the like leverages the
split register file 216. - Referring to
FIG. 4 , a schematic block diagram shows a logical view of theregister file 216 and functional units in theprocessor 100. The physical implementation of thecore processor 100 is simplified by replicating a single functional unit to form the three mediafunctional units 220. The mediafunctional units 220 include circuits that execute various arithmetic and logical operations including general-purpose code, graphics code, and video-image-speech (VIS) processing. VIS processing includes video processing, image processing, digital signal processing (DSP) loops, speech processing, and voice recognition algorithms, for example. - Referring to
FIG. 5 , a simplified pictorial schematic diagram depicts an example of instruction execution among a plurality of mediafunctional units 220. Results generated by various internal function blocks within a first individual media functional unit are immediately accessible internally to the first mediafunctional unit 510 but are only accessible globally by other mediafunctional units functional unit 510, regardless of the actual latency of the instruction. Therefore, instructions executing within a functional unit can be scheduled by software to execute immediately, taking into consideration the actual latency of the instruction. In contrast, software that schedules instructions executing in different functional units is expected to account for the five cycle latency. In the diagram, the shaded areas represent the stage at which the pipeline completes execution of an instruction and generates final result values. A result is not available internal to a functional unit a final shaded stage completes. In the example, media processing unit instructions have three different latencies—four cycles for instructions such as fmuladd and fadd, two cycles for instructions such as pmuladd, and one cycle for instructions like padd and xor. - Although internal bypass logic within a media
functional unit 220 forwards results to execution units within the same mediafunctional unit 220, the internal bypass logic does not detect incorrect attempts to reference a result before the result is available. - Software that schedules instructions for which a dependency occurs between a particular media functional unit, for example 512, and other media
functional units functional unit 512 and the generalfunctional unit 222, is to account for the five cycle latency between entry of an instruction to the mediafunctional unit 512 and the five cycle pipeline duration. - Referring to
FIG. 6 , a schematic block diagram depicts an embodiment of themultiport register file 216. A plurality of read address buses RA1 through RAN carry read addresses that are applied to decoder ports 616-1 through 616-N, respectively. Decoder circuits are well known to those of ordinary skill in the art, and any of several implementations could be used as the decoder ports 616-1 through 616-N. When an address is presented to any of decoder ports 616-1 through 616-N, the address is decoded and a read address signal is transmitted by adecoder port 616 to a register in amemory cell array 618. Data from thememory cell array 618 is output usingoutput data drivers 622. Data is transferred to and from thememory cell array 618 under control of control signals carried on some of the lines of the buses of the plurality of read address buses RA1 through RAN. - Referring to
FIGS. 7A and 7B , a schematic block diagram and a pictorial diagram, respectively, illustrate theregister file 216 and amemory array insert 710. Theregister file 216 is connected to a fourfunctional units illustrative register file 216 has twelve readports 730 and four writeports 732. The twelve readports 730 are illustratively allocated with three ports connected to each of the four functional units. The four writeports 732 are connected to receive data from all of the four functional units. - The
register file 216 includes a decoder, as is shown inFIG. 6 , for each of the sixteen read and write ports. Theregister file 216 includes amemory array 740 that is partially shown in theinsert 710 illustrated inFIG. 7B and includes a plurality of word lines 744 and bit lines 746. The word lines 744 andbit lines 746 are simply a set of wires that connect transistors (not shown) within thememory array 740. The word lines 744 select registers so that a particular word line selects a register of theregister file 216. The bit lines 746 are a second set of wires that connect the transistors in thememory array 740. Typically, the word lines 744 andbit lines 746 are laid out at right angles. In the illustrative embodiment, the word lines 744 and thebit lines 746 are constructed of metal laid out in different planes such as ametal 2 layer for the word lines 744 and ametal 3 layer for the bit lines 746. In other embodiments, bit lines and word lines may be constructed of other materials, such as polysilicon, or can reside at different levels than are described in the illustrative embodiment, that are known in the art of semiconductor manufacture. In the illustrative example, a distance of about lm separates the word lines 744 and a distance of approximately 1 μm separates the bit lines 746. Other circuit dimensions may be constructed for various processes. The illustrative example shows one bit line per port, other embodiments may use multiple bit lines per port. - When a particular functional unit reads a particular register in the
register file 216, the functional unit sends an address signal via the readports 730 that activates the appropriate word lines to access the register. In a register file having a conventional structure and twelve read ports, each cell, each storing a single bit of information, is connected to twelve word lines to select an address and twelve bit lines to carry data read from the address. - The four write
ports 732 address registers in the register file using four word lines 744 and fourbit lines 746 connected to each cell. The four word lines 744 address a cell and the fourbit lines 746 carry data to the cell. - Thus, if the
illustrative register file 216 were laid out in a conventional manner with twelve readports 730 and four writeports 732 for a total of sixteen ports and the ports were 1 μm apart, one memory cell would have an integrated circuit area of 256 μm2 (16×16). The area is proportional to the square of the number of ports. - The
register file 216 is alternatively implemented to perform single-ended reads and/or single-ended writes utilizing a single bit line per port per cell, or implemented to perform differential reads and/or differential writes using two bit lines per port per cell. - However, in this embodiment the
register file 216 is not laid out in the conventional manner and instead is split into a plurality of separate and individualregister file segments 224. Referring toFIG. 8 , a schematic block diagram shows an arrangement of theregister file 216 into the fourregister file segments 224. Theregister file 216 remains operational as a single logical register file in the sense that the four of theregister file segments 224 contain the same number of registers and the same register values as a conventional register file of the same capacity that is not split. The separatedregister file segments 224 differ from a register file that is not split through elimination of lines that would otherwise connect ports to the memory cells. Accordingly, eachregister file segment 224 has connections to only three of the twelve readports 730, lines connecting a register file segment to the other nine read ports are eliminated. All writes are broadcast so that each of the fourregister file segments 224 has connections to all fourwrite ports 732. Thus each of the fourregister file segments 224 has three read ports and four write ports for a total of seven ports. The individual cells are connected to seven word lines and seven bit lines so that a memory array with a spacing of 1 μm between lines has an area of approximately 49 μm2. In the illustrative embodiment, the fourregister file segments 224 have an area proportion to seven squared. The total area of the fourregister file segments 224 is therefore proportional to 49times 4, a total of 196. - The split register file thus advantageously reduces the area of the memory array by a ratio of approximately 256/196 (1.3× or 30%). The reduction in area further advantageously corresponds to an improvement in speed performance due to a reduction in the length of the word lines 744 and the
bit lines 746 connecting the array cells that reduces the time for a signal to pass on the lines. The improvement in speed performance is highly advantageous due to strict time budgets that are imposed by the specification of high-performance processors and also to attain a large capacity register file that is operational at high speed. For example, the operation of reading theregister file 216 typically takes place in a single clock cycle. For a processor that executes at 500 MHz, a cycle time of two nanoseconds is imposed for accessing theregister file 216. Conventional register files typically only have up to about 32 registers in comparison to the 128 registers in theillustrative register file 216 of theprocessor 100. Aregister file 216 substantially larger than the register file in conventional processors is highly advantageous in high-performance operations such as video and graphic processing. The reduced size of theregister file 216 is highly useful for complying with time budgets in a large capacity register file. - Referring to
FIG. 9 , a simplified schematic timing diagram illustrates timing of theprocessor pipeline 900. Thepipeline 900 includes nine stages including three initiating stages, a plurality of execution phases, and two terminating stages. The three initiating stages are optimized to include only those operations necessary for decoding instructions so that jump and call instructions, which are pervasive in the Java™ language, execute quickly. Optimization of the initiating stages advantageously facilitates branch prediction since branches, jumps, and calls execute quickly and do not introduce many bubbles. - The first of the initiating stages is a fetch
stage 910 during which theprocessor 100 fetches instructions from the 16 Kbyte two-way set-associative instruction cache 210. The fetched instructions are aligned in theinstruction aligner 212 and forwarded to theinstruction buffer 214 in analign stage 912, a second stage of the initiating stages. The aligning operation properly positions the instructions for storage in a particular segment of the fourregister file segments functional units 220 and one generalfunctional unit 222. In a third stage, adecoding stage 914 of the initiating stages, the fetched and aligned VLIW instruction packet is decoded and the scoreboard (not shown) is read and updated in parallel. The fourregister file segments - Following the
decoding stage 914, the execution stages are performed. The two terminating stages include a trap-handling stage 960 and a write-back stage 962 during which result data is written-back to thesplit register file 216. - While the invention has been described with reference to various embodiments, it will be understood that these embodiments are illustrative and that the scope of the invention is not limited to them. Many variations, modifications, additions and improvements of the embodiments described are possible. For example, those skilled in the art will readily implement the steps necessary to provide the structures and methods disclosed herein, and will understand that the process parameters, materials, and dimensions are given by way of example only and can be varied to achieve the desired structure as well as modifications which are within the scope of the invention. Variations and modifications of the embodiments disclosed herein may be made based on the description set forth herein, without departing from the scope and spirit of the invention as set forth in the following claims.
- For example, while the illustrative embodiment specifically discusses advantages gained in using the Java™ programming language with the described system, any suitable programming language is also supported. Other programming languages that support multiple-threading are generally more advantageously used in the described system. Also, while the illustrative embodiment specifically discusses advantages attained in using Java Virtual Machines with the described system, any suitable processing engine is also supported. Other processing engines that support multiple-threading are generally more advantageously used in the described system.
- Furthermore, although the illustrative register file has one bit line per port, in other embodiments more bit lines may be allocated for a port. The described word lines and bit lines are formed of a metal. In other examples, other conductive materials such as doped polysilicon may be employed for interconnects. The described register file uses single-ended reads and writes so that a single bit line is employed per bit and per port. In other processors, differential reads and writes with dual-ended sense amplifiers may be used so that two bit lines are allocated per bit and per port, resulting in a bigger pitch. Dual-ended sense amplifiers improve memory fidelity but greatly increase the size of a memory array, imposing a heavy burden on speed performance. Thus the advantages attained by the described register file structure are magnified for a memory using differential reads and writes. The spacing between bit lines and word lines is described to be approximately 1 μm. In some processors, the spacing may be greater than lam. In other processors the spacing between lines is less than 1 μm.
- Exemplary Instruction Set Architecture
- The material that follows provides a detailed description of an exemplary instruction set suitable for use in a processor architecture such as illustrated in the above-referenced drawings and described elsewhere herein.
- Except for symbols reserved for registers, Café assembler symbols are just like those for the SPARC assembler. See the SPARC assembler manual for those details.
- The Café assembler uses the .proc pseudo-op similar to SPARC's, but defines no operand for it. This pseudo-op should be used to mark the beginning of a function so the assembler can know to require the beginning of an instruction word. This only makes a practical difference if an immediately preceding function ends with an instruction that does not appear to consummate an instruction word, but using the .proc pseudo-op is a good habit in any case.
- General Purpose Registers
- Café has 256 32-bit general purpose registers, numbered 0 through 255. The assembler reserves symbols of the form:
[Rr]<digit><digit>*
for general purpose register specifiers. That is, the letter R in either case followed by a non-empty string of decimal digits denotes a general purpose register. Strings of digits indicating values greater than 255 are diagnosed as errors. - In addition to these canonical register names, a few symbols are reserved for aliases of registers that have special uses.
sp Stack Pointer, an alias for r1. 1p Link Pointer. The call instruction puts the return address in 1p, which is an alias for r2. fp Frame Pointer, an alias for r3. gmr GPU-to-MPU by-pass register, an alias for r6. mgr MPU-to-GPU by-pass register, an alias for r7. Others register alias symbols may be defined as register use conventions evolve. None of these symbols is case-sensitive. - In addition to register symbols, a general purpose register can be denoted by a register expression of the form:
% % [Rr] ?<constant-expression> - That is, double percent sign, optionally followed the letter R in either case, followed by the first pass (no forward references, no relocations) constant expression in the range of zero to 255. The optional R is of no value to the the assembler, but some users believe it's useful to see it in the source.
-
- r0 is a fiat-zero source operand and a result-sink registers.
Control and Status Registers
- r0 is a fiat-zero source operand and a result-sink registers.
- In order to provide an extensible namespace for control and status registers, symbols denoting them begin with %. None of these symbols is case-sensitive.
- Program Counters
- Documentation refers to a program counter, % pc, and its sidekick % npc, but it's not apparent that either is used by the assembler.
- Processor Status Register
- Café has a program status register, for which this assembler uses the symbol % psr. The layout of % psr follows. % psr can be read and modified by the getir and set ir instructions using its internal register ordinal, 1.
- Bit 24 of % psr specifies The processor ID. Clear denotes cpu0, and set denotes cpu1.
-
Bits 23 and 22 of % psr specify the current Trap Level. - Bit 21 of % psr determines the endianness of loads and stores. The initial state of this bit is clear, which means big-endian. When set it means little-endian.
- Bit 20 of % psr is the Instruction Address Check Enable flag. Its description is yet to be supplied.
- Bit 19 of % psr is the Data Address Check Enable flag. Its description is yet to be supplied.
- Bit 18 of % psr is the Garbage Check Enable flag. Its description is yet to be supplied.
- Bit 17 of % psr is the Data Cache Enable flag. When set, the data cache is enabled; when clear, data cache is disabled.
- Bit 16 of % psr is the Instruction Cache Enable flag. When set, the instruction cache is enabled; when clear, instruction cache is disabled.
- Bit 15 of % psr is the Supervisor Mode flag. When set it indicates the processor is in supervisor mode, which allows certain privileged activities. Among these privileged activities is the ability to change all but the two right-most fields of % psr. The Supervisor Mode flag is most often set during trap handling, which is explained in the Traps section of the microarchitecture manual.
- Bit 14 of % psr is the Interrupt Enable flag. When set, interrupts are enabled. Its use is explained in the Traps section of the microarchitecture manual.
- Bits 13 through 10 of % psr is the Processor Interrupt Level, which is explained in the Traps section of the microarchitecture manual.
- The % psr fields that can be set when the Supervisor Mode flag is clear are grouped together at the low-order end. They follow.
- A 2-bit field of % psr specifies the mode in effect for the saturated arithmetic performed by some of the parallel integer operations. The bounds for saturation are given in the adjacent table. Modes 00 and 01 are expressed as two's-complement 16-bit integers. Mode 10 is expressed in S.15 fixed-point.
Mode 11 is S2.13 fixed-point. The simulator is usingbits 8 and 9 of % psr for this specification.bounds mode low high 00 000000000 . . . 0 011111111 . . . 1 01 100000000 . . . 0 011111111 . . . 1 10 100000000 . . . 0 011111111 . . . 1 11 111000000 . . . 0 001000000 . . . 0 - (For those of us habituated to the common floating-point representation, the Si.ƒ notation in the preceding paragraph can be confusing. In these fixed point formats, the sign-bit S is one bit of the integer part of the number. For example, in S2.13 format, the integer part is a two's complement 3-bit number. There is no “dedicated” sign-bit as with the floating-point representation, and thus no negative zero to worry about.)
- The low-order eight bits of % psr are used as dirty bits for octants of the general purpose register file. A new process begins with all the dirty bits clear, and a octant's dirty bit is set when a register in that octant is written. Café's large register file is a formidable lot of state to manage during a context-switch. An isolated region of a long-running program that causes dirty bits to be set should clear them when it's safe to do so. Within this 8-bit field a given register number N corresponds to the bit (1<<(N>>5)).
- Trap Base Register
- A vector of trap handler addresses is pointed-to by a trap base register, for which the assembler uses the symbol % tbr. Only the high-order 19 bits of % tbr are used to address the vector, so the vector must be positioned at an 8192-byte boundary. Details for use of the vector are described in the “Traps” chapter of the Café Architecture Manual.
- The assembler's only concern is the ability to read or set % tbr using the getir and setir instructions. Reads of the low-order 13 bits of % tbr always return zero, and writes to the low-order 13 bits of % tbr are always ignored.
- No other control or status registers are described yet. Those that will be defined that can be read or set will be accessible using the getir and setir instructions.
- Instruction Set
- Instruction Formats
- Note: Since this document was written the term “SFU” has evolved to “GFU”, and the term “UFU” has evolved to “MFU”. Until there's no problem more important than changing all occurrences of the old terms here, the author apologizes for the inconvenience.
- Café instructions are issued in instruction words composed of one SFU instruction and zero to three UFU instructions. An SFU instruction begins with a 2-bit header field that is a count of the UFU instructions that follow in the instruction word. All of the instructions in an instruction word are issued in the same cycle.
- When there isn't useful work to do on all the UFUs, UFU instructions need not be present. However, the UFU on which an instruction executes is determined by the position of the instruction in the instruction word. To cause an instruction to execute on the second or third UFU, there must have been instructions in the previous slots of the instruction word. This is an issue when trying to avoid the latency of propagating a result from one FU to another.
- The assembler infers the beginning of an instruction word from the presence of an SFU instruction. UFU instructions that follow form the rest of the instruction word. More than three consecutive UFU instructions are reported as a fatal error, since the assembler cannot create a well-formed Café instruction word from that.
- Several mnemonics denote instructions implemented both as SFU and UFU operations. These mnemonics indicate an SFU instruction only when used at the beginning of an instruction word. An instruction word boundary is established when the immediately preceding instruction word uses all three of its UFU slots or by the presence of two adjacent semicolons (;;), the instruction word delimiter.
- Algebraicly, if not lexically, the double semicolon is a full colon, meaning it's time to flush the instruction word. For example:
-
- ;; add r6,1,r6; add r7,1,r7;; is a single instruction word beginning with an SFU add operation and having a single UFU operation, also an add. But the similar pattern:
- ;; add r6,1,r6;; add r7,1,r7;; is to instruction words, each consisting of an SFU add operation.
SFU Instruction Formats
- An SFU instruction begins with a 2-bit header (labeled hdr in the instruction format diagrams appearing later in this section) that gives the number of UFU instructions that follow the SFU instruction in the instruction word. That is, the header vaules and instruction word contents they indicate are:
header value instructions in instruction word 00 SFU only 01 SFU + UFU1 10 SFU + UFU1 + UFU2 11 SFU + UFU1 + UFU2 + UFU3 - The first two bits of an SFU opcode determine the class of the operation. The values and classes are:
00 Call and branch 01 Compute 10 Memory (uncacheable) 11 Memory (cacheable) - Generally, the third bit of an SFU opcode, the i-bit, is set when an operation uses an immediate for its second source operand and clear when it does not. SFU opcodes beginning with 00 (call and branch) are 6 bits, and all others are 8 bits.
- Opcodes for the memory operations can be shown in a matrix where the bits usually indicate cacheability, signedness, size, and direction:
Memory (cacheable, leading 11) Opcodes opcode[2.0] 0xx-(unsigned) 1xx-(signed) opcode[7:3] byte short word long byte short word long 11ixx 000 001 010 011 100 101 110 111 11i00 ldub ldus lduw ldpair ldb lds ldw (ldg) 11i01 — lduso lduwo ld_diag — ldso ldwo prefetch 11i10 stb sts stw stpair cstb csts cstw — 11i11 s2ib stso stwo st_diag — — cas — -
Memory (uncacheable, leading 10) Opcodes opcode[2.0] 0xx-(unsigned) 1xx-(signed) opcode [7:3] byte short word long byte short word long 10ixx 000 001 010 011 100 101 110 111 10i00 ncldub ncldus nclduw ncldg ncldb nclds ncldw — 10i01 — nclduso nclduwo — — ncldso ncldwo — 10i10 ncstb nests ncstw ncstpair — — — — 10i11 — ncstso ncstwo — — — — — - Opcodes in the compute (leading 01) quadrant of the SFU opcode space generally are not assigned in ways where the bit patterns reveal much other than where the i-bit is used. Mnemonics in both the upper and lower halves of this table are those for opcodes that are the same except for a clear or set i-bit. Note that there are only three free spaces in the upper half and sixteen free in the upper half.
Compute (leading 01) Opcodes opcode[7:3] opcode[2:0] 01ixx 000 001 010 011 100 101 110 111 i = 0 01000 add — sub — not or and xor 01001 idiv rem ppower cmovenz sethi cmovez blockaddr — 01010 shll shrl shra — cmpeq cmplt frecsqrt — 01011 fcmpeq fcmplt fcmple cmpult fdiv precsqrt getir setir i = 1 01100 add — sub — — or add xor 01101 idiv rem — cmovenz — cmovez — — 01110 shll shrl shra — cmpeq cmplt — — 01111 — — — cmpult — — — — - Opcodes in the call and branch (leading 00) quadrant of the SFU opcode space have some irregularities compared to other SFU opcodes. Since call and nop opcodes must be unique in their higher-order six bits, they have effective footprints of four opcodes each. Similarly, bz and bnz, with their prediction qualifiers, each use up four opcode slots. This quadrant does not use the i-bit as the other three do.
Call and branch (leading 00) Opcodes opcode[7:3] opcode[2:0] 00xxx 000 001 010 011 100 101 110 111 00000 call n/a n/a n/a bz bz, pt bz, ph bz, ph, pt 00001 bnz bnz, pt bnz, ph bnz, ph, pt jmpl done retry sir 00010 softtrap iflush — — — — — — 00011 — — — — — — — — 00100 nop nop nop nop — — — — 00101 — — — — — — — — 00110 softtrap membar — — — — — — 00111 — — — — — — — —
The SFU instruction formats are: -
- memory and compute
- (See
FIG. 10A )
- (See
- sethi
- (See
FIG. 10B )
- (See
- Only the sethi instruction uses this format.
- call
- (See
FIG. 10C )
- (See
- Only the sethi instruction uses this format.
- (See
FIG. 10D )
- (See
- Only the bz and bnz instructions use this format.
- memory and compute
- UFU Instruction Formats
-
- three-source operand
- (See
FIG. 10E )
- (See
- two-source operand
- (See
FIG. 10F )
- (See
- three-source operand
- The immediate field of the UFU two-source instruction is 14 bit to be consistent with very similar operations using the SFU compute format.
- Instruction Categories
Integer Instructions mnemonic argument list opcode L operation add rs1,reg_or_imm14,rd S- 01i00000 1 Add U-000000 addc rs1,rs2,rs3,rd U-000001 1 Add with carry cmovenz rs1,reg_or_imm14,rd S- 01i01011 1 Move if not zero rs1,reg_or_imm8,rd U-pseudo-op cmovez rs1,reg_or_imm14,rd S- 01i01101 1 Move if zero rs1,reg_or_imm8,rd U-pseudo-op cmpeq rs1,reg_or_imm14,rd S- 01i10100 1 Compare equals U-101100 cmplt rs1,reg_or_imm14,rd S- 01i10101 1 Compare less than U-101110 cpicknz rs1,reg_or_imm8,reg_or_imm8,rd U-101010 1 Conditionally (non- zero) pick cpickz rs1,reg_or_imm8,reg_or_imm8,rd U-101011 1 Conditionally (zero) pick cmpult rs1,reg_or_imm14,rd S- 01i11011 1 Compare less than U-101111 genborrow rs1,reg_or_imm14,rd U-101000 1 Generate borrow gencarry rs1,reg_or_imm14,rd U-101001 1 Generate carry getir rs1,rd S-01011110 ? Get internal register idiv rs1,reg_or_imm14,rd S-01i01000 ? Division moveind rs1,[rs2] U-100010 2 Move indirect mul rs1,reg_or-imm14,rd U-000110 2 Multiply muladd rs1,reg_or_imm8,reg_or_imm8,rd U-001000 2 Multiply add mulsub rs1,reg_or_imm8,reg_or_imm8,rd U-001001 2 Multiply add padd rs1,reg_or_imm14,rd U-000100 1 Parallel add padds rs1,rs2,rd U-110100 1 Saturating parallel add pcmovenz rs1,rs2,rd U-111100 1 Parallel move if not zero pcmovez rs1,rs2,rd U-111101 1 Parallel move if zero pcmpeq rs1,reg_or_imm14,rd U-110001 1 Parallel compare equal pcmplt rs1,reg_or_imm14,rd U-110011 1 Parallel compare less than pmul rs1,reg_or_imm14,rd U-000111 2 Parallel multiply pmuladd rs1,rs2,rs3,rd U-001010 2 Parallel multiply add pmuladds rs1,rs2,rs3,rd U-111010 2 Saturating parallel multiply add pmulsub rs1,rs2,rs3,rd U-001011 2 Parallel multiply subtract pmulsubs rs1,rs2,rs3,rd U-111011 2 Saturating parallel multiply subtract psub rs1,reg_or_imm14,rd U-000101 1 Parallel subtract psubs rs1,rs2,rd U-110101 1 Saturating parallel subtract rem rs1,reg_or_imm14,rd S-01i01001 ? Remainder sethi imm22,rd S-01001100 1 Sethi setir rs1,rd S-01011111 1 Set internal register sub rs1,reg_or_imm14,rd S-01i00010 1 Subtract U-000010 subc rs1,rs2,rs3,rd U-000011 1 Subtract with carry -
Logical Instructions mnemonic argument list opcode L operation add rs1,reg_or_imm14,rd S- 01i00110 1 Add U-010110 cccb rs1,reg_or_imm14,rd U-111110 1 Count consecutive clear bits not rs1,rd S- 01i00100 1 Not U-010100 or rs1,reg_or_imm14,rd S- 01i00101 1 Or U-010101 pshll rs1,reg_or_imm14,rd U-011100 1 Parallel shift left logical pshra rs1,reg_or_imm14,rd U-011110 1 Parallel shift right arithmetic pshrl rs1,reg_or_imm14,rd U-011101 1 Parallel shift right logical shll rs1,reg_or_imm14,rd S- 01i10000 1 Shift left U-011000 logical shra rs1,reg_or_imm14,rd S- 01i10010 1 Shift right U-011010 arithmetic shrl rs1,reg_or_imm14,rd S- 01i10001 1 Shift right U-011001 logical xor rs1,reg_or_imm14,rd S- 01i00111 1 Exclusive or U-010111 -
Floating Point Instructions mnemonic argument list opcode L operation clip rs1,rs2,rs3,rd U-011111 1 Clip fadd rs1,rs2,rd U-001100 4 Single precision addition fcmpeq rs1,rs2,rd S-01011000 1 FP compare equals fcmple rs1,rs2,rd S-01011010 1 FP compare less than or equals fcmplt rs1,rs2,rd S-01011001 1 FP compare less than fdiv rs1,rs2,rd S-01011100 6 Single precision division fmul rs1,rs2,rd U-001110 4 Single precision multiplication fmuladd rs1,rs2,rs3,rd U-010000 4 Single-precision multiply- add fmulsub rs1,rs2,rs3,rd U-010001 4 Single precision multiply- subtract frecsqrt rs1,rd S-01010110 6 Single precision reciprocal square root fsub rs1,rs2,rd U-001101 4 Single precision subtraction - Fixed-Point Instructions
- The fixed-point operands of these instructions are in S2.13 format; that is, these instructions are unaffected by the fixed-point mode bits of the psr. Precision may be lost as a result is rendered in that format. Overflows saturate.
mnemonic argument list opcode L operation ppower rs1,rs2,rd S- 01i01010 6 Parallel exponentiation precsqrt rs1,rd S-01011101 6 Parallel reciprocal square root -
Convert Instructions mnemonic argument list opcode L operation fix2flt rs1,reg_or_imm14,rd U-100111 4 Fixed point to single precision flt2fix rs1,reg_or_imm14,rd U-100110 4 Single precision to fixed point -
Control Flow Instructions mnemonic argument list opcode L operation bndck rs1,reg_or_imm8,reg_or_imm8 U-011011 ? Bound check bnz rd,label S- 000010ht 1 Branch if not zero bz rd,label S- 000001ht 1 Branch if zero call label S-000000 1 Call done no arguments S-00001101 ? Skip trapped instruction jmpl rs1,rd S-00001100 2 Jump and link nop no arguments S- 001000xx 1 Null operation retry no arguments S-00001110 ? Retry trapped instruction return rs1 pseudo-op 2 Return sir no arguments S-00001111 ? Software-initiated reset softtrap rs1,reg_or_imm14 S-00i10000 ? Software-initiated trap - Memory Access Instructions
- Café memory accesses can be done either big- or little-endian, with big-endian being the default. Endianness is controlled by a bit in % psr.
- Instructions are stored big-endian. The assembler and linker assume that relocations and other initializations in sections other than code sections should also be treated as big-endian.
- Most memory access instructions use a two-component effective address specification, denoted in this document by [address]. [address] may be [rs1+rs2], [rs1+simm14], or [rs1]. When [rs1] is specified, the assembler infers an immediate zero for the second address component of the instruction.
mnemonic argument list opcode L operation cas rs1,{rs2],rd S-11011110 ? Compare and swap(atomic) cstb rd,rs1,[rs2] S-11010100 ? Conditional store byte csts rd,rs1,[rs2] S-11010101 ? Conditional store short cstw rd,rs1,[rs2] S-11010110 ? Conditional store word iflush no arguments S-00010001 ? Flush instruction pipe ldb [address],rd S- 11i00100 2 Load byte ldpair [address],rd S-11i00011 ? Load pair lds [address],rd S- 11i00101 2 Load short ldso [address],rd S- 11i01101 2 Load short other-endian ldub [address],rd S- 11i00000 2 Load unsigned byte ldus [address],rd S- 11i00001 2 Load unsigned short lduso [address],rd S- 11i01001 2 Load unsigned short other-endian lduw [address],rd S- 11i00010 2 Load unsigned word lduwo [address],rd S- 11i01010 2 Load unsigned word other-endian ldw [address],rd S- 11i00110 2 Load word ldwo [address],rd S- 11i01110 2 Load word other-endian membar no arguments S-00110001 ? Memory barrier ncldb [address],rd S-10i00100 ? Non-cacheable load byte ncldg [address],rd S-10i00011 ? Non-cacheable load group nclds [address],rd S-10i00101 ? Non-cacheable load short ncldso [address],rd S-10i01101 ? Non-cacheable load short other-endian ncldub [address],rd S-10i00000 ? Non-cacheable load unsigned byte ncldus [address],rd S-10i00001 ? Non-cacheable load unsigned short nclduso [address],rd S-10i01001 ? Non-cacheable load unsigned short other- endian nclduw [address],rd S-10i00010 ? Non-cacheable load unsigned word nclduwo [address],rd S-10i01010 ? Non-cacheable load unsigned word other- endian ncldw [address],rd S-10i00110 ? Non-cacheable load word ncldwo [address],rd S-10i01110 ? Non-cacheable load word other-endian ncstb rd,[address] S-10i10000 ? Non-cacheable store byte ncstpair rd,[address] S-10i10011 ? Non-cacheable store pair nCsts rd,[address] S-10i10001 ? Non-cacheable store short ncstso rd,[address] S-10i11001 ? Non-cacheable store short other-endian ncstw rd,[address] S-10i10010 ? Non-cacheable store word ncstwo rd,[address] S-10i11010 ? Non-cacheable store word other-endian s2ib rd,[address] S- 11i11000 2 Store byte to instruction stb rd,[address] S- 11i10000 1 Store byte stpair rd,[address] S-11i10011 ? Store pair sts rd,[address] S- 01i10001 1 Store short stso rd,[address] S- 01i11001 1 Store short other-endian stw rd,[address] S- 10i10010 1 Store word stwo rd,[address] S- 10i11010 1 Store word other-endian
[still missing pefetch, ld_diag and st_diag, at least]
-
Pixel Instructions mnemonic argument list opcode L operation bitext rs1,rs2,rs3,rd U-111111 2 Bit extract byteshuffle rs1,rs2,rs3,rd U-100001 2 Byte extract pack rs1,rs2,rs3,rd U-100000 1 Pack pdist rs1,rs2,rd U-100101 4 Pixel distance pmean rs1,rs2,rd U-100100 1 Parallel mean
more as I learn more . . .
Scheduling - Note: This section is superseded by the “Code Scheduling Guidlines” section of the Café Microprocessor Architecture Manual. Please feel free to recommend improvements or corrections to this material, but refer to the manual for the definitive treatment.
- None of the results of UFU operations is scoreboarded. All loads and a few non-pipelined long-latency SFU operations such as idiv, fdiv, frecsqrt, precsqrt, and ppower (the complete list is not determined) are scoreboarded.
- Non-scoreboarded results are available to subsequent operations on the unit that produces them after their latencies; earlier use is erroneous. Latencies are shown in the columns labeled “L” in the tables in the preceding section.
- A result produced on one unit is available as an operand on another unit when it is being written to the register file (that is, when it reaches pipe stage W1). Earlier use is erroneous. This takes 5 cycles on a UFU. On the SFU this takes one cycle more than the latency of the producing operation. The difference is because UFU results have to pass through all 4 E-stages of a pipe before reaching W1, and SFU results do not.
- Use of a scoreboarded result register as an operand causes instruction issue to stall for as many cycles as it takes for that result to become available. If the referencing instruction that provokes the stall is also on the SFU, the stall is only until the result is available for intra-unit bypass. In the case of a load that hits in the cache, the stall could be as short as a single cycle. If the referencing instruction is on a UFU, stall lasts until the result reaches stage W1, where it can be bypassed on its way to the register file.
- To help improve the latency of feeding operations not available on all units, special bypass registers are available. A completed (that is, finished but not yet in W1) instruction's result can be bypassed from the first UFU to the SFU if its destination register is r4. Likewise, a completed instruction's result can be bypassed from the SFU to the the first UFU if its destination register is r5.
- Multiple writes to the same register of the register file in the same same cycle is an erroneous condition with an undefined result. More generally, any time a given register is the destination of more than operation in progress, the program is erroneous. Suppose that unit's 4-E-stage pipe sees a sequence of operations like:
cycle operation 0 4-cycle op → r10 1 4-cycle op → r10 2 anything 3 anything 4 any op using cycle zero's r10 as a source 5 any op using cycle one's r10 as a source - Before being too awestruck by this tight scheduling, consider what happens if issue has to stall, say for icache fill, in cycle two or three: both uses of r10 will get the result produced by the op that started in cycle one. Never issue a redefinition of a register's value between a definition and a use. The value that will reach the use is not deterministic.
- Instruction Details
- add
- add is an integer instruction that computers “r[rs1]+r[rs2] ” or “r[rs1]+sign_ext(imm14) ”. The use of an immediate for the second source operand sets the i-bit of the opcode of the SFU version or the first header bit of the UFU version. The resulting sum is left in r[rd].
- The suggested assembler syntax is:
-
- add rs1, reg_or_imm14, rd
- The SFU version of this instruction uses the SFU compute format, and the UFU version uses the UFU 2-source format.
- addc, subc
- addc and subc are integer instructions that compute “r[rs1]+r[rs2]+r[rs3]” or “r[rs1]−r[rs2]−r[rs3]”, respectively. Only the least significant bit of r[rs3], which is expected to be the result of a gencarry or genborrow, is used. The result is left in r[rd].
- The suggested assembler syntax is:
-
- addc rs1, rs2, rs3, rd
- subc rs1, rs2, rs3, rd
- addc and subc are UFU instructions that use the UFU 3-source format.
- and
- and is a logical instruction that computes “r[rs1] & r[rs2]” or “r[rs1] & imm14”. The use of an immediate for the second source operand sets the i-bit of the opcode of the SFU version or the first header bit of the UFU version. The result is left in r[rd].
- The suggested assembler syntax is:
-
- and rs1, reg_or_imm14, rd
- The SFU version of and uses the SFU compute format, and the UFU version uses the UFU 2-source format.
- Bitext
- bitext is a pixel instruction that extracts bits from the pair of registers r[rs1] and r[rs2]. The extracted field is described by a 6-bit length in bits 21 . . . 16 of r[rs3]and a 5-bit skip count in
bits 4 . . . 0 of r[rs3]. The skip count is applied at the left-most (high-order) end of r[rs1]. The extracted field is right justified with r[rd] without sign-extension. - (The illustration below is appropriated from a someone else's slide. I hope the stylistic change is not too jarring, but prose description has not been lucid for some readers.)
- See
FIG. 11 - The suggested assembler syntax is:
-
- bitext rs1, rs2, rs3, rd
- bitext is an UFU operation that uses the UFU 3-source format.
- bndck
- bndck is a control flow instruction that causes a trap if “r[rs1]==0”, the second source operand is less than zero, of if the second source operand is greater than or equal to the third source operand. What it means to trap is not yet defined.
- The second source operand can be either r[rs2] or sign_ext(imm8), and the third source operand can be either r[rs3] or sign_ext(imm8). The use of an immediate for either causes the appropriate header bit to be set in the instruction.
- The suggested assembler syntax is:
-
- bndck rs1, reg_or_imm8, reg_or_imm8 bndck is a UFU operation that uses the UFU 3-source format. The operation has no use for this format's r[rd], which the assembler will make appear to be zero.
bnz
- bndck rs1, reg_or_imm8, reg_or_imm8 bndck is a UFU operation that uses the UFU 3-source format. The operation has no use for this format's r[rd], which the assembler will make appear to be zero.
- bnz is a control flow instruction for a branch to the offset implied by the difference between the program counter and the specified label if the value in “r[rd]” is not equal to the integer zero. “label” is a label at the branch target; the assembler will either determine the displacement or generate relocation information so a linker can determine it. Whenever the displacement is determined, it must be expressible in a signed 22-bit field.
- The mnemonic may be followed optionally by the qualifier,pt, which means this conditional branch is staticly predicted to be taken. The use of this qualifier sets the T-bit in the instruction.
- The suggested assembler syntax is:
-
- bnz rd, label
- bnz, pt rd, label
- bnz is an SFU operation that uses the branch instruction format.
byteshuffle
- byteshuffle is a pixel instruction that copies the bytes from its sources r[rs1] and r[rs2] to byte positions of r[rd] according to the pattern described by the bits of the least significant two bytes of r[rs3].
- Each group of four contiguous bits the lower-order two bytes of r[rs3] is the ordinal of the byte position of the eight bytes of the register pair r[rs1]−r[rs2] from which a byte is copied to the corresponding byte of r[rd]. An out-of-range byte ordinal (that is, a value greater than 7) means the corresponding byte of r[rd] will be zeroed.
- The suggested assembler syntax is:
-
- byteshuffle rs1, rs2, rs3, rd
- byteshuffle is a UFU operation that uses the UFU 3-source format.
- bz
- bz is a control flow instruction for a branch to the offset implied by the difference between the program counter and the specified label if the value in “r[rd]” is equal to the integer zero. “label” is a label at the branch target; the assembler will either determine the displacement or generate relocation information so a linker can determine it. Whenever the displacement is determined, it must be expressible in a signed 22-bit field.
- The mnemonic may be followed optionally by the qualifier, pt, which means this conditional branch is staticly predicted to be taken. The use of this qualifier sets the T-bit in the instruction.
- Unconditional branches are commonly coded using bz with r0 for the register operand. When the assembler sees this “unconditional conditional” branch without prediction, it will infer the “,pt” qualification, which can improve instruction prefetching.
- The suggested assembler syntax is:
-
- bz rd, label
- bz, pt rd, label
- bz is an SFU operation that uses the branch instruction format.
- call
- call is a control flow instruction causes a control transfer to the address specified by its label operand. A call to address zero is an illegal instruction.
- The return address (% npc at the time of the call) is left in r2, an implicit operand of this instruction. For that reason the assembler has the alias lp (“link pointer”) for r2.
- The suggested assembler syntax is:
-
- call label
- call is an SFU operation that uses a format of its own.
- cas
- cas is a memory access instruction that compare the content of register r[rs1] with the content of the 32-bit word in memory addressed by r[rs2]. If those values are equal, the content of register r[rd] is swapped with the word addressed by r[rs2]. Otherwise, the content of the addressed memory word is unchanged, but the value at that memory address replaces the content of register r[rd].
- The effective address from which to load must be word-aligned, but the consequences of failing to do that are not defined.
- Note that cas uses dcache, which makes it unsuitable for thread synchronization in a multi-Café configuration.
- The suggested assembler syntax is:
-
- cas rs1, [rs2], rd
- cas is an SFU operation that use the 2-source register variant of the SFU memory format. Since it always uses two source registers, the i-bit of its opcodes is always clear.
- cccb
- cccb is a logical instruction that counts consecutive clear bits is its first source operand, r[rs1], beginning from the high-order bit, first skipping the number of bits specified by the second source operand. The second source operand my be either a register or an immediate. In either case, only the the low-
order 5 bits of the skip-count are used. The count of clear bits is left in r[rd]. - The suggested assembler syntax is:
-
- cccb rs1, reg_or_imm14, rd
- cccb is an UFU operation that uses the UFU 2-source format.
- clip
- clip is a floating point instruction that computes:
-
- ((r[rs1] > r[rs2] ? 1:0) <<1) |
- (r[rs1] < −r[rs2] ? 1:0) |
- (r[rs3] << 2)
- and leaves the result in r[rd]. The practical effect of this operation is to return a copy of r[rs3] shifted left 2 bits with the two least significant bits occupied by indications of how the single-precision floating point numbers in r[rs1] and r[rs2] compare.
- The suggested assembler syntax is:
-
- clip rs1, rs2, rs3, rd
- clip is a UFU operation that uses the UFU 3-source format, but the variants of that format that allow immediates for the second and third source operand are not used.
- cmovenz
- cmovenz is an integer instruction that copies the value of the second source operand, specified by “r[rs2] ” or “sign_ext(imm)”, to the result register r[rd] only if the first source operand, r[rs1], is non-zero. Note that cmovenz allows a 14-bit immediate on the SFU but only an 8-bit immediate on a UFU.
- The suggested assembler syntax is:
cmovenz rs1,reg_or_imm14,rd ! SFU cmovenz rs1,reg_or_imm8,rd ! UFU - The SFU version of the cmovenz uses the SFU compute format. The UFU version of cmovenz is a pseudo-op for the cpicknz instruction with r[rd] replicated in the r[rs3] field.
- cmovez
- cmove z is an integer instruction that copies the value of the second source operand, specified by “r[rs2]” or “sign_ext(imm)”, to the result register r[rd] only if the first source operand, r[rs1], is zero. Note that cmovez allows a 14-bit immediate on the SFU but only an 8-bit immediate on a UFU.
- The suggested assembler syntax is:
cmovez rs1,reg_or_imm14,rd ! SFU cmovez rs1,reg_or_imm8,rd ! UFU - The SFU version of cmovez uses the SFU compute format. The UFU version of cmovez is a pseudo-op for the cpickz instruction with r[rd] replicated in the r[rs3] field.
- cmpeq
- cmpeq is an integer instruction that computes “r[rs1]==r[rs2]” or “r[rs1]=sign_ext(imm14)”. The use of an immediate for the second source operand sets the i-bit of the opcode. The resulting zero or one is left in r[rd].
- The suggested assembler syntax is:
cmpeq rs1,reg_or_imm14,rd - The SFU version of this instruction uses the SFU compute format, and the UFU version uses the UFU 2-source format.
- cmplt
- cmplt is an integer instruction that computes “r[rs1]<r[rs2]” or “r[rs1]<sign_ext(imm14)”. The use of an immediate for the second source operand sets the i-bit of the opcode. The resulting zero or one is left in r[rd].
- The suggested assembler syntax is:
cmplt rs1,reg_or_imm14,rd - The SFU version of this instruction uses the SFU compute format, and the UFU version uses the UFU 2-source format.
- cmpult
- cmpult is an integer instruction that computes “(unsigned) r[rs1]<(unsigned) r[rs2] ” or “(unsigned) r[rs1]<(unsigned) imml 4”. The use of an immediate for the second source operand sets the i-bit of the opcode. The resulting zero or one is left in r[rd].
- The suggested assembler syntax is:
cmpult rs1,reg_or_imml4,rd - The SFU version of this instruction uses the SFU compute format, and the UFU version uses the UFU 2-source format.
- cpicknz
- cpicknz is an integer instruction that assigns its second source operand to r[rd] if the first source operand, r[rs1], is non-zero; otherwise, the third source operand is assigned to r[rd]. Each of the second and third source operands can be either a register or a sign-extended 8-bit immediate.
- The suggested assembler syntax is:
cpicknz rs1,reg_or_imm8,reg_or_imm8,rd
cpicknz is a UFU operation that uses the UFU 3-source format.
cpickz - cpickz is an integer instruction that assigns its second source operand to r[rd] if the first source operand, r[rs1], is zero; otherwise, the third source operand is assigned to r[rd]. Each of the second and third source operands can be either a register or a sign-extended 8-bit immediate.
- The suggested assembler syntax is:
cpickz rs1,reg_or_imm8,reg_or_imm8,rd - cpickz is a UFU operation that uses the UFU 3-source format.
- cstb, csts, cstw
- cstb, csts, and cstw are memory access instructions that, if the value in the register r[rs1] is non-zero, store the value the register r[rd] at the address in register r[rs2].
- The effective address to which to store must be aligned to a natural boundary for the size of the store, but the consequences of failing to do that are not defined.
- The suggested assembler syntax is:
cstb rd rs1,[rs2] csts rd,rs1,[rs2] cstw rd,rs1,[rs2] - cst[b, s, w]are SFU operations that use the 2-source register variant of the SFU memory format. Since they always use two source registers, the i-bit of their opcodes is always clear.
- done
- done is a control flow instruction causes a control transfer from a trap handler to the next instruction word after the instruction that caused the trap. Please refer to the Traps chapter of the Café Microarchitecture specification for the complete description.
- The suggested assembler syntax is:
-
- done
- done is a SFU operation that uses the SFU compute format, but is has no use for any operand.
- fadd
- fadd is a floating point instruction that computes “r[rs1]+r[rs2]”, where the values of the source operands are IEEE single-precision floating point numbers. The result is delivered in r[rd].
- The suggested assembler syntax is:
fadd rs1,rs2,rd - fadd is a UFU operation that uses the UFU 2-source format.
- fcmpeq
- fcmpeq is a floating point instruction that compares the single-precision floating point operands in its source registers r[rs1] and r[rs2] for equality. The destination register r[rd] set set to the
integer value 1 if the source operands are equal and zero otherwise. If either source operand is a NaN, they are not equal. - The suggested assembler syntax is:
fcmpeq rs1,rs2,rd - fcmpeq is a SFU operation that uses the SFU compute format.
- fcmple
- fcmple is a floating point instruction that set its destination register r[rd] to the
integer value 1 if the single-precision floating point value in r[rs1] is less than or equal to the single-precision floating point value in r[rs2] and to zero otherwise. If the value of either source operand is a NaN, the result is zero. - The suggested assembler syntax is:
fcmple rs1,rs2,rd - fcmple is a SFU operation that uses the SFU compute format.
- fcmplt
- fcmpit is a floating point instruction that set its destination register r[rd] to the
integer value 1 if the single-precision floating point value in r[rs1] is less than the single-precision floating point value in r[rs2] and to zero otherwise. If the value of either source operand is a NaN, the result is zero. - The suggested assembler syntax is:
fcmplt rs1,rs2,rd - fcmpit is a SFU operation that uses the SFU compute format.
- fdiv
- fdiv is a floating point instruction that computes “r[rs1] |r[rs2]”, where the value of the source operands are IEEE single-precision floating point numbers. The result is delivered in r[rd].
- The suggested assembler syntax is:
fdiv rs1,rs2,rd - fdiv is a SFU operation that uses the SFU compute format.
- fix2flt
- fix2 flt is a convert instruction that converts a fixed point value in r[rs1], with its binary point specified by the low-
order 5 bits of r[rs2] or imm14, to a single precision floating point result in r[rd]. - The suggested assembler syntax is:
fix2flt rs1,reg_or_imm14,rd - fix2flt is an UFU operation that uses the UFU 2-source format.
- flt2fix
- flt2fix is a convert instruction that converts a single precision floating point value in r[rs1] to a fixed point result in r[rd] with the binary point as specified by the low-
order 5 bits of r[rs2] or imm14. - The suggested assembler syntax is:
flt2fix rs1,reg_or_imm14,rd - flt2fix is an UFU operation that uses the UFU 2-source format.
- fmul
- fmul is a floating point instruction that computes “r[rs1]*r[rs2]”, where the values of the source operands are IEEE single-precision floating point numbers. The result is delivered in r[rd].
- The suggested assembler syntax is:
fmul rs1,rs2,rd - fmul is a UFU operation that uses the UFU 2-source format.
- fmuladd
- fmuladd is a floating point instruction that computes “(r[rs1]*r[rs2])+r[rs3] ”, where the values of the source operands are IEEE single-precision floating point numbers. The result is delivered in r[rd].
- The suggested assembler syntax is:
fmuladd rs1,rs2,rs3,rd - fmuladd is a UFU operation that uses the UFU 3-source format.
- fmulsub
- fmul sub is a floating point instruction that computes “(r[rs1]*r[rs2])−r[rs3]”, where the values of the source operands are IEEE single-precision floating point numbers. The result is delivered in r[rd].
- The suggested assembler syntax is:
fmulsub rs1,rs2,rs3,rd - fmulsub is a UFU operation that uses the UFU 3-source format.
- frecsqrt
- frecsqrt is a floating point instruction that computes the reciprocal square root of the single-precision floating-point number in r[rs1] and puts that result in r[rd]. What will this do with an argument less than or equal to zero?
- The suggested assembler syntax is:
frecsqrt rs1,rd
frecsqrt is an SFU operation uses the SFU compute format, but has no use for the second source operand of that format.
fsub - fsub is a floating point instruction that computes “r[rs1]−r[rs2] ”, where the values of the source operands are IEEE single-precision floating point numbers. The result is delivered in r[rd].
- The suggested assembler syntax is:
fsub rs1,rs2,rd - fsub is a UFU operation that uses the UFU 2-source format.
- genborrow, gencarry
- genborrow and gencarry are integer instructions that generate a one or zero in r[rd] if subtracting or adding, respectively, the source operands generates a borrow or carry, respectively. The result would be useful as the third source operand of sesequent addc and subc operations.
- The suggested assembler syntax is:
genborrow rs1,reg_or_imm14,rd gencarry rs1,reg_or_imm14,rd
genborrow and gencarry are are UFU operations that use the UFU 2-source format.
getir - getir is an integer instruction that gets the value of the internal register the ordinal of which is its r[rs1] operand and puts that value in the register specified by r[rd].
- The suggested assembler syntax is:
getir rs1,rd - getir is an SFU operation that uses the SFU compute format, though in an irregular way. It has no use for the second source field, and the first source operand is an internal register number, NOT one of the general purpose registers.
- idiv
- idiv is an integer instruction that computes “r[rs1] |r[rs2]” or “r[rs1]|sign_ext(imm14)”. The use of an immediate for the second source operand sets the i-bit of the opcode. The result is left in r[rd].
- If the second source operand is zero, idiv will trap.
- The suggested assembler syntax is:
idiv rs1,reg_or_imm14,rd - idiv is an SFU operation that uses the SFU compute format.
- iflush
- iflush is a memory access instruction that is used to make sure that modifications to code space are visible by the processor executing the iflush iflush invalidates all younger instructions that have already entered the pipe.
- The suggested assembler syntax is:
-
-
- iflush
- iflush is an SFU operation that uses the compute instruction format, but uses none of that format's operands.
jmpl
- jmpl is a control flow instruction that causes a register-indirect control transfer to the address in r[rs1]. The current value of % npc is left in r[rd].
- If the branch target is an entry-point that might also expect to be reached by a call instruction, a register other than lp is a poor choice for r[rd].
- The suggested assembler syntax is:
jmpl rs1,rd - jmpl is an SFU operation that uses the compute instruction format. The second source operand of that format is not used by jmpl.
- ldb, lds, ldw
- ldb, lds, and ldw are memory access instructions that load an 8-bit byte, a 16-bit short, or a 32-bit word from address into the destination register r[rd]. The value loaded is a sign-extended in the destination register. The use of an immediate for the second component of the address sets the i-bit of the opcode.
- The effective address from which to load must be aligned to a natural boundary for the size of the load, but the consequences of failing to do that are not defined.
- The suggested assembler syntax is:
ldb [address],rd lds [address],rd ldw [address],rd - ld [b, s, w] instructions are SFU operations that use the SFU memory format.
- ldso, ldwo
- ldso and ldwo are memory access instructions that load a 16-bit short or a 32-bit word from address into the destination register r[rd] using the opposite endianness from that indicated by the endian-bit of % psr. The value loaded is sign-extended in the destination register. The use of an immediate for the second component of the address sets the i-bit of the opcode.
- The effective address from which to load must be aligned to a natural boundary for the size of the load, but the consequences of failing to do that are not defined.
- The suggested assembler syntax is:
ldso [address],rd ldwo [address],rd - ldso and ldwo are SFU operations that use the SFU memory format.
- ldpair
- ldpair is a memory access instruction that performs a load into a pair of adjacent registers beginning at the register specified by r[rd] from address. The use of an immediate for the second component of the address sets the i-bit of the opcode, r[rd] must be an even-numbered register.
- The effective address from which to load must be even word-aligned, but the consequences of failing to do that are not defined.
- The suggested assembler syntax is:
ldpair rd,[address] - ldpair is an SFU operation that uses the SFU memory format.
- ldub, ldus, lduw
- ldub, ldus, and lduw are memory access instructions that load an unsigned 8-bit byte, an unsigned 16-bit short, or an unsigned 32-bit word from address into the destination register r[rd]. The use of an immediate for the second component of the address sets the i-bit of the opcode.
- The effective address from which to load must be aligned to a natural boundary for the size of the load, but the consequences of failing to do that are not defined.
- While Café registers are 32-bits the behavior of lduw and ldw is identical. Future extension is liable to to change that, so appropriate consideration should be used when choosing between these instructions.
- The suggested assembler syntax is:
ldub [address],rd ldus [address],rd lduw [address],rd - ldu [b, s, w] instructions are SFU operations that use the SFU memory format.
- lduso, lduwo
- lduso, and lduwo are memory access instructions that load an unsigned 16-bit short or an unsigned 32-bit word from address into the destination register r[rd] using the opposite endianness from that indicated by the endian-bit of % psr. The use of an immediate for the second component of the address sets the i-bit of the opcode.
- The effective address from which to load must be aligned to a natural boundary for the size of the load, but the consequences of failing to do that are not defined.
- While Café registers are 32-bits the behavior of lduwo and ldwo is identical. Future extension is liable to to change that, so appropriate consideration should be used when choosing between these instructions.
- The suggested assembler syntax is:
lduso [address],rd luwo [address],rd - lduso and lsuwo are SFU operations that use the SFU memory format.
- membar
- membar is memory access instruction that specifies that all memory reference instructions already issued must be performed before any subsequent memory reference instruction may be initiated.
- The suggested assembler syntax is:
-
- membar
- membar is an SFU operation that uses the compute instruction format, but uses none of that format's operands.
- moveind
- moveind is an integer instruction that copies the content of its first source operand, r[rs1], to the register indicated by the least significant eight bits of its second source register, r[rs2].
- The suggested assembler syntax is:
-
- moveind rs1, [rs2]
- moveind is a UFU operation that uses the UFU 2-source format, but it has no use for the destination register field and does not accept an immediate for the second source operand.
- mul
- mul is an integer instruction that computes “r[rs1]*r[rs2]” or “r[rs1]* sign_ext(imm14)”. The use of an immediate for the second source operand sets the first bit of the instruction header. The result is left in r[rd].
- The suggested assembler syntax is:
-
- mul rs1, reg_or_imm14, rd
- mul is a UFU operation that uses the UFU 2-source format.
- muladd
- muladd is an integer instruction that computes “(r[s1]*r[s2])+r[s3]”, “(r[s1 *r[s2])+sign ext(imm8)”, “(r[s1]*sign_ext(imm8))+r[s3] ”, or “(r[s1]* sign_ext(imm8))+sign_ext(imm8)” and puts the result in r[rd]. The use of an immediate for the second or third source operand sets the first or second bit, respectively, of the instruction header.
- The suggested assembler syntax is:
-
- muladd rs1, reg_or_imm8, reg_or_imm8, rd muladd is a UFU operation that uses the UFU 3-source format.
mulsub
- muladd rs1, reg_or_imm8, reg_or_imm8, rd muladd is a UFU operation that uses the UFU 3-source format.
- mulsub is an integer instruction that computes “(r[s1]*r[s2])−r[s3]” “(r[s]*r[s2])−sign_ext(imm8)”, “(r[s1]*sign_ext(imm8))−r[s3]”, or “(r[s1]* sign_ext(imm8))−sign_ext(imm8)” and puts the result in r[rd]. The use of an immediate for the second or third source operand sets the first or second bit, respectively, of the instruction header.
- The suggested assembler syntax is:
-
- mulsub rs1, reg_or_imm8, reg_or_imm8, rd
- mul sub is a UFU operation that uses the UFU 3-source format.
- ncldb, nclds, ncldw
- ncldb, nclds, and ncldw are memory access instructions that perform a non-cacheable load of an 8-bit byte, a 16-bit short, or a 32-bit word from address into the destination register r[rd]. The value loaded is sign-extended in the destination register. The use of an immediate for the second component of the address sets the i-bit of the opcode.
- The effective address from which to load must be aligned to a natural boundary for the size of the load, but the consequences of failing to do that are not defined.
- The suggested assembler syntax is:
ncldb [address],rd nclds [address],rd ncldw [address],rd - ncld[b, s, w] instructions are SFU operations that use the SFU memory format.
- ncldg
- ncldg is a memory access instruction that does an uncached load of a group of eight consecutive 32-bit words from address into eight consecutive registers beginning with the one specified by r[rd]. The use of an immediate for the second component of the address sets the i-bit of the opcode. r[rd] must be 8-register aligned.
- The effective address from which to load must be eight-word-aligned, but the consequences of failing to do that are not defined.
- The suggested assembler syntax is:
-
- ncldg [address], rd
- ncldg was formerly known as ldg. The assembler temporarily knows the former name as an alias for the new name to ease the transition.
- Note that there is no complementary ncstg instruction.
- ncldg is an SFU operation that use the SFU memory format.
- ncldso, ncldwo
- ncldso and ncldwo are memory access instructions that perform a non-cacheable load of a 16-bit short or a 32-bit word from address into the destination register r[rd] using the opposite endianness from that indicated by the endian-bit of % psr. The value loaded is sign-extended in the destination register. The use of an immediate for the second component of the address sets the i-bit of the opcode.
- The effective address from which to load must be aligned to a natural boundary for the size of the load, but the consequences of failing to do that are not defined.
- The suggested assembler syntax is:
ncldso [address],rd ncldwo [address],rd - ncldso and ncldwo are SFU operations that use the SFU memory format.
- ncldub, ncldus, nclduw
- ncldub, ncldus, and nclduw are memory access instructions that perform a non-cacheable load an unsigned 8-bit byte, an unsigned 16-bit short, or an unsigned 32-bit word from address into the destination register r[rd]. The use of an immediate for the second component of the address sets the i-bit of the opcode.
- The effective address from which to load must be aligned to a natural boundary for the size of the load, but the consequences of failing to do that are not defined.
- Note that while Café registers are 32-bits the behavior of nclduw and ncldw is identical, future extension is liable to to change that, so appropriate consideration should be used when choosing between these instructions.
- The suggested assembler syntax is:
ncldub [address],rd ncldus [address],rd nclduw [address],rd - ncldu [b, s, w] instructions are SFU operations that use the SFU memory format.
- nclduso, nclduwo
- nclduso and nclduwo are memory access instructions that perform a non-cacheable load an unsigned 16-bit short or an unsigned 32-bit word from address into the designation register r[rd] using the opposite endianness from that indicated by the endian-bit of % psr. The use of an immediate for the second component of the address sets the i-bit of the opcode.
- The effective address from which to load must be aligned to a natural boundary for the size of the load, but the consequences of failing to do that are not defined.
- Note that while Café registers are 32-bits the behavior of nclduwo and ncldwo is identical, future extension is liable to to change that, so appropriate consideration should be used when choosing between these instructions.
- The suggested assembler syntax is:
nclduso [address],rd nclduwo [address],rd - nclduso and nclduwo are SFU operations that use the SFU memory format.
- ncstb, ncsts, ncstw
- ncstb, ncsts, and ncstw are memory access instruction that perform a non-cacheable store an 8-bit byte, a 16-bit short, or a 32-bit word from the register specified by r[rd] to address. The use of an immediate for the second component of the address sets the i-bit of the opcode.
- For ncsts and ncstw the effective address to which to store must be 2-byte aligned or 4-byte aligned, respectively, but the consequences of failing to do that are not defined.
- The suggested assembler syntax is:
ncstb rd,[address] ncsts rd,[address] ncstw rd,[address] - ncst [sb, s, w] are SFU operations that use the SFU memory format.
- ncstso, ncstwo
- ncsts and ncstw are memory access instruction that perform a non-cacheable store of a 16-bit short or a 32-bit word from the register specified by r[rd] to address using the opposite endianness from that indicated by the endian-bit of % psr. The use of an immediate for the second component of the address sets the i-bit of the opcode.
- For ncstso and ncstwo the effective address to which to store must be 2-byte aligned or 4-byte aligned, respectively, but the consequences of failing to do that are not defined.
- The suggested assembler syntax is:
ncstso rd,[address] ncstwo rd,[address] - ncstso and ncstwo are SFU operations that use the SFU memory format.
- ncstpair
- ncstpair is a memory access instruction that performs an uncached store of a pair of adjacent registers beginning at the register specified by r[rd] to address. The use of an immediate for the second component of the address sets the i-bit of the opcode. r[rd] must be an even-numbered register.
- The effective address to which to store must be even word-aligned, but the consequences of failing to do that are not defined.
- The suggested assembler syntax is:
ncstpair rd,[address] - ncstpair is an SFU operation that uses the SFU memory format.
- nop
- nop is a control flow instruction that does nothing. It has the special property of being unique in its leading byte with its remaining bytes ignored.
- The suggested assembler syntax is:
- nop
- nop is an SFU operation that uses the branch instruction format, but only the leading 6 bits of its opcode are significant and none of the other fields is used.
- not
- not is a logical instruction that computes the bit-wise complement of r[rs1], leaving the result is r[rd].
not rs1,rd - The SFU version of not uses the SFU compute format, and the UFU version uses the UFU 2-source format. This instruction has no use for a second source operand; the assembler infers r0 in its place for purely neurotic reasons.
- or
- or is a logical instruction that computes “r[rs1] r[rs2] ” or “r[rs1] |imm14”. The use of an immediate for the second source operand sets the i-bit of the opcode of the SFU version or the first header bit of the UFU version. The bit-wise logical result is left in r[rd].
- The suggested assembler syntax is:
-
-
- or rs1, reg_or_imm14, rd
- The SFU version of or uses the SFU compute format, and the UFU version uses the UFU 2-source format.
- pack
- pack is a pixel instruction that treats its first two source operands, r[rs1] and r[rs2], as two pair of unsigned 16-bit operands. Each 16-bit operand is shifted right by the the value of the the low-
order 4 bits of the third source operand, r[rs3]. - The low-
order 8 bits of the resulting values are packed into the result register r[rd], with the value derived from 31:16 of r[rs1] in bits 31:24, the value derived from 15:0 of r[rs1] in bits 23:16, the value derived from 31:16 of r[rs2] in bits 15:8, and the value derived from 15:0 of r[rs2] in bits 7:0. - The suggested assembler syntax is:
pack rs1,rs2,rs3,rd - pack is a UFU operation that uses the UFU 3-source format.
- padd, padds
- padd and padds are integer instructions that compute “r[rs1]+r[rs2]”, where each of the sources is treated as a pair of independent 16-bit quantities yeilding a pair of independent 16-bit sums in r[rd]. padd also can have a 14-bit immediate as its second source operand, in which case the sign-extended immediate is added to each of 16-bit numbers in r[rs1] and the two 16-bit sums are left in r[rd].
- padd produces ordinary two's complement integer results, and padds produces saturated results.
- The suggested assembler syntax is:
padd rs1, reg_or_imm14, rd padds rs1, rs2, rd - padd and padds are UFU operations that use the UFU 2-source format.
- pcmovenz
- pcmovenz is an integer instruction that uses two 16-bit flags in r[rs1] to control whether the corresponding 16-bit fields of r[rs2] are copied to the same positions of r[rd]. A field of r[rs2] is copied to r[rd] is the corresponding flag field of r[rs1] is non-zero. The flags in r[rs1] will most likely be the result of a preceding pcmpeq or pcmplt.
- The suggested assembler syntax is:
pcmovenz rs1,rs2,rd - pcmovenz is a UFU operation that uses the UFU 2-source format.
- pcmovez
- pcmove z is an integer instruction that uses two 16-bit flags in r[rs1] to control whether the corresponding 16-bit fields of r[rs2] are copied to the same positions of r[rd]. A field of r[rs2] is copied to r[rd] is the corresponding flag field of r[rs1] is zero. The flags in r[rs1] will most likely be the result of a preceding pcmpeq or pcmplt.
- The suggested assembler syntax is:
pcmovez rs1,rs2,rd - pcmovez is a UFU operation that uses the UFU 2-source format.
- pcmpeq
- pcmpeq is an integer instruction that compares for equality the pair of shorts in its first source register with either a pair of shorts in its second source register or with a signed 14-bit immediate.
- That is, when the second source operand is a register, the short in bits 31:16 of r[rd] is set to one if “r[rs1]<31.16>==r[rs2]<31:16>” and zero otherwise, and the short in bits 15:0 of r[rd] is set to one if “r[rs1]<15:0>==r[rs2]<15:0>” and zero otherwise. When the second source operand is an immediate, the short in bits 31:16 of r[rd] is set to one if “r[rs1]<31:16>==sign_ext(imm14)” and zero otherwise, and the short in bits 15:0 of r[rd] is set to one if “r[rs1]<15:0>==sign_ext(imm14)” and zero otherwise.
- The suggested assembler syntax is:
pcmpeq rs1,reg_or_imm14,rd - pcmpeq is a UFU operation that uses the UFU 2-source format.
- pcmplt
- pcmplt is an integer instruction that does a “compare less than” of pair of shorts in its first source source register with either a pair of shorts in its second source register or with a signed 14-bit immediate.
- That is, when the second source operand is a register, the short in bits 31:16 of r[rd] is set to one if “r[rs1]<31:16><r[rs2]<31:16>” and zero otherwise, and the short in bits 15:0 of r[rd] is set to one if “r[rs1]<15:0><r[rs2]<15:0>” and zero otherwise. When the second source operand is an immediate, the short in bits 31:16 of r[rd] is set to one if “r[rs1]<31:16><sign_ext(imm14)” and zero otherwise, and the short in bits 15:0 of r[rd] is set to one if “r[rs1]<15:0><sign_ext(imm14)” and zero otherwise.
- The suggested assembler syntax is:
pcmplt rs1,reg_or_imm14,rd - pcmplt is a UFU operation that uses the UFU 2-source format.
- pdist
- pdist is a pixel instruction that treats each of its two source registers, r[rs1] and r[rs2], as four unsigned 8-bit values, subtracts the corresponding pairs, and adds the sum of the absolute values of those differences to the value in the register specified by r[rd].
- The suggested assembler syntax is:
pdist rs1,rs2,rd - pdist is a UFU operation that uses the UFU 2-source format.
- pmean
- pmean is a pixel instruction that treats its source operands r[rs1] and r[rs2] as pairs of unsigned 16-bit integers and computes a pair of mean values in r[rd]. These means are rounded high according to the formula: “r[rd]<15:0>=(r[rs1]<15:0>+r[rs2]<15:0>+1)>>1” and likewise for the other halves.
- The suggested assembler syntax is:
pmean rs1,rs2,rd - pmean is a UFU operation that uses the UFU 2-source format.
- pmul
- pmul is an integer instruction that multiplies the pair of 16-bit operands in r[rs1] with either a a pair of 16-bit operands in r[rs2] or with a sign-extended 14-bit immediate, placing a pair of independent 16-bit products in r[rd].
- That is, when the second source operand is a register, bits 31:16 of r[rd] are set to the product bits 31:16 of r[rs1] and bits 31:16 of r[rs2] and bits 15:0 of r[rd] are set to the product bits 15:0 of r[rs1] and bits 15:0 of r[rs2]. When the second source operand is an immediate, bits 31:16 of r[rd] are set to the product bits 31:16 of r[rs1] and sign_ext(imm14) and bits 15:0 of r[rd] are set to the product bits 15:0 of r[rs1] and sign_ext(imm14).
- The format of the operands and results is indicated by the saturation field of the Processor Status Register, but this operation does not saturate.
- The suggested assembler syntax is:
-
- pmul rs1, reg_or_imm14, rd
- pmul is a UFU operation that uses the UFU 2-source format.
- pmuladd, pmuladds
- pmuladd and pmuladd are integer instructions that compute “(r[rs1]*r[rs2])+r[s3]”, where each of the sources is treated as a pair of independent 16-bit quantities yeilding a pair of independent 16-bit results in r[rd]. The format of the operands and results is indicated by the saturation field of the Processor Status Register, but pmuladd does not saturate and pmuladds does.
- The suggested assembler syntax is:
pmuladd rs1,rs2,rs3,rd pmuladds rs1,rs2,rs3,rd - pmuladd and pmuladd are UFU operations that use the UFU 3-source format.
- pmulsub, pmulsubs
- pmulsub and pmulsub are integer instructions that compute “(r[rs1]*r[rs2])−r[s3] ”, where each of the sources is treated as a pair of independent 16-bit quantities yeilding a pair of independent 16-bit results in r[rd]. The format of the operands and results is indicated by the saturation field of the Processor Status Register, but pmulsub does not saturate and pmulsubs does.
- The suggested assembler syntax is:
pmulsub rs1,rs2,rs3,rd pmulsubs rs1,rs2,rs3,rd - pmul sub and pmul subs are UFU operations that use the UFU 3-source format.
- ppower
- ppower is an fixed-point instruction that computes “r[rs1]**r[rs2]”, where each of the sources is treated as a pair of independent 16-bit quantities yeilding a pair of independent 16-bit powers in r[rd].
- By stipulation, zero to any power is zero.
ppower rs1,rs2,rd - ppower is a SFU operation that uses the SFU compute format.
- precsqrt
- precsqrt is a fixed-point instruction that computes a pair of fixed-point reciprocal square roots of the 16-bit values of r[rs1]. The results are delivered in r[rd].
- The suggested assembler syntax is:
precsqrt rs1,rd - precsqrt is a SFU operation that uses the SFU compute format. The second source operand for the format is unused by this instruction.
- pshll
- pshll is a logical instruction that shifts each of the the pair of 16-bit operands in r[rs1] left by either the lower-
order 4 bits of the corresponding half of r[rs2] or the low-order 4 bits of its immediate. The shifted results are left in r[rd]. - The suggested assembler syntax is:
pshll rs1,reg_or_imm14,rd - pshll is a UFU operation that uses the UFU 2-source format.
- pshra
- pshra is a logical instruction that performs a right arithmetic shift of the pair of 16-bit operands in r[rs1]. The shift count of each is either the low-
order 4 bits of the corresponding half of r[rs2] or the low-order 4 bits of its immediate. The shifted results are left in r[rd]. - The suggested assembler syntax is:
pshra rs1,reg_or_imm14,rd - pshra is a UFU operation that uses the UFU 2-source format.
- pshrl
- pshrl is a logical instruction that performs a right logical shift of the pair of 16-bit operands in r[rs1]. The shift count of each is either the low-
order 4 bits of the corresponding half of r[rs2] or the low-order 4 bits of its immediate. The shifted results are left in r[rd]. - The suggested assembler syntax is:
pshrl rs1,reg_or_imm14,rd - pshrl is a UFU operation that uses the UFU 2-source format.
- psub, psubs
- psub and psubs are integer instructions that compute “r[rs1]−r[rs2]”, where each of the sources is treated as a pair of independent 16-bit quantities yielding a pair of independent 16-bit differences in r[rd]. psub also can have a 14-bit immediate as its second source operand, in which case the sign-extended immediate is subtracted from each of 16-bit numbers in r[rs1] and the two 16-bit differences are left in r[rd].
- psub produces ordinary two's-complement integer results, and psubs produces saturated results.
- The suggested assembler syntax is:
psub rs1,reg_or_imm14,rd psubs rs1,rs2,rd - psub and psubs are UFU operations that use the UFU 2-source format.
- rem
- rem is an integer instruction that computes “r[rs1] % r[rs2]” or “r[rs1] % sign_ext(imm14)”. The use of an immediate for the second source operand sets the i-bit of the opcode. The result is left in r[rd].
- If the second source operand is zero, rem will trap.
- The suggested assembler syntax is:
rem rs1,reg_or_imm14,rd - rem is an SFU operation that uses the SFU compute format.
- retry
- retry is a control flow instruction causes a control transfer from a trap handler to the instruction word that caused the trap. Please refer to the Traps chapter of the Café Microarchitecture specification for the complete description.
- The suggested assembler syntax is:
-
- retry
- retry is a SFU operation that uses the SFU compute format, but it has no use for any operand.
- return
- returnis a control flow instruction that causes a register-indirect control transfer to the address in “r[rs1]”.
- The suggested assembler syntax is:
return rs1 - return is a pseudo-op. What it really means is “jmpl r[rs1]+0, r0”.
- s2ib, s2 is, s2iw
- Note: s2 is and s2iw might be removed. This section will be rewrtten when the matter is settled.
- s2ib, s2is, and s2iw are memory access instruction that store an 8-bit byte, a 16-bit short, or a 32-bit word from the register specified by r[rd] to address. The use of an immediate for the second component of the address sets the i-bit of the opcode. The intent is that these are the instructions to be used to modify code on the fly, so these stores guarantee instruction cache consistency.
- For s2 is and s2iw the effective address to which to store must be 2-byte aligned or 4-byte aligned, respectively, but the consequences of failing to do that are not defined. Because of this alignment requirement and the fact that Café instructions can be at any byte boundary, care must be taken in choosing the right instruction.
- The suggested assembler syntax is:
s2ib rd,[address] s2is rd,[address] s2iw rd,[address] - s2i [sb, s, w] are SFU operations that use the SFU memory format.
- sethi
- sethi is an integer instruction that places its immediate operand in the high-
order 22 bits of r[rd] and clears the low-order 10 bits. It is frequently used with the % hi operator to form base addresses for subsequent memory references. - The suggested assembler syntax is:
sethi imm22,rd sethi %hi(label),rd - sethi is an SFU operation that uses a format of its own.
- setir
- setir is an integer instruction that sets the internal register the ordinal of which is its r[rd] operand to the value in its r[rs1] operand.
- The suggested assembler syntax is:
setir rs1,rd - setir is an SFU operation that uses the SFU compute format in an irregular way. It has no use for the second source operand, and the destination register is an internal register number, NOT one of the general purpose registers.
- shll
- shll is a logical instruction that computes “r[rs1]<<r[rs2]” or “r[rs1]<<imm”. The use of an immediate for the second source operand sets the i-bit of the opcode of the SFU version or the first header bit of the UFU version. The result is left in r[rd]. Only the low-
order 5 bits of the second source operand are used. - The suggested assembler syntax is:
shll rs1,reg_or_imm,rd - The SFU version of
sh 1 uses the SFU compute format, and the UFU version uses the UFU 2-source format. - shra
- shra is a logical instruction that computes “r[rs1]>>r[rs2]” or “r[rs1]>>imm”. The use of an immediate for the second source operand sets the i-bit of the opcode of the SFU version or the first header bit of the UFU version. The result is left in r[rd]. The first source operand is treated as a signed integer, so the result is sign-extended. Only the low-
order 5 bits of the second source operand are used. - The suggested assembler syntax is:
shra rs1,reg_or_imm,rd - The SFU version of shra uses the SFU compute format, and the UFU version uses the UFU 2-source format.
- shrl
- shrl is a logical instruction that computes “r[rs1]>>r[rs2]” or “r[rs1]>>imm”. The use of an immediate for the second source operand sets the i-bit of the opcode of the SFU version or the first header bit of the UFU version. The result is left in r[rd]. The first source operand is treated as an unsigned integer, so the result is not sign-extended. Only the low-
order 5 bits of the second source operand are used. - The suggested assembler syntax is:
-
- shrl rs1, reg_or_imm, rd
- The SFU version of shri uses the SFU compute format, and the UFU version uses the UFU 2-source format.
- sir
- sir is a control flow instruction that resets the machine. sir is a privileged instruction; executing it when the Supervisor Mode flag of % psr is clear causes a privileged instruction trap.
- The suggested assembler syntax is:
-
- sir
- sir is an SFU operation that uses the SFU compute format, but has no use for any of that format's operands.
- softtrap
- softtrap is a control flow instruction that generates a trap. The ordinal of the trap is specified by r[rs1]−r[rs2] or r[rs1]−sign_ext(imm14). For details on how traps work, see the Traps chapter of the Café Microarchitecture specification.
- The suggested assembler syntax is:
softtrap rs1,reg_or_imm14 - softtrap is an SFU operation that uses the SFU compute format but has no use for the destination register field of that format.
- stb, sts, stw
- stb, sts, and stw are memory access instruction that store an 8-bit byte, a 16-bit short, or a 32-bit word from the register specified by r[rd] to address. The use of an immediate for the second component of the address sets the i-bit of the opcode.
- For sts and stw the effective address to which to store must be 2-byte aligned or 4-byte aligned, respectively, but the consequences of failing to do that are not defined.
- The suggested assembler syntax is:
stb rd,[address] sts rd,[address] stw rd,[address] - St [sv, s, w] are SFU operations that use the SFU memory format.
- stg
- stg would be the mnemonic for a store group instruction if the were one, but there is not. Most careful readers notice this asymmetry and ask whether it is an oversight, so this note hopes to explain and forestall that question.
- Since the register file read port can deliver no more than 64 bits per cycle, a store of a group would induce a 3-cycle stall. Doing the stores with four stpair instructions would make better use of the issue bandwidth because that would allow work to proceed on the other units.
- stso, stwo
- sts and stw are memory access instruction that store a 16-bit short or a 32-bit word from the register specified by r[rd] to address using the opposite endianness from that indicated by the endian-bit of % psr. The use of an immediate for the second component of the address sets the i-bit of the opcode.
- For stso and stwo the effective address to which to store must be 2-byte aligned or 4-byte aligned, respectively, but the consequences of failing to do that are not defined.
- The suggested assembler syntax is:
stso rd,[address] stwo rd,[address] - stso and stwo are SFU operations that use the SFU memory format.
- stpair
- stpair is a memory access instruction that performs a store of a pair of adjacent registers beginning at the register specified by r[rd] to address. The use of an immediate for the second component of the address sets the i-bit of the opcode. r[rd] must be an even-numbered register.
- The effective address to which to store must be even word-aligned, but the consequences of failing to do that are not defined.
- The suggested assembler syntax is:
stpair rd,[address] - stpair is an SFU operation that uses the SFU memory format.
- sub
- sub is an integer instruction that computes r[rs1]−r[rs2] or r[rs1]−sign_ext(imml4). The use of an immediate for the second source operand sets the i-bit of the opcode of the SFU version or the first header bit of the UFU version. The result is left in r[rd].
- The suggested assembler syntax is:
sub rs1,reg_or_imm14,rd - The SFU version of sub uses the SFU compute format, and the UFU version uses the UFU 2-source format.
- xor
- xor is a logical instruction that computes “r[rs1]{circumflex over ( )}r[rs2]” or “r[rs1]{circumflex over ( )}imm]4”. The use of an immediate for the second source operand sets the i-bit of the opcode of the SFU version or the first header bit of the UFU version. The bit-wise logical result is left in r[rd].
- The suggested assembler syntax is:
xor rs1,reg_or_imm14,rd - The SFU version of xor uses the SFU compute format, and the UFU version uses the UFU 2-source format.
Claims (1)
1. A processor comprising:
a plurality of independent processor elements in a single integrated circuit chip capable of executing a respective plurality of threads concurrently in a multiple-thread mode of operation; and
at least some of the independent processing elements including plural processing units,
wherein at least some of the threads are executable in parallel on plural ones of the processing units in accordance with an instruction set that encodes the parallel execution.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US10/818,785 US20050010743A1 (en) | 1998-12-03 | 2004-04-06 | Multiple-thread processor for threaded software applications |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US09/204,480 US6718457B2 (en) | 1998-12-03 | 1998-12-03 | Multiple-thread processor for threaded software applications |
US10/818,785 US20050010743A1 (en) | 1998-12-03 | 2004-04-06 | Multiple-thread processor for threaded software applications |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US09/204,480 Continuation US6718457B2 (en) | 1998-12-03 | 1998-12-03 | Multiple-thread processor for threaded software applications |
Publications (1)
Publication Number | Publication Date |
---|---|
US20050010743A1 true US20050010743A1 (en) | 2005-01-13 |
Family
ID=22758067
Family Applications (4)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US09/204,480 Expired - Lifetime US6718457B2 (en) | 1998-12-03 | 1998-12-03 | Multiple-thread processor for threaded software applications |
US09/589,039 Expired - Lifetime US7042466B1 (en) | 1998-12-03 | 2000-06-06 | Efficient clip-testing in graphics acceleration |
US10/818,785 Abandoned US20050010743A1 (en) | 1998-12-03 | 2004-04-06 | Multiple-thread processor for threaded software applications |
US11/382,203 Abandoned US20060282650A1 (en) | 1998-12-03 | 2006-05-08 | Efficient clip-testing |
Family Applications Before (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US09/204,480 Expired - Lifetime US6718457B2 (en) | 1998-12-03 | 1998-12-03 | Multiple-thread processor for threaded software applications |
US09/589,039 Expired - Lifetime US7042466B1 (en) | 1998-12-03 | 2000-06-06 | Efficient clip-testing in graphics acceleration |
Family Applications After (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US11/382,203 Abandoned US20060282650A1 (en) | 1998-12-03 | 2006-05-08 | Efficient clip-testing |
Country Status (4)
Country | Link |
---|---|
US (4) | US6718457B2 (en) |
EP (1) | EP1137984B1 (en) |
DE (1) | DE69909829T2 (en) |
WO (1) | WO2000033185A2 (en) |
Cited By (39)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030221077A1 (en) * | 2002-04-26 | 2003-11-27 | Hitachi, Ltd. | Method for controlling storage system, and storage control apparatus |
US20040103261A1 (en) * | 2002-11-25 | 2004-05-27 | Hitachi, Ltd. | Virtualization controller and data transfer control method |
US20050060507A1 (en) * | 2003-09-17 | 2005-03-17 | Hitachi, Ltd. | Remote storage disk control device with function to transfer commands to remote storage devices |
US20050060505A1 (en) * | 2003-09-17 | 2005-03-17 | Hitachi, Ltd. | Remote storage disk control device and method for controlling the same |
US20050071559A1 (en) * | 2003-09-29 | 2005-03-31 | Keishi Tamura | Storage system and storage controller |
US20050102479A1 (en) * | 2002-09-18 | 2005-05-12 | Hitachi, Ltd. | Storage system, and method for controlling the same |
US20050160222A1 (en) * | 2004-01-19 | 2005-07-21 | Hitachi, Ltd. | Storage device control device, storage system, recording medium in which a program is stored, information processing device and storage system control method |
US20050193167A1 (en) * | 2004-02-26 | 2005-09-01 | Yoshiaki Eguchi | Storage subsystem and performance tuning method |
US20050246491A1 (en) * | 2003-01-16 | 2005-11-03 | Yasutomo Yamamoto | Storage unit, installation method thereof and installation program therefore |
US20060010502A1 (en) * | 2003-11-26 | 2006-01-12 | Hitachi, Ltd. | Method and apparatus for setting access restriction information |
US20060026397A1 (en) * | 2004-07-27 | 2006-02-02 | Texas Instruments Incorporated | Pack instruction |
US20060047906A1 (en) * | 2004-08-30 | 2006-03-02 | Shoko Umemura | Data processing system |
US20060090048A1 (en) * | 2004-10-27 | 2006-04-27 | Katsuhiro Okumoto | Storage system and storage control device |
US20060195669A1 (en) * | 2003-09-16 | 2006-08-31 | Hitachi, Ltd. | Storage system and storage control device |
US20060242645A1 (en) * | 2005-04-26 | 2006-10-26 | Lucian Codrescu | System and method of executing program threads in a multi-threaded processor |
US20070067607A1 (en) * | 2005-09-19 | 2007-03-22 | Via Technologies, Inc. | Selecting multiple threads for substantially concurrent processing |
US20070143581A1 (en) * | 2005-12-21 | 2007-06-21 | Arm Limited | Superscalar data processing apparatus and method |
US20070174542A1 (en) * | 2003-06-24 | 2007-07-26 | Koichi Okada | Data migration method for disk apparatus |
US20070174508A1 (en) * | 2003-07-31 | 2007-07-26 | King Matthew E | Non-fenced list dma command mechanism |
US20080028196A1 (en) * | 2006-07-27 | 2008-01-31 | Krishnan Kunjunny Kailas | Method and apparatus for fast synchronization and out-of-order execution of instructions in a meta-program based computing system |
US20080282034A1 (en) * | 2005-09-19 | 2008-11-13 | Via Technologies, Inc. | Memory Subsystem having a Multipurpose Cache for a Stream Graphics Multiprocessor |
CN100456230C (en) * | 2007-03-19 | 2009-01-28 | 中国人民解放军国防科学技术大学 | Computing group structure for superlong instruction word and instruction flow multidata stream fusion |
US20090300621A1 (en) * | 2008-05-30 | 2009-12-03 | Advanced Micro Devices, Inc. | Local and Global Data Share |
US7665070B2 (en) | 2004-04-23 | 2010-02-16 | International Business Machines Corporation | Method and apparatus for a computing system using meta program representation |
US20110078427A1 (en) * | 2009-09-29 | 2011-03-31 | Shebanow Michael C | Trap handler architecture for a parallel processing unit |
US20110167244A1 (en) * | 2008-02-20 | 2011-07-07 | International Business Machines Corporation | Early instruction text based operand store compare reject avoidance |
US20110179197A1 (en) * | 2002-08-08 | 2011-07-21 | Ibm Corporation | Method and system for storing memory compressed data onto memory compressed disks |
US20120079247A1 (en) * | 2010-09-28 | 2012-03-29 | Anderson Timothy D | Dual register data path architecture |
US20130205298A1 (en) * | 2012-02-06 | 2013-08-08 | Samsung Electronics Co., Ltd. | Apparatus and method for memory overlay |
US20150052533A1 (en) * | 2013-08-13 | 2015-02-19 | Samsung Electronics Co., Ltd. | Multiple threads execution processor and operating method thereof |
US20150052307A1 (en) * | 2013-08-15 | 2015-02-19 | Fujitsu Limited | Processor and control method of processor |
US20150100737A1 (en) * | 2013-10-03 | 2015-04-09 | Cavium, Inc. | Method And Apparatus For Conditional Storing Of Data Using A Compare-And-Swap Based Approach |
US20160283209A1 (en) * | 2015-03-25 | 2016-09-29 | International Business Machines Corporation | Unaligned instruction relocation |
US9501243B2 (en) | 2013-10-03 | 2016-11-22 | Cavium, Inc. | Method and apparatus for supporting wide operations using atomic sequences |
US9626189B2 (en) | 2012-06-15 | 2017-04-18 | International Business Machines Corporation | Reducing operand store compare penalties |
WO2017132385A1 (en) * | 2016-01-26 | 2017-08-03 | Icat Llc | Processor with reconfigurable algorithmic pipelined core and algorithmic matching pipelined compiler |
US20190197001A1 (en) * | 2017-12-22 | 2019-06-27 | Alibaba Group Holding Limited | Centralized-distributed mixed organization of shared memory for neural network processing |
TWI682357B (en) * | 2017-04-28 | 2020-01-11 | 美商英特爾股份有限公司 | Compute optimizations for low precision machine learning operations |
US11062077B1 (en) * | 2019-06-24 | 2021-07-13 | Amazon Technologies, Inc. | Bit-reduced verification for memory arrays |
Families Citing this family (49)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6075935A (en) * | 1997-12-01 | 2000-06-13 | Improv Systems, Inc. | Method of generating application specific integrated circuits using a programmable hardware architecture |
JP3541669B2 (en) * | 1998-03-30 | 2004-07-14 | 松下電器産業株式会社 | Arithmetic processing unit |
US7587582B1 (en) | 1998-12-03 | 2009-09-08 | Sun Microsystems, Inc. | Method and apparatus for parallel arithmetic operations |
US6718457B2 (en) * | 1998-12-03 | 2004-04-06 | Sun Microsystems, Inc. | Multiple-thread processor for threaded software applications |
WO2001016702A1 (en) | 1999-09-01 | 2001-03-08 | Intel Corporation | Register set used in multithreaded parallel processor architecture |
US6968469B1 (en) | 2000-06-16 | 2005-11-22 | Transmeta Corporation | System and method for preserving internal processor context when the processor is powered down and restoring the internal processor context when processor is restored |
WO2002015000A2 (en) * | 2000-08-16 | 2002-02-21 | Sun Microsystems, Inc. | General purpose processor with graphics/media support |
US7681018B2 (en) | 2000-08-31 | 2010-03-16 | Intel Corporation | Method and apparatus for providing large register address space while maximizing cycletime performance for a multi-threaded register file set |
US7127588B2 (en) * | 2000-12-05 | 2006-10-24 | Mindspeed Technologies, Inc. | Apparatus and method for an improved performance VLIW processor |
US8762581B2 (en) * | 2000-12-22 | 2014-06-24 | Avaya Inc. | Multi-thread packet processor |
CA2346762A1 (en) * | 2001-05-07 | 2002-11-07 | Ibm Canada Limited-Ibm Canada Limitee | Compiler generation of instruction sequences for unresolved storage devices |
US6954846B2 (en) * | 2001-08-07 | 2005-10-11 | Sun Microsystems, Inc. | Microprocessor and method for giving each thread exclusive access to one register file in a multi-threading mode and for giving an active thread access to multiple register files in a single thread mode |
US7500240B2 (en) * | 2002-01-15 | 2009-03-03 | Intel Corporation | Apparatus and method for scheduling threads in multi-threading processors |
AU2003219666A1 (en) * | 2002-01-15 | 2003-07-30 | Chip Engines | Reconfigurable control processor for multi-protocol resilient packet ring processor |
US6934951B2 (en) * | 2002-01-17 | 2005-08-23 | Intel Corporation | Parallel processor with functional pipeline providing programming engines by supporting multiple contexts and critical section |
US7437724B2 (en) * | 2002-04-03 | 2008-10-14 | Intel Corporation | Registers for data transfers |
US7210127B1 (en) | 2003-04-03 | 2007-04-24 | Sun Microsystems | Methods and apparatus for executing instructions in parallel |
US7600221B1 (en) | 2003-10-06 | 2009-10-06 | Sun Microsystems, Inc. | Methods and apparatus of an architecture supporting execution of instructions in parallel |
US7380086B2 (en) * | 2003-12-12 | 2008-05-27 | International Business Machines Corporation | Scalable runtime system for global address space languages on shared and distributed memory machines |
US8643659B1 (en) | 2003-12-31 | 2014-02-04 | 3Dlabs Inc., Ltd. | Shader with global and instruction caches |
US20050210472A1 (en) * | 2004-03-18 | 2005-09-22 | International Business Machines Corporation | Method and data processing system for per-chip thread queuing in a multi-processor system |
US7941585B2 (en) * | 2004-09-10 | 2011-05-10 | Cavium Networks, Inc. | Local scratchpad and data caching system |
WO2006031551A2 (en) * | 2004-09-10 | 2006-03-23 | Cavium Networks | Selective replication of data structure |
US7594081B2 (en) | 2004-09-10 | 2009-09-22 | Cavium Networks, Inc. | Direct access to low-latency memory |
US7503368B2 (en) | 2004-11-24 | 2009-03-17 | The Boeing Company | Composite sections for aircraft fuselages and other structures, and methods and systems for manufacturing such sections |
US8732368B1 (en) | 2005-02-17 | 2014-05-20 | Hewlett-Packard Development Company, L.P. | Control system for resource selection between or among conjoined-cores |
US9003168B1 (en) * | 2005-02-17 | 2015-04-07 | Hewlett-Packard Development Company, L. P. | Control system for resource selection between or among conjoined-cores |
US8713286B2 (en) * | 2005-04-26 | 2014-04-29 | Qualcomm Incorporated | Register files for a digital signal processor operating in an interleaved multi-threaded environment |
WO2006128062A2 (en) * | 2005-05-25 | 2006-11-30 | Terracotta, Inc. | Database caching of queries and stored procedures using database provided facilities for dependency analysis and detected database updates for invalidation |
US9176741B2 (en) * | 2005-08-29 | 2015-11-03 | Invention Science Fund I, Llc | Method and apparatus for segmented sequential storage |
US8296550B2 (en) * | 2005-08-29 | 2012-10-23 | The Invention Science Fund I, Llc | Hierarchical register file with operand capture ports |
US8275976B2 (en) * | 2005-08-29 | 2012-09-25 | The Invention Science Fund I, Llc | Hierarchical instruction scheduler facilitating instruction replay |
US20070083735A1 (en) * | 2005-08-29 | 2007-04-12 | Glew Andrew F | Hierarchical processor |
US7644258B2 (en) * | 2005-08-29 | 2010-01-05 | Searete, Llc | Hybrid branch predictor using component predictors each having confidence and override signals |
US9501448B2 (en) * | 2008-05-27 | 2016-11-22 | Stillwater Supercomputing, Inc. | Execution engine for executing single assignment programs with affine dependencies |
US8255905B2 (en) | 2008-06-27 | 2012-08-28 | Microsoft Corporation | Multi-threaded processes for opening and saving documents |
US20100191911A1 (en) * | 2008-12-23 | 2010-07-29 | Marco Heddes | System-On-A-Chip Having an Array of Programmable Processing Elements Linked By an On-Chip Network with Distributed On-Chip Shared Memory and External Shared Memory |
US8428930B2 (en) * | 2009-09-18 | 2013-04-23 | International Business Machines Corporation | Page mapped spatially aware emulation of a computer instruction set |
US8447583B2 (en) | 2009-09-18 | 2013-05-21 | International Business Machines Corporation | Self initialized host cell spatially aware emulation of a computer instruction set |
US8617049B2 (en) * | 2009-09-18 | 2013-12-31 | Ethicon Endo-Surgery, Inc. | Symmetrical drive system for an implantable restriction device |
US8301434B2 (en) | 2009-09-18 | 2012-10-30 | International Buisness Machines Corporation | Host cell spatially aware emulation of a guest wild branch |
US9158566B2 (en) | 2009-09-18 | 2015-10-13 | International Business Machines Corporation | Page mapped spatially aware emulation of computer instruction set |
US8949106B2 (en) * | 2009-09-18 | 2015-02-03 | International Business Machines Corporation | Just in time compiler in spatially aware emulation of a guest computer instruction set |
US8756589B2 (en) | 2011-06-14 | 2014-06-17 | Microsoft Corporation | Selectable dual-mode JIT compiler for SIMD instructions |
US8898376B2 (en) | 2012-06-04 | 2014-11-25 | Fusion-Io, Inc. | Apparatus, system, and method for grouping data stored on an array of solid-state storage elements |
US9563425B2 (en) | 2012-11-28 | 2017-02-07 | Intel Corporation | Instruction and logic to provide pushing buffer copy and store functionality |
US9317294B2 (en) | 2012-12-06 | 2016-04-19 | International Business Machines Corporation | Concurrent multiple instruction issue of non-pipelined instructions using non-pipelined operation resources in another processing core |
KR102258414B1 (en) * | 2017-04-19 | 2021-05-28 | 상하이 캠브리콘 인포메이션 테크놀로지 컴퍼니 리미티드 | Processing apparatus and processing method |
CN110750232B (en) * | 2019-10-17 | 2023-06-20 | 电子科技大学 | SRAM-based parallel multiplication and addition device |
Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5179702A (en) * | 1989-12-29 | 1993-01-12 | Supercomputer Systems Limited Partnership | System and method for controlling a highly parallel multiprocessor using an anarchy based scheduler for parallel execution thread scheduling |
US5553305A (en) * | 1992-04-14 | 1996-09-03 | International Business Machines Corporation | System for synchronizing execution by a processing element of threads within a process using a state indicator |
US5574939A (en) * | 1993-05-14 | 1996-11-12 | Massachusetts Institute Of Technology | Multiprocessor coupling system with integrated compile and run time scheduling for parallelism |
US5933627A (en) * | 1996-07-01 | 1999-08-03 | Sun Microsystems | Thread switch on blocked load or store using instruction thread field |
US6205543B1 (en) * | 1998-12-03 | 2001-03-20 | Sun Microsystems, Inc. | Efficient handling of a large register file for context switching |
US6249861B1 (en) * | 1998-12-03 | 2001-06-19 | Sun Microsystems, Inc. | Instruction fetch unit aligner for a non-power of two size VLIW instruction |
US6279100B1 (en) * | 1998-12-03 | 2001-08-21 | Sun Microsystems, Inc. | Local stall control method and structure in a microprocessor |
US20010042190A1 (en) * | 1998-12-03 | 2001-11-15 | Marc Tremblay | Local and global register partitioning in a vliw processor |
US6321325B1 (en) * | 1998-12-03 | 2001-11-20 | Sun Microsystems, Inc. | Dual in-line buffers for an instruction fetch unit |
US20010052063A1 (en) * | 1998-12-03 | 2001-12-13 | Marc Tremblay | Implicitly derived register specifiers in a processor |
US6343348B1 (en) * | 1998-12-03 | 2002-01-29 | Sun Microsystems, Inc. | Apparatus and method for optimizing die utilization and speed performance by register file splitting |
US6615338B1 (en) * | 1998-12-03 | 2003-09-02 | Sun Microsystems, Inc. | Clustered architecture in a VLIW processor |
US6658447B2 (en) * | 1997-07-08 | 2003-12-02 | Intel Corporation | Priority based simultaneous multi-threading |
Family Cites Families (29)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5197130A (en) * | 1989-12-29 | 1993-03-23 | Supercomputer Systems Limited Partnership | Cluster architecture for a highly parallel scalar/vector multiprocessor system |
US5706415A (en) * | 1991-12-20 | 1998-01-06 | Apple Computer, Inc. | Method and apparatus for distributed interpolation of pixel shading parameter values |
US5517603A (en) * | 1991-12-20 | 1996-05-14 | Apple Computer, Inc. | Scanline rendering device for generating pixel values for displaying three-dimensional graphical images |
US5307449A (en) * | 1991-12-20 | 1994-04-26 | Apple Computer, Inc. | Method and apparatus for simultaneously rendering multiple scanlines |
US5345541A (en) * | 1991-12-20 | 1994-09-06 | Apple Computer, Inc. | Method and apparatus for approximating a value between two endpoint values in a three-dimensional image rendering device |
DE69418646T2 (en) * | 1993-06-04 | 2000-06-29 | Sun Microsystems Inc | Floating point processor for a high-performance three-dimensional graphics accelerator |
JP3676411B2 (en) | 1994-01-21 | 2005-07-27 | サン・マイクロシステムズ・インコーポレイテッド | Register file device and register file access method |
JP3547482B2 (en) * | 1994-04-15 | 2004-07-28 | 株式会社日立製作所 | Information processing equipment |
AUPM704294A0 (en) * | 1994-07-25 | 1994-08-18 | Canon Information Systems Research Australia Pty Ltd | Method and apparatus for the creation of images |
US5761475A (en) | 1994-12-15 | 1998-06-02 | Sun Microsystems, Inc. | Computer processor having a register file with reduced read and/or write port bandwidth |
US5742796A (en) * | 1995-03-24 | 1998-04-21 | 3Dlabs Inc. Ltd. | Graphics system with color space double buffering |
US5712799A (en) | 1995-04-04 | 1998-01-27 | Chromatic Research, Inc. | Method and structure for performing motion estimation using reduced precision pixel intensity values |
US5689674A (en) * | 1995-10-31 | 1997-11-18 | Intel Corporation | Method and apparatus for binding instructions to dispatch ports of a reservation station |
US5764943A (en) | 1995-12-28 | 1998-06-09 | Intel Corporation | Data path circuitry for processor having multiple instruction pipelines |
EP0976029A2 (en) * | 1996-01-24 | 2000-02-02 | Sun Microsystems, Inc. | A processor for executing instruction sets received from a network or from a local memory |
US5657291A (en) | 1996-04-30 | 1997-08-12 | Sun Microsystems, Inc. | Multiport register file memory cell configuration for read operation |
US5778248A (en) | 1996-06-17 | 1998-07-07 | Sun Microsystems, Inc. | Fast microprocessor stage bypass logic enable |
US5778243A (en) * | 1996-07-03 | 1998-07-07 | International Business Machines Corporation | Multi-threaded cell for a memory |
US5872963A (en) * | 1997-02-18 | 1999-02-16 | Silicon Graphics, Inc. | Resumption of preempted non-privileged threads with no kernel intervention |
US5974538A (en) * | 1997-02-21 | 1999-10-26 | Wilmot, Ii; Richard Byron | Method and apparatus for annotating operands in a computer system with source instruction identifiers |
US6137497A (en) * | 1997-05-30 | 2000-10-24 | Hewlett-Packard Company | Post transformation clipping in a geometry accelerator |
US6052128A (en) * | 1997-07-23 | 2000-04-18 | International Business Machines Corp. | Method and apparatus for clipping convex polygons on single instruction multiple data computers |
US6052129A (en) * | 1997-10-01 | 2000-04-18 | International Business Machines Corporation | Method and apparatus for deferred clipping of polygons |
US6212544B1 (en) * | 1997-10-23 | 2001-04-03 | International Business Machines Corporation | Altering thread priorities in a multithreaded processor |
US6092175A (en) * | 1998-04-02 | 2000-07-18 | University Of Washington | Shared register storage mechanisms for multithreaded computer systems with out-of-order execution |
JP3983394B2 (en) * | 1998-11-09 | 2007-09-26 | 株式会社ルネサステクノロジ | Geometry processor |
US6718457B2 (en) * | 1998-12-03 | 2004-04-06 | Sun Microsystems, Inc. | Multiple-thread processor for threaded software applications |
US6714197B1 (en) * | 1999-07-30 | 2004-03-30 | Mips Technologies, Inc. | Processor having an arithmetic extension of an instruction set architecture |
US6671796B1 (en) * | 2000-02-25 | 2003-12-30 | Sun Microsystems, Inc. | Converting an arbitrary fixed point value to a floating point value |
-
1998
- 1998-12-03 US US09/204,480 patent/US6718457B2/en not_active Expired - Lifetime
-
1999
- 1999-12-03 DE DE69909829T patent/DE69909829T2/en not_active Expired - Lifetime
- 1999-12-03 EP EP99963017A patent/EP1137984B1/en not_active Expired - Lifetime
- 1999-12-03 WO PCT/US1999/028821 patent/WO2000033185A2/en active IP Right Grant
-
2000
- 2000-06-06 US US09/589,039 patent/US7042466B1/en not_active Expired - Lifetime
-
2004
- 2004-04-06 US US10/818,785 patent/US20050010743A1/en not_active Abandoned
-
2006
- 2006-05-08 US US11/382,203 patent/US20060282650A1/en not_active Abandoned
Patent Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5179702A (en) * | 1989-12-29 | 1993-01-12 | Supercomputer Systems Limited Partnership | System and method for controlling a highly parallel multiprocessor using an anarchy based scheduler for parallel execution thread scheduling |
US5553305A (en) * | 1992-04-14 | 1996-09-03 | International Business Machines Corporation | System for synchronizing execution by a processing element of threads within a process using a state indicator |
US5574939A (en) * | 1993-05-14 | 1996-11-12 | Massachusetts Institute Of Technology | Multiprocessor coupling system with integrated compile and run time scheduling for parallelism |
US5933627A (en) * | 1996-07-01 | 1999-08-03 | Sun Microsystems | Thread switch on blocked load or store using instruction thread field |
US6658447B2 (en) * | 1997-07-08 | 2003-12-02 | Intel Corporation | Priority based simultaneous multi-threading |
US6279100B1 (en) * | 1998-12-03 | 2001-08-21 | Sun Microsystems, Inc. | Local stall control method and structure in a microprocessor |
US6249861B1 (en) * | 1998-12-03 | 2001-06-19 | Sun Microsystems, Inc. | Instruction fetch unit aligner for a non-power of two size VLIW instruction |
US20010042190A1 (en) * | 1998-12-03 | 2001-11-15 | Marc Tremblay | Local and global register partitioning in a vliw processor |
US6321325B1 (en) * | 1998-12-03 | 2001-11-20 | Sun Microsystems, Inc. | Dual in-line buffers for an instruction fetch unit |
US20010052063A1 (en) * | 1998-12-03 | 2001-12-13 | Marc Tremblay | Implicitly derived register specifiers in a processor |
US6343348B1 (en) * | 1998-12-03 | 2002-01-29 | Sun Microsystems, Inc. | Apparatus and method for optimizing die utilization and speed performance by register file splitting |
US6615338B1 (en) * | 1998-12-03 | 2003-09-02 | Sun Microsystems, Inc. | Clustered architecture in a VLIW processor |
US6205543B1 (en) * | 1998-12-03 | 2001-03-20 | Sun Microsystems, Inc. | Efficient handling of a large register file for context switching |
Cited By (108)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7937513B2 (en) | 2002-04-26 | 2011-05-03 | Hitachi, Ltd. | Method for controlling storage system, and storage control apparatus |
US20030221077A1 (en) * | 2002-04-26 | 2003-11-27 | Hitachi, Ltd. | Method for controlling storage system, and storage control apparatus |
US20050235107A1 (en) * | 2002-04-26 | 2005-10-20 | Hitachi, Ltd. | Method for controlling storage system, and storage control apparatus |
US8161206B2 (en) | 2002-08-08 | 2012-04-17 | International Business Machines Corporation | Method and system for storing memory compressed data onto memory compressed disks |
US8230139B2 (en) | 2002-08-08 | 2012-07-24 | International Business Machines Corporation | Method and system for storing memory compressed data onto memory compressed disks |
US20110179197A1 (en) * | 2002-08-08 | 2011-07-21 | Ibm Corporation | Method and system for storing memory compressed data onto memory compressed disks |
US20110185132A1 (en) * | 2002-08-08 | 2011-07-28 | Ibm Corporation | Method and system for storing memory compressed data onto memory compressed disks |
US8250265B2 (en) * | 2002-08-08 | 2012-08-21 | International Business Machines Corporation | Method and system for storing memory compressed data onto memory compressed disks |
US20060036777A1 (en) * | 2002-09-18 | 2006-02-16 | Hitachi, Ltd. | Storage system, and method for controlling the same |
US20050102479A1 (en) * | 2002-09-18 | 2005-05-12 | Hitachi, Ltd. | Storage system, and method for controlling the same |
US8572352B2 (en) | 2002-11-25 | 2013-10-29 | Hitachi, Ltd. | Virtualization controller and data transfer control method |
US20070192558A1 (en) * | 2002-11-25 | 2007-08-16 | Kiyoshi Honda | Virtualization controller and data transfer control method |
US8190852B2 (en) | 2002-11-25 | 2012-05-29 | Hitachi, Ltd. | Virtualization controller and data transfer control method |
US20040103261A1 (en) * | 2002-11-25 | 2004-05-27 | Hitachi, Ltd. | Virtualization controller and data transfer control method |
US7694104B2 (en) | 2002-11-25 | 2010-04-06 | Hitachi, Ltd. | Virtualization controller and data transfer control method |
US7877568B2 (en) | 2002-11-25 | 2011-01-25 | Hitachi, Ltd. | Virtualization controller and data transfer control method |
US20040250021A1 (en) * | 2002-11-25 | 2004-12-09 | Hitachi, Ltd. | Virtualization controller and data transfer control method |
US20050246491A1 (en) * | 2003-01-16 | 2005-11-03 | Yasutomo Yamamoto | Storage unit, installation method thereof and installation program therefore |
US20060248302A1 (en) * | 2003-01-16 | 2006-11-02 | Yasutomo Yamamoto | Storage unit, installation method thereof and installation program therefore |
US20070174542A1 (en) * | 2003-06-24 | 2007-07-26 | Koichi Okada | Data migration method for disk apparatus |
US7444435B2 (en) * | 2003-07-31 | 2008-10-28 | International Business Machines Corporation | Non-fenced list DMA command mechanism |
US20070174508A1 (en) * | 2003-07-31 | 2007-07-26 | King Matthew E | Non-fenced list dma command mechanism |
US20060195669A1 (en) * | 2003-09-16 | 2006-08-31 | Hitachi, Ltd. | Storage system and storage control device |
US20070192554A1 (en) * | 2003-09-16 | 2007-08-16 | Hitachi, Ltd. | Storage system and storage control device |
US20050114599A1 (en) * | 2003-09-17 | 2005-05-26 | Hitachi, Ltd. | Remote storage disk control device with function to transfer commands to remote storage devices |
US8255652B2 (en) | 2003-09-17 | 2012-08-28 | Hitachi, Ltd. | Remote storage disk control device and method for controlling the same |
US20070150680A1 (en) * | 2003-09-17 | 2007-06-28 | Hitachi, Ltd. | Remote storage disk control device with function to transfer commands to remote storage devices |
US20080172537A1 (en) * | 2003-09-17 | 2008-07-17 | Hitachi, Ltd. | Remote storage disk control device and method for controlling the same |
US20050166023A1 (en) * | 2003-09-17 | 2005-07-28 | Hitachi, Ltd. | Remote storage disk control device and method for controlling the same |
US20050060507A1 (en) * | 2003-09-17 | 2005-03-17 | Hitachi, Ltd. | Remote storage disk control device with function to transfer commands to remote storage devices |
US7707377B2 (en) | 2003-09-17 | 2010-04-27 | Hitachi, Ltd. | Remote storage disk control device and method for controlling the same |
US20050138313A1 (en) * | 2003-09-17 | 2005-06-23 | Hitachi, Ltd. | Remote storage disk control device with function to transfer commands to remote storage devices |
US7975116B2 (en) | 2003-09-17 | 2011-07-05 | Hitachi, Ltd. | Remote storage disk control device and method for controlling the same |
US20050060505A1 (en) * | 2003-09-17 | 2005-03-17 | Hitachi, Ltd. | Remote storage disk control device and method for controlling the same |
US20050071559A1 (en) * | 2003-09-29 | 2005-03-31 | Keishi Tamura | Storage system and storage controller |
US7373670B2 (en) | 2003-11-26 | 2008-05-13 | Hitachi, Ltd. | Method and apparatus for setting access restriction information |
US8156561B2 (en) | 2003-11-26 | 2012-04-10 | Hitachi, Ltd. | Method and apparatus for setting access restriction information |
US8806657B2 (en) | 2003-11-26 | 2014-08-12 | Hitachi, Ltd. | Method and apparatus for setting access restriction information |
US20060010502A1 (en) * | 2003-11-26 | 2006-01-12 | Hitachi, Ltd. | Method and apparatus for setting access restriction information |
US20050160222A1 (en) * | 2004-01-19 | 2005-07-21 | Hitachi, Ltd. | Storage device control device, storage system, recording medium in which a program is stored, information processing device and storage system control method |
US20060190550A1 (en) * | 2004-01-19 | 2006-08-24 | Hitachi, Ltd. | Storage system and controlling method thereof, and device and recording medium in storage system |
US20050193167A1 (en) * | 2004-02-26 | 2005-09-01 | Yoshiaki Eguchi | Storage subsystem and performance tuning method |
US8281098B2 (en) | 2004-02-26 | 2012-10-02 | Hitachi, Ltd. | Storage subsystem and performance tuning method |
US20070055820A1 (en) * | 2004-02-26 | 2007-03-08 | Hitachi, Ltd. | Storage subsystem and performance tuning method |
US8046554B2 (en) | 2004-02-26 | 2011-10-25 | Hitachi, Ltd. | Storage subsystem and performance tuning method |
US7809906B2 (en) | 2004-02-26 | 2010-10-05 | Hitachi, Ltd. | Device for performance tuning in a system |
US7665070B2 (en) | 2004-04-23 | 2010-02-16 | International Business Machines Corporation | Method and apparatus for a computing system using meta program representation |
US20060026397A1 (en) * | 2004-07-27 | 2006-02-02 | Texas Instruments Incorporated | Pack instruction |
US8843715B2 (en) | 2004-08-30 | 2014-09-23 | Hitachi, Ltd. | System managing a plurality of virtual volumes and a virtual volume management method for the system |
US7840767B2 (en) | 2004-08-30 | 2010-11-23 | Hitachi, Ltd. | System managing a plurality of virtual volumes and a virtual volume management method for the system |
US20060047906A1 (en) * | 2004-08-30 | 2006-03-02 | Shoko Umemura | Data processing system |
US20070245062A1 (en) * | 2004-08-30 | 2007-10-18 | Shoko Umemura | Data processing system |
US8122214B2 (en) | 2004-08-30 | 2012-02-21 | Hitachi, Ltd. | System managing a plurality of virtual volumes and a virtual volume management method for the system |
US20080016303A1 (en) * | 2004-10-27 | 2008-01-17 | Katsuhiro Okumoto | Storage system and storage control device |
US20060090048A1 (en) * | 2004-10-27 | 2006-04-27 | Katsuhiro Okumoto | Storage system and storage control device |
US7673107B2 (en) | 2004-10-27 | 2010-03-02 | Hitachi, Ltd. | Storage system and storage control device |
US7814487B2 (en) * | 2005-04-26 | 2010-10-12 | Qualcomm Incorporated | System and method of executing program threads in a multi-threaded processor |
US20060242645A1 (en) * | 2005-04-26 | 2006-10-26 | Lucian Codrescu | System and method of executing program threads in a multi-threaded processor |
US20070067607A1 (en) * | 2005-09-19 | 2007-03-22 | Via Technologies, Inc. | Selecting multiple threads for substantially concurrent processing |
US7454599B2 (en) * | 2005-09-19 | 2008-11-18 | Via Technologies, Inc. | Selecting multiple threads for substantially concurrent processing |
US20080282034A1 (en) * | 2005-09-19 | 2008-11-13 | Via Technologies, Inc. | Memory Subsystem having a Multipurpose Cache for a Stream Graphics Multiprocessor |
US20070143581A1 (en) * | 2005-12-21 | 2007-06-21 | Arm Limited | Superscalar data processing apparatus and method |
US7734897B2 (en) * | 2005-12-21 | 2010-06-08 | Arm Limited | Allocation of memory access operations to memory access capable pipelines in a superscalar data processing apparatus and method having a plurality of execution threads |
US8301870B2 (en) * | 2006-07-27 | 2012-10-30 | International Business Machines Corporation | Method and apparatus for fast synchronization and out-of-order execution of instructions in a meta-program based computing system |
US20080028196A1 (en) * | 2006-07-27 | 2008-01-31 | Krishnan Kunjunny Kailas | Method and apparatus for fast synchronization and out-of-order execution of instructions in a meta-program based computing system |
CN100456230C (en) * | 2007-03-19 | 2009-01-28 | 中国人民解放军国防科学技术大学 | Computing group structure for superlong instruction word and instruction flow multidata stream fusion |
US20110167244A1 (en) * | 2008-02-20 | 2011-07-07 | International Business Machines Corporation | Early instruction text based operand store compare reject avoidance |
US8195924B2 (en) * | 2008-02-20 | 2012-06-05 | International Business Machines Corporation | Early instruction text based operand store compare reject avoidance |
CN102047241A (en) * | 2008-05-30 | 2011-05-04 | 先进微装置公司 | Local and global data share |
US10140123B2 (en) | 2008-05-30 | 2018-11-27 | Advanced Micro Devices, Inc. | SIMD processing lanes storing input pixel operand data in local register file for thread execution of image processing operations |
EP3413206A1 (en) * | 2008-05-30 | 2018-12-12 | Advanced Micro Devices, Inc. | Local and global data share |
JP2011522325A (en) * | 2008-05-30 | 2011-07-28 | アドバンスト・マイクロ・ディバイシズ・インコーポレイテッド | Local and global data sharing |
US9619428B2 (en) * | 2008-05-30 | 2017-04-11 | Advanced Micro Devices, Inc. | SIMD processing unit with local data share and access to a global data share of a GPU |
KR101474478B1 (en) | 2008-05-30 | 2014-12-19 | 어드밴스드 마이크로 디바이시즈, 인코포레이티드 | Local and global data share |
EP2289001A4 (en) * | 2008-05-30 | 2012-12-05 | Advanced Micro Devices Inc | Local and global data share |
US20090300621A1 (en) * | 2008-05-30 | 2009-12-03 | Advanced Micro Devices, Inc. | Local and Global Data Share |
WO2009145917A1 (en) * | 2008-05-30 | 2009-12-03 | Advanced Micro Devices, Inc. | Local and global data share |
EP2289001A1 (en) * | 2008-05-30 | 2011-03-02 | Advanced Micro Devices, Inc. | Local and global data share |
WO2011041330A1 (en) * | 2009-09-29 | 2011-04-07 | Nvidia Corporation | Trap handler architecture for a parallel processing unit |
US20110078427A1 (en) * | 2009-09-29 | 2011-03-31 | Shebanow Michael C | Trap handler architecture for a parallel processing unit |
CN102648449A (en) * | 2009-09-29 | 2012-08-22 | 辉达公司 | Trap handler architecture for a parallel processing unit |
US8522000B2 (en) | 2009-09-29 | 2013-08-27 | Nvidia Corporation | Trap handler architecture for a parallel processing unit |
US8880855B2 (en) * | 2010-09-28 | 2014-11-04 | Texas Instruments Incorporated | Dual register data path architecture with registers in a data file divided into groups and sub-groups |
US20120079247A1 (en) * | 2010-09-28 | 2012-03-29 | Anderson Timothy D | Dual register data path architecture |
US20130205298A1 (en) * | 2012-02-06 | 2013-08-08 | Samsung Electronics Co., Ltd. | Apparatus and method for memory overlay |
US9703593B2 (en) * | 2012-02-06 | 2017-07-11 | Samsung Electronics Co., Ltd. | Apparatus and method for memory overlay |
US9626189B2 (en) | 2012-06-15 | 2017-04-18 | International Business Machines Corporation | Reducing operand store compare penalties |
US20150052533A1 (en) * | 2013-08-13 | 2015-02-19 | Samsung Electronics Co., Ltd. | Multiple threads execution processor and operating method thereof |
US20150052307A1 (en) * | 2013-08-15 | 2015-02-19 | Fujitsu Limited | Processor and control method of processor |
US9501243B2 (en) | 2013-10-03 | 2016-11-22 | Cavium, Inc. | Method and apparatus for supporting wide operations using atomic sequences |
US9390023B2 (en) * | 2013-10-03 | 2016-07-12 | Cavium, Inc. | Method and apparatus for conditional storing of data using a compare-and-swap based approach |
US20150100737A1 (en) * | 2013-10-03 | 2015-04-09 | Cavium, Inc. | Method And Apparatus For Conditional Storing Of Data Using A Compare-And-Swap Based Approach |
US20160283211A1 (en) * | 2015-03-25 | 2016-09-29 | International Business Machines Corporation | Unaligned instruction relocation |
US20160283209A1 (en) * | 2015-03-25 | 2016-09-29 | International Business Machines Corporation | Unaligned instruction relocation |
US9792098B2 (en) * | 2015-03-25 | 2017-10-17 | International Business Machines Corporation | Unaligned instruction relocation |
US9875089B2 (en) * | 2015-03-25 | 2018-01-23 | International Business Machines Corporation | Unaligned instruction relocation |
US10223091B2 (en) * | 2015-03-25 | 2019-03-05 | International Business Machines Corporation | Unaligned instruction relocation |
WO2017132385A1 (en) * | 2016-01-26 | 2017-08-03 | Icat Llc | Processor with reconfigurable algorithmic pipelined core and algorithmic matching pipelined compiler |
CN108885543A (en) * | 2016-01-26 | 2018-11-23 | Icat有限责任公司 | Processor with reconfigurable algorithm pipeline kernel and algorithmic match assembly line compiler |
TWI682357B (en) * | 2017-04-28 | 2020-01-11 | 美商英特爾股份有限公司 | Compute optimizations for low precision machine learning operations |
US10726514B2 (en) | 2017-04-28 | 2020-07-28 | Intel Corporation | Compute optimizations for low precision machine learning operations |
US10853906B2 (en) | 2017-04-28 | 2020-12-01 | Intel Corporation | Compute optimizations for low precision machine learning operations |
US11138686B2 (en) | 2017-04-28 | 2021-10-05 | Intel Corporation | Compute optimizations for low precision machine learning operations |
US11468541B2 (en) | 2017-04-28 | 2022-10-11 | Intel Corporation | Compute optimizations for low precision machine learning operations |
US11948224B2 (en) | 2017-04-28 | 2024-04-02 | Intel Corporation | Compute optimizations for low precision machine learning operations |
US20190197001A1 (en) * | 2017-12-22 | 2019-06-27 | Alibaba Group Holding Limited | Centralized-distributed mixed organization of shared memory for neural network processing |
US10922258B2 (en) * | 2017-12-22 | 2021-02-16 | Alibaba Group Holding Limited | Centralized-distributed mixed organization of shared memory for neural network processing |
US11062077B1 (en) * | 2019-06-24 | 2021-07-13 | Amazon Technologies, Inc. | Bit-reduced verification for memory arrays |
Also Published As
Publication number | Publication date |
---|---|
EP1137984A2 (en) | 2001-10-04 |
US6718457B2 (en) | 2004-04-06 |
WO2000033185A3 (en) | 2000-10-12 |
US7042466B1 (en) | 2006-05-09 |
DE69909829T2 (en) | 2004-05-27 |
DE69909829D1 (en) | 2003-08-28 |
EP1137984B1 (en) | 2003-07-23 |
WO2000033185A2 (en) | 2000-06-08 |
US20010042188A1 (en) | 2001-11-15 |
US20060282650A1 (en) | 2006-12-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20050010743A1 (en) | Multiple-thread processor for threaded software applications | |
US6279100B1 (en) | Local stall control method and structure in a microprocessor | |
Waterman et al. | The RISC-V instruction set manual, volume I: User-level ISA, version 2.0 | |
US7490228B2 (en) | Processor with register dirty bit tracking for efficient context switch | |
EP0782071B1 (en) | Data processor | |
JP2931890B2 (en) | Data processing device | |
US6671796B1 (en) | Converting an arbitrary fixed point value to a floating point value | |
US20040193837A1 (en) | CPU datapaths and local memory that executes either vector or superscalar instructions | |
WO2000033183A9 (en) | Method and structure for local stall control in a microprocessor | |
US5701442A (en) | Method of modifying an instruction set architecture of a computer processor to maintain backward compatibility | |
US8539399B1 (en) | Method and apparatus for providing user-defined interfaces for a configurable processor | |
US20010042187A1 (en) | Variable issue-width vliw processor | |
US7117342B2 (en) | Implicitly derived register specifiers in a processor | |
US6615338B1 (en) | Clustered architecture in a VLIW processor | |
US6341348B1 (en) | Software branch prediction filtering for a microprocessor | |
US20040193838A1 (en) | Vector instructions composed from scalar instructions | |
US6625634B1 (en) | Efficient implementation of multiprecision arithmetic | |
US7779231B2 (en) | Pipelined processing using option bits encoded in an instruction | |
US7861061B2 (en) | Processor instruction including option bits encoding which instructions of an instruction packet to execute | |
JP2927281B2 (en) | Parallel processing unit | |
Grow et al. | Evaluation of the Pentium 4 for imaging applications | |
Jeroen van Straten | ρ-VEX user manual | |
JP2785820B2 (en) | Parallel processing unit | |
JP2001216154A (en) | Method and device for reducing size of code with exposed pipeline by encoding nop operation as instruction operand | |
Pilz et al. | Code optimization techniques of data-intensive tasks onto statically scheduled architectures: Optimal performance on the TigerSharc |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION |