US20060282650A1 - Efficient clip-testing - Google Patents
Efficient clip-testing Download PDFInfo
- Publication number
- US20060282650A1 US20060282650A1 US11/382,203 US38220306A US2006282650A1 US 20060282650 A1 US20060282650 A1 US 20060282650A1 US 38220306 A US38220306 A US 38220306A US 2006282650 A1 US2006282650 A1 US 2006282650A1
- Authority
- US
- United States
- Prior art keywords
- value
- instruction
- register
- clip
- processor
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
- 238000012360 testing method Methods 0.000 title claims abstract description 25
- 238000000034 method Methods 0.000 claims abstract description 7
- 238000012545 processing Methods 0.000 claims description 14
- 238000010586 diagram Methods 0.000 description 13
- 230000006870 function Effects 0.000 description 4
- 230000009466 transformation Effects 0.000 description 2
- 238000000844 transformation Methods 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 230000001427 coherent effect Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000007667 floating Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3824—Operand accessing
- G06F9/3826—Bypassing or forwarding of data results, e.g. locally between pipeline stages or within a pipeline stage
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/30007—Arrangements for executing specific machine instructions to perform operations on data operands
- G06F9/3001—Arithmetic instructions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30098—Register arrangements
- G06F9/3012—Organisation of register space, e.g. banked or distributed register file
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30098—Register arrangements
- G06F9/3012—Organisation of register space, e.g. banked or distributed register file
- G06F9/30123—Organisation of register space, e.g. banked or distributed register file according to context, e.g. thread buffers
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3824—Operand accessing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3836—Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
- G06F9/3851—Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution from multiple instruction streams, e.g. multistreaming
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3877—Concurrent instruction execution, e.g. pipeline or look ahead using a slave processor, e.g. coprocessor
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3885—Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
- G06F9/3888—Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units controlled by a single instruction for multiple threads [SIMT] in parallel
Definitions
- the present invention relates generally to processors and, more particularly to instructions for use with processors.
- a series of mathematical transformations are applied to the data stored in the memory of the computer to transform the three-dimensional representation of the object into a two-dimensional image that can be displayed on a screen of the computer.
- One of the operations required as part of these transformations is a determination of which triangles or portions of the triangles are visible from the viewpoint chosen for the displayed image. This operation is known as clip-testing.
- An important element of a clip-testing operation is determining whether a point at a given set of coordinates is within the eye space visible on the screen.
- the present invention provides a method and apparatus for performing fast clip-testing operations in a general purpose processor.
- the fast clip-testing operations are accomplished by executing a single instruction for comparing a first value x to a second value y and, as a result of the comparison, determining whether x is less than y and whether x is less than negative y.
- the values x and y are stored in respective source registers of the processor specified by the instruction.
- one or more binary values representing the results of the determination are inserted into a destination register of the processor also specified by the instruction.
- the invention advantageously provides a general purpose processor with the ability to execute a clip-testing function with a single instruction compared with prior art general purpose processors that require multiple instructions to perform the same function.
- the general purpose processor of the present invention allows for more efficient and faster clip-testing operations.
- FIG. 1A is a schematic block diagram illustrating a single integrated circuit chip implementation of a processor in accordance with an embodiment of the present invention.
- FIG. 1B is a schematic block diagram showing the core of the processor.
- FIG. 2A is a diagrammatic block diagram of a register file of the processor of FIG. 1B .
- FIG. 2B is a diagrammatic block diagram of a register of the register file of FIG. 2A .
- FIG. 3A is a diagrammatic block diagram showing instruction formats for four operand instructions supported by the processor of FIG. 1B .
- FIG. 3B is a diagrammatic block diagram showing an instruction format for a clip-testing instruction supported by the processor of FIG. 1B .
- FIG. 4 is a diagrammatic block diagram showing the relationship between the instruction format of FIG. 3B and the register file of FIG. 2A .
- FIG. 5 is a block diagram of one implementation of the circuitry within MFUs 222 of the processor of FIG. 1B for performing the clip-testing instruction of FIG. 3B .
- FIG. 6 is a block diagram of an alternative implementation of the circuitry within MFUs 222 of the processor of FIG. 1B for performing the clip-testing instruction of FIG. 3B .
- FIG. 1A and 1B A processor in accordance to the principles of the present invention is illustrated in FIG. 1A and 1B .
- FIG. 1A a schematic block diagram illustrates a single integrated circuit chip implementation of a processor 100 that includes a memory interface 102 , a geometry preprocessor 104 , two media processing units 110 and 112 , a shared data cache 106 and several interface controllers.
- the components are mutually linked and closely linked to the processor core with high bandwidth, low-latency communication channels to manage multiple high-bandwidth data streams efficiently and with a low response time.
- Illustrative memory interface 102 is a direct Rambus Dynamic RAM (DRDRAM) controller.
- Shared data cache 106 is a dual-ported storage that is shared among media processing units 110 and 112 with one port allocated to each of media processing unit 110 and 112 .
- Media processing units 110 and 112 are included in a single integrated circuit chip to support an execution environment exploiting thread level parallelism in which two independent threads can execute simultaneously.
- the threads may arise from any source such as the same application, different applications, the operating system, or the runtime environment.
- Parallelism is exploited at the thread level since parallelism is rare beyond four, or even two, instructions per cycle in general purpose code.
- illustrative processor 100 is an eight-wide machine with eight execution units for executing instructions.
- a typical “general-purpose” processing code has an instruction level parallelism of about two so that, on average, most (about six) of the eight execution units would be idle at any time.
- Illustrative processor 100 employs thread level parallelism and operates on two independent threads, possibly attaining twice the performance of a processor having the same resources and clock rate but utilizing traditional non-thread parallelism.
- processor 100 shown in FIG. 1A includes two processing units on an integrated circuit chip
- the architecture is highly scalable so that one to several closely-coupled processors may be formed in a cache-based coherent architecture and resident on the same die to process multiple threads of execution.
- processor 100 a limitation on the number of processors formed on a single die arises from capacity constraints of integrated circuit technology rather than from architectural constraints relating to the interactions and interconnections between processors.
- Media processing units 110 and 112 each includes an instruction cache 210 , an instruction aligner 212 , an instruction buffer 214 , a split register file 216 , a plurality of execution units, and a load/store unit 218 .
- media processing units 110 and 112 use a plurality of execution units for executing instructions.
- the execution units for media processing units 110 and 112 include three media functional units (MFU) 222 and one general functional unit (GFU) 220 .
- the media functional units 222 are single-instruction-multiple-data (SIMD) media functional units.
- Each media functional unit 222 is capable of processing parallel 16-bit components, in addition to 32-bit operands.
- Various parallel 16-bit operations supply the single-instruction-multiple-data capability for processor 100 including add, multiply-add, shift, compare, and the like.
- Media functional units 222 operate in combination as tightly-coupled digital signal processors (DSPs).
- DSPs digital signal processors
- Each media functional unit 222 has a separate and individual sub-instruction stream, but all three media functional units 222 execute synchronously so that the subinstructions progress lock-step through pipeline stages.
- General functional unit 220 is a RISC processor capable of executing arithmetic logic unit (ALU) operations, loads and stores, branches, and various specialized and esoteric functions such as parallel power operations, reciprocal squareroot operations, and many others.
- General functional unit 220 supports less common parallel operations such as the parallel reciprocal square root instruction.
- Each media processing unit 110 and 112 includes a split register file 216 , which forms a single logical register file including 256 thirty-two bit registers.
- Split register file 216 is split into a plurality of register file segments 214 to form a multi-ported structure that is replicated to reduce the integrated circuit die area and to reduce access time.
- Media processing units 110 and 112 are highly structured computation blocks that execute software-scheduled data computation operations with fixed, deterministic and relatively short instruction latencies, operational characteristics yielding simplification in both function and cycle time.
- the operational characteristics support multiple instruction issue through a pragmatic very large instruction word (VLIW) approach.
- VLIW instruction word always includes one instruction that executes in general functional unit (GFU) 220 and from zero to three instructions that execute in media functional units (MFU) 222 .
- An MFU instruction field within the VLIW instruction word includes an operation code (opcode) field, two or three source register (or immediate) fields, and one destination register field.
- Instructions are executed in-order in processor 100 but loads can finish out-of-order with respect to other instructions and with respect to other loads, allowing loads to be moved up in the instruction stream so that data can be streamed from main memory.
- Processor 100 is further described in co-pending application Ser. No. 09/204,480, entitled “A Multiple-Thread Processor for Threaded Software Applications” by Marc Tremblay and William Joy, filed on Dec. 3, 1998, which is herein incorporated by reference in its entirety.
- FIG. 2A The structure of a register file of the processor of FIG. 1B is illustrated in FIG. 2A .
- the register file is made up of an arbitrary number of registers R 0 , R 1 , R 2 . . . Rn.
- Each of registers R 0 , R 1 , R 2 . . . Rn in turn has an arbitrary number of bits, as shown in FIG. 2B .
- the number of bits in each of registers R 0 , R 1 , R 2 . . . Rn is 32 .
- those skilled in the art realize that the principles of the present invention can be applied to an arbitrary number of registers each having an arbitrary number of bits. Accordingly, the present invention is not limited to any particular number of registers or bits per register.
- FIG. 3A illustrates four instruction formats for four-operand instructions supported by the processor of FIG. 1B .
- Each instruction format has an 8-bit opcode and four 8-bit operands.
- the first of the operands is a reference to a destination register (RD) for the instruction.
- the second operand is a reference to a first source register for the instruction (RS 1 ).
- the third and fourth operands can be references to a second (RS 2 ) and a third source register (RS 3 ), an immediate value to be used in the instruction or any combination thereof.
- FIG. 3B illustrates an instruction format for a clip-testing instruction (clip) supported by the processor of FIG. 1 , in accordance to the present invention.
- All operands are references to registers in the register file of the processor, as shown in FIG. 4 .
- the RD operand represents a clip mask representing whether vertices of a triangle fall outside a range of homogeneous coordinates in the eye space of an image to be clipped.
- the RS 1 operand represents the coefficient defining the homogenous eye space.
- the RS 2 operand represents the x, y and z values of the vertex examined by the clip-testing instruction.
- the RS 3 operand represents the value of the clip mask prior to the execution of the clip-testing instruction.
- each of the operands of the clip-testing instruction refers to an arbitrary register of the register file of FIG. 2A in which the represented value is stored.
- the operand RD contains a reference to the R 2 register
- the operand RS 1 contains a reference to the R 3 register
- the operand RS 2 contains a reference to the R 5 register
- the operand RS 3 contains a reference to the R 7 register.
- FIG. 5 is a block diagram of one implementation of the circuitry within MFUs 222 of the processor of FIG. 1B for performing the clip-testing operation.
- the clip-testing operation compares a value stored in register RS 1 to the value stored in register RS 2 and to the negative of the value stored in RS 2 .
- the values in RS 1 and RS 2 are IEEE single precision floating point values.
- the value stored in register RS 3 is shifted left by two bits. The shifted bits are then copied into register RD and two bits representing the results of the comparisons are inserted in the two least significant bits (LSBs) of the value stored in register RD.
- LSBs least significant bits
- the processor when executing the clip-testing instruction, routes the values stored in registers RS 1 and RS 2 to respective input ports of comparator 510 .
- the value stored in register RS 1 is also routed to an input port of comparator 530 .
- the most significant bit (MSB) of the value stored in register RS 2 is routed to an input line of inverter 520 .
- a value on an output line of inverter 520 , together with the 31 LSBs of the value stored in register RS 2 is then routed to the other input port of comparator 530 .
- the 30 LSBs of the value stored in register RS 3 are written into the 30 MSBs of register RD, effectively performing a two bit logical shift left of the value stored in register RS 3 .
- the values on respective output ports of comparators 510 and 530 are then written into the 2 LSBs of the register RD. Accordingly, the value that is stored in register RD represents a clip mask indicating whether a vertex of a triangle falls outside an homogenous eye space defined by the value stored in register RS 1 .
- FIG. 6 is a block diagram of an alternative implementation of the circuitry within MFUs 222 of the processor of FIG. 1B for performing the clip-testing instruction.
- the absolute values i.e., the 31 LSBs
- a value on an output line of comparator 510 is routed to respective control lines of multiplexers 610 and 620 .
- the sign bits i.e., the MSBs
- the values stored in registers RS 1 and RS 2 are routed to respective input lines of multiplexer 620 .
- the MSB of the value stored in register RS 2 is also routed to an input line of inverter 520 .
- An output line of inverter 520 and the MSB of the value stored in register RS 1 are, in turn, routed to respective input lines of multiplexer 610 .
- the 30 LSBs of the value stored in register RS 3 are written into the 30 MSBs of register RD, effectively performing a two bit logical shift left of the value stored in register RS 3 .
- the values on respective output lines of multiplexers 610 and 620 are routed to respective input ports of multiplexers 650 and 660 .
- a logical 0 value is provided on the remaining input ports of multiplexers 650 and 660 .
- Respective control ports of multiplexers 650 and 660 are, in turn, driven by output lines' of gates 630 and 640 .
- the values stored in registers RS 1 and RS 2 are provided to respective input ports of comparator 670 .
- the input lines of gates 630 are connected to the output port of comparator 670 and the sign bits of the values stored in registers RS 1 and RS 2 .
- the input lines of gates 640 are connected to the output port of comparator 670 , the sign bit of the value stored in register RS 1 and the complement of the sign bit (generated by inverter 635 ) of the value stored in register RS 2 .
- the output lines of gates 630 and 640 are connected to respective control ports of multiplexers 650 and 660 . Finally, the values on respective output ports of multiplexers 650 and 660 are written in the 2 LSBs of register RD.
- Embodiments described above illustrate but do not limit the invention.
- the invention is not limited by any number of registers specified by the instructions.
- the invention is not limited to any particular hardware implementation. Those skilled in the art realize that alternative hardware implementation can be employed in lieu of the one described herein in accordance to the principles of the present invention. Other embodiments and variations are within the scope of the invention, as defined by the following claims.
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Computational Mathematics (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Pure & Applied Mathematics (AREA)
- Advance Control (AREA)
- Executing Machine-Instructions (AREA)
Abstract
A method and apparatus for performing fast clip-testing operations in a general purpose processor are provided. This is accomplished by executing a single instruction for comparing a first value x to a second value y and, as a result of the comparison, determining whether x is less than y and whether x-is less than negative y. The values x and y are stored in respective source registers of the processor specified by the instruction. Finally, as a result of the determination, one or more binary values representing the results of the determination are inserted into a destination register of the processor also specified by the instruction. Accordingly, the invention advantageously provides a general purpose processor with the ability to execute a clip-testing function with a single instruction compared with prior art general purpose processors that require multiple instructions to perform the same function. Thus, the general purpose processor of the present invention allows for more efficient and faster clip-testing operations.
Description
- This patent application is a continuation of co-pending U.S. patent application Ser. No. 09/589,039, filed Jun. 6, 2000, naming as inventors Jeffrey Meng Wah Chan, Michael F. Deering, and Marc Tremblay, which in turn is a continuation of U.S. Pat. No. 6,718,457, filed Dec. 3, 1998, naming as inventors Marc Tremblay and William Joy, which is incorporated by reference herein in its entirety.
- 1. Field of the Invention
- The present invention relates generally to processors and, more particularly to instructions for use with processors.
- 2. Related Art
- The increasing popularity of multimedia and 3-D graphics display has created a substantial demand for current microprocessors to support graphics operations. Typically, this is done by means of surface graphics techniques, where an object is represented as a collection of very small primitives, simple geometric shapes such as triangles that approximate the shape of the object. Each of the triangles is represented by a set of vertices whose coordinates are stored in the memory of a computer. In addition to the coordinates of the vertices, additional information pertaining to color, lighting and other properties of the triangles are also stored in the memory of the computer. In order to display the objects represented by the triangles, a series of mathematical transformations are applied to the data stored in the memory of the computer to transform the three-dimensional representation of the object into a two-dimensional image that can be displayed on a screen of the computer. One of the operations required as part of these transformations is a determination of which triangles or portions of the triangles are visible from the viewpoint chosen for the displayed image. This operation is known as clip-testing. An important element of a clip-testing operation is determining whether a point at a given set of coordinates is within the eye space visible on the screen.
- While dedicated graphics processors such as DSPs provide varying levels of hardware support for clip-testing operations, general purpose processors typically provide only limited support for clip-testing operations, thereby requiring these operations to be performed by software executing on the processor. Since hardware implementations are inherently faster than software implementations, there is a need for a general purpose processor that supports faster clip-testing operations.
- The present invention provides a method and apparatus for performing fast clip-testing operations in a general purpose processor. The fast clip-testing operations are accomplished by executing a single instruction for comparing a first value x to a second value y and, as a result of the comparison, determining whether x is less than y and whether x is less than negative y. The values x and y are stored in respective source registers of the processor specified by the instruction. As a result of the determination, one or more binary values representing the results of the determination are inserted into a destination register of the processor also specified by the instruction.
- Accordingly, the invention advantageously provides a general purpose processor with the ability to execute a clip-testing function with a single instruction compared with prior art general purpose processors that require multiple instructions to perform the same function. Thus, the general purpose processor of the present invention allows for more efficient and faster clip-testing operations.
-
FIG. 1A is a schematic block diagram illustrating a single integrated circuit chip implementation of a processor in accordance with an embodiment of the present invention. -
FIG. 1B is a schematic block diagram showing the core of the processor. -
FIG. 2A is a diagrammatic block diagram of a register file of the processor ofFIG. 1B . -
FIG. 2B is a diagrammatic block diagram of a register of the register file ofFIG. 2A . -
FIG. 3A is a diagrammatic block diagram showing instruction formats for four operand instructions supported by the processor ofFIG. 1B . -
FIG. 3B is a diagrammatic block diagram showing an instruction format for a clip-testing instruction supported by the processor ofFIG. 1B . -
FIG. 4 is a diagrammatic block diagram showing the relationship between the instruction format ofFIG. 3B and the register file ofFIG. 2A . -
FIG. 5 is a block diagram of one implementation of the circuitry withinMFUs 222 of the processor ofFIG. 1B for performing the clip-testing instruction ofFIG. 3B . -
FIG. 6 is a block diagram of an alternative implementation of the circuitry withinMFUs 222 of the processor ofFIG. 1B for performing the clip-testing instruction ofFIG. 3B . - A processor in accordance to the principles of the present invention is illustrated in
FIG. 1A and 1B . - Referring to
FIG. 1A , a schematic block diagram illustrates a single integrated circuit chip implementation of aprocessor 100 that includes amemory interface 102, ageometry preprocessor 104, twomedia processing units data cache 106 and several interface controllers. The components are mutually linked and closely linked to the processor core with high bandwidth, low-latency communication channels to manage multiple high-bandwidth data streams efficiently and with a low response time. -
Illustrative memory interface 102 is a direct Rambus Dynamic RAM (DRDRAM) controller. Shareddata cache 106 is a dual-ported storage that is shared amongmedia processing units media processing unit -
Media processing units illustrative processor 100 is an eight-wide machine with eight execution units for executing instructions. A typical “general-purpose” processing code has an instruction level parallelism of about two so that, on average, most (about six) of the eight execution units would be idle at any time.Illustrative processor 100 employs thread level parallelism and operates on two independent threads, possibly attaining twice the performance of a processor having the same resources and clock rate but utilizing traditional non-thread parallelism. - Although
processor 100 shown inFIG. 1A includes two processing units on an integrated circuit chip, the architecture is highly scalable so that one to several closely-coupled processors may be formed in a cache-based coherent architecture and resident on the same die to process multiple threads of execution. Thus, inprocessor 100, a limitation on the number of processors formed on a single die arises from capacity constraints of integrated circuit technology rather than from architectural constraints relating to the interactions and interconnections between processors. - Referring to
FIG. 1B , a schematic block diagram shows the core ofprocessor 100.Media processing units instruction cache 210, aninstruction aligner 212, aninstruction buffer 214, asplit register file 216, a plurality of execution units, and a load/store unit 218. Inillustrative processor 100,media processing units media processing units functional units 222 are single-instruction-multiple-data (SIMD) media functional units. Each mediafunctional unit 222 is capable of processing parallel 16-bit components, in addition to 32-bit operands. Various parallel 16-bit operations supply the single-instruction-multiple-data capability forprocessor 100 including add, multiply-add, shift, compare, and the like. Mediafunctional units 222 operate in combination as tightly-coupled digital signal processors (DSPs). Each mediafunctional unit 222 has a separate and individual sub-instruction stream, but all three mediafunctional units 222 execute synchronously so that the subinstructions progress lock-step through pipeline stages. - General
functional unit 220 is a RISC processor capable of executing arithmetic logic unit (ALU) operations, loads and stores, branches, and various specialized and esoteric functions such as parallel power operations, reciprocal squareroot operations, and many others. Generalfunctional unit 220 supports less common parallel operations such as the parallel reciprocal square root instruction. - Each
media processing unit split register file 216, which forms a single logical register file including 256 thirty-two bit registers.Split register file 216 is split into a plurality ofregister file segments 214 to form a multi-ported structure that is replicated to reduce the integrated circuit die area and to reduce access time. -
Media processing units - Instructions are executed in-order in
processor 100 but loads can finish out-of-order with respect to other instructions and with respect to other loads, allowing loads to be moved up in the instruction stream so that data can be streamed from main memory. - For example, during processing of triangles, multiple vertices are operated upon in parallel so that the utilization rate of resources is high, achieving effective spatial software pipelining. Thus operations are overlapped in time by operating on several vertices simultaneously, rather than overlapping several loop iterations in time. For other types of applications with high instruction level parallelism, high trip count loops are software-pipelined so that most media
functional units 222 are fully utilized. -
Processor 100 is further described in co-pending application Ser. No. 09/204,480, entitled “A Multiple-Thread Processor for Threaded Software Applications” by Marc Tremblay and William Joy, filed on Dec. 3, 1998, which is herein incorporated by reference in its entirety. - The structure of a register file of the processor of
FIG. 1B is illustrated inFIG. 2A . The register file is made up of an arbitrary number of registers R0, R1, R2 . . . Rn. Each of registers R0, R1, R2 . . . Rn, in turn has an arbitrary number of bits, as shown inFIG. 2B . In one embodiment, the number of bits in each of registers R0, R1, R2 . . . Rn is 32. However, those skilled in the art realize that the principles of the present invention can be applied to an arbitrary number of registers each having an arbitrary number of bits. Accordingly, the present invention is not limited to any particular number of registers or bits per register. -
FIG. 3A illustrates four instruction formats for four-operand instructions supported by the processor ofFIG. 1B . Each instruction format has an 8-bit opcode and four 8-bit operands. The first of the operands is a reference to a destination register (RD) for the instruction. The second operand, in turn, is a reference to a first source register for the instruction (RS1). Finally, the third and fourth operands can be references to a second (RS2) and a third source register (RS3), an immediate value to be used in the instruction or any combination thereof. -
FIG. 3B illustrates an instruction format for a clip-testing instruction (clip) supported by the processor ofFIG. 1 , in accordance to the present invention. All operands are references to registers in the register file of the processor, as shown inFIG. 4 . The RD operand represents a clip mask representing whether vertices of a triangle fall outside a range of homogeneous coordinates in the eye space of an image to be clipped. The RS1 operand represents the coefficient defining the homogenous eye space. The RS2 operand represents the x, y and z values of the vertex examined by the clip-testing instruction. The RS3 operand represents the value of the clip mask prior to the execution of the clip-testing instruction. - In
FIG. 4 , each of the operands of the clip-testing instruction refers to an arbitrary register of the register file ofFIG. 2A in which the represented value is stored. For example, the operand RD contains a reference to the R2 register, the operand RS1 contains a reference to the R3 register, the operand RS2 contains a reference to the R5 register and the operand RS3 contains a reference to the R7 register. -
FIG. 5 is a block diagram of one implementation of the circuitry withinMFUs 222 of the processor ofFIG. 1B for performing the clip-testing operation. The clip-testing operation compares a value stored in register RS1 to the value stored in register RS2 and to the negative of the value stored in RS2. The values in RS1 and RS2 are IEEE single precision floating point values. Additionally, the value stored in register RS3 is shifted left by two bits. The shifted bits are then copied into register RD and two bits representing the results of the comparisons are inserted in the two least significant bits (LSBs) of the value stored in register RD. Thus the value that is stored in register RD represents a bit mask indicating which vertices of a triangle fall outside an homogeneous eye space defined by the coefficient stored in RS1. - In the implementation shown in
FIG. 5 , when executing the clip-testing instruction, the processor routes the values stored in registers RS1 and RS2 to respective input ports ofcomparator 510. The value stored in register RS1 is also routed to an input port ofcomparator 530. The most significant bit (MSB) of the value stored in register RS2 is routed to an input line ofinverter 520. A value on an output line ofinverter 520, together with the 31 LSBs of the value stored in register RS2, is then routed to the other input port ofcomparator 530. - More specifically, when the value stored in register RS1 is less than the value stored in register RS2, then a “1” is provided to the second least significant bit of register RD. When the value stored in register RS1 is greater than or equal to the value stored in register RS2, then a “0” is provided to the second least significant bit of register RD. Also, when the value stored in register RS1 is less than the negative of the value stored in register RS2, then a “1” is provided to the least significant bit of register RD. When the value stored in register RS1 is greater than or equal to the negative of the value stored in RS2, then a “0” is provided to the least significant bit of register RD.
- The 30 LSBs of the value stored in register RS3 are written into the 30 MSBs of register RD, effectively performing a two bit logical shift left of the value stored in register RS3. The values on respective output ports of
comparators -
FIG. 6 is a block diagram of an alternative implementation of the circuitry withinMFUs 222 of the processor ofFIG. 1B for performing the clip-testing instruction. In the implementation ofFIG. 6 , the absolute values (i.e., the 31 LSBs) of the values stored in registers RS1 and RS2 are routed to respective input ports ofcomparator 510. A value on an output line ofcomparator 510 is routed to respective control lines ofmultiplexers multiplexer 620. In addition, the MSB of the value stored in register RS2 is also routed to an input line ofinverter 520. An output line ofinverter 520 and the MSB of the value stored in register RS1 are, in turn, routed to respective input lines ofmultiplexer 610. - As a result, the value on the output line of
multiplexer 610 effectively represents the value of the comparison rs1<rs2, as illustrated in Table 1 below.TABLE 1 Sign RS1 Sign RS2 |rs1| < |rs2| rs1 < rs2 1 1 1 0 1 0 1 1 0 1 1 0 0 0 1 1 1 1 0 1 1 0 0 1 0 1 0 0 0 0 0 0 - Similarly, the value on the output line of
multiplexer 620 effectively represents the value of the comparison rs1<−rs2, as illustrated in Table 2 below.TABLE 2 Sign RS1 Sign RS2 |rs1| < |rs2| rs1 − rs2 1 1 1 1 1 0 1 0 0 1 1 1 0 0 1 0 1 1 0 1 1 0 0 1 0 1 0 0 0 0 0 0 - The 30 LSBs of the value stored in register RS3 are written into the 30 MSBs of register RD, effectively performing a two bit logical shift left of the value stored in register RS3. The values on respective output lines of
multiplexers multiplexers multiplexers multiplexers gates comparator 670. The input lines ofgates 630 are connected to the output port ofcomparator 670 and the sign bits of the values stored in registers RS1 and RS2. The input lines ofgates 640 are connected to the output port ofcomparator 670, the sign bit of the value stored in register RS1 and the complement of the sign bit (generated by inverter 635) of the value stored in register RS2. The output lines ofgates multiplexers multiplexers - While a three source register implementation is described, those skilled in the art realize that the principles of the present invention can be applied to instructions having an arbitrary number of source and destination registers. Accordingly, the present invention is not limited to any particular number of source or destination registers.
- Embodiments described above illustrate but do not limit the invention. In particular, the invention is not limited by any number of registers specified by the instructions. In addition, the invention is not limited to any particular hardware implementation. Those skilled in the art realize that alternative hardware implementation can be employed in lieu of the one described herein in accordance to the principles of the present invention. Other embodiments and variations are within the scope of the invention, as defined by the following claims.
Claims (4)
1. A method of performing an instance of a clip instruction that indicates a first value, a second value, a third value, and a destination, the method comprising:
comparing the first value against the second value, and
writing to the destination the third value and indications of the comparing.
2. The method of claim 1 , wherein the comparing comprises:
determining whether the first value is greater than the second value; and
determining whether the first value is less than a negative of the second value.
3. A sequence of executable instruction instances encoded in one or more machine-readable media, the sequence comprising:
an instance of a clip instruction that indicates a first value, a second value, a third value, and a destination register, the clip instruction instance executable to,
compare the first value against the second value to determine whether the first value is greater than the second value,
compare the first value against a negative of the second value to determine whether the first value is less than the negative of the second value, and
write to the destination the third value and indications of the comparisons.
4. An apparatus comprising:
at least one processing unit; and
means for performing clip testing with a single instruction instance.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US11/382,203 US20060282650A1 (en) | 1998-12-03 | 2006-05-08 | Efficient clip-testing |
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US09/204,480 US6718457B2 (en) | 1998-12-03 | 1998-12-03 | Multiple-thread processor for threaded software applications |
US09/589,039 US7042466B1 (en) | 1998-12-03 | 2000-06-06 | Efficient clip-testing in graphics acceleration |
US11/382,203 US20060282650A1 (en) | 1998-12-03 | 2006-05-08 | Efficient clip-testing |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US09/589,039 Continuation US7042466B1 (en) | 1998-12-03 | 2000-06-06 | Efficient clip-testing in graphics acceleration |
Publications (1)
Publication Number | Publication Date |
---|---|
US20060282650A1 true US20060282650A1 (en) | 2006-12-14 |
Family
ID=22758067
Family Applications (4)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US09/204,480 Expired - Lifetime US6718457B2 (en) | 1998-12-03 | 1998-12-03 | Multiple-thread processor for threaded software applications |
US09/589,039 Expired - Lifetime US7042466B1 (en) | 1998-12-03 | 2000-06-06 | Efficient clip-testing in graphics acceleration |
US10/818,785 Abandoned US20050010743A1 (en) | 1998-12-03 | 2004-04-06 | Multiple-thread processor for threaded software applications |
US11/382,203 Abandoned US20060282650A1 (en) | 1998-12-03 | 2006-05-08 | Efficient clip-testing |
Family Applications Before (3)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US09/204,480 Expired - Lifetime US6718457B2 (en) | 1998-12-03 | 1998-12-03 | Multiple-thread processor for threaded software applications |
US09/589,039 Expired - Lifetime US7042466B1 (en) | 1998-12-03 | 2000-06-06 | Efficient clip-testing in graphics acceleration |
US10/818,785 Abandoned US20050010743A1 (en) | 1998-12-03 | 2004-04-06 | Multiple-thread processor for threaded software applications |
Country Status (4)
Country | Link |
---|---|
US (4) | US6718457B2 (en) |
EP (1) | EP1137984B1 (en) |
DE (1) | DE69909829T2 (en) |
WO (1) | WO2000033185A2 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20230334008A1 (en) * | 2008-05-27 | 2023-10-19 | Stillwater Supercomputing, Inc. | Execution engine for executing single assignment programs with affine dependencies |
Families Citing this family (87)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6075935A (en) * | 1997-12-01 | 2000-06-13 | Improv Systems, Inc. | Method of generating application specific integrated circuits using a programmable hardware architecture |
JP3541669B2 (en) * | 1998-03-30 | 2004-07-14 | 松下電器産業株式会社 | Arithmetic processing unit |
US7587582B1 (en) | 1998-12-03 | 2009-09-08 | Sun Microsystems, Inc. | Method and apparatus for parallel arithmetic operations |
US6718457B2 (en) * | 1998-12-03 | 2004-04-06 | Sun Microsystems, Inc. | Multiple-thread processor for threaded software applications |
WO2001016702A1 (en) | 1999-09-01 | 2001-03-08 | Intel Corporation | Register set used in multithreaded parallel processor architecture |
US6968469B1 (en) | 2000-06-16 | 2005-11-22 | Transmeta Corporation | System and method for preserving internal processor context when the processor is powered down and restoring the internal processor context when processor is restored |
WO2002015000A2 (en) * | 2000-08-16 | 2002-02-21 | Sun Microsystems, Inc. | General purpose processor with graphics/media support |
US7681018B2 (en) * | 2000-08-31 | 2010-03-16 | Intel Corporation | Method and apparatus for providing large register address space while maximizing cycletime performance for a multi-threaded register file set |
US7127588B2 (en) * | 2000-12-05 | 2006-10-24 | Mindspeed Technologies, Inc. | Apparatus and method for an improved performance VLIW processor |
US8762581B2 (en) * | 2000-12-22 | 2014-06-24 | Avaya Inc. | Multi-thread packet processor |
CA2346762A1 (en) * | 2001-05-07 | 2002-11-07 | Ibm Canada Limited-Ibm Canada Limitee | Compiler generation of instruction sequences for unresolved storage devices |
US6954846B2 (en) * | 2001-08-07 | 2005-10-11 | Sun Microsystems, Inc. | Microprocessor and method for giving each thread exclusive access to one register file in a multi-threading mode and for giving an active thread access to multiple register files in a single thread mode |
AU2003219666A1 (en) * | 2002-01-15 | 2003-07-30 | Chip Engines | Reconfigurable control processor for multi-protocol resilient packet ring processor |
US7500240B2 (en) * | 2002-01-15 | 2009-03-03 | Intel Corporation | Apparatus and method for scheduling threads in multi-threading processors |
US6934951B2 (en) * | 2002-01-17 | 2005-08-23 | Intel Corporation | Parallel processor with functional pipeline providing programming engines by supporting multiple contexts and critical section |
US7437724B2 (en) * | 2002-04-03 | 2008-10-14 | Intel Corporation | Registers for data transfers |
JP4704659B2 (en) * | 2002-04-26 | 2011-06-15 | 株式会社日立製作所 | Storage system control method and storage control device |
US7958289B2 (en) * | 2002-08-08 | 2011-06-07 | International Business Machines Corporation | Method and system for storing memory compressed data onto memory compressed disks |
JP2004110367A (en) * | 2002-09-18 | 2004-04-08 | Hitachi Ltd | Storage system control method, storage control device, and storage system |
US7263593B2 (en) | 2002-11-25 | 2007-08-28 | Hitachi, Ltd. | Virtualization controller and data transfer control method |
JP2004220450A (en) * | 2003-01-16 | 2004-08-05 | Hitachi Ltd | Storage device, its introduction method and its introduction program |
US7210127B1 (en) | 2003-04-03 | 2007-04-24 | Sun Microsystems | Methods and apparatus for executing instructions in parallel |
JP2005018193A (en) * | 2003-06-24 | 2005-01-20 | Hitachi Ltd | Interface command control method for disk device, and computer system |
US7203811B2 (en) * | 2003-07-31 | 2007-04-10 | International Business Machines Corporation | Non-fenced list DMA command mechanism |
JP4386694B2 (en) * | 2003-09-16 | 2009-12-16 | 株式会社日立製作所 | Storage system and storage control device |
US7219201B2 (en) | 2003-09-17 | 2007-05-15 | Hitachi, Ltd. | Remote storage disk control device and method for controlling the same |
JP4598387B2 (en) * | 2003-09-17 | 2010-12-15 | 株式会社日立製作所 | Storage system |
JP4307202B2 (en) * | 2003-09-29 | 2009-08-05 | 株式会社日立製作所 | Storage system and storage control device |
US7600221B1 (en) | 2003-10-06 | 2009-10-06 | Sun Microsystems, Inc. | Methods and apparatus of an architecture supporting execution of instructions in parallel |
JP4307964B2 (en) | 2003-11-26 | 2009-08-05 | 株式会社日立製作所 | Access restriction information setting method and apparatus |
US7380086B2 (en) * | 2003-12-12 | 2008-05-27 | International Business Machines Corporation | Scalable runtime system for global address space languages on shared and distributed memory machines |
US8643659B1 (en) | 2003-12-31 | 2014-02-04 | 3Dlabs Inc., Ltd. | Shader with global and instruction caches |
JP2005202893A (en) * | 2004-01-19 | 2005-07-28 | Hitachi Ltd | Storage device controller, storage system, recording medium recording program, information processor, and method for controlling storage system |
JP4391265B2 (en) | 2004-02-26 | 2009-12-24 | 株式会社日立製作所 | Storage subsystem and performance tuning method |
US20050210472A1 (en) * | 2004-03-18 | 2005-09-22 | International Business Machines Corporation | Method and data processing system for per-chip thread queuing in a multi-processor system |
US7665070B2 (en) | 2004-04-23 | 2010-02-16 | International Business Machines Corporation | Method and apparatus for a computing system using meta program representation |
EP1622009A1 (en) * | 2004-07-27 | 2006-02-01 | Texas Instruments Incorporated | JSM architecture and systems |
JP4646574B2 (en) | 2004-08-30 | 2011-03-09 | 株式会社日立製作所 | Data processing system |
EP1794979B1 (en) * | 2004-09-10 | 2017-04-12 | Cavium, Inc. | Selective replication of data structure |
US7941585B2 (en) | 2004-09-10 | 2011-05-10 | Cavium Networks, Inc. | Local scratchpad and data caching system |
US7594081B2 (en) | 2004-09-10 | 2009-09-22 | Cavium Networks, Inc. | Direct access to low-latency memory |
JP2006127028A (en) * | 2004-10-27 | 2006-05-18 | Hitachi Ltd | Memory system and storage controller |
US7503368B2 (en) | 2004-11-24 | 2009-03-17 | The Boeing Company | Composite sections for aircraft fuselages and other structures, and methods and systems for manufacturing such sections |
US9003168B1 (en) * | 2005-02-17 | 2015-04-07 | Hewlett-Packard Development Company, L. P. | Control system for resource selection between or among conjoined-cores |
US8732368B1 (en) | 2005-02-17 | 2014-05-20 | Hewlett-Packard Development Company, L.P. | Control system for resource selection between or among conjoined-cores |
US7814487B2 (en) * | 2005-04-26 | 2010-10-12 | Qualcomm Incorporated | System and method of executing program threads in a multi-threaded processor |
US8713286B2 (en) * | 2005-04-26 | 2014-04-29 | Qualcomm Incorporated | Register files for a digital signal processor operating in an interleaved multi-threaded environment |
US7716377B2 (en) * | 2005-05-25 | 2010-05-11 | Harris Steven T | Clustering server providing virtual machine data sharing |
US8275976B2 (en) * | 2005-08-29 | 2012-09-25 | The Invention Science Fund I, Llc | Hierarchical instruction scheduler facilitating instruction replay |
US8296550B2 (en) * | 2005-08-29 | 2012-10-23 | The Invention Science Fund I, Llc | Hierarchical register file with operand capture ports |
US7644258B2 (en) * | 2005-08-29 | 2010-01-05 | Searete, Llc | Hybrid branch predictor using component predictors each having confidence and override signals |
US9176741B2 (en) * | 2005-08-29 | 2015-11-03 | Invention Science Fund I, Llc | Method and apparatus for segmented sequential storage |
US20070083735A1 (en) * | 2005-08-29 | 2007-04-12 | Glew Andrew F | Hierarchical processor |
US20080282034A1 (en) * | 2005-09-19 | 2008-11-13 | Via Technologies, Inc. | Memory Subsystem having a Multipurpose Cache for a Stream Graphics Multiprocessor |
US7454599B2 (en) * | 2005-09-19 | 2008-11-18 | Via Technologies, Inc. | Selecting multiple threads for substantially concurrent processing |
US7734897B2 (en) * | 2005-12-21 | 2010-06-08 | Arm Limited | Allocation of memory access operations to memory access capable pipelines in a superscalar data processing apparatus and method having a plurality of execution threads |
US8301870B2 (en) * | 2006-07-27 | 2012-10-30 | International Business Machines Corporation | Method and apparatus for fast synchronization and out-of-order execution of instructions in a meta-program based computing system |
CN100456230C (en) * | 2007-03-19 | 2009-01-28 | 中国人民解放军国防科学技术大学 | Computing group structure for superlong instruction word and instruction flow multidata stream fusion |
US7975130B2 (en) * | 2008-02-20 | 2011-07-05 | International Business Machines Corporation | Method and system for early instruction text based operand store compare reject avoidance |
EP2289001B1 (en) * | 2008-05-30 | 2018-07-25 | Advanced Micro Devices, Inc. | Local and global data share |
US8255905B2 (en) * | 2008-06-27 | 2012-08-28 | Microsoft Corporation | Multi-threaded processes for opening and saving documents |
US20100191911A1 (en) * | 2008-12-23 | 2010-07-29 | Marco Heddes | System-On-A-Chip Having an Array of Programmable Processing Elements Linked By an On-Chip Network with Distributed On-Chip Shared Memory and External Shared Memory |
US8301434B2 (en) | 2009-09-18 | 2012-10-30 | International Buisness Machines Corporation | Host cell spatially aware emulation of a guest wild branch |
US8428930B2 (en) * | 2009-09-18 | 2013-04-23 | International Business Machines Corporation | Page mapped spatially aware emulation of a computer instruction set |
US8949106B2 (en) * | 2009-09-18 | 2015-02-03 | International Business Machines Corporation | Just in time compiler in spatially aware emulation of a guest computer instruction set |
US8447583B2 (en) | 2009-09-18 | 2013-05-21 | International Business Machines Corporation | Self initialized host cell spatially aware emulation of a computer instruction set |
US8617049B2 (en) * | 2009-09-18 | 2013-12-31 | Ethicon Endo-Surgery, Inc. | Symmetrical drive system for an implantable restriction device |
US9158566B2 (en) | 2009-09-18 | 2015-10-13 | International Business Machines Corporation | Page mapped spatially aware emulation of computer instruction set |
US8522000B2 (en) * | 2009-09-29 | 2013-08-27 | Nvidia Corporation | Trap handler architecture for a parallel processing unit |
US8904115B2 (en) * | 2010-09-28 | 2014-12-02 | Texas Instruments Incorporated | Cache with multiple access pipelines |
US8756589B2 (en) | 2011-06-14 | 2014-06-17 | Microsoft Corporation | Selectable dual-mode JIT compiler for SIMD instructions |
KR101996641B1 (en) * | 2012-02-06 | 2019-07-04 | 삼성전자주식회사 | Apparatus and method for memory overlay |
US8898376B2 (en) | 2012-06-04 | 2014-11-25 | Fusion-Io, Inc. | Apparatus, system, and method for grouping data stored on an array of solid-state storage elements |
US9626189B2 (en) | 2012-06-15 | 2017-04-18 | International Business Machines Corporation | Reducing operand store compare penalties |
US9563425B2 (en) | 2012-11-28 | 2017-02-07 | Intel Corporation | Instruction and logic to provide pushing buffer copy and store functionality |
US9317294B2 (en) | 2012-12-06 | 2016-04-19 | International Business Machines Corporation | Concurrent multiple instruction issue of non-pipelined instructions using non-pipelined operation resources in another processing core |
KR20150019349A (en) * | 2013-08-13 | 2015-02-25 | 삼성전자주식회사 | Multiple threads execution processor and its operating method |
JP6183049B2 (en) * | 2013-08-15 | 2017-08-23 | 富士通株式会社 | Arithmetic processing device and control method of arithmetic processing device |
US9390023B2 (en) * | 2013-10-03 | 2016-07-12 | Cavium, Inc. | Method and apparatus for conditional storing of data using a compare-and-swap based approach |
US9501243B2 (en) | 2013-10-03 | 2016-11-22 | Cavium, Inc. | Method and apparatus for supporting wide operations using atomic sequences |
US9792098B2 (en) * | 2015-03-25 | 2017-10-17 | International Business Machines Corporation | Unaligned instruction relocation |
RU2018130817A (en) * | 2016-01-26 | 2020-02-27 | Айкэт Ллс | PROCESSOR WITH RECONFIGURABLE ALGORITHMIC CONVEYOR CORE AND ALGORITHMIC AGREEMENT CONVEYOR COMPILATOR |
JP6865847B2 (en) * | 2017-04-19 | 2021-04-28 | シャンハイ カンブリコン インフォメーション テクノロジー カンパニー リミテッドShanghai Cambricon Information Technology Co.,Ltd. | Processing equipment, chips, electronic equipment and methods |
US10726514B2 (en) | 2017-04-28 | 2020-07-28 | Intel Corporation | Compute optimizations for low precision machine learning operations |
US10922258B2 (en) * | 2017-12-22 | 2021-02-16 | Alibaba Group Holding Limited | Centralized-distributed mixed organization of shared memory for neural network processing |
US11062077B1 (en) * | 2019-06-24 | 2021-07-13 | Amazon Technologies, Inc. | Bit-reduced verification for memory arrays |
CN110750232B (en) * | 2019-10-17 | 2023-06-20 | 电子科技大学 | SRAM-based parallel multiplication and addition device |
Citations (30)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5197130A (en) * | 1989-12-29 | 1993-03-23 | Supercomputer Systems Limited Partnership | Cluster architecture for a highly parallel scalar/vector multiprocessor system |
US5307449A (en) * | 1991-12-20 | 1994-04-26 | Apple Computer, Inc. | Method and apparatus for simultaneously rendering multiple scanlines |
US5345541A (en) * | 1991-12-20 | 1994-09-06 | Apple Computer, Inc. | Method and apparatus for approximating a value between two endpoint values in a three-dimensional image rendering device |
US5517603A (en) * | 1991-12-20 | 1996-05-14 | Apple Computer, Inc. | Scanline rendering device for generating pixel values for displaying three-dimensional graphical images |
US5517611A (en) * | 1993-06-04 | 1996-05-14 | Sun Microsystems, Inc. | Floating-point processor for a high performance three dimensional graphics accelerator |
US5574939A (en) * | 1993-05-14 | 1996-11-12 | Massachusetts Institute Of Technology | Multiprocessor coupling system with integrated compile and run time scheduling for parallelism |
US5657291A (en) * | 1996-04-30 | 1997-08-12 | Sun Microsystems, Inc. | Multiport register file memory cell configuration for read operation |
US5689674A (en) * | 1995-10-31 | 1997-11-18 | Intel Corporation | Method and apparatus for binding instructions to dispatch ports of a reservation station |
US5706415A (en) * | 1991-12-20 | 1998-01-06 | Apple Computer, Inc. | Method and apparatus for distributed interpolation of pixel shading parameter values |
US5712799A (en) * | 1995-04-04 | 1998-01-27 | Chromatic Research, Inc. | Method and structure for performing motion estimation using reduced precision pixel intensity values |
US5721868A (en) * | 1994-01-21 | 1998-02-24 | Sun Microsystems, Inc. | Rapid register file access by limiting access to a selectable register subset |
US5742782A (en) * | 1994-04-15 | 1998-04-21 | Hitachi, Ltd. | Processing apparatus for executing a plurality of VLIW threads in parallel |
US5742796A (en) * | 1995-03-24 | 1998-04-21 | 3Dlabs Inc. Ltd. | Graphics system with color space double buffering |
US5761475A (en) * | 1994-12-15 | 1998-06-02 | Sun Microsystems, Inc. | Computer processor having a register file with reduced read and/or write port bandwidth |
US5764943A (en) * | 1995-12-28 | 1998-06-09 | Intel Corporation | Data path circuitry for processor having multiple instruction pipelines |
US5778248A (en) * | 1996-06-17 | 1998-07-07 | Sun Microsystems, Inc. | Fast microprocessor stage bypass logic enable |
US5778243A (en) * | 1996-07-03 | 1998-07-07 | International Business Machines Corporation | Multi-threaded cell for a memory |
US5872963A (en) * | 1997-02-18 | 1999-02-16 | Silicon Graphics, Inc. | Resumption of preempted non-privileged threads with no kernel intervention |
US5925123A (en) * | 1996-01-24 | 1999-07-20 | Sun Microsystems, Inc. | Processor for executing instruction sets received from a network or from a local memory |
US5974538A (en) * | 1997-02-21 | 1999-10-26 | Wilmot, Ii; Richard Byron | Method and apparatus for annotating operands in a computer system with source instruction identifiers |
US6014147A (en) * | 1994-07-25 | 2000-01-11 | Canon Information Systems Research Australia Pty Ltd | Computer machine architecture for creating images from graphical elements and a method of operating the architecture |
US6052129A (en) * | 1997-10-01 | 2000-04-18 | International Business Machines Corporation | Method and apparatus for deferred clipping of polygons |
US6052128A (en) * | 1997-07-23 | 2000-04-18 | International Business Machines Corp. | Method and apparatus for clipping convex polygons on single instruction multiple data computers |
US6092175A (en) * | 1998-04-02 | 2000-07-18 | University Of Washington | Shared register storage mechanisms for multithreaded computer systems with out-of-order execution |
US6137497A (en) * | 1997-05-30 | 2000-10-24 | Hewlett-Packard Company | Post transformation clipping in a geometry accelerator |
US6212544B1 (en) * | 1997-10-23 | 2001-04-03 | International Business Machines Corporation | Altering thread priorities in a multithreaded processor |
US20010042188A1 (en) * | 1998-12-03 | 2001-11-15 | Marc Tremblay | Multiple-thread processor for threaded software applications |
US6603481B1 (en) * | 1998-11-09 | 2003-08-05 | Mitsubishi Denki Kabushiki Kaisha | Geometry processor capable of executing input/output and high speed geometry calculation processing in parallel |
US6671796B1 (en) * | 2000-02-25 | 2003-12-30 | Sun Microsystems, Inc. | Converting an arbitrary fixed point value to a floating point value |
US6714197B1 (en) * | 1999-07-30 | 2004-03-30 | Mips Technologies, Inc. | Processor having an arithmetic extension of an instruction set architecture |
Family Cites Families (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5179702A (en) * | 1989-12-29 | 1993-01-12 | Supercomputer Systems Limited Partnership | System and method for controlling a highly parallel multiprocessor using an anarchy based scheduler for parallel execution thread scheduling |
US5553305A (en) * | 1992-04-14 | 1996-09-03 | International Business Machines Corporation | System for synchronizing execution by a processing element of threads within a process using a state indicator |
US5933627A (en) * | 1996-07-01 | 1999-08-03 | Sun Microsystems | Thread switch on blocked load or store using instruction thread field |
US6658447B2 (en) * | 1997-07-08 | 2003-12-02 | Intel Corporation | Priority based simultaneous multi-threading |
US7114056B2 (en) * | 1998-12-03 | 2006-09-26 | Sun Microsystems, Inc. | Local and global register partitioning in a VLIW processor |
US6249861B1 (en) * | 1998-12-03 | 2001-06-19 | Sun Microsystems, Inc. | Instruction fetch unit aligner for a non-power of two size VLIW instruction |
US6205543B1 (en) * | 1998-12-03 | 2001-03-20 | Sun Microsystems, Inc. | Efficient handling of a large register file for context switching |
US6279100B1 (en) * | 1998-12-03 | 2001-08-21 | Sun Microsystems, Inc. | Local stall control method and structure in a microprocessor |
US7117342B2 (en) * | 1998-12-03 | 2006-10-03 | Sun Microsystems, Inc. | Implicitly derived register specifiers in a processor |
US6321325B1 (en) * | 1998-12-03 | 2001-11-20 | Sun Microsystems, Inc. | Dual in-line buffers for an instruction fetch unit |
US6615338B1 (en) * | 1998-12-03 | 2003-09-02 | Sun Microsystems, Inc. | Clustered architecture in a VLIW processor |
US6343348B1 (en) * | 1998-12-03 | 2002-01-29 | Sun Microsystems, Inc. | Apparatus and method for optimizing die utilization and speed performance by register file splitting |
-
1998
- 1998-12-03 US US09/204,480 patent/US6718457B2/en not_active Expired - Lifetime
-
1999
- 1999-12-03 EP EP99963017A patent/EP1137984B1/en not_active Expired - Lifetime
- 1999-12-03 WO PCT/US1999/028821 patent/WO2000033185A2/en active IP Right Grant
- 1999-12-03 DE DE69909829T patent/DE69909829T2/en not_active Expired - Lifetime
-
2000
- 2000-06-06 US US09/589,039 patent/US7042466B1/en not_active Expired - Lifetime
-
2004
- 2004-04-06 US US10/818,785 patent/US20050010743A1/en not_active Abandoned
-
2006
- 2006-05-08 US US11/382,203 patent/US20060282650A1/en not_active Abandoned
Patent Citations (32)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5197130A (en) * | 1989-12-29 | 1993-03-23 | Supercomputer Systems Limited Partnership | Cluster architecture for a highly parallel scalar/vector multiprocessor system |
US5706415A (en) * | 1991-12-20 | 1998-01-06 | Apple Computer, Inc. | Method and apparatus for distributed interpolation of pixel shading parameter values |
US5307449A (en) * | 1991-12-20 | 1994-04-26 | Apple Computer, Inc. | Method and apparatus for simultaneously rendering multiple scanlines |
US5345541A (en) * | 1991-12-20 | 1994-09-06 | Apple Computer, Inc. | Method and apparatus for approximating a value between two endpoint values in a three-dimensional image rendering device |
US5517603A (en) * | 1991-12-20 | 1996-05-14 | Apple Computer, Inc. | Scanline rendering device for generating pixel values for displaying three-dimensional graphical images |
US5574939A (en) * | 1993-05-14 | 1996-11-12 | Massachusetts Institute Of Technology | Multiprocessor coupling system with integrated compile and run time scheduling for parallelism |
US5517611A (en) * | 1993-06-04 | 1996-05-14 | Sun Microsystems, Inc. | Floating-point processor for a high performance three dimensional graphics accelerator |
US5721868A (en) * | 1994-01-21 | 1998-02-24 | Sun Microsystems, Inc. | Rapid register file access by limiting access to a selectable register subset |
US5742782A (en) * | 1994-04-15 | 1998-04-21 | Hitachi, Ltd. | Processing apparatus for executing a plurality of VLIW threads in parallel |
US6014147A (en) * | 1994-07-25 | 2000-01-11 | Canon Information Systems Research Australia Pty Ltd | Computer machine architecture for creating images from graphical elements and a method of operating the architecture |
US5761475A (en) * | 1994-12-15 | 1998-06-02 | Sun Microsystems, Inc. | Computer processor having a register file with reduced read and/or write port bandwidth |
US5742796A (en) * | 1995-03-24 | 1998-04-21 | 3Dlabs Inc. Ltd. | Graphics system with color space double buffering |
US5712799A (en) * | 1995-04-04 | 1998-01-27 | Chromatic Research, Inc. | Method and structure for performing motion estimation using reduced precision pixel intensity values |
US5689674A (en) * | 1995-10-31 | 1997-11-18 | Intel Corporation | Method and apparatus for binding instructions to dispatch ports of a reservation station |
US5764943A (en) * | 1995-12-28 | 1998-06-09 | Intel Corporation | Data path circuitry for processor having multiple instruction pipelines |
US5925123A (en) * | 1996-01-24 | 1999-07-20 | Sun Microsystems, Inc. | Processor for executing instruction sets received from a network or from a local memory |
US5657291A (en) * | 1996-04-30 | 1997-08-12 | Sun Microsystems, Inc. | Multiport register file memory cell configuration for read operation |
US5778248A (en) * | 1996-06-17 | 1998-07-07 | Sun Microsystems, Inc. | Fast microprocessor stage bypass logic enable |
US5778243A (en) * | 1996-07-03 | 1998-07-07 | International Business Machines Corporation | Multi-threaded cell for a memory |
US5872963A (en) * | 1997-02-18 | 1999-02-16 | Silicon Graphics, Inc. | Resumption of preempted non-privileged threads with no kernel intervention |
US5974538A (en) * | 1997-02-21 | 1999-10-26 | Wilmot, Ii; Richard Byron | Method and apparatus for annotating operands in a computer system with source instruction identifiers |
US6137497A (en) * | 1997-05-30 | 2000-10-24 | Hewlett-Packard Company | Post transformation clipping in a geometry accelerator |
US6052128A (en) * | 1997-07-23 | 2000-04-18 | International Business Machines Corp. | Method and apparatus for clipping convex polygons on single instruction multiple data computers |
US6052129A (en) * | 1997-10-01 | 2000-04-18 | International Business Machines Corporation | Method and apparatus for deferred clipping of polygons |
US6212544B1 (en) * | 1997-10-23 | 2001-04-03 | International Business Machines Corporation | Altering thread priorities in a multithreaded processor |
US6092175A (en) * | 1998-04-02 | 2000-07-18 | University Of Washington | Shared register storage mechanisms for multithreaded computer systems with out-of-order execution |
US6603481B1 (en) * | 1998-11-09 | 2003-08-05 | Mitsubishi Denki Kabushiki Kaisha | Geometry processor capable of executing input/output and high speed geometry calculation processing in parallel |
US20030206173A1 (en) * | 1998-11-09 | 2003-11-06 | Mitsubishi Denki Kabushiki Kaisha | Geometry processor capable of executing input/output and high speed geometry calculation processing in parallel |
US20010042188A1 (en) * | 1998-12-03 | 2001-11-15 | Marc Tremblay | Multiple-thread processor for threaded software applications |
US6718457B2 (en) * | 1998-12-03 | 2004-04-06 | Sun Microsystems, Inc. | Multiple-thread processor for threaded software applications |
US6714197B1 (en) * | 1999-07-30 | 2004-03-30 | Mips Technologies, Inc. | Processor having an arithmetic extension of an instruction set architecture |
US6671796B1 (en) * | 2000-02-25 | 2003-12-30 | Sun Microsystems, Inc. | Converting an arbitrary fixed point value to a floating point value |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20230334008A1 (en) * | 2008-05-27 | 2023-10-19 | Stillwater Supercomputing, Inc. | Execution engine for executing single assignment programs with affine dependencies |
Also Published As
Publication number | Publication date |
---|---|
US20050010743A1 (en) | 2005-01-13 |
US7042466B1 (en) | 2006-05-09 |
WO2000033185A3 (en) | 2000-10-12 |
US20010042188A1 (en) | 2001-11-15 |
WO2000033185A2 (en) | 2000-06-08 |
DE69909829T2 (en) | 2004-05-27 |
EP1137984A2 (en) | 2001-10-04 |
DE69909829D1 (en) | 2003-08-28 |
US6718457B2 (en) | 2004-04-06 |
EP1137984B1 (en) | 2003-07-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US7042466B1 (en) | Efficient clip-testing in graphics acceleration | |
US7114056B2 (en) | Local and global register partitioning in a VLIW processor | |
US6757820B2 (en) | Decompression bit processing with a general purpose alignment tool | |
US7028170B2 (en) | Processing architecture having a compare capability | |
US8106914B2 (en) | Fused multiply-add functional unit | |
US6343348B1 (en) | Apparatus and method for optimizing die utilization and speed performance by register file splitting | |
CN115454501A (en) | Method and apparatus for performing reduction operations on multiple data element values | |
CN112148251A (en) | System and method for skipping meaningless matrix operations | |
US6341300B1 (en) | Parallel fixed point square root and reciprocal square root computation unit in a processor | |
US11726912B2 (en) | Coupling wide memory interface to wide write back paths | |
US7117342B2 (en) | Implicitly derived register specifiers in a processor | |
US6615338B1 (en) | Clustered architecture in a VLIW processor | |
JP5326314B2 (en) | Processor and information processing device | |
US7558816B2 (en) | Methods and apparatus for performing pixel average operations | |
JPH07244589A (en) | Computer system and method to solve predicate and boolean expression | |
US6678710B1 (en) | Logarithmic number system for performing calculations in a processor | |
US7587582B1 (en) | Method and apparatus for parallel arithmetic operations | |
US20230129750A1 (en) | Performing a floating-point multiply-add operation in a computer implemented environment | |
US6625634B1 (en) | Efficient implementation of multiprecision arithmetic | |
CN111813447B (en) | Processing method and processing device for data splicing instruction | |
EP1367485B1 (en) | Pipelined processing | |
WO2002015000A2 (en) | General purpose processor with graphics/media support | |
EP1367484B1 (en) | Instruction encoding |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION |