WO1999052040A1 - Architecture pour traitement graphique - Google Patents

Architecture pour traitement graphique Download PDF

Info

Publication number
WO1999052040A1
WO1999052040A1 PCT/US1999/007771 US9907771W WO9952040A1 WO 1999052040 A1 WO1999052040 A1 WO 1999052040A1 US 9907771 W US9907771 W US 9907771W WO 9952040 A1 WO9952040 A1 WO 9952040A1
Authority
WO
WIPO (PCT)
Prior art keywords
memory
processor
block
memory blocks
embedded dram
Prior art date
Application number
PCT/US1999/007771
Other languages
English (en)
Inventor
Richard Rubinstein
Original Assignee
Stellar Technologies, Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Stellar Technologies, Ltd. filed Critical Stellar Technologies, Ltd.
Priority to AU36386/99A priority Critical patent/AU3638699A/en
Publication of WO1999052040A1 publication Critical patent/WO1999052040A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/20Processor architectures; Processor configuration, e.g. pipelining
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/80Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
    • G06F15/8007Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors single instruction multiple data [SIMD] multiprocessors
    • G06F15/8015One dimensional arrays, e.g. rings, linear arrays, buses

Definitions

  • the present invention is generally in the field of digital computer architectures and, more specifically, is directed to circuits, systems and methodologies for digital signal processing utilizing shared, reconfigurable memory elements.
  • DSP Digital Signal Processing
  • DSP cores are specially configured to efficiently process DSP algorithms.
  • One example of a known DSP chip is the DSP 56002 processor, commercially available from Motorola.
  • the general purpose processor carries out Input/Output (I/O) tasks, logic functions, address generation, etc. This is a workable but costly solution.
  • I/O Input/Output
  • evolving new applications require increasing amounts of memory and the use of multiple conventional digital signal processors. Additionally, power dissipation becomes a limiting factor in hardware of this type.
  • one object of the present invention is to provide an improved computer architecture that utilizes available memory more efficiently in DSP systems. Another object is to reduce the power consumption and size of DSP systems.
  • a further object of the present invention is to provide for shared and reconfigurable memory in order to reduce I/O processor requirements for digital signal processing in processor and co-processor architectures.
  • a further object is to extend a shared, reconfigurable memory architecture to multiple memory blocks and execution units.
  • Another object of the invention is utilization of novel "bit configuration tables" in connection with shared and reconfigurable memory to use memory more efficiently, and to improve performance by making new data continually available so that the execution unit is never idle.
  • a further object of the invention is to provide improvements in memory addressing methods, architectures and circuits for continuous DSP execution together with simultaneous, continuous Direct Memory
  • DMA Downlink Access
  • a still further object of the invention is to provide improvements in execution units for DSP operations, for example execution unit architectures that feature deep- pipeline structures and local registers, and that support parallel operations. Modified execution units can be used to improve efficiency of operation in conjunction with reconfigurable memory.
  • Yet another object of the present invention is a "virtual two port" memory structure based on a conventional, single-port memory cell. Yet another object is to provide for implementation in both Static Random Access Memory (SRAM) and Dynamic Random Access Memory (DRAM) configurations of the virtual two-port memory.
  • SRAM Static Random Access Memory
  • DRAM Dynamic Random Access Memory
  • the present invention is directed to improved hardware architectures for digital signal processing, and more specifically, is directed to "memory-centric" methods and apparatus for improved performance in digital signal processing or "DSP” systems.
  • improvements in DSP performance have been achieved by providing special arithmetic units or “execution units” that are optimized to carry out the arithmetic operations that are commonly required in DSP — mainly multiplication and addition — at very high speed.
  • execution unit is the "DAU” (data execution unit) provided in the WE DSP32C chip from AT&T.
  • the AT&T execution unit, and others like it, provide relatively fast, floating point arithmetic operations to support computation-intensive applications such as speech, graphics and image processing.
  • DSP systems While many improvements have been made in floating-point execution units, pipelined architectures, decreased cycle times, etc., known DSP systems generally work with standard memory systems. For example, DRAM integrated circuits are used for reading input data and, on the output side, for storing output data. DSP data is moved into and out of the DRAM memory systems using known techniques such as multiple-ported memory, DMA hardware, buffers, and the like. While such systems benefit from improvements in memory speed and density, data transfer remains a relative bottleneck. I have reconsidered these known techniques and discovered that significant gains in performance and flexibility can be achieved by focusing on the memory, in addition to the execution unit, and by providing improvements in methods and circuits for moving data efficiently among data sources (such as a host processor bus or I/O channel), memory subsystems, and execution units. Since the focus is on the memory, I coined the term "memory-centric" computing.
  • One aspect of the invention is a memory subsystem that is partitioned into two or more blocks of memory space.
  • One block of the memory communicates with an I/O or DMA channel to load data, while the other block of memory simultaneously communicates with one or more execution units that carry out arithmetic operations on data in the second block. Results are written back to the second block of memory.
  • the memory blocks are effectively "swapped" so that the second block, now holding processed (output) data, communicates with the I/O channel to output that data, while the execution unit communicates with the first block, which by then has been filled with new input data.
  • Methods and apparatus are shown for implementing this memory swapping technique in real time so that the execution unit is never idle.
  • Another aspect of the invention provides for interfacing two or more address generators to the same block of memory, so that memory block swapping can be accomplished without the use of larger multi-ported memory cells.
  • the present invention is useful in a wide variety of signal processing applications including programmable MPEG encode and decode, graphics, speech processing, image processing, array processors, etc.
  • the invention can be used, for example, for switching applications in which multiple I/O channels are operated simultaneously.
  • a further aspect of the invention provides for partitioning the memory space into two or more memory "segments" - with the ability to selectively assign one or more such segments to form a required block of memory.
  • one block of memory can be configured to include say, four segments, and be associated with an execution unit, while another block of memory is configured to include only one segment and may be assigned to an I/O or DMA channel.
  • This flexibility is useful in matching the memory block size to the requirements of an associated execution unit for a particular operation, say a recursive type of digital filter such as an Infinite Impulse Response (IIR) filter.
  • Memory segments can be of arbitrary size as will be shown in more detail later.
  • the memory is readily “reconfigurable” so that it can adapt to the particular calculations required.
  • the memory reconfiguration is controlled by configuration control signals.
  • the configuration control signals may be generated based upon
  • configuration bits which can be downloaded from a core processor, instruction decoder, or durable memory, for reconfiguring the memory responsive to the task at hand.
  • the configuration bits are stored in extended bit positions in the regular memory, so that pointers or traps can be used in software to reconfigure the hardware.
  • a further aspect of the invention provides for generating configuration control signals in an improved address generator or in a novel Memory- centric DSP Controller ("MDSPC").
  • MDSPC Memory- centric DSP Controller
  • the new address generation techniques include both reconfiguring and addressing the memory to support a particular computation in the execution unit.
  • memory blocks can be reconfigured both in depth, i.e. number or words or rows, as well as in width (word size).
  • the memory word size can be easily configured to match that of the I/O channel currently in use.
  • the invention can be implemented in both von Neumann as well as Harvard architectures.
  • a further aspect of the invention provides ways to interface multiple blocks of memory, in combination with one or more execution units.
  • parallel execution units can be used to advantage.
  • Another aspect of the invention is a system that is readily configurable to take advantage of the available execution resources.
  • the configuration bits described above also can include controls for directing data to and from multiple execution units as and when appropriate.
  • the invention further anticipates an execution unit that can be reconfigured in several ways, including selectable depth (number of pipeline stages) and width (i.e. multiple word sizes concurrently).
  • the pipelined execution unit(s) includes internal register files with feedback.
  • the execution unit configuration and operation also can be controlled by execution unit configuration control signals.
  • the execution unit configuration control signals can be determined by "configuration bits" stored in the memory, or stored in a separate "configuration table".
  • the configuration table can be downloaded by the host core processor, and/or updated under software control.
  • the configuration control signals are generated by the MDSPC controller mentioned above executing microcode.
  • FIG. 1 is a system level block diagram of an architecture for digital signal processing (DSP) using shared memory according to the present invention.
  • DSP digital signal processing
  • FIG. 2 illustrates circuitry for selectively coupling two or moare address generators to a single block of memory.
  • FIG. 3 is a block diagram illustrating portions of the memory circuitry and address generators of Fig. 1 in a fixed-partition memory configuration.
  • FIG. 4 shows more detail of address and bit line connections in a two-port memory system of the type described.
  • FIGS. 5A-5C illustrate selected address and control signals in a Processor Implementation of a DSP system, i.e. a complete DSP system integrated on a single chip.
  • FIG. 6 A illustrates an alternative embodiment in which a separate DSP program counter is provided for accessing the memory.
  • FIG. 6B illustrates an alternative embodiment in which an MDSPC accesses the memory.
  • FIGS. 7A-B are block diagrams that illustrate embodiments of the invention in a Harvard architecture.
  • FIG. 8 is a conceptual diagram that illustrates a shared, reconfigurable memory architecture according to the present invention.
  • FIG. 9 illustrates connection of address lines to a shared, reconfigurable memory with selectable (granular) partitioning of the reconfigurable portion of the memory.
  • FIG. 10 illustrates a system that implements a reconfigurable segment of memory under bit selection table control.
  • FIG. 11A is a block diagram illustrating an example of using single-ported RAM in a DSP computing system according to the present invention.
  • FIG. 1 IB is a table illustrating a pipelined timing sequence for addressing and accessing the one-port memory so as to implement a "virtual two-port" memory.
  • FIG. 12 illustrates a block of memory having at least one reconfigurable segment with selectable write and read data paths.
  • FIG. 13 A is a schematic diagram showing detail of one example of the write selection circuitry of the reconfigurable memory of Fig. 12.
  • FIG. 13B illustrates transistor pairs arranged for propagating or isolating bit lines as an alternative to transistors 466 in Fig. 13 A or as an alternative to the bit select transistors 462, 464 of Fig. 13A.
  • FIG. 14 is a block diagram illustrating extension of the shared, reconfigurable memory architecture to multiple segments of memory.
  • FIG. 15 is a simplified block diagram illustrating multiple reconfigurable memory segments with multiple sets of sense amps.
  • FIGS. 16A-16D are simplified block diagrams illustrating various examples of memory segment configurations to form memory blocks of selectable size.
  • FIG. 17 is a block diagram of a DSP architecture illustrating a multiple memory block to multiple execution unit interface scheme in which configuration is controlled via specialized address generators.
  • FIGS. 18A-18C are simplified block diagrams illustrating various configurations of segments of a memory block into association with multiple execution units.
  • FIG. 19 is a simplified block diagram illustrating a shared, reconfigurable memory system utilizing common sense amps.
  • FIG. 20 is a simplified block diagram illustrating a shared, reconfigurable memory system utilizing multiple sense amps for each memory segment.
  • FIG. 21 is a timing diagram illustrating memory swapping cycles.
  • FIG. 22 A is a block diagram illustrating memory swapping under bit table control.
  • FIG. 22B is a block diagram illustrating memory swapping under MDSPC control.
  • FIG. 23 is a simplified block diagram of an MPEG encoder/decoder architecture according to the present invention.
  • FIG. 24 is a simplified block diagram of a digital signal processing ring topology architecture according to the invention.
  • Fig. 1 is a system-level block diagram of an architecture for memory and computing-intensive applications such as digital signal processing.
  • a microprocessor interface 40 includes a DMA port 42 for moving data into a memory via path 46 and reading data from the memory via path 44.
  • the microprocessor interface 40 generically represents an interface to any type of controller or microprocessor.
  • the interface partition indicated by the dashed line 45 in Fig. 1 may be a physical partition, where the microprocessor is in a separate integrated circuit, or it can merely indicate a functional partition in an implementation in which all of the memory and circuitry represented in the diagram of Fig. 1 is implemented on board a single integrated circuit.
  • the microprocessor interface also includes control signals indicated at 52.
  • the microprocessor or controller can also provide microcode (not shown) for memory control and address generation, as well as control signals for configuration and operation of the functional execution units, as described later.
  • the present invention may be integrated into an existing processor or controller core design, so that both the core processor and the present invention reside in the same integrated circuit, reference will be made herein to the core processor meaning the processor that the present invention has been attached to or integrated with.
  • a two-port memory comprises the first memory block 50, labeled "A” and a second memory block 60, labeled "B. " The memory is addressed by a source address generator 70 and a destination address generator 80.
  • a functional execution unit 90 also is coupled to the two-port memory, left and right I/O channels, as illustrated at block B. Preferably, these are not conventional two-port memory I/O ports; rather, they have novel structures described later.
  • the interface 44, 46 to the two-port memory block A is a DMA interface that is in communication with the host processor or controller 40.
  • Block A receives data coefficients and optionally other parameters from the controller, and also returns completed data to the controller that results from various DSP, graphics, MPEG encode/decode or other operations carried out in the execution unit 90.
  • This output data can include, for example, FFT results, or convolution data, or graphics rendering data, etc.
  • the single memory can alternately act as both a graphics frame buffer and a graphics computation buffer memory.
  • the memory block "B" (60) interfaces with the functional execution unit 90.
  • the functional execution unit 90 receives data from the two-port memory block B and executes on it, and then returns results ("writeback") to the 10
  • the source address generator 70 supplies source or input data to the functional execution unit while the destination address generator 80 supplies addresses for writing results (or intermediate data) from the execution unit to the memory.
  • source address generator 70 provides addressing while the functional execution unit is reading input data from memory block B
  • the destination address generator 80 provides addressing to the same memory block B while the functional execution unit 90 is writing results into the memory.
  • the memory effectively "swaps" blocks A and B, so that block B is in communication with the DMA channel 42 to read out the results of the execution.
  • This "swapping" of memory blocks includes several aspects, the first of which is switching the memory address generator lines so as to couple them to the appropriate physical block of memory.
  • the system can be configured so that the entire memory space (blocks A and B in the illustration) are accessed first by an I/O channel, and then the entire memory swapped to be accessed by the processor or execution unit.
  • any or all of the memory can be reconfigured as described.
  • the memory can be SRAM, DRAM or any other type of random access semiconductor memory or functionally equivalent technology. DRAM refresh is provided by address generators, or may not be required where the speed of execution and updating the memory (access frequency) is sufficient to obviate refresh.
  • FIGURE 2 Figure 2 illustrates one way of addressing a memory block with two (or more) address generators.
  • one address generator is labeled “DMA” and the other "ADDR GEN” although they are functionally similar.
  • one of the address generators 102 has a series of output lines, corresponding to memory word lines. Each output line is coupled to a corresponding buffer (or word line driver or the like), 130 to 140. Each driver has an enable input coupled to a common enable 11
  • the other address generator 104 similarly has a series of output lines coupled to respective drivers 150 to 160.
  • the number of word lines is at least equal to the number of rows of the memory block 200.
  • the second set of drivers also have enable inputs coupled to the common enable control line 142, but note the inverter "bubbles" on drivers 130 to 140, indicating the active-low enables of drivers 150 to
  • the DMA address generator 102 is coupled to the memory 200 row address inputs.
  • the ADDR GEN 104 is coupled to the memory 200 row address inputs. In this way, the address inputs are "swapped" under control of a single bit.
  • Alternative circuitry can be used to achieve the equivalent effect.
  • the devices illustrated can be tri-state output devices, or open collector or open drain structures can be used where appropriate.
  • Other alternatives include transmission gates or simple pass transistors for coupling the selected address generator outputs to the memory address lines. The same strategy can be extended to more than two address sources, as will be apparent to those skilled in the art in view of this disclosure.
  • Figure 3 is a block diagram illustrating a physical design of portions of the memory circuitry and address generators of Fig. 1 in a fixed-partition configuration.
  • fixed partition I mean that the size of memory block A and the size of memory block B cannot change dynamically.
  • the memory block A (50) and block B (60) correspond to the same memory blocks of Fig. 1.
  • the memory itself preferably is dynamic RAM, although static RAM or other solid state memory technologies could be used as well.
  • memory block B just two bits or memory cells 62 AND 64 are shown by way of illustration.
  • the memory block is likely to include thousands or even millions of rows, each row (or word) being perhaps 64 or more bits wide.
  • a typical memory block using today's technology is likely to be one or two megabytes.
  • the memory blocks need not be of equal size. Neither memory depth nor word size is critical to the invention. 12
  • the source address generator 70 is coupled to both memory blocks A and B.
  • the top row includes a series of cells including bit cell 62.
  • the source address generator preferably has output lines coupled to all of the rows of not only block B, but block A as well, although only one row line is illustrated in block A.
  • corresponding address lines from the AG 70 and the DMA 102 are shown as connected in common, e.g. at line 69. However, in practice, these address lines are selectable as described above with reference to Fig. 2.
  • a destination address generator 80 similarly is coupled to the row lines of both blocks of memory.
  • Memory cells 62 and 64 are full two-ported cells on the same column in this example.
  • a write select multiplexer 106 directs data either from the DMA (42 in Fig. 1) (or another block of memory) or from the execution unit 90, responsive to a control signal 108.
  • the control signal is provided by the controller or microprocessor of Fig, 1, by a configuration bit, or by an MDSPC.
  • the selected write data is provided to column amplifiers 110, 112 which in turn are connected to corresponding memory cell bit lines. 110 and 112 are bit and /bit ("bit bar") drivers.
  • Below cell 64 is a one-bit sense amplifier 116. A bit output from the sense amp 116 is directed, for example, to a latch 72.
  • Both the DMA and the execution unit are coupled to receive data from latch 72, depending on appropriate control, enable and clock signals (not shown here). Or, both the DMA and the execution path may have separate latches, the specifics being a matter of design choice. Only one sense amp is shown for illustration, while in practice there will be at least one sense amp for each column.
  • Fig. 4 shows more detail of the connection of cells of the memory to source and destination address lines. This drawing shows how the source address lines 13
  • Fig. 21 is a conceptual diagram illustrating an example for the timing of operation of the architecture illustrated in Fig. 1.
  • TO A, T1A, etc. are specific instances of two operating time cycles TO and Tl.
  • the cycle length can be predetermined, or can be a parameter downloaded to the address generators.
  • TO andTl are not necessarily the same length and are defined as alternating and mutually exclusive, i.e. a first cycle Tl starts at the end of TO, and a second cycle TO starts at the end of the first period Tl, and so on. Both TO and Tl are generally longer than the basic clock or memory cycle time.
  • Fig. 22 A is a block diagram of a single port architecture which will be used to illustrate an example of functional memory swapping in the present invention during repeating TO and Tl cyles.
  • Execution address generator 70 addresses memory block A (50) during TO cycles. This is indicated by the left (TO) portion of AG 70. During
  • execution address generator 70 addresses memory block B (60), as indicated by the right portion of 70.
  • AG 70 also receives setup or configuration data in preparation for again addressing Mem Block A during the next TO cycle.
  • AG 70 also receives configuration data in preparation for again addressing Mem Block B during the next Tl cycle.
  • DMA address generator 102 addresses memory block B (60) during TO cycles. This is indicated by the left (TO) portion of DMA AG 102. During Tl cycles, DMA address generator 102 addresses memory block A (50), as indicated by the right portion of 102. During Tl, DMA AG 102 also receives setup or configuration data in preparation for again addressing Mem Block B during the next 14
  • DMA 102 also receives configuration data in preparation for again addressing Mem Block A during the next Tl cycle.
  • the functional execution unit (90 in Fig. 1) is operating continuously on data in memory block A 50 under control of execution address generator 70.
  • DMA address generator 102 is streaming data into memory block B 60.
  • memory blocks A and B effectively swap such that execution unit 90 will process the data in memory block B 60 under control of execution address generator 70 and data will stream into memory block A 50 under control of DMA address generator 102.
  • memory blocks A and B again effectively swap such that execution unit 90 will process the data in memory block A 50 under control of execution address generator 70 and data will stream into memory block B 60 under control of DMA address generator 102.
  • the functions of the execution address generator and DMA address generator are performed by the MDPSC 172 under microcode control.
  • FIGURES 5A-C Processor Implementation
  • a two-port memory again comprises a block A (150) and a block B (160).
  • Memory block B is coupled to a DSP execution unit 130.
  • An address generator 170 is coupled to memory block B 160 via address lines 162.
  • the address generator unit is executing during a first cycle TO and during time TO is loading parameters for subsequent execution in cycle Tl.
  • the lower memory block A is accessed via core processor data address register 142A or core processor instruction address register 142B.
  • the data memory and the instructional program memory are located in the same physical memory.
  • a microprocessor system of the Harvard architecture has separate physical memory for 15
  • the present invention can be used to advantage in the Harvard architecture environment as well, as described below with reference to Figs. 7 A and 7B.
  • Fig. 5 A also includes a bit configuration table 140.
  • the bit configuration table can receive and store information from the memory 150 or from the core processor, via bus 180, or from an instruction fetched via the core processor instruction address register 142B. Information is stored in the bit configuration table during cycle TO for controlling execution during the next subsequent cycle Tl .
  • the bit configuration table can be loaded by a series of operations, reading information from the memory block A via bus 180 into the bit configuration tables. This information includes address generation parameters and opcodes. Examples of some of the address parameters are starting address, modulo-address counting, and the length of timing cycles TO and Tl . Examples of op codes for controlling the execution unit are the multiply and accumulate operations necessary for to perform an FFT.
  • the bit configuration table is used to generate configuration control signal 152 which determines the position of virtual boundary 136 and, therefore, the configuration of memory blocks A and B. It also provides the configuration information necessary for operation of the address generator 170 and the DSP execution unit 130 during the Tl execution cycle time.
  • Path 174 illustrates the execution unit/memory interface control signals from the bit configuration table 140 to the DSP execution unit 130.
  • Path 176 illustrates the configuration control signal to the execution unit to reconfigure the execution unit.
  • Path 178 illustrates the op codes sent to executionunit 130 which cause execution unit to perform the operations necessary to process data.
  • Path 188 shows configuration information loaded from the configuration tables into the address generator 170.
  • the architecture illustrated in Fig. 5 A preferably would utilize the extended instructions of a given processor architecture to allow the address register from the 16
  • instruction memory to create the information flow into the bit configuration table.
  • special instructions or extended instructions in the controller or microprocessor architecture can be used to enable this mechanism to operate as described above. Such an implementation would provide tight coupling to the microprocessor architecture.
  • Fig. 5B illustrates an embodiment of the present invention wherein the functions of address generator 170 and bit configuration table 140 of Fig. 5A are performed by memory-centric DSP controller (MDSPC) 172.
  • MDSPC memory-centric DSP controller
  • the core processor writes microcode for MDSPC 172 along with address parameters into memory block B 150.
  • the microcode and address parameters are downloaded into local memory within MDSPC 172.
  • a DSP process initiated in MDPSC 172 then generates the appropriate memory configuration control signals 152 and execution unit configuration control signals 176 based upon the downloaded microcode to control the position of virtual boundary 136 and structure execution unit 130 to optimize performance for the process corresponding to the microcode.
  • MDSPC 172 As the DSP process executes, MDSPC 172 generates addresses for memory block B 160 and controls the execution unit/memory interface to load operands from memory into the execution unit 130 which are then processed by execution unit 130 responsive to op codes 178 sent from MDSPC 172 to execution unit 130.
  • virtual boundary 136 may be adjusted responsive to microcode during process execution in order to dynamicly optimize the memory and execution unit configurations.
  • the MDSPC 172 supplies the timing and control for the interfaces between memory and the execution unit. Further, algorithm coefficients to the execution unit may be supplied directly from the MDSPC. The use of microcode in the MDSPC results in execution of the DSP process that is more efficient than the frequent downloading of bit configuration tables and address parameters associated 17
  • the microcoded method represented by the MDSPC results in fewer bits to transfer from the core processor to memory for the DSP process and less frequent updates of this information from the core processor. Thus, the core processor bandwidth is conserved along with the amount of bits required to store the control information.
  • Fig. 5C illustrates an embodiment of the present invention wherein the reconfigurability of memory in the present invention is used to allocate an additional segment of memory, memory block C 190, which permits MDPSC 172 to execute microcode and process address parameters out of memory block C 190 rather than local memory.
  • This embodiment requires an additional set of address 192 and data 194 lines to provide the interface between memory block C 190 and MDSPC 172 and address bus control circuitry 144 under control of MDSPC 172 to disable the appropriate address bits from core processor register file
  • This configuration permits simultaneous access of MDSPC 172 to memory block C 190, DSP execution unit 130 to memory block B and the core processor to memory block A.
  • virtual boundaries 136A and 136B are dynamically reconfigurable to optimize the memory configuration for the DSP process executing in MDSPC 172.
  • the bit tables and microcode discussed above may alternatively reside in durable store, such as ROM or flash memory.
  • the durable store may be part of memory block A or may reside outside of memory block A wherein the content of durable store is transferred to memory block A or to the address generators or
  • the DSP process may be triggered by either decoding a preselected bit pattern corresponding to a DSP function into an address in memory block A containing the bit tables or microcode required for execution of the DSP function.
  • Yet another approach to triggering the DSP process is to place the bit 18
  • microcodes for the DSP function at a particular location in memory block A and the DSP process is triggered by the execution of a jump instruction to that particular location.
  • a DSP function such as a Fast Fourier Transform (FFT) or IIR
  • FFT Fast Fourier Transform
  • IIR IIR
  • FIGURES 6A and 6B Referring now to Fig. 6A, in an alternative embodiment, a separate program counter 190 is provided for DSP operations.
  • the core controller or processor (not shown) loads information into the program counter 190 for the DSP operation and then that program counter in turn addresses the memory block 150 to start the process for the DSP.
  • Information required by the DSP operations would be stored in memory.
  • any register of the core processor such as data address register 142A or instruction address register 142B, can be used for addressing memory 150.
  • Bit Configuration Table 140 in addition to generating memory configuration signal 152, produces address enable signal 156 to control address bus control circuitry 144 in order to select the address register which accesses memory block A and also to selectively enable or disable address lines of the registers to match the memory configuration (i.e. depending on the position of virtual boundary 136, address bits are enabled if the bit is needed to access all of memory block A and disabled if block A is smaller than the memory space accessed with the address bit).
  • Fig. 6 A shows the DSP program counter 190 being loaded by the processor with an address to move into memory block A. In that case, the other address sources in register file 142 are disabled, at least with respect to addressing memory 150.
  • three different alternative mechanisms are illustrated for accessing the memory 150 in order to fetch the bit configurations and other parameters 140. The selection of which addressing mechanism is most advantageous 19
  • Fig. 6B shows an embodiment wherein MDSPC 172 is used to generate addresses for memory block A in place of DSP PC 190.
  • Address enable signal 156 selects between the address lines of MDSPC 172 and those of register file 142 in response to the microcode executed by MDSPC 172.
  • MDSPC 172 will be executing out of memory block A and therefore requires access to the content of memory block A.
  • memory blocks A (150) and B (160) are separated by "virtual boundary" 136.
  • block A and block B are portions of a single, common memory, in a preferred embodiment.
  • the location of the "virtual boundary" is defined by the configuration control signal generated responsive to the bit configuration table parameters.
  • the memory is reconfigurable under software control.
  • this memory has a variable boundary, the memory preferably is part of the processor memory, it is not contemplated as a separate memory distinct from the processor architecture.
  • the memory as shown and described is essentially reconfigurable directly into the microprocessor itself.
  • the memory block B, 160, duly configured executes into the DSP execution unit as shown in Fig. 5.
  • virtual boundary 136 is controlled based on the microcode downloaded to MDSPC 172.
  • microcode determines the position of both virtual boundary 136A and 136B to create memory block C 190.
  • Fig. 7A illustrates an alternative embodiment, corresponding to Fig. 5A, of the present invention in a Harvard-type architecture, comprising a data memory block 20
  • a 206 and block B 204 and a separate core processor instruction memory 200.
  • the instruction memory 200 in addressed by a program counter 202. Instructions fetched from the instruction memory 200 pass via path 220 to a DSP instruction decoder 222. The instruction decoder in turn provides addresses for DSP operations, table configurations, etc., to an address register 230. Address register 230 in turn addresses the data memory block A 206. Data from the memory passes via path 240 to load the bit configuration tables etc. 242 which in turn configure the address generator for addressing the data memory block B during the next execution cycle of the DSP execution unit 250.
  • Fig. 6 thus illustrates an alternative approach to accessing the data memory A to fetch bit configuration data.
  • a special instruction is fetched from the instruction memory that includes an opcode field that indicates a DSP operation, or more specifically, a DSP configuration operation, and includes address information for fetching the appropriate configuration for the subroutine.
  • MDPSC 246 replaces AG 244 and Bit Configuration Table 242. Instructions in core processor instruction memory 200 that correspond to functions to be executed by DSP Execution Unit 250 are replaced with a preselected bit pattern which is not recognized as a valid instruction by the core processor.
  • DSP Instruction Decode 222 decodes the preselected bit patterns and generates an address for DSP operations and address parameters stored in data memory A and also generates a DSP control signal which triggers the DSP process in MDSPC 246.
  • DSP Instruction Decode 222 can also be structured to be responsive to output data from data memory A 206 into producing the addresses latched in address register 230.
  • the DSP Instruction Decode 222 may be reduced or eliminated if the DSP process is initiated by an instruction causing a jump to the bit table or microcode in memory block A pertaining to the execution of the DSP process.
  • the present invention includes an architecture that features shared, reconfigurable memory for efficient operation of one or more processors together with one or more functional execution units such as DSP execution units.
  • Fig. 6 A shows an implementation of a sequence of operations, much like a 21
  • Fig. 6B shows an implementation wherein the DSP function is executed under the control of an MDSPC under microcode control.
  • the invention is illustrated as integrated with a von Neumann microprocessor architecture.
  • Figs. 7 A and. 7B illustrate applications of the present invention in the context of a Harvard- type architecture.
  • the system of Fig. 1 illustrates an alternative stand-alone or coprocessor implementation.
  • Next is a description of how to implement a shared, reconfigurable memory system.
  • Fig. 8 is a conceptual diagram illustrating a reconfigurable memory architecture for DSP according to another aspect of the present invention.
  • a memory or a block of memory includes rows from 0 through Z.
  • a first portion of the memory 266, addresses 0 to X, is associated, for example, with an execution unit (not shown).
  • a second (hatched) portion of the memory 280 extends from addresses from X+ 1 to Y.
  • a third portion of the memory 262, extending from addresses Y+ l to Z, is associated, for example, with a DMA or I/O channel.
  • associated here we mean a given memory segment can be accessed directly by the designated DMA or execution unit as further explained herein.
  • the second segment 280 is reconfigurable in that it can be switched so as to form a part of the execution segment 266 or become part of the DMA segment 262 as required.
  • each memory word or row includes data and/or coefficients, as indicated on the right side of the figure.
  • configuration control bits can include write enable, read enable, and other control information. So, for example, when the execution segment 266 is swapped to provide access by the DMA channel, configuration control bits in 266 can be used to couple the DMA channel to the I/O port of segment 266 for data transfer. In this way, a memory access or software trap can be used to reconfigure the system without delay.
  • the configuration control bits shown in Fig. 8 are one method of effecting memory reconfiguration that relates to the use of a separate address generator and bit configuration table as shown in Figs. 5 A and 7A. This approach effectively drives an address configuration state machine and requires considerable overhead processing to maintain the configuration control bits in a consistent and current state.
  • the configuration control bits are unnecessary because the MDSPC modifies the configuration of memory algorithmically based upon the microcode executed by the MDSPC. Therefore, the MDSPC of Figs. 5B, 5C and 7B is used.
  • MDSPC maintains the configuration of the memory internally rather than as part of the reconfigured memory words themselves.
  • FIGURE 9 Fig. 9 illustrates connection of address and data lines to a memory of the type described in Fig. 8.
  • a DMA or I/O channel address port 102 provides sufficient address lines for accessing both the rows of the DMA block of memory 262, indicated as bus 270, as well as the reconfigurable portion of the memory 280, via additional address lines indicated as bus 272.
  • the block 280 is configured as a part of the DMA portion of the memory, the DMA memory effectively occupies the memory space indicated by the brace 290 and the address lines 272 are controlled by the DMA channel 102.
  • Fig. 9 also shows an address generator 104 that addresses the execution block of memory 266 via bus 284. Address generator 104 also provides additional address lines for controlling the reconfigurable block 280 via bus 272.
  • Address generator 104 also provides additional address lines for controlling the reconfigurable block 280 via bus 272.
  • the execution block of memory has a total size indicated by brace 294, while the DMA portion is reduced to the size of block 262.
  • Fig. 9 indicates data access ports 110 and 120.
  • the upper data port 110 is associated with the DMA block of memory, which, as described, is of selectable size.
  • port 120 accesses the execution portion of the memory. Circuitry for selection of input (write) data sources and output (read) data destinations for a block of memory was described earlier. Alternative structures and implementation of multiple reconfigurable memory segments are described below.
  • the entire block need not be switched in toto to one memory block or the other.
  • the reconfigurable block preferably is partitionable so that a selected portion (or all) of the block can be switched to join the upper or lower block.
  • the granularity of this selection is a matter of design choice, at a cost of additional hardware, e.g. sense amps, as the granularity increases, as further explained later.
  • Fig. 10 illustrates a system that implements a reconfigurable segment of memory 280 under bit selection table control.
  • a reconfigurable memory segment 280 receives a source address from either the AG or DMA source address generator 274 and it receives a destination address from either the AG or DMA destination address generator 281.
  • Write control logic 270 for example a word wide multiplexer, selects write input data from either the DMA channel or the execution unit according to a control signal 272.
  • the source address generator 274 includes bit 24
  • the configuration control circuitry 276 either driven by a bit table or under microcode control, generates the write select signal 272.
  • the configuration control circuitry also determines which source and destination addresses lines are coupled to the memory — either "AG” (address generator) when the block 280 is configured as part of the an "AG” memory block for access by the execution unit, or the "DMA" address lines when the block 280 is configured as part of the DMA or I/O channel memory block.
  • the configuration control logic provides enable and/or clock controls to the execution unit 282 and to the DMA channel 284 for controlling which destination receives read data from the memory output data output port 290.
  • Fig. 11 is a partial block/partial schematic diagram illustrating the use of a single ported RAM in a DSP computing system according to the present invention.
  • a single-ported RAM 300 includes a column of memory cells 302, 304, etc. Only a few cells of the array are shown for clarity.
  • a source address generator 310 and destination address generator 312 are arranged for addressing the memory 300. More specifically, the address generators are arranged to assert a selected one address line at a time to a logic high state.
  • the term "address generator” in this context is not limited to a conventional DSP address generator. It could be implemented in various ways, including a microprocessor core, microcontroller, programmable sequencer, etc. Address generation can be provided by a micro-coded machine. Other implementations that provide DSP type of addressing are deemed equivalents. However, known address generators do not provide control and configuration functions such as those illustrated in Fig. 10 — configuration bits 330.
  • a multiplexer 320 selects data either from the DMA or from the execution unit, 25
  • control signal 322 responsive to the configuration bits in the source address generator 310.
  • the selected data is applied through drivers 326 to the corresponding column of the memory array 300 (only one column, i.e. one pair of bit lines, is shown in the drawing).
  • the bit lines also are coupled to a sense amplifier 324, which in turn provides output or write data to the execution unit
  • the execution unit 326 is enabled by an execution enable control signal responsive to the configuration bits 330 in the destination address block 312. Configuration bits 330 also provide a DMA control enable signal at 332.
  • the key here is to eliminate the need for a two-ported RAM cell by using a logical OR of the last addresses from the destination and source registers (located in the corresponding destination or source address generators). Source and destination operations are not simultaneous, but operation is still fast. A source write cycle followed by a destination read cycle would take only a total time of two memory cycles.
  • Fig. 12 illustrates a first segment of memory 400 and a second memory segment 460.
  • first segment 400 only a few rows and a few cells are shown for purposes of illustration.
  • One row of the memory begins at cell 402, a second row of the memory begins at cell 404, etc. Only a single bit line pair, 410, is shown for illustration.
  • a first write select circuit such as a multiplexer 406 is provided for selecting a source of write input data.
  • one input to the select circuit 406 may be coupled to a DMA channel or memory block Ml.
  • a second input to the MUX 406 may be coupled to an execution unit or another memory block M2.
  • Ml, M2, etc. we use the designations Ml, M2, etc., to refer generically, not only to other blocks of memory, but to execution units or other functional parts of a DSP system in general. 26
  • the multiplexer 406 couples a selected input source to the bit lines in the memory segment 400.
  • the select circuit couples all, say 64 or 128 bit lines, for example, into the memory. Preferably, the select circuit provides the same number of bits as the word size.
  • the bit lines for example bit line pair 410, extend through the memory array segment to a second write select circuit 420. This circuit selects the input source to the second memory segment 460. If the select circuit 420 selects the bit lines from memory segment 400, the result is that memory segment 400 and the second memory segment 460 are effectively coupled together to form a single block of memory. Alternatively, the second select circuit 420 can select write data via path 422 from an alternative input source.
  • a source select circuit 426 for example a similar multiplexer circuit, can be used to select this input from various other sources, indicated as M2 and Ml.
  • memory segment 460 is effectively isolated from the first memory segment 400.
  • the bit lines of memory segment 400 are directed via path 430 to sense amps 440 for reading data out of the memory segment 400.
  • sense amps 440 can be sent to a disable or low power standby state, since they need not be used.
  • Fig. 13 shows detail of the input selection logic for interfacing multiple memory segments.
  • the first memory segment bit line pair 410 is coupled to the next memory segment 460, or conversely isolated from it, under control of pass devices 466.
  • the input select logic 426 includes a first pair of pass transistors 426 for connecting bit lines from source Ml to bit line drivers 470.
  • a second pair of pass transistors 464 controUably couples an alternative input source M2 bit lines to drivers 470.
  • the pass devices 462, 464, and 466, are all controllable by control bits originating, for 27
  • Fig. 14 is a high-level block diagram illustrating extension of the architectures of Figs. 12 and 13 to a plurality of memory segments. Details of the selection logic and sense amps is omitted from this drawing for clarity. In general, this drawing illustrates how any available input source can be directed to any segment of the memory under control of the configuration bits.
  • Fig. 15 is another block diagram illustrating a plurality of configurable memory segments with selectable input sources, as in Fig. 14.
  • multiple sense amps 482, 484, 486, are coupled to a common data output latch 480.
  • sense amp 484 provides read bits from that combined block, and sense amp 482 can be idle.
  • Figs. 16A through 16D are block diagrams illustrating various configurations of multiple, reconfigurable blocks of memory.
  • the designations Ml, M2, M3, etc. refer generically to other blocks of memory, execution units, I/O channels, etc.
  • Fig. 16 A four segments of memory are coupled together to form a single, large block associated with input source Ml.
  • a single sense amp 500 can be used to read data from this common block of memory (to a destination associated with Ml).
  • the first block of memory is associated with resource Ml, and its output is provided through sense amp 502.
  • the other three blocks of memory, designated M2 are configured together to form a single block of memory - three segments long ⁇ associated with resource M2.
  • sense amp 508 provides output from the common block (3xM2), while sense amps 504 and 506 can be idle.
  • Figs. 16C and 16D provide additional examples that are self explanatory in view of the foregoing description. This illustration is not intended to imply that all 28
  • memory segments are of equal size. To the contrary, they can have various sizes as explained elsewhere herein.
  • Fig. 17 is a high-level block diagram illustrating a DSP system according to the present invention in which multiple memory blocks are interfaced to multiple execution units so as to optimize performance of the system by reconfiguring it as necessary to execute a given task.
  • a first block of memory Ml provides read data via path 530 to a first execution unit ("EXEC A") and via path 532 to a second execution unit (EXEC B").
  • Execution unit A outputs results via path 534 which in turn is provided both to a first multiplexer or select circuit MUX-1 and to a second select circuit MUX-2.
  • MUX-1 provides select write data into memory Ml.
  • a second segment of memory M2 provides read data via path 542 to execution unit A and via path 540 to execution unit B.
  • Output data or results from execution unit B are provided via path 544 to both MUX-1 and to MUX-2.
  • MUX-2 provides selected write data into the memory block M2. In this way, data can be read from either memory block into either execution unit, and results can be written from either execution unit into either memory block.
  • a first source address generator SI provides source addressing to memory block Ml.
  • Source address generator SI also includes a selection table for determining read/ write configurations. Thus, SI provides control bit "Select A" to
  • SI also provides a "Select A" control bit to MUX-2 in order to select execution unit A as the data source for writing into memory M2.
  • a destination address generator Dl provides destination addressing to memory block Ml.
  • Dl also includes selection tables which provide a "Read 1 " control signal to execution A and a second "Read 1" control signal to execution unit B. By asserting a selected one of these control signals, the selection bits in Dl directs a selected one of the execution units to read data from memory Ml.
  • a second source address generator S2 provides source addressing to memory segment M2. Address generator S2 also provides a control bit "select B" to MUX-1 29
  • a second destination address generator D2 provides destination addressing to memory block M2 via path 560. Address generator D2 also provides control bits for configuring this system. D2 provides a read to signal to execution unit A via path 562 and a read to signal to execution unit B via path 564 for selectively causing the corresponding execution unit to read data from memory block M2.
  • Fig. 18A illustrates at a high level the parallelism of memory and execution units that becomes available utilizing the reconfigurable architecture described herein.
  • a memory block comprising for example 1,000 rows, may have, say, 256 bits and therefore 256 outputs from respective sense amplifiers, although the word size is not critical. 64 bits may be input to each of four parallel execution units El - E4.
  • the memory block thus is configured into four segments, each segment associated with a respective one of the execution units, as illustrated in Fig. 18B. As suggested in the figure, these memory segments need not be of equal size.
  • Fig. 18A a memory block, comprising for example 1,000 rows, may have, say, 256 bits and therefore 256 outputs from respective sense amplifiers, although the word size is not critical. 64 bits may be input to each of four parallel execution units El - E4.
  • the memory block thus is configured into four segments, each segment associated with a respective one of the execution units, as illustrated in Fig. 18B. As suggested in the figure, these memory segments need not be of
  • 18C shows a further segmentation, and reconfiguration, so that a portion of segment M2 is joined with segment Ml so as to form a block of memory associated with execution unit El.
  • a portion of memory segment M3, designated “M3/2” is joined together with the remainder of segment M2, designated “M2/2", to form a memory block associated with execution unit E2, and so on.
  • Segmentation of the memory may be designed to permit reconfigurability down to the granularity of words or bits if necessary.
  • Fig. 19 illustrates an alternative embodiment in which the read bit lines from multiple memory segments, 30
  • read bit lines 604 are directed to a multiplexer circuit 606, or its equivalent, which in turn has an output coupled to shared or common set of sense amps 610.
  • Sense amps 610 in turn provide output to a data output latch 612, I/O bus or the like.
  • the multiplexer or selection circuitry 604 is responsive to control signals (not shown) which select which memory segment output is "tapped" to the sense amps. This architecture reduces the number of sense amps in exchange for the addition of selection circuitry 606.
  • Fig. 20. is a block diagram illustrating a memory system of multiple configurable memory segments having multiple sense amps for each segment. This alternative can be used to improve speed of "swapping" read data paths and reduce interconnect overhead in some applications.
  • RAMDSP Design for MPEG-2 Encode/Decode Introduction We use "RAMDSP" as a shorthand for embedded DRAM solutions tailored for digital signal processing applications. Full-frame video compression and decompression requires massive amounts of computational power combined with special operations to support efficient use of digital processing engines. Successful single chip designs (without embedded DRAM) include Chromatic Research Mpact2 and C-Cubed DVX in the 3 to 6 million transistors range. The RAMDSP design includes the necessary 8 MB DRAM in the same package and still uses less than 2 million logic transistors.
  • the RAMDSP solution combines a standard Execution Unit (EU) and three specialized EUs in a multiprocessor arrangement with four separate memories to support MIMD style multiprocessing.
  • the standard EU serves to coordinate activities and is sufficiently powerful to handle the complete decode operation.
  • the specialized EUs are designed to handle the especially compute intensive motion- estimation phase of the MPEG-2 encoder, as well as general signal processing kernels. 31
  • the MPEG-2 encode requirements are much more demanding that decode requirements.
  • Each frame must be transformed using the Discrete Cosine Transform (DCT) and reference frames (R-frame) must be inverse transformed to give accurate error estimates for encoding intermediate frames (I-frame).
  • DCT Discrete Cosine Transform
  • R-frame reference frames
  • I-frame encoding intermediate frames
  • ME Motion Estimation
  • Specialized MPEG-2 encoder chips often implement highly parallel ME engines, which can perform at rates of 6 to 12 Billion Ops Per Second (Bops).
  • the RAMDSP architecture has been enhanced to directly compute absolute differences in highly parallel SIMD instructions (described below).
  • FIG. 23 shows an illustrative RAMDSP MPEG-2 configuration.
  • One general RAMDSP block, with 2MB DRAM serves as the supervisor and does general Code/Decode ("Codec") tasks.
  • the other three RAMDSP blocks are specialized for Motion Estimation ("ME Unit"), but can also serve other compute intensive functions.
  • An internal 64-bit bus connects the separate engine/memory units. This bus is also interfaced to the outside world through a custom Bus Interface Unit (BIU).
  • the BIU can include high-speed SRAM buffering and certain specialized functions such as table look-up for bit serial encode/decode.
  • the general purpose RAMDSP engine has standard 64-bit data paths from the register file through the two ALUs.
  • the specialized ME units have 128-bit data paths. This allows twice the throughput of computational operations, without increasing the code density or complexity. Further specialization includes reduced need for local SRAM and micro-code store due to simplified processing requirements of the ME kernels. 32
  • the RAMDSP design preferably includes several extensions to the Instruction Set Architecture (ISA) that support MPEG-2 algorithms.
  • An absolute difference instruction (PABDIF) allows computing up to 16 absolute differences for 8-bit data in parallel on the standard engine and up to 32 absolute differences on the specialized ME engine, all in a single clock period. These differences can then be accumulated in 16-bit precision in two additional clock periods.
  • Other, more traditional IS As e.g. , Intel's MMX technology
  • PARS parallel add, round and shift instruction
  • RAMDSP Ring Topology There are numerous ways of arranging and connecting multiple RAMDSP engines, a RISC controller, and one or more Bus Interface Units on a single chip. Many of these methods involve the use a bus to move data between functional units. The disadvantage of using a bus to transfer data is that it requires both time and power. Another method to transfer data between functional units would be to merely switch the memory access configuration such that a different processor has access to the data. Note that this is not really transferring the data at all, but just transferring the right to access it. However, if every processor could access every memory segment, then it does not take very many processors and memory segments before the interconnect would become unwieldy. Thus, it would be of great benefit to have a design that would allow data to be "transferred" by switching memory access and, at the same time, keep the interconnect simple. Hence, the proposed ring topology described below.
  • RAMDSP engines processors
  • memory blocks memory blocks
  • FIG. 24 uses four processors and four memory blocks - this number is arbitrary; it can be scaled.
  • Each memory block consists of, e.g. eight single-ported segments such that multiple segments can be accessed simultaneously in parallel.
  • a configuration register Through the use of a configuration register, multiplexers, and demultiplexers, several units can access each segment as described above in the description of reconfigurable memory blocks and segments.
  • Figure 24 shows four bi-directional connections to each memory block: one to each of its two neighboring processors, one to a RISC controller in the center of the ring, and one pointing away from the ring (labeled N, E, S, and W).
  • the four paths pointing away from the ring could either all go to the Bus Interface Unit (BIU) 34
  • each segment of each memory block need not be able to be switched to all four connections at each block; however, that would be desirable. Each segment could just be able to be attached to two units so long as all six possible pairs are covered. If the processor in the clockwise direction is denoted by CW, the counter clockwise by CCW, the RISC as RISC, and off-board as OFF, then the six combinations are: 1. RISC-OFF
  • Another configuration that should be considered is to rotate the processors 45 degrees counter clockwise on the ring such that a processor and a memory block are integrated (just as in a normal single engine RAMDSP).
  • a processor and a memory block are integrated (just as in a normal single engine RAMDSP).
  • one connection to each memory segment is always to its own engine, which will be denoted as LOCAL.
  • the connections might be:
  • Each processor in Figure 24 preferably is a complete RAMDSP engine less the DMEM, which is shown separately. Therefore, each processor has its own program in its own program store. Thus, with on the chip, the architecture is a distributed memory MIMD. This allows tremendous flexibility in how problems can be partitioned. Two simple examples:
  • stages are assigned to processors in a manner to balance the computational load. For simplicity, assume that there are the same number of stages as there are processors and each stage requires the same amount of computation. In this case, one stage is assigned to each processor and they are assigned clockwise around the ring. As the program runs, each processor reads its input data into its EU from the counter clockwise memory, processes it, and writes out the results to the clockwise memory. Thus, all the processor work in parallel on different stages of the algorithm with the data flowing around the ring without needing any data transfers over a buss.
  • Geometry processing is responsible for converting a stored scene into a form ready to be rendered onto the screen; processing takes place on polygons that approximate the surfaces of the scene. This involves transformation of the scene data to reflect the current user viewpoint, removal of hidden surfaces and objects lying outside the field of view, and scaling objects so as to provide realistic perspective in a
  • Geometry processing relies heavily on floating-point operations, and as such is typically performed on the PC itself (due to the superior floating-point performance of Pentium-class processors).
  • the performance of geometry processing is chiefly a function of the number of polygons. Additionally, most operations can be 37
  • DXF Drawing Exchange Format
  • CAD drawings is a common example. While it is possible in principle to process polygons with any number of edges, virtually all schemes exploit triangles to minimize the mathematical complexity of geometry processing operations.
  • the starting point for geometry processing is generally a list of triangles making up the scene, each triangle being defined by the 3D coordinates of its vertices (in practice, the list of vertices takes the form of a hierarchical tree, to avoid the redundancy of storing and processing shared vertices).
  • the list of vertices takes the form of a hierarchical tree, to avoid the redundancy of storing and processing shared vertices).
  • Associated with each triangle is its color information; in some cases, precomputed surface normal vectors to the triangles may also be stored in the file format.
  • the first step is to transform it into a coordinate space reflecting the desired view (this is commonly referred to as "camera coordinates").
  • Camera coordinates There are three basic operations used in transformation: translation, scaling, and rotation.
  • Translation simply involves shifting the position of the scene within the coordinate space; it is accomplished by adding the appropriate offset to each coordinate of each vertex.
  • Scaling involves changing the size of the scene within the coordinate space to reflect different viewing distances; it is accomplished by multiplying each coordinate of each vertex by the appropriate scaling factor.
  • rotation of the scene allows any arbitrary viewpoint to be realized; this requires computation or table-lookup of the sine and cosine of the three angles that the desired viewing direction makes with the stored coordinate axes, in addition to both addition and multiplication. Since vertices are independent of each other from a transformation standpoint, each operation can be made SIMD parallel on different vertices insofar as hardware allows; similarly, the three operations can be made MIMD parallel (that is to say, pipelined). 38
  • 3D depth This involves multiplying each vertex's x and y coordinate by the viewing distance, and subsequently dividing by its z coordinate.
  • the exponential falloff of illumination with distance is generally ignored, since the effects on the scene are negligible.
  • perspective projection can be made SIMD parallel insofar as hardware allows.
  • Object space clipping is the process of discarding those triangles that fall entirely outside the viewing window of the scene as it has been transformed. While it is not strictly necessary to perform this during geometry processing (since during final rendering and rasterization pixels lying outside the field of view will in any case not be projected on the screen; this is known as image-space clipping), it is generally done to reduce the number of triangles that the rendering stage will have to process.
  • Object space clipping involves several multiplications, divides, and comparisons for each vertex; triangles are discarded if they fall outside a six-sided viewing frustum. Triangles that are partially within the viewing frustum may be discarded by re- 39
  • Backface culling is the process of discarding those triangles that will be invisible from the current viewpoint, since they lie on the far side of objects in the scene. As with object space clipping, backface culling is performed during geometry processing to reduce the number of triangles that the rendering stage will have to process. Backface culling involves computing the dot product of the surface normal of a triangle face and the view direction vector; if the resulting angle is less than zero (hence obtuse), the triangle face is invisible from the current viewpoint. Backface culling can be made SIMD parallel on a triangle-by-triangle basis.
  • Rendering is the process of converting the polygon representation of the scene to a pixel image, ready to be displayed on the screen. Rendering chiefly involves integer operations on pixels, in contrast with the floating-point operations performed during geometry processing, and is in general far more memory-intensive than geometry processing. The bulk of rendering takes place on the graphics accelerator board in most modern PC systems. In addition to the conversion from triangles to pixels, several other operations are most conveniently performed at rendering time; these are described below. As in the geometry processing phase, there are numerous opportunities to exploit both SIMD and MIMD parallelism during rendering. 2.2.1 Shading Associated with each triangle is a single color, selected based on the color of the stored triangle in conjunction with any lighting effects applied during geometry processing. Simply rasterizing a scene based on such triangles results in a faceted 40
  • Shading in this context, is the process of generating non-uniform color across triangles so as to produce a smoother, more realistic appearance.
  • the most commonly used algorithm for animated graphics is Gouraud shading; another algorithm, Phong shading, yields better results but is computationally too costly to be used in animation.
  • Gouraud shading involves first computing the surface normals at all vertices, by averaging the surface normals of the triangles meeting at any given vertex. Then the dot product of each vertex's normal and the light source is computed, resulting in an intensity value for each vertex. Finally, the intensities of the pixels within triangles are computed by interpolating between the values at the vertices. Shading lends itself to both SIMD and MIMD parallelism, since the vertex normals can be computed in parallel and then passed down the pipeline to the interpolation phase, given appropriate hardware support.
  • Z-Buffering is a technique for foolproof hidden surface removal.
  • the hidden surfaces that Z-buffering addresses are those that face the viewing direction, but are obscured by other surfaces closer to the viewer.
  • Z-buffering is generally performed at the time that triangles are converted to pixels.
  • a Z-buffer is conceptually a 2D buffer with the same number of elements as the screen display; the element size is typically 16 or 32 bits.
  • Each x-y element of the buffer is initialized to the maximum z value (depth) of the scene. Then as each triangle is processed, the z values of the points within the triangle are compared to the contents of the Z-buffer at the appropriate x-y location.
  • Texture mapping is a technique allowing an arbitrary amount of detail to be incorporated into a scene, while still keeping the triangle count reasonable. It also allows certain special effects that would be difficult to achieve with higher triangle density. Texture mapping involves the application of pre- generated patterns or pictures to triangle faces in the scene; a classic example is the application of a brick pattern to a polygon, although there is no requirement that the applied picture be regular like brickwork.
  • Texture maps are stored as pixel bitmaps; their elements are referred to as texels.
  • texels In applying a texture map to a given triangle in a scene, one or more techniques are typically used to prevent aliasing artifacts.
  • the texture map must be perspective-corrected to conform to the orientation of the triangle surface in 3D space.
  • bilinear filtering may be used to reduce the blockiness that results from adjacent display pixels being determined by a single texel; bilinear filtering simply uses a texel averaging scheme in which the four orthogonally adjacent texels also contribute to the pixel's value.
  • MlP-mapping uses pre-computed texture maps of various resolutions in lieu of bilinear filtering; the texture map applied to a given triangle is then chosen based on the triangle's size.
  • Trilinear filtering is generally employed along with MlP-mapping to smooth out textural discontinuities between adjacent triangles.
  • Texture mapping is memory-intensive, from both size and bandwidth standpoints.
  • MlP-mapping approach a number of differently-scaled maps (typically on the order of 6 or 8) must be stored for each triangle in the scene; it is 42
  • AGP is a new bus design, based on the PCI standard, that is aimed at providing high bandwidth and latency in a dedicated graphics bus, along with a new operating mode aimed specifically at reducing the texture mapping bottleneck. It is just now finding its way onto mainstream motherboards. Ultimately, it will offer up to 266 MHz performance, although current implementations are running at 66 MHz (twice the speed of the PCI bus).
  • the bus offers two operating modes: DMA and Execution.
  • DMA mode is essentially just a faster, dedicated PCI bus; it is aimed at allowing higher-speed bulk transfers of data back and forth between the host PC RAM and the accelerator board's RAM.
  • Execution mode is specifically designed to accommodate the multiple random accesses to host RAM that are typical of current texture mapping approaches.
  • the accelerator board maps an area of the host PC's RAM for dedicated usage; thence it can do small transfers to and from the host via direct- addressed transfer.
  • DMA mode the faster speed of the bus is really the advantage offered by AGP; the ability to direct address the host's RAM is chiefly a programming convenience.
  • the MMX extensions to the Pentium instruction set allow an extra level of SIMD parallelism to be applied to typical graphics operations (particularly those of geometry processing).
  • the MMX architecture overloads the Pentium's floating point registers with a second mode, in which 8, 16, and 32-bit parallel arithmetic 43
  • Benchmarks There are no industry standard chip-level benchmarks for 3D graphics accelerator chips, because graphics performance is dependent on many things besides raw chip performance: the host PC, the accelerator board logic, the SVGA subsystem, and so forth. There are, however, a number of generally recognized benchmarks at the integrated system level, that primarily reflect the performance of the accelerator board (other things being more or less equal in the overall system configuration).
  • Two key benchmarks are: the Ziff-Davis 3D Winbench, championed by PC Magazine; and Wizmark, pushed by 3Dfx Interactive (the manufacturer of the Voodoo graphics accelerator chip). The former is important because it is vendor- independent, the latter because any new graphics accelerator approach will have to compare itself to the forthcoming Voodoo2-based accelerator boards, widely advocated as the state of the art in 1998 (though such boards are not scheduled to ship until March).
  • a suitably architected RAMDSP chip has the potential to be a uniquely powerful graphics accelerator, for two key reasons: (1) Most or all of the 3D 44
  • graphics pipeline could be implemented on-chip, with consequent savings in host PC cycles and memory (both size and bandwidth/latency); and (2) Due to the on-chip DRAM and multi-level parallelism of RAMDSP, several of the bottleneck operations in the 3D pipeline (notably texture mapping) could be handled much more efficiently than in current competing products in the PC marketplace.
  • An additional attractive feature of RAMDSP technology is its low power consumption, making it suitable for powerful 3D graphics on laptops and even PDA's; however, it is unclear at this writing that the demand for animated 3D is strong on such platforms.
  • the chip must contain multiple EU's and associated DRAM's.
  • the fourfold EU/DRAM ring configuration (with a central RISC executive processor) that has been discussed would likely be adequate to allow the entire pipeline to be chip-resident.

Abstract

La présente invention concerne une architecture à mémoire vive dynamique (DRAM) intégrée qui est spécialement adaptée au traitement graphique. Cette architecture comprend des moteurs multiples de traitement et des blocs de mémoire disposés selon une topologie en anneau. Les moteurs de processeur comprennent une unité d'exécution classique et des unités d'exécution spécialisées couplées à des blocs de mémoire reconfigurables à l'appui d'un multitraitement de style à instructions multiples, données multiples (MIMD). De plus, des extensions d'architecture de type ISA permettent d'améliorer très sensiblement les performances MPEG-2, le codage MPEG-2 et autres applications avec processeur de mise en mémoire (DSP).
PCT/US1999/007771 1998-04-08 1999-04-08 Architecture pour traitement graphique WO1999052040A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU36386/99A AU3638699A (en) 1998-04-08 1999-04-08 Architecture for graphics processing

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US8126698P 1998-04-08 1998-04-08
US60/081,266 1998-04-08

Publications (1)

Publication Number Publication Date
WO1999052040A1 true WO1999052040A1 (fr) 1999-10-14

Family

ID=22163106

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US1999/007771 WO1999052040A1 (fr) 1998-04-08 1999-04-08 Architecture pour traitement graphique

Country Status (2)

Country Link
AU (1) AU3638699A (fr)
WO (1) WO1999052040A1 (fr)

Cited By (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6198488B1 (en) 1999-12-06 2001-03-06 Nvidia Transform, lighting and rasterization system embodied on a single semiconductor platform
US6353439B1 (en) 1999-12-06 2002-03-05 Nvidia Corporation System, method and computer program product for a blending operation in a transform module of a computer graphics pipeline
US6417851B1 (en) 1999-12-06 2002-07-09 Nvidia Corporation Method and apparatus for lighting module in a graphics processor
US6452595B1 (en) 1999-12-06 2002-09-17 Nvidia Corporation Integrated graphics processing unit with antialiasing
US6504542B1 (en) 1999-12-06 2003-01-07 Nvidia Corporation Method, apparatus and article of manufacture for area rasterization using sense points
US6515671B1 (en) 1999-12-06 2003-02-04 Nvidia Corporation Method, apparatus and article of manufacture for a vertex attribute buffer in a graphics processor
US6573900B1 (en) 1999-12-06 2003-06-03 Nvidia Corporation Method, apparatus and article of manufacture for a sequencer in a transform/lighting module capable of processing multiple independent execution threads
US6593923B1 (en) 2000-05-31 2003-07-15 Nvidia Corporation System, method and article of manufacture for shadow mapping
US6597356B1 (en) 2000-08-31 2003-07-22 Nvidia Corporation Integrated tessellator in a graphics processing unit
US6650325B1 (en) 1999-12-06 2003-11-18 Nvidia Corporation Method, apparatus and article of manufacture for boustrophedonic rasterization
EP1364298A2 (fr) * 2000-11-12 2003-11-26 Bitboys, Inc. Moteur de rendu tridimensionnel a memoire incorporee
US6697064B1 (en) 2001-06-08 2004-02-24 Nvidia Corporation System, method and computer program product for matrix tracking during vertex processing in a graphics pipeline
US6765575B1 (en) 1999-12-06 2004-07-20 Nvidia Corporation Clip-less rasterization using line equation-based traversal
US6806886B1 (en) 2000-05-31 2004-10-19 Nvidia Corporation System, method and article of manufacture for converting color data into floating point numbers in a computer graphics pipeline
US6844880B1 (en) 1999-12-06 2005-01-18 Nvidia Corporation System, method and computer program product for an improved programmable vertex processing model with instruction set
US6870540B1 (en) 1999-12-06 2005-03-22 Nvidia Corporation System, method and computer program product for a programmable pixel processing model with instruction set
US7006101B1 (en) 2001-06-08 2006-02-28 Nvidia Corporation Graphics API with branching capabilities
US7162716B2 (en) 2001-06-08 2007-01-09 Nvidia Corporation Software emulator for optimizing application-programmable vertex processing
US7170513B1 (en) 1998-07-22 2007-01-30 Nvidia Corporation System and method for display list occlusion branching
US7209140B1 (en) 1999-12-06 2007-04-24 Nvidia Corporation System, method and article of manufacture for a programmable vertex processing model with instruction set
FR2895102A1 (fr) * 2005-12-19 2007-06-22 Dxo Labs Sa Procede pour traiter un objet dans une plateforme a processeur(s) et memoire(s) et plateforme utilisant le procede
WO2007071883A2 (fr) * 2005-12-19 2007-06-28 Dxo Labs Procede et systeme de traitement de donnes numeriques
US7286133B2 (en) 2001-06-08 2007-10-23 Nvidia Corporation System, method and computer program product for programmable fragment processing
US7456838B1 (en) 2001-06-08 2008-11-25 Nvidia Corporation System and method for converting a vertex program to a binary format capable of being executed by a hardware graphics pipeline
US8269768B1 (en) 1998-07-22 2012-09-18 Nvidia Corporation System, method and computer program product for updating a far clipping plane in association with a hierarchical depth buffer

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5659543A (en) * 1995-02-22 1997-08-19 3Com Corporation Communication method and system for aggregates of separate entities using data/management path and mapping path for identifying entities and managing a communication network
US5841444A (en) * 1996-03-21 1998-11-24 Samsung Electronics Co., Ltd. Multiprocessor graphics system

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5659543A (en) * 1995-02-22 1997-08-19 3Com Corporation Communication method and system for aggregates of separate entities using data/management path and mapping path for identifying entities and managing a communication network
US5841444A (en) * 1996-03-21 1998-11-24 Samsung Electronics Co., Ltd. Multiprocessor graphics system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
FUJITA et al., "A Dataflow Image Processing System Tip-4", PROCEEDINGS OF THE 5TH INTERNATIONAL CONFERENCE ON IMAGE ANALYSIS AND PROCESSING, 20-22 September 1989, pages 734-741, 000566824 *
GUTTAG K., ET AL.: "A SINGLE-CHIP MULTIPROCESSOR FOR MULTIMEDIA: THE MVP.", IEEE COMPUTER GRAPHICS AND APPLICATIONS., IEEE SERVICE CENTER, NEW YORK, NY., US, vol. 12., no. 06., 1 November 1992 (1992-11-01), US, pages 53 - 64., XP002921354, ISSN: 0272-1716, DOI: 10.1109/38.163625 *

Cited By (45)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8269768B1 (en) 1998-07-22 2012-09-18 Nvidia Corporation System, method and computer program product for updating a far clipping plane in association with a hierarchical depth buffer
US7170513B1 (en) 1998-07-22 2007-01-30 Nvidia Corporation System and method for display list occlusion branching
US6992669B2 (en) 1999-12-06 2006-01-31 Nvidia Corporation Integrated graphics processing unit with antialiasing
US7209140B1 (en) 1999-12-06 2007-04-24 Nvidia Corporation System, method and article of manufacture for a programmable vertex processing model with instruction set
US6778176B2 (en) 1999-12-06 2004-08-17 Nvidia Corporation Sequencer system and method for sequencing graphics processing
US6515671B1 (en) 1999-12-06 2003-02-04 Nvidia Corporation Method, apparatus and article of manufacture for a vertex attribute buffer in a graphics processor
US6573900B1 (en) 1999-12-06 2003-06-03 Nvidia Corporation Method, apparatus and article of manufacture for a sequencer in a transform/lighting module capable of processing multiple independent execution threads
US7095414B2 (en) 1999-12-06 2006-08-22 Nvidia Corporation Blending system and method in an integrated computer graphics pipeline
US7064763B2 (en) 1999-12-06 2006-06-20 Nvidia Corporation Single semiconductor graphics platform
US6650325B1 (en) 1999-12-06 2003-11-18 Nvidia Corporation Method, apparatus and article of manufacture for boustrophedonic rasterization
US6650330B2 (en) 1999-12-06 2003-11-18 Nvidia Corporation Graphics system and method for processing multiple independent execution threads
US6198488B1 (en) 1999-12-06 2001-03-06 Nvidia Transform, lighting and rasterization system embodied on a single semiconductor platform
US6417851B1 (en) 1999-12-06 2002-07-09 Nvidia Corporation Method and apparatus for lighting module in a graphics processor
US6765575B1 (en) 1999-12-06 2004-07-20 Nvidia Corporation Clip-less rasterization using line equation-based traversal
US6504542B1 (en) 1999-12-06 2003-01-07 Nvidia Corporation Method, apparatus and article of manufacture for area rasterization using sense points
US6452595B1 (en) 1999-12-06 2002-09-17 Nvidia Corporation Integrated graphics processing unit with antialiasing
US7697008B1 (en) 1999-12-06 2010-04-13 Nvidia Corporation System, method and article of manufacture for a programmable processing model with instruction set
US6870540B1 (en) 1999-12-06 2005-03-22 Nvidia Corporation System, method and computer program product for a programmable pixel processing model with instruction set
US7034829B2 (en) 1999-12-06 2006-04-25 Nvidia Corporation Masking system and method for a graphics processing framework embodied on a single semiconductor platform
US7755636B1 (en) 1999-12-06 2010-07-13 Nvidia Corporation System, method and article of manufacture for a programmable processing model with instruction set
US6992667B2 (en) 1999-12-06 2006-01-31 Nvidia Corporation Single semiconductor graphics platform system and method with skinning, swizzling and masking capabilities
US6353439B1 (en) 1999-12-06 2002-03-05 Nvidia Corporation System, method and computer program product for a blending operation in a transform module of a computer graphics pipeline
US7002588B1 (en) 1999-12-06 2006-02-21 Nvidia Corporation System, method and computer program product for branching during programmable vertex processing
US6844880B1 (en) 1999-12-06 2005-01-18 Nvidia Corporation System, method and computer program product for an improved programmable vertex processing model with instruction set
US7009607B2 (en) 1999-12-06 2006-03-07 Nvidia Corporation Method, apparatus and article of manufacture for a transform module in a graphics processor
US6593923B1 (en) 2000-05-31 2003-07-15 Nvidia Corporation System, method and article of manufacture for shadow mapping
US6806886B1 (en) 2000-05-31 2004-10-19 Nvidia Corporation System, method and article of manufacture for converting color data into floating point numbers in a computer graphics pipeline
US6906716B2 (en) 2000-08-31 2005-06-14 Nvidia Corporation Integrated tessellator in a graphics processing unit
US6597356B1 (en) 2000-08-31 2003-07-22 Nvidia Corporation Integrated tessellator in a graphics processing unit
EP1364298A2 (fr) * 2000-11-12 2003-11-26 Bitboys, Inc. Moteur de rendu tridimensionnel a memoire incorporee
EP1364298A4 (fr) * 2000-11-12 2008-12-17 Bitboys Inc Moteur de rendu tridimensionnel a memoire incorporee
EP2618301A1 (fr) * 2000-11-12 2013-07-24 Advanced Micro Devices, Inc. Moteur de rendu tridimensionnel à mémoire incorporée
US7006101B1 (en) 2001-06-08 2006-02-28 Nvidia Corporation Graphics API with branching capabilities
US7286133B2 (en) 2001-06-08 2007-10-23 Nvidia Corporation System, method and computer program product for programmable fragment processing
US7456838B1 (en) 2001-06-08 2008-11-25 Nvidia Corporation System and method for converting a vertex program to a binary format capable of being executed by a hardware graphics pipeline
US7162716B2 (en) 2001-06-08 2007-01-09 Nvidia Corporation Software emulator for optimizing application-programmable vertex processing
US6982718B2 (en) 2001-06-08 2006-01-03 Nvidia Corporation System, method and computer program product for programmable fragment processing in a graphics pipeline
US6697064B1 (en) 2001-06-08 2004-02-24 Nvidia Corporation System, method and computer program product for matrix tracking during vertex processing in a graphics pipeline
WO2007071884A3 (fr) * 2005-12-19 2007-08-16 Dxo Labs Procede pour traiter un objet dans une plateforme a processeur(s) et memoire(s) et plateforme utilisant le procede
WO2007071883A3 (fr) * 2005-12-19 2007-08-16 Dxo Labs Procede et systeme de traitement de donnes numeriques
WO2007071883A2 (fr) * 2005-12-19 2007-06-28 Dxo Labs Procede et systeme de traitement de donnes numeriques
JP2009524123A (ja) * 2005-12-19 2009-06-25 ディーエックスオー ラブズ 1つまたは複数のプロセッサとメモリとを有するプラットフォーム上でオブジェクトを処理する方法、およびこの方法を使用するプラットフォーム
US8412725B2 (en) 2005-12-19 2013-04-02 Dxo Labs Method for processing an object on a platform having one or more processors and memories, and platform using same
US8429625B2 (en) 2005-12-19 2013-04-23 Dxo Labs Digital data processing method and system
FR2895102A1 (fr) * 2005-12-19 2007-06-22 Dxo Labs Sa Procede pour traiter un objet dans une plateforme a processeur(s) et memoire(s) et plateforme utilisant le procede

Also Published As

Publication number Publication date
AU3638699A (en) 1999-10-25

Similar Documents

Publication Publication Date Title
WO1999052040A1 (fr) Architecture pour traitement graphique
US6807620B1 (en) Game system with graphics processor
US6798421B2 (en) Same tile method
US6819332B2 (en) Antialias mask generation
US7522171B1 (en) On-the-fly reordering of 32-bit per component texture images in a multi-cycle data transfer
US7158141B2 (en) Programmable 3D graphics pipeline for multimedia applications
US6731288B2 (en) Graphics engine with isochronous context switching
US6577317B1 (en) Apparatus and method for geometry operations in a 3D-graphics pipeline
US8074224B1 (en) Managing state information for a multi-threaded processor
US6624819B1 (en) Method and system for providing a flexible and efficient processor for use in a graphics processing system
US6791559B2 (en) Parameter circular buffers
EP0875853A2 (fr) Architecture de processeur graphique
WO2000019377A1 (fr) Processeur graphique a ombrage differe
US11551400B2 (en) Apparatus and method for optimized tile-based rendering
US7747842B1 (en) Configurable output buffer ganging for a parallel processor
WO2017127173A1 (fr) Sélection de niveau de détail pendant un traçage de rayon
US7484076B1 (en) Executing an SIMD instruction requiring P operations on an execution unit that performs Q operations at a time (Q<P)
US5914724A (en) Lighting unit for a three-dimensional graphics accelerator with improved handling of incoming color values
US7868894B2 (en) Operand multiplexor control modifier instruction in a fine grain multithreaded vector microprocessor
US20030164823A1 (en) 3D graphics accelerator architecture
WO2018052618A1 (fr) Placage d'ombres optimisé à suppression d'éléments indésirables en z hiérarchique (hiz)
Poulton et al. Breaking the frame-buffer bottleneck with logic-enhanced memories
US7489315B1 (en) Pixel stream assembly for raster operations
CA2298337C (fr) Systeme de jeu a processeur graphique
US20080055307A1 (en) Graphics rendering pipeline

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AL AM AT AU AZ BA BB BG BR BY CA CH CN CU CZ DE DK EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MD MG MK MN MW MX NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT UA UG US UZ VN YU ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW SD SL SZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

122 Ep: pct application non-entry in european phase