WO1999013397A1 - Fifo memory device using shared reconfigurable memory block - Google Patents
Fifo memory device using shared reconfigurable memory block Download PDFInfo
- Publication number
- WO1999013397A1 WO1999013397A1 PCT/US1998/019115 US9819115W WO9913397A1 WO 1999013397 A1 WO1999013397 A1 WO 1999013397A1 US 9819115 W US9819115 W US 9819115W WO 9913397 A1 WO9913397 A1 WO 9913397A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- memory
- write
- read
- input
- block
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F5/00—Methods or arrangements for data conversion without changing the order or content of the data handled
- G06F5/06—Methods or arrangements for data conversion without changing the order or content of the data handled for changing the speed of data flow, i.e. speed regularising or timing, e.g. delay lines, FIFO buffers; over- or underrun control therefor
- G06F5/10—Methods or arrangements for data conversion without changing the order or content of the data handled for changing the speed of data flow, i.e. speed regularising or timing, e.g. delay lines, FIFO buffers; over- or underrun control therefor having a sequence of storage locations each being individually accessible for both enqueue and dequeue operations, e.g. using random access memory
Definitions
- the present invention is generally in the field of digital computing and, more specifically, is directed to a memory subsystem that combines shared, reconfigurable memory techniques together with a micro-coded controller in the context of a FIFO 5 memory device.
- the 5 memory systems described in the prior case are shared in the sense that a given block of memory can first be configured for access by the CPU, for example to load data, and then "swapped" so that the same block of physical memory can be directly accessed by an execution unit, for example a DSP execution unit, to carry out various calculations on that data. After the calculations are completed, the same block of memory can be "swapped” o once again, so that the CPU has immediate access to the results.
- the memory is reconfigurable in a variety of ways, as described below, so as to allocate memory resources as between the CPU and the execution unit (or multiple execution units) in the most efficient manner possible.
- Reconfiguring the memory can include forming memory blocks of various sizes; selecting write (input) sources; selecting read (destination) targets; selecting word size, and so forth.
- selecting write (input) sources selecting read (destination) targets; selecting word size, and so forth.
- DSP digital signal processing
- DSP is just one example of computation-intensive calculation.
- the concepts of the prior case as well as the present invention are applicable to a wide variety of execution tasks, including but not limited to DSP and related tasks such as motion picture encoding, decoding, and encryption, decryption, etc.
- the principles of the parent application can be applied advantageously in the context of a FIFO memory. Accordingly, the present specification adds additional disclosure directed to application of shared, configurable memory and tightly coupled execution units to the FIFO memory context.
- Another aspect of the prior application is a memory-centric DSP controller
- MDSPC MDSPC
- the MDSPC was described as providing memory address generation and a variety of other control functions, including reconfiguring the memory as summarized above to support a particular computation in the execution unit.
- the name "MDSPC” was appropriate for that controller in the context of the parent case, in which the preferred embodiment was described for digital signal processing.
- the principles of the parent case and the present invention are not so limited. Accordingly, the present application includes description of a "controller” which is functionally similar to the "MDSPC" introduced in the parent case. Drawing Figs. 1-22 and the corresponding description herein were included in the parent application.
- the present application includes additional drawing Figs. 23-28.
- the parent application describes a memory subsystem that is partitioned into two or more blocks of memory space.
- One block of the memory communicates with an I/O or DMA channel to load data, while the other block of memory simultaneously communicates with one or more execution units that carry out arithmetic operations on data in the second biock. Results are written back to the second block of memory.
- the memory blocks are effectively "swapped" so that the second block, now 5 holding processed (output) data, communicates with the I/O channel to output that data, while the execution unit communicates with the first block, which by then has been filled with new input data.
- Methods and apparatus are shown for implementing this memory swapping technique in real time so that the execution unit is never idle.
- the present application extends these concepts to FIFO memory. o Another aspect of the parent case describes interfacing two or more address generators to the same block of memory, so that memory block swapping can be accomplished without the use of larger muiti-ported memory cells. The present application extends these concepts to FIFO memory as well.
- the parent application identified earlier describes the concept of reconfiguring an 5 execution unit in several ways, including selectable depth (number of pipeline stages) and width (i.e. multiple word sizes concurrently).
- the pipelined execution unit(s) includes internal register files with feedback.
- the execution unit configuration and operation also can be controlled by execution unit configuration control signals.
- the execution unit configuration control signals can be determined by "configuration bits" o stored in the memory, or stored in a separate "configuration table".
- the configuration table can be downloaded by the host core processor, and/or updated under software control.
- the configuration control signals are generated by the controller mentioned above executing microcode.
- Another object of the invention is to provide improved FIFO memory performance at reduced cost by deploying a combination of SRAM and DRAM technologies in a FIFO memory.
- Another object of the present invention is to improve performance in execution of 5 complex operations by tightly coupling a FIFO memory to an execution unit.
- a further object of the invention is to improve effective FIFO density and reduce costs through new applications of DRAM memory cells, combined with a new local FIFO memory controller strategy.
- Yet another object of the present invention is to apply reconfigurable memory o circuits and methodologies to improve performance in connection with FIFO memory applications.
- a still further object is to apply shared, reconfigurable DRAM memory circuits and methodologies to FIFO memory applications in order to build high performance computing engines.
- a further object is to provide for concurrent execution of a calculation using a tightly coupled execution unit, while allowing concurrent access to a FIFO memory 5 subsystem by the CPU.
- One aspect of the present invention is a FIFO memory subsystem that includes a microprogrammable controller.
- the subsystem also includes address selection circuitry so as to provide a selected address to the FIFO memory from any of several address sources.
- the address sources can include the CPU address bus, the local controller, and potentially another circuit arranged for addressing the memory in connection with downloading execution parameters.
- the controller executes microcode which can be stored in any of at least three places.
- the microcode can be stored on board the local controller.
- the microcode can be stored in a separate read-only memory, e.g., ROM or flash memory.
- the microcode can be stored in the data portion of the FIFO memory.
- Another aspect of the invention provides for downloading microcode from the CPU into a portion of the FIFO data memory for subsequent execution by the local controller.
- the microcode can include configuration bits, op-codes for the execution unit, constants, parameters, etc. which the controller in turn provides to the execution unit.
- a further aspect of the invention includes providing an address decoder coupled to the CPU address line, for detecting assertion of a predetermined address that is used to trigger a particular execution. In response to detecting the predetermined address, the controller then configures the execution unit as appropriate, and begins execution of a microcoded sequence to carry out the corresponding calculation in the execution unit.
- the controller can notify the CPU, for example by writing control bits to a special memory address monitored by the CPU.
- FIG. 1 is a system level block diagram of an architecture for digital signal processing (DSP) using shared memory according to the present invention.
- DSP digital signal processing
- FIG. 2 illustrates circuitry for selectively coupling two or more address generators to a single block of memory.
- FIG. 3 is a block diagram illustrating portions of the memory circuitry and address generators of Fig. 1 in a fixed-partition memory configuration.
- FIG. 4 shows more detail of address and bit line connections in a two-port memory system of the type described.
- FIGS. 5A-5C illustrate selected address and control signals in a Processor Implementation of a DSP system, i.e. a complete DSP system integrated on a single chip.
- FIG. 6A illustrates an alternative embodiment in which a separate DSP program counter is provided for accessing the memory.
- FIG. 6B illustrates an alternative embodiment in which an MDSPC accesses the memory.
- FIGS. 7A-B are block diagrams that illustrate embodiments of the invention in a Harvard architecture.
- FIG. 8 is a conceptual diagram that illustrates a shared, reconfigurable memory architecture according to the present invention.
- FIG. 9 illustrates connection of address lines to a shared, reconfigurable memory with selectable (granular) partitioning of the reconfigurable portion of the memory.
- FIG. 10 illustrates a system that implements a reconfigurable segment of memory under bit selection table control.
- FIG. 11 A is a block diagram illustrating an example of using single-ported RAM in a DSP computing system according to the present invention.
- FIG. 11B is a table illustrating a pipelined timing sequence for addressing and accessing the one-port memory so as to implement a "virtual two-port" memory.
- FIG. 12 illustrates a block of memory having at least one reconfigurable segment with selectable write and read data paths.
- FIG. 13A is a schematic diagram showing detaii of one example of the write selection circuitry of the reconfigurable memory of Fig. 12.
- FIG. 13B illustrates transistor pairs arranged for propagating or isolating bit lines as an alternative to transistors 466 in Fig. 13A or as an alternative to the bit select transistors 462, 464 of Fig. 13A.
- FIG. 14 is a block diagram illustrating extension of the shared, reconfigurable memory architecture to multiple segments of memory.
- FIG. 15 is a simplified block diagram illustrating multiple reconfigurable memory 5 segments with multiple sets of sense amps.
- FIGS. 16A-16D are simplified block diagrams illustrating various examples of memory segment configurations to form memory blocks of selectable size.
- FIG. 17 is a block diagram of a DSP architecture illustrating a multiple memory block to multiple execution unit interface scheme in which configuration is controlled via o specialized address generators.
- FIGS. 18A-18C are simplified block diagrams illustrating various configurations of segments of a memory block into association with multiple execution units.
- FIG. 19 is a simplified block diagram illustrating a shared, reconfigurable memory system utilizing common sense amps.
- FIG. 20 is a simplified block diagram illustrating a shared, reconfigurable memory system utilizing multiple sense amps for each memory segment.
- FIG. 21 is a timing diagram illustrating memory swapping cycles.
- FIG. 22A is a block diagram illustrating memory swapping under bit table control.
- FIG. 22B is a block diagram illustrating memory swapping under MDSPC control.
- FIG. 23 is a block diagram illustrating an embodiment of a FIFO memory device utilizing reconfigurable memory technology under control of memory-centric read and write count controllers.
- FIG. 24 is a block diagram illustrating the function of the FIFO memory device of FIG. 23.
- FIG. 25 is a block diagram further illustrating the function of the FIFO memory device of FIG. 23.
- FIG. 26 is a biock diagram illustrating a special case in the function of the FIFO memory device of FIG. 23. DETAILED DESCRIPTION OF A PREFERRED EMBODIMENT
- FIGURE 1 A first figure.
- Fig. 1 is a system-level block diagram of an architecture for memory and computing-intensive applications such as digital signal processing.
- a microprocessor interface 40 includes a DMA port 42 for moving data into a memory via path 46 and reading data from the memory via path 44.
- the microprocessor interface 40 generically represents an interface to any type of controller or microprocessor.
- the interface partition indicated by the dashed line 45 in Fig. 1 may be a physical partition, where the microprocessor is in a separate integrated circuit, or it can merely indicate a functional partition in an implementation in which all of the memory and circuitry represented in the diagram of Fig. 1 is implemented on board a single integrated circuit.
- the microprocessor interface (DMA 42) also includes control signals indicated at 52.
- the microprocessor or controller can aiso provide microcode (not shown) for memory control and address generation, as well as control signals for configuration and operation of the functional execution units, as described later.
- the present invention may be integrated into an existing processor or controller core design, so that both the core processor and the present invention reside in the same integrated circuit, reference will be made herein to the core processor meaning the processor that the present invention has been attached to or integrated with.
- a two-port memory comprises the first memory biock 50, labeled "A" and a second memory block 60, labeled "B.”
- the memory is addressed by a source address generator 70 and a destination address generator 80.
- a functional execution unit 90 aiso is coupled to the two-port memory, left and right I/O channels, as illustrated at block B.
- these are not conventional two-port memory I/O ports; rather, they have novel structures described later.
- the interface 44, 46 to the two-port memory block A is a DMA interface that is in communication with the host processor or controller 40.
- Block A receives data coefficients and optionally other parameters from the controller, and also returns completed data to the controller that results from various DSP, graphics, MPEG encode/decode or other operations carried out in the execution unit 90.
- This output data can include, for example, FFT results, or convolution data, or graphics rendering data, etc.
- the single memory can alternately act as both a graphics frame buffer and a graphics computation buffer memory.
- the memory block "B" interfaces with the functional execution unit 90.
- the functional execution unit 90 receives data from the two-port memory block B and executes on it, and then returns results ("writeback") to the same two-port memory structure.
- the source address generator 70 supplies source or input data to the functional execution unit while the destination address generator 80 supplies addresses for writing results (or intermediate data) from the execution unit to the memory.
- source address generator 70 provides addressing while the functional execution unit is reading input data from memory block B
- the destination address generator 80 provides addressing to the same memory block B whiie the functional execution unit 90 is writing results into the memory.
- the memory effectively "swaps" blocks A and B, so that block B is in communication with the DMA channel 42 to read out the results of the execution. Conversely, and simultaneously, the execution unit proceeds to execute on the new input data in biock A.
- This "swapping" of memory blocks includes several aspects, the first of which is switching the memory address generator lines so as to couple them to the appropriate physical block of memory.
- the system can be configured so that the entire memory space (blocks A and B in the illustration) are accessed first by an I/O channel, and then the entire memory swapped to be accessed by the processor or execution unit.
- any or all of the memory can be reconfigured as described.
- the memory can be SRAM, DRAM or any other type of random access semiconductor memory or functionally equivalent technology. DRAM refresh is provided by address generators, or may not be required where the speed of execution and updating the memory (access frequency) is sufficient to obviate refresh.
- Figure 2 illustrates one way of addressing a memory block with two (or more) address generators.
- one address generator is labeled “DMA” and the other "ADDR GEN" although they are functionally similar.
- one of the address generators 102 has a series of output lines, corresponding to memory word lines. Each output line is coupled to a corresponding buffer (or word line driver or the like), 130 to 140. Each driver has an enable input coupled to a common enable line 142.
- the other address generator 104 similarly has a series of output lines coupled to respective drivers 150 to 160.
- the number of word lines is at least equal to the number of rows of the memory block 200.
- the second set of drivers also have enable inputs coupled to the common enable control line 142, but note the inverter "bubbles" on drivers 130 to 140, indicating the active-low enables of drivers 150 to 160. Accordingly, when the control line 142 is low, the DMA address generator 102 is coupled to the memory 200 row address inputs. When the control line 142 is high, the ADDR GEN 104 is coupled to the memory 200 row address inputs. In this way, the address inputs are "swapped" under control of a single bit.
- Alternative circuitry can be used to achieve the equivalent effect.
- the devices illustrated can be tri-state output devices, or open collector or open drain structures can be used where appropriate. Other alternatives include transmission gates or simple pass transistors for coupling the selected address generator outputs to the memory address lines. The same strategy can be extended to more than two address sources, as will be apparent to those skilled in the art in view of this disclosure.
- Figure 3 is a biock diagram illustrating a physical design of portions of the memory circuitry and address generators of Fig. 1 in a fixed-partition configuration.
- fixed partition I mean that the size of memory block A and the size of memory block B cannot change dynamically.
- the memory block A (50) and block B (60) correspond to the same memory blocks of Fig. 1.
- the memory itself preferably is dynamic RAM, although static RAM or other solid state memory technologies could be used as well.
- 5 memory block B just two bits or memory cells 62 AND 64 are shown by way of illustration. In a typical implementation, the memory block is likely to include thousands or even millions of rows, each row (or word) being perhaps 64 or more bits wide.
- the source address generator 70 is coupled to both memory blocks A and B.
- the top row includes a series of cells including bit cell 62.
- the source address generator preferably has output lines coupled to all of the rows of not only block B, but block A as well, although only one row line is illustrated in block A.
- 5 corresponding address lines from the AG 70 and the DMA 102 are shown as connected in common, e.g. at line 69. However, in practice, these address lines are selectable as described above with reference to Fig. 2.
- a destination address generator 80 similarly is coupled to the row lines of both blocks of memory.
- Memory cells 62 and 64 are full two-ported cells on the same column in o this example.
- either source AG 70 or DMA 102 address the left port
- either destination AG 80 or DMA 100 address the right port.
- a write select multiplexer 106 directs data either from the DMA (42 in Fig. 1) (or another block of memory) or from the execution unit 90, responsive to a control signal 108.
- the control signal is provided by the controller or microprocessor of Fig, 1 , by a configuration bit, or by an MDSPC.
- the 5 selected write data is provided to column amplifiers 110, 112 which in tum are connected to corresponding memory cell bit lines.
- bit and /bit (“bit bar”) drivers.
- Below cell 64 is a one-bit sense amplifier 116.
- a bit output from the sense amp 116 is directed, for example, to a latch 72.
- Both the DMA and the execution unit are coupled to receive data from latch 72, depending on appropriate control, enable and clock signals (not shown here). Or, both the DMA and the execution path may have separate latches, the specifics being a matter of design choice. Only one sense amp is shown for illustration, while in practice there wiil be at least one sense amp for each column. Use of multiple sense amps is described later.
- Fig. 4 shows more detail of the connection of cells of the memory to source and destination address lines. This drawing shows how the source address lines (when asserted) couple the write bit line and its complement, i.e. input lines 110,112 respectively, to the memory cells.
- the destination address lines couple the cell outputs to the read bit lines 114, 115 and thence to sense amp 116. Although only one column is shown, in practice write and read bit lines are provided for each column across the full width of the memory word. The address lines extend across the full row as is conventional.
- Fig. 21 is a conceptual diagram illustrating an example for the timing of operation of the architecture illustrated in Fig. 1.
- TOA, T1 A, etc. are specific instances of two operating time cycles TO and T1.
- the cycle length can be predetermined, or can be a parameter downloaded to the address generators.
- TO and T1 are not necessarily the same length and are defined as alternating and mutually exclusive, i.e. a first cycle T1 starts at the end of TO, and a second cycle TO starts at the end of the first period T1 , and so on. Both TO and T1 are generally longer than the basic clock or memory cycle. time.
- Fig. 22A is a block diagram of a single port architecture which will be used to illustrate an example of functional memory swapping in the present invention during repeating TO and T1 cycles.
- Execution address generator 70 addresses memory block A (50) during TO cycles. This is indicated by the left (TO) portion of AG 70. During T1 cycles, execution address generator 70 addresses memory block B (60), as indicated by the right portion of 70. During T1 , AG 70 also receives setup or configuration data in preparation for again addressing Mem Block A during the next TO cycle. Similarly, during TO, AG 70 also receives configuration data in preparation for again addressing Mem Block B during the next T1 cycle.
- DMA address generator 102 addresses memory block B (60) during TO cycles. This is indicated by the left (TO) portion of DMA AG 102. During T 1 cycles, DMA address generator 102 addresses memory block A (50), as indicated by the right portion of 102. During T1 , DMA AG 102 also receives setup or configuration data in preparation for again addressing Mem Block B during the next TO cycle. Similarly, during TO, DMA 102 also receives configuration data in preparation for again addressing Mem Block A during the next T1 cycle.
- the functional execution unit (90 in Fig. 1) is operating continuously on data in memory block A 50 under control of execution address generator 70.
- DMA address generator 102 is streaming data into memory block B 60.
- memory blocks A and B effectively swap such that execution unit 90 will process the data in memory block B 60 under control of execution address generator 70 and data will stream into memory block A 50 under control of DMA address generator 102.
- memory blocks A and B again effectively swap such that execution unit 90 wiil process the data in memory block A 50 under control of execution address generator 70 and data will stream into memory block B 60 under control of DMA address generator 102.
- Fig. 22B the functions of the execution address generator and DMA address generator are performed by the MDPSC 172 under microcode control.
- a two- port memory again comprises a block A (150) and a block B (160).
- Memory block B is coupled to a DSP execution unit 130.
- An address generator 170 is coupled to memory block B 160 via address lines 162.
- the address generator unit is executing during a first cycle TO and during time TO is loading parameters for subsequent execution in cycle T1.
- the lower memory block A is accessed via core processor data address register 142A or core processor instruction address register 142B.
- the data memory and the instructional program memory are located in the same physical memory.
- a microprocessor system of the Harvard architecture has separate physical memory for data and instructions. The present invention can be used to advantage in the Harvard architecture environment as well, as described below with reference to Figs. 7A and 7B.
- Fig. 5A also includes a bit configuration table 140.
- the bit configuration table can receive and store information from the memory 150 or from the core processor, via bus 180, or from an instruction fetched via the core processor instruction address register 142B. Information is stored in the bit configuration table during cycle TO for controlling execution during the next subsequent cycle T1.
- the bit configuration tabie can be loaded by a series of operations, reading information from the memory biock A via bus 180 into the bit configuration tables.
- This information includes address generation parameters and opcodes. Examples of some of the address parameters are starting address, modulo- address counting, and the length of timing cycles TO and T1. Examples of op codes for controlling the execution unit are the multiply and accumulate operations necessary for to perform an FFT.
- the bit configuration table is used to generate configuration control signal 152 which determines the position of virtual boundary 136 and, therefore, the configuration of memory blocks A and B. It also provides the configuration information necessary for operation of the address generator 170 and the DSP execution unit 130 during the T1 execution cycle time.
- Path 174 illustrates the execution unit/memory interface control signals from the bit configuration table 140 to the DSP execution unit 130.
- Path 176 illustrates the configuration control signal to the execution unit to reconfigure the execution unit.
- Path 178 illustrates the op codes sent to execution unit 130 which cause execution unit to perform the operations necessary to process data.
- Path 188 shows configuration information loaded from the configuration tables into the address generator 170.
- the architecture illustrated in Fig. 5A preferably would utilize the extended instructions of a given processor architecture to allow the address register from the instruction memory to create the information flow into the bit configuration table.
- special instructions or extended instructions in the controller or microprocessor architecture can be used to enable this mechanism to operate as described above.
- Such an implementation would provide tight coupling to the microprocessor architecture.
- Fig. 5B illustrates an embodiment of the present invention wherein the functions of address generator 170 and bit configuration table 140 of Fig. 5A are performed by memory-centric DSP controller (MDSPC) 172.
- MDSPC memory-centric DSP controller
- the core processor writes microcode for MDSPC 172 aiong with address parameters into memory block B 150.
- the microcode and address parameters are downloaded into local memory within MDSPC 172.
- a DSP process initiated in MDPSC 172 then generates the appropriate memory configuration control signals 152 and execution unit configuration control signals 176 based upon the downloaded microcode to control the position of virtual boundary 136 and structure execution unit 130 to optimize performance for the process corresponding to the microcode.
- MDSPC 172 As the DSP process executes, MDSPC 172 generates addresses for memory block B 160 and controls the execution unit/memory interface to load operands from memory into the execution unit 130 which are then processed by execution unit 130 responsive to op codes 178 sent from MDSPC 172 to execution unit 130.
- virtual boundary 136 may be adjusted responsive to microcode during process execution in order to dynamically optimize the memory and execution unit configurations.
- the MDSPC 172 supplies the timing and control for the interfaces between memory and the execution unit. Further, algorithm coefficients to the execution unit may be supplied directly from the MDSPC.
- the use of microcode in the MDSPC results in execution of the DSP process that is more efficient than the frequent downloading of bit configuration tables and address parameters associated with the architecture of Fig. 5A.
- the microcoded method represented by the MDSPC results in fewer bits to transfer from the core processor to memory for the DSP process and less frequent updates of this information from the core processor. Thus, the core processor bandwidth is conserved along with the amount of bits required
- Fig. 5C illustrates an embodiment of the present invention wherein the reconfigurabiiity of memory in the present invention is used to allocate an additional segment of memory, memory block C 190, which permits MDPSC 172 to execute microcode and process address parameters out of memory block C 190 rather than locai memory.
- This embodiment requires an additional set of address 192 and data 194 lines to provide the interface between memory block C 190 and MDSPC 172 and address bus control circuitry 144 under control of MDSPC 172 to disable the appropriate address bits from core processor register file 142.
- This configuration permits simultaneous access of MDSPC 172 to memory block C 190, DSP execution unit 130 to memory biock B and the core processor to memory block A.
- virtual boundaries 136A and 136B are dynamically reconfigurable to optimize the memory configuration for the DSP process executing in MDSPC 172.
- the bit tables and microcode discussed above may alternatively reside in durable store, such as ROM or flash memory.
- the durable store may be part of memory block A or may reside outside of memory block A wherein the content of durable store is transferred to memory block A or to the address generators or MDSPC during system initialization.
- the DSP process may be triggered by either decoding a preselected bit pattern corresponding to a DSP function into an address in memory block A containing the bit tabies or microcode required for execution of the DSP function.
- Yet another approach to triggering the DSP process is to place the bit tables or microcode for the DSP function at a particular location in memory block A and the DSP process is triggered by the execution of a jump instruction to that particular location.
- the microcode to perform a DSP function such as a Fast Fourier Transform (FFT) or IIR, is loaded beginning at a specific memory location within memory block A. Thereafter, execution of a jump instruction to that specific memory location causes execution to continue at that location thus spawning the DSP process.
- FFT Fast Fourier Transform
- IIR IIR
- a separate program counter 190 is provided for DSP operations.
- the core controller or processor (not shown) loads information into the program counter 190 for the DSP operation and then that program counter in turn addresses the memory block 150 to start the process for the DSP.
- Information required by the DSP operations would be stored in memory.
- any register of the core processor such as data address register 142A or instruction address register 142B, can be used for addressing memory 150.
- Bit Configuration Table 140 in addition to generating memory configuration signal 152, produces address enable signal 156 to control address bus control circuitry 144 in order to select the address register which accesses memory block A and also to selectively enable or disable address lines of the registers to match the memory configuration (i.e. depending on the position of virtual boundary 136, address bits are enabled if the bit is needed to access all of memory block A and disabled if block A is smaller than the memory space accessed with the address bit).
- Fig. 6A shows the DSP program counter 190 being loaded by the processor with an address to move into memory block A.
- the other address sources in register file 142 are disabled, at least with respect to addressing memory 150.
- three different alternative mechanisms are illustrated for accessing the memory 150 in order to fetch the bit configurations and other parameters 140. The selection of which addressing mechanism is most advantageous may depend upon the particular processor architecture with which the present invention is implemented.
- 5 Fig. 6B shows an embodiment wherein MDSPC 172 is used to generate addresses for memory block A in place of DSP PC 190.
- Address enable signal 156 selects between the address lines of MDSPC 172 and those of register file 142 in response to the microcode executed by MDSPC 172. As discussed above, if the microcode for MDSPC 172 resides in memory block A or a portion thereof, MDSPC 172 o will be executing out of memory block A and therefore requires access to the content of memory biock A.
- memory blocks A (150) and B (160) are separated by 5 "virtual boundary" 136.
- block A and block B are portions of a single, common memory, in a preferred embodiment.
- the location of the "virtual boundary" is defined by the configuration control signal generated responsive to the bit configuration table parameters.
- the memory is reconfigurable under software control.
- this memory has a variable boundary, the memory preferably is part of the o processor memory, it is not contemplated as a separate memory distinct from the processor architecture.
- the memory as shown and described is essentially reconfigurable directly into the microprocessor itself.
- the memory block B, 160 duly configured, executes into the DSP execution unit as shown in Fig. 5.
- virtual boundary 136 is controlled based on the microcode downloaded to MDSPC 172.
- microcode determines the position of both virtual boundary 136A and 136B to create memory block C 190.
- FIGURES 7A and 7B Fig. 7A illustrates an alternative embodiment, corresponding to Fig. 5A, of the present invention in a Harvard-type architecture, comprising a data memory block A 206 and block B 204, and a separate core processor instruction memory 200.
- the instruction memory 200 in addressed by a program counter 202. Instructions fetched from the instruction memory 200 pass via path 220 to a DSP instruction decoder 222.
- the instruction decoder in tum provides addresses for DSP operations, table configurations, etc., to an address register 230. Address register 230 in turn addresses the data memory block A 206. Data from the memory passes via path 240 to load the bit configuration tables etc.
- FIG. 6 thus illustrates an alternative approach to accessing the data memory A to fetch bit configuration data.
- a special instruction is fetched from the instruction memory that includes an opcode field that indicates a DSP operation, or more specifically, a DSP configuration operation, and includes address information for fetching the appropriate configuration for the subroutine.
- MDPSC 246 replaces AG 244 and Bit Configuration Table 242. Instructions in core processor instruction memory 200 that correspond to functions to be executed by DSP Execution Unit 250 are replaced with a preselected bit pattern which is not recognized as a valid instruction by the core processor.
- DSP Instruction Decode 222 decodes the preselected bit patterns and generates an address for DSP operations and address parameters stored in data memory A and also generates a DSP control signal which triggers the DSP process in MDSPC 246.
- DSP Instruction Decode 222 can also be structured to be responsive to output data from data memory A 206 into producing the addresses latched in address register 230.
- the DSP Instruction Decode 222 may be reduced or eliminated if the DSP process is initiated by an instruction causing a jump to the bit table or microcode in memory block A pertaining to the execution of the DSP process.
- the present invention includes an architecture that features shared, reconfigurable memory for efficient operation of one or more processors together with one or more functional execution units such as DSP execution units.
- Fig. 6A shows an implementation of a sequence of operations, much like a subroutine, in which a core controller or processor loads address information into a DSP program counter, in order to fetch parameter information from the memory.
- Fig. 6B shows an implementation wherein the DSP function is executed under the control of an MDSPC under microcode control.
- the invention is illustrated as integrated with a von Neumann microprocessor architecture.
- Figs. 7A and 7B illustrate applications of the present invention in the context of a Harvard-type architecture.
- the system of Fig. 1 illustrates an alternative stand-alone or coprocessor implementation.
- Next is a description of how to implement a shared, reconfigurable memory system.
- FIGURE 8 Fig. 8 is a conceptual diagram illustrating a reconfigurable memory architecture for
- a memory or a block of memory includes rows from 0 through Z.
- a first portion of the memory 266, addresses 0 to X, is associated, for example, with an execution unit (not shown).
- a second (hatched) portion of the memory 280 extends from addresses from X+1 to Y.
- a third portion of the memory 262, extending from addresses Y+1 to Z, is associated, for example, with a DMA or I/O channel.
- associated here we mean a given memory segment can be accessed directly by the designated DMA or execution unit as further explained herein.
- the second segment 280 is reconfigurable in that it can be switched so as to form a part of the execution segment 266 or become part of the DMA segment 262 as required.
- each memory word or row includes data and/or coefficients, as indicated on the right side of the figure.
- configuration control bits are shown to the left of dashed line 267.
- This extended portion of the memory can be used for storing a bit configuration table that provides configuration control bits as described previously with reference to the bit 5 configuration table 140 of Figs. 5A and 6A.
- These selection bits can include write enable, read enable, and other control information. So, for example, when the execution segment 266 is swapped to provide access by the DMA channel, configuration control bits in 266 can be used to couple the DMA channel to the I/O port of segment 266 for data transfer. In this way, a memory access or software trap can be used to reconfigure the system o without delay.
- the configuration control bits shown in Fig. 8 are one method of effecting memory reconfiguration that relates to the use of a separate address generator and bit configuration table as shown in Figs. 5A and 7A. This approach effectively drives an address configuration state machine and requires considerable overhead processing to 5 maintain the configuration control bits in a consistent and current state.
- the configuration control bits are unnecessary because the MDSPC modifies the configuration of memory algorithmically based upon the microcode executed by the MDSPC. Therefore, the MDSPC maintains the configuration of the memory internally rather than as part of the o reconfigured memory words themselves.
- Fig. 9 illustrates connection of address and data lines to a memory of the type described in Fig. 8.
- a DMA or I/O channel address port 102 provides 5 sufficient address lines for accessing both the raws of the DMA block of memory 262, indicated as bus 270, as well as the reconfigurable portion of the memory 280, via additional address lines indicated as bus 272.
- the block 280 is configured as a part of the DMA portion of the memory, the DMA memory effectively occupies the memory space indicated by the brace 290 and the address lines 272 are controlled by the DMA channel 102.
- Fig. 9 also shows an address generator 104 that addresses the execution block of memory 266 via bus 284.
- Address generator 104 also provides additional address lines for controlling the reconfigurable block 280 via bus 272.
- the execution 5 block of memory has a total size indicated by brace 294, while the DMA portion is reduced to the size of block 262.
- Fig. 9 indicates data access ports 110 and 120.
- the upper data port 110 is associated with the DMA block of memory, which, as described, is of selectable size.
- port 120 accesses the execution portion of the 5 memory. Circuitry for selection of input (write) data sources and output (read) data destinations for a block of memory was described earlier.
- the entire block need not be switched in toto to one memory block or the other.
- the reconfigurable biock preferably is partitionable so o that a selected portion (or all) of the block can be switched to join the upper or lower block.
- the granularity of this selection is a matter of design choice, at a cost of additional hardware, e.g. sense amps, as the granularity increases, as further explained later.
- Fig. 10 illustrates a system that implements a reconfigurable segment of memory 280 under bit selection table control.
- a reconfigurable memory segment 280 receives a source address from either the AG or DMA source address generator 274 and it receives a destination address from either the AG or DMA destination address generator 281.
- Write control logic 270 for example a word wide multiplexer, selects write input data from either the DMA channel or the execution unit according to a control signal 272.
- the source address generator 274 includes bit table control circuitry 276.
- the configuration control circuitry 276, either driven by a bit table or under microcode control, generates the 5 write select signal 272.
- the configuration control circuitry also determines which source and destination addresses lines are coupled to the memory - either "AG” (address generator) when the block 280 is configured as part of the an "AG” memory biock for access by the execution unit, or the "DMA" address lines when the block 280 is configured as part of the DMA or I/O channel memory block.
- the configuration control logic o provides enable and/or clock controls to the execution unit 282 and to the DMA channel
- FIGURE 11 5 is a partial block/partiai schematic diagram illustrating the use of a single ported RAM in a DSP computing system according to the present invention.
- a single-ported RAM 300 includes a column of memory cells 302, 304, etc. Only a few cells of the array are shown for clarity.
- a source address generator 310 and destination address generator 312 are arranged for addressing the memory 300. More specifically, o the address generators are arranged to assert a selected one address line at a time to a logic high state.
- the term "address generator” in this context is not limited to a conventional DSP address generator. It could be implemented in various ways, including a microprocessor core, microcontroller, programmable sequencer, etc.
- Address generation can be provided by a micro-coded machine. Other implementations that 5 provide DSP type of addressing are deemed equivalents. However, known address generators do not provide control and configuration functions such as those illustrated in Fig. 10 - configuration bits 330.
- the corresponding address lines from the source and destination blocks 310, 312 are logically "ORed" together, as illustrated by OR gate 316, with reference to the top row of the memory comprising memory cell 302. Only one row address line is asserted at a given time.
- a multiplexer 320 selects data either from the DMA or from the execution unit, according to a control signal 322 responsive to the configuration bits in the source address generator 310.
- the selected data is applied through drivers 326 to the corresponding coiumn of the memory array 300 (only one column, i.e. one pair of bit lines, is shown in the drawing). For each column, the bit lines also are coupled to a sense amplifier 324, which in tum provides output or write data to the execution unit 326 and to the DMA 328 via path 325.
- the execution unit 326 is enabled by an execution enable control signal responsive to the configuration bits 330 in the destination address block 312. Configuration bits 330 also provide a DMA control enable signal at 332.
- the key here is to eliminate the need for a two-ported RAM cell by using a logical OR of the last addresses from the destination and source registers (located in the corresponding destination or source address generators). Source and destination operations are not simultaneous, but operation is still fast. A source write cycle followed by a destination read cycle would take only a total time of two memory cycles.
- Fig. 12 illustrates a first segment of memory 400 and a second memory segment 460.
- first segment 400 only a few rows and a few cells are shown for pu ⁇ oses of illustration.
- One row of the memory begins at cell 402, a second row of the memory begins at cell 404, etc. Only a single bit line pair, 410, is shown for illustration.
- a first write select circuit such as a multiplexer 406 is provided for selecting a source of write input data.
- one input to the select circuit 406 may be coupled to a DMA channel or memory block M1.
- a second input to the MUX 406 may be coupled to an execution unit or another memory block M2.
- M1 , M2, etc. we use the designations M1 , M2, etc., to refer generically, not only to other blocks of memory, but to execution units or other functional parts of a DSP system in general.
- the multiplexer 406 couples a selected input source to the bit lines in the memory segment 400.
- the select circuit couples all, say 64 or 128 bit lines, for example, into the memory. Preferably, the select circuit provides the same number of bits as the word size.
- bit lines extend through the memory array segment to a second write select circuit 420.
- This circuit selects the input source to the second memory segment 460. If the select circuit 420 selects the bit lines from memory segment 400, the result is that memory segment 400 and the second memory segment 460 are effectively coupled together to form a single block of memory.
- the second select circuit 420 can select write data via path 422 from an alternative input source.
- a source select circuit 426 for example a similar multiplexer circuit, can be used to select this input from various other sources, indicated as M2 and M1. When the alternative input source is coupled to the second memory segment 460 via path 422, memory segment 460 is effectively isolated from the first memory segment 400.
- bit lines of memory segment 400 are directed via path 430 to sense amps 440 for reading data out of the memory segment 400.
- sense amps 440 can be sent to a disable or low power standby state, since they need not be used.
- Fig. 13 shows detail of the input selection logic for interfacing multiple memory segments.
- the first memory segment bit line pair 410 is coupled to the next memory segment 460, or conversely isolated from it, under control of pass devices 466.
- the input select logic 426 includes a first pair of pass transistors 426 for connecting bit lines from source M1 to bit line drivers 470.
- a second pair of pass transistors 464 controllably couples an alternative input source M2 bit lines to drivers 470.
- the pass devices 462, 464, and 466 are all controllable by control bits originating, for example, in the address generator circuitry described above with reference to Fig. 9. Pass transistors, transmission gates or the like can be considered equivalents for selecting input (write data) sources.
- Fig. 14 is a high-level block diagram illustrating extension of the architectures of Figs. 12 and 13 to a plurality of memory segments. Details of the selection logic and sense amps is omitted from this drawing for clarity. In general, this drawing illustrates how any available input source can be directed to any segment of the memory under control of the configuration bits.
- Fig. 15 is another block diagram illustrating a plurality of configurable memory segments with selectable input sources, as in Fig. 14.
- multiple sense amps 482, 484, 486, are coupled to a common data output latch 480.
- sense amp 484 provides read bits from that combined block, and sense amp 482 can be idle.
- Figs. 16A through 16D are block diagrams illustrating various configurations of multiple, reconfigurable blocks of memory.
- the designations M1 , M2, M3, etc. refer generically to other blocks of memory, execution units, I/O channels, etc.
- Fig. 16A four segments of memory are coupled together to form a singie, large block associated with input source M1.
- a single sense amp 500 can be used to read data from this common block of memory (to a destination associated with M1).
- the first block of memory is associated with resource M1 , and its output is provided through sense amp 502.
- the other three blocks of memory, designated M2, are configured together to form a single block of memory - three segments long - associated with resource M2.
- sense amp 508 provides output from the common block (3xM2), while sense amps 504 and 506 can be idle.
- Figs. 16C and 16D provide additional examples that are self explanatory in view of the foregoing description. This illustration is not intended to imply that all memory segments are of equal size. To the contrary, they can have various sizes as explained elsewhere herein.
- Fig. 17 is a high-level block diagram illustrating a DSP system according to the present invention in which multiple memory blocks are interfaced to multiple execution units so as to optimize performance of the system by reconfiguring it as necessary to execute a given task.
- a first block of memory M1 provides read data via path 530 to a first execution unit ("EXEC A") and via path 532 to a second execution unit (EXEC B").
- Execution unit A outputs results via path 534 which in turn is provided both to a first multiplexer or select circuit MUX-1 and to a second select circuit MUX-2.
- MUX-1 provides select write data into memory M1.
- a second segment of memory M2 provides read data via path 542 to execution unit A and via path 540 to execution unit B.
- Output data or results from execution unit B are provided via path 544 to both MUX-1 and to MUX-2.
- MUX-2 provides selected write data into the memory block M2. In this way, data can be read from either memory block into either execution unit, and results can be written from either execution unit into either memory block.
- a first source address generator S1 provides source addressing to memory block M1.
- Source address generator S1 also includes a selection table for determining read/write configurations.
- S1 provides control bit "Select A" to MUX-1 in order to select execution unit A as the input source for a write operation to memory M1.
- S1 also provides a "Select A” control bit to MUX-2 in order to select execution unit A as the data source for writing into memory M2.
- a destination address generator D1 provides destination addressing to memory biock M1.
- D1 also includes selection tables which provide a "Read 1" control signal to execution A and a second "Read 1" control signal to execution unit B. By asserting a selected one of these control signals, the selection bits in D1 directs a selected one of the execution units to read data from memory Ml
- a second source address generator S2 provides source addressing to memory segment M2. Address generator S2 also provides a control bit "select B" to MUX-1 via path 550 and to MUX-2 via path 552. These signals cause the corresponding multiplexer to select execution unit B as the input source for write back data into the corresponding memory block.
- a second destination address generator D2 provides destination addressing to memory block M2 via path 560. Address generator D2 also provides control bits for configuring this system. D2 provides a read to signal to execution unit A via path 562 and a read to signal to execution unit B via path 564 for selectively causing the corresponding execution unit to read data from memory block M2.
- Fig. 18A illustrates at a high level the parallelism of memory and execution units that becomes available utilizing the reconfigurable architecture described herein.
- a memory block comprising for example 1,000 rows, may have, say, 256 bits and therefore 256 outputs from respective sense amplifiers, although the word size is not critical. 64 bits may be input to each of four parallel execution units E1 - E4.
- the memory block thus is configured into four segments, each segment associated with a respective one of the execution units, as illustrated in Fig. 18B. As suggested in the figure, these memory segments need not be of equal size.
- Fig. 18A a memory block, comprising for example 1,000 rows, may have, say, 256 bits and therefore 256 outputs from respective sense amplifiers, although the word size is not critical. 64 bits may be input to each of four parallel execution units E1 - E4.
- the memory block thus is configured into four segments, each segment associated with a respective one of the execution units, as illustrated in Fig. 18B. As suggested in the figure, these memory segments need not
- 18C shows a further segmentation, and reconfiguration, so that a portion of segment M2 is joined with segment M1 so as to form a block of memory associated with execution unit El
- a portion of memory segment M3, designated "M3/2" is joined together with the remainder of segment M2, designated "M2 2", to form a memory block associated with execution unit E2, and so on.
- M3/2 portion of memory segment M3
- M2 2 portion of memory segment M2 2
- the choice of one half block increments for the illustration above is arbitrary. Segmentation of the memory may be designed to permit reconfigurability down to the granularity of words or bits if necessary.
- Fig. 19 illustrates an alternative embodiment in which the read bit lines from multiple memory segments, for example read bit lines 604, are directed to a multiplexer circuit 606, or its equivalent, which in turn has an output coupled to shared or common set of sense amps 610.
- Sense amps 610 in turn provide output to a data output latch 612, I/O bus or the like.
- the multiplexer or selection circuitry 604 is responsive to control signals (not shown) which seiect which memory segment output is "tapped" to the sense amps. This architecture reduces the number of sense amps in exchange for the addition of selection circuitry 606.
- Fig. 20. is a biock diagram illustrating a memory system of multiple configurable memory segments having multiple sense amps for each segment. This altemative can be used to improve speed of "swapping" read data paths and reduce interconnect overhead in some applications.
- FIFO First In First Out
- FIFO's use embedded logic and DRAM (Dynamic Random Access Memory) on the same chip.
- DRAM Dynamic Random Access Memory
- the FIFO products that result from this technology will be far less expensive to manufacture, since these products wiil have smaller die size than the devices using conventional SRAM technology manufactured by Cypress Semiconductor, Integrated Device Technology Inc., and other semiconductor manufacturers using conventional SRAM technology (Static Random Technology).
- SRAM Static Random Technology
- Write pointer goes from word count 0 to 1 ,024 to move, and at the point that write pointer is at 1 ,024 write pointer stops.
- Read pointer in response to a Read 64 word command (64 is example) starts to move from 0 to 64 and stops.
- Block A receives the write command an write clock and. outputs a write pointer to block A into which write data is written into address of the write pointer via the write port into the input FIFO A, shown in Fig. 23.
- the input FIFO A stores eight words associated with the input stream.
- the eight word storage is related to the matching bandwidth between an input data write stream and DRAM bandwidth, e.g., if the DRAM cycles at 40 ns internal or 24 MHz and the input word rate is 5.ns/word, then an eight word deed input FIFO or Buffer Memory is required (note that the input FIFO or Buffer Memory utilized SRAM technology) thus 5.ns write times into this input FIFO or Buffer are readily achieved.
- the entire eight words are unloaded into a write interface to DRAM illustrated in B that includes the digital synchronization logic to deal with the asynchronous clocks for write and the internal DRAM clock.
- the DRAM clock wiil by synchronous with the write and read clocks an digital synchronization logic will note have to be included in the interface mechanisms write interface for DRAM B and read interface for DRAM G.
- the write address logic receives control and write address information from the write count and control block to enable loading of the write interface B data into the correct ROW in the DRAM block.
- the write count and control block H will enable refreshing of the DRAM together with the I row count CL.
- the speed of new data entering into the FIFO DRAM system on a chip is not being written into and then read out of the FIFO sufficiently fast, this not requiring refresh of the DRAM. The following methods would be utilized.
- the write input FIFO size would be doubled to hold 16 words; this implies running at high speed of 200 MHz.
- Incoming data would take a minimum of 80 ns to fill the input FIFO buffer. If the cycle time of the DRAM is 80 ns then 40 ns would be used to implement a refresh operation under control of write control block and 40 ns would be used to implement a write to a row in DRAM. If information comes in very slow or in bursts requiring refresh the write could and control block H together with the Read Count and control block I will feed appropriate control signals to the refresh control block to enable refreshing of the DRAM.
- the read operation into the read interface block G results in not only latching of o the output data from the DRAM biock but also digital synchronization of the data to the output read clock. Again under control of the read count and control data is parallel loaded from the read interface latches to the out FIFO. Then under control of cycling read pointer information is read out of the read port. Note that the text above describes a flow of information through the FIFO-DRAM system. The flow described must not be confused 5 with the fact that the output FIFO is always pre-loaded to immediately read data out of its read port F, the detailed mechanisms to accomplish this will be described below. The construction of the FIFO block F for reading is described below in Fig. 25.
- Output FIFO reads output data directly after being loaded with a multiple word load via the DRAM interface.
- the cycle of the DRAM follows the following criteria. At least on FIFO must be completely full via loading from DRAM at any given instead in time, thus when FIFO (1) states to read outputs FIFO (2) must be full. FIFO (1) does not have to receive a complete parallel load operation until it empties and read operations from FIFO (2) are 5 initiated.
- the size of the output FIFO buffers is a function of the following criteria: Refresh requirements - does application require memory refresh or is data 0 written into and read from DRAM so that refresh of the DRAM is not required. No refresh implies less output buffer capacity requirement. Clock speed of output reads from FIFO high speed clocks rates will require more FIFO capacity than required by low speed clocks. 5 Clock speed of output reads from FIFO higher speed clocks rates will require more FIFO capacity than required by low speed clocks. Continuous operation of input FIFO's for write operations in conjunction with read operations occurring in parallel this in effect increases buffer sizes to do the requirement to allow for sequential access of the DRAM, e.g., one read, one write, one read, etc.
- the input FIFO A's would in effect have a read port tied to the read port of output FIFO (F) to accomplish this operation since the DRAM bank is empty.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Advance Control (AREA)
Abstract
A FIFO memory device is shown which uses embedded logic and DRAM on the same chip. A memory controller controls the transfer of data from input and to output ports to match the access speed of the DRAM to the input and output requirements for the FIFO system. High speed FIFO performance can thus be obtained from relatively high density, low cost and slow access DRAM. The FIFO products that result from this technology will be far less expensive to manufacture.
Description
5
FIFO MEMORY DEVICE USING SHARED RECONF1GURABLE MEMORY BLOCK
Related application data: This application claims priority from U.S. Pat. Appln. Ser. No. 60/058,767 filed Sept. 12, 1997. 0
FIELD OF THE INVENTION The present invention is generally in the field of digital computing and, more specifically, is directed to a memory subsystem that combines shared, reconfigurable memory techniques together with a micro-coded controller in the context of a FIFO 5 memory device.
BACKGROUND OF THE INVENTION The prior application, entitled "Shared, Reconfigurable Memory Architectures for Digital Signal Processing" U.S. Pat. Appln Ser. no. 08/821326 filed March 21 , 1997, o incoφorated herein by reference, described the need to improve digital signal processing performance while containing or reducing cost. That application describes improved computer architectures that utilize available memory resources more efficiently by providing for shared and reconfigurable memory so as to reduce I/O processor requirements for computation intensive tasks such as digital signal processing. The 5 memory systems described in the prior case are shared in the sense that a given block of memory can first be configured for access by the CPU, for example to load data, and then "swapped" so that the same block of physical memory can be directly accessed by an execution unit, for example a DSP execution unit, to carry out various calculations on that data. After the calculations are completed, the same block of memory can be "swapped" o once again, so that the CPU has immediate access to the results. The memory is reconfigurable in a variety of ways, as described below, so as to allocate memory resources as between the CPU and the execution unit (or multiple execution units) in the
most efficient manner possible. Reconfiguring the memory can include forming memory blocks of various sizes; selecting write (input) sources; selecting read (destination) targets; selecting word size, and so forth. Various particulars and alternative embodiments are set forth below, so as to enable one skilled in the art to implement shared, reconfigurable memory architectures.
The parent case described the invention with reference to digital signal processing. However, DSP is just one example of computation-intensive calculation. The concepts of the prior case as well as the present invention are applicable to a wide variety of execution tasks, including but not limited to DSP and related tasks such as motion picture encoding, decoding, and encryption, decryption, etc. Moreover, the principles of the parent application can be applied advantageously in the context of a FIFO memory. Accordingly, the present specification adds additional disclosure directed to application of shared, configurable memory and tightly coupled execution units to the FIFO memory context. Another aspect of the prior application is a memory-centric DSP controller
("MDSPC"). The MDSPC was described as providing memory address generation and a variety of other control functions, including reconfiguring the memory as summarized above to support a particular computation in the execution unit. The name "MDSPC" was appropriate for that controller in the context of the parent case, in which the preferred embodiment was described for digital signal processing. However, the principles of the parent case and the present invention are not so limited. Accordingly, the present application includes description of a "controller" which is functionally similar to the "MDSPC" introduced in the parent case. Drawing Figs. 1-22 and the corresponding description herein were included in the parent application. The present application includes additional drawing Figs. 23-28.
SUMMARY OF THE INVENTION The parent application describes a memory subsystem that is partitioned into two or more blocks of memory space. One block of the memory communicates with an I/O or
DMA channel to load data, while the other block of memory simultaneously communicates with one or more execution units that carry out arithmetic operations on data in the second biock. Results are written back to the second block of memory. Upon conclusion of that process, the memory blocks are effectively "swapped" so that the second block, now 5 holding processed (output) data, communicates with the I/O channel to output that data, while the execution unit communicates with the first block, which by then has been filled with new input data. Methods and apparatus are shown for implementing this memory swapping technique in real time so that the execution unit is never idle. The present application extends these concepts to FIFO memory. o Another aspect of the parent case describes interfacing two or more address generators to the same block of memory, so that memory block swapping can be accomplished without the use of larger muiti-ported memory cells. The present application extends these concepts to FIFO memory as well.
The parent application identified earlier describes the concept of reconfiguring an 5 execution unit in several ways, including selectable depth (number of pipeline stages) and width (i.e. multiple word sizes concurrently). Preferably the pipelined execution unit(s) includes internal register files with feedback. The execution unit configuration and operation also can be controlled by execution unit configuration control signals. The execution unit configuration control signals can be determined by "configuration bits" o stored in the memory, or stored in a separate "configuration table". The configuration table can be downloaded by the host core processor, and/or updated under software control. Preferably, the configuration control signals are generated by the controller mentioned above executing microcode.
This combination of reconfigurable memory, together with reconfigurable 5 execution units, and the associated techniques for efficiently moving data between them, provides an architecture that is highly flexible. Microcoded software can be used to take advantage of this architecture so as to achieve new levels of performance. Because the circuits described herein require only one-port or two-port memory cells, they allow higher density and the associated advantages of lowered power dissipation, reduced
capacitance, etc. in the preferred integrated circuit embodiments, whether implemented as a stand-alone coprocessor, or together with a standard processor core, or by way of modification of an existing processor core design. An important feature of the architectures described herein is that they provide a tightly coupled relationship between 5 memory and execution units. This feature provides the advantages of reducing internal interconnect requirements, thereby lowering power consumption. In addition, the invention provides for doing useful work on virtually all clock cycles. This feature minimizes power consumption as well. The present application extends these concepts to FIFO memory systems. o One .object of the present invention therefore is to reduce the effective cost of
FIFO memory by utilizing DRAM technology. Another object of the invention is to provide improved FIFO memory performance at reduced cost by deploying a combination of SRAM and DRAM technologies in a FIFO memory.
Another object of the present invention is to improve performance in execution of 5 complex operations by tightly coupling a FIFO memory to an execution unit. A further object of the invention is to improve effective FIFO density and reduce costs through new applications of DRAM memory cells, combined with a new local FIFO memory controller strategy.
Yet another object of the present invention is to apply reconfigurable memory o circuits and methodologies to improve performance in connection with FIFO memory applications. A still further object is to apply shared, reconfigurable DRAM memory circuits and methodologies to FIFO memory applications in order to build high performance computing engines. A further object is to provide for concurrent execution of a calculation using a tightly coupled execution unit, while allowing concurrent access to a FIFO memory 5 subsystem by the CPU.
One aspect of the present invention is a FIFO memory subsystem that includes a microprogrammable controller. The subsystem also includes address selection circuitry so as to provide a selected address to the FIFO memory from any of several address sources. The address sources can include the CPU address bus, the local controller, and
potentially another circuit arranged for addressing the memory in connection with downloading execution parameters.
The controller executes microcode which can be stored in any of at least three places. First, the microcode can be stored on board the local controller. Second, the microcode can be stored in a separate read-only memory, e.g., ROM or flash memory. Third, the microcode can be stored in the data portion of the FIFO memory.
Another aspect of the invention provides for downloading microcode from the CPU into a portion of the FIFO data memory for subsequent execution by the local controller. The microcode can include configuration bits, op-codes for the execution unit, constants, parameters, etc. which the controller in turn provides to the execution unit. A further aspect of the invention includes providing an address decoder coupled to the CPU address line, for detecting assertion of a predetermined address that is used to trigger a particular execution. In response to detecting the predetermined address, the controller then configures the execution unit as appropriate, and begins execution of a microcoded sequence to carry out the corresponding calculation in the execution unit. By applying the principles of shared, reconfigurable memory architecture as described in the parent case, these aspects can be implemented while allowing for concurrent access to the FIFO memory by the CPU. Upon completion of its task in the execution unit, the controller can notify the CPU, for example by writing control bits to a special memory address monitored by the CPU.
The foregoing and other objects, features and advantages of the invention will become more readily apparent from the following detailed description of a preferred embodiment of the invention which proceeds with reference to the accompanying drawings. BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 is a system level block diagram of an architecture for digital signal processing (DSP) using shared memory according to the present invention.
FIG. 2 illustrates circuitry for selectively coupling two or more address generators
to a single block of memory.
FIG. 3 is a block diagram illustrating portions of the memory circuitry and address generators of Fig. 1 in a fixed-partition memory configuration.
FIG. 4 shows more detail of address and bit line connections in a two-port memory system of the type described.
FIGS. 5A-5C illustrate selected address and control signals in a Processor Implementation of a DSP system, i.e. a complete DSP system integrated on a single chip.
FIG. 6A illustrates an alternative embodiment in which a separate DSP program counter is provided for accessing the memory. FIG. 6B illustrates an alternative embodiment in which an MDSPC accesses the memory.
FIGS. 7A-B are block diagrams that illustrate embodiments of the invention in a Harvard architecture.
FIG. 8 is a conceptual diagram that illustrates a shared, reconfigurable memory architecture according to the present invention.
FIG. 9 illustrates connection of address lines to a shared, reconfigurable memory with selectable (granular) partitioning of the reconfigurable portion of the memory.
FIG. 10 illustrates a system that implements a reconfigurable segment of memory under bit selection table control. FIG. 11 A is a block diagram illustrating an example of using single-ported RAM in a DSP computing system according to the present invention.
FIG. 11B is a table illustrating a pipelined timing sequence for addressing and accessing the one-port memory so as to implement a "virtual two-port" memory.
FIG. 12 illustrates a block of memory having at least one reconfigurable segment with selectable write and read data paths.
FIG. 13A is a schematic diagram showing detaii of one example of the write selection circuitry of the reconfigurable memory of Fig. 12.
FIG. 13B illustrates transistor pairs arranged for propagating or isolating bit lines as an alternative to transistors 466 in Fig. 13A or as an alternative to the bit select
transistors 462, 464 of Fig. 13A.
FIG. 14 is a block diagram illustrating extension of the shared, reconfigurable memory architecture to multiple segments of memory.
FIG. 15 is a simplified block diagram illustrating multiple reconfigurable memory 5 segments with multiple sets of sense amps.
FIGS. 16A-16D are simplified block diagrams illustrating various examples of memory segment configurations to form memory blocks of selectable size.
FIG. 17 is a block diagram of a DSP architecture illustrating a multiple memory block to multiple execution unit interface scheme in which configuration is controlled via o specialized address generators.
FIGS. 18A-18C are simplified block diagrams illustrating various configurations of segments of a memory block into association with multiple execution units.
FIG. 19 is a simplified block diagram illustrating a shared, reconfigurable memory system utilizing common sense amps. 5 FIG. 20 is a simplified block diagram illustrating a shared, reconfigurable memory system utilizing multiple sense amps for each memory segment.
FIG. 21 is a timing diagram illustrating memory swapping cycles.
FIG. 22A is a block diagram illustrating memory swapping under bit table control.
FIG. 22B is a block diagram illustrating memory swapping under MDSPC control. o FIG. 23 is a block diagram illustrating an embodiment of a FIFO memory device utilizing reconfigurable memory technology under control of memory-centric read and write count controllers.
FIG. 24 is a block diagram illustrating the function of the FIFO memory device of FIG. 23. 5 FIG. 25 is a block diagram further illustrating the function of the FIFO memory device of FIG. 23.
FIG. 26 is a biock diagram illustrating a special case in the function of the FIFO memory device of FIG. 23.
DETAILED DESCRIPTION OF A PREFERRED EMBODIMENT
FIGURE 1
Fig. 1 is a system-level block diagram of an architecture for memory and computing-intensive applications such as digital signal processing. In Fig. 1, a microprocessor interface 40 includes a DMA port 42 for moving data into a memory via path 46 and reading data from the memory via path 44. Alternatively, a single, bidirectional port could be used. The microprocessor interface 40 generically represents an interface to any type of controller or microprocessor. The interface partition indicated by the dashed line 45 in Fig. 1 may be a physical partition, where the microprocessor is in a separate integrated circuit, or it can merely indicate a functional partition in an implementation in which all of the memory and circuitry represented in the diagram of Fig. 1 is implemented on board a single integrated circuit. Other types of partitioning, use of hybrid circuits, etc., can be used. The microprocessor interface (DMA 42) also includes control signals indicated at 52. The microprocessor or controller can aiso provide microcode (not shown) for memory control and address generation, as well as control signals for configuration and operation of the functional execution units, as described later.
Because the present invention may be integrated into an existing processor or controller core design, so that both the core processor and the present invention reside in the same integrated circuit, reference will be made herein to the core processor meaning the processor that the present invention has been attached to or integrated with.
In Fig. 1, a two-port memory comprises the first memory biock 50, labeled "A" and a second memory block 60, labeled "B." The memory is addressed by a source address generator 70 and a destination address generator 80. A functional execution unit 90 aiso is coupled to the two-port memory, left and right I/O channels, as illustrated at block B. Preferably, these are not conventional two-port memory I/O ports; rather, they have novel structures described later.
In operation, the interface 44, 46 to the two-port memory block A is a DMA interface that is in communication with the host processor or controller 40. Block A
receives data coefficients and optionally other parameters from the controller, and also returns completed data to the controller that results from various DSP, graphics, MPEG encode/decode or other operations carried out in the execution unit 90. This output data can include, for example, FFT results, or convolution data, or graphics rendering data, etc. Thus the single memory can alternately act as both a graphics frame buffer and a graphics computation buffer memory.
Concurrently, the memory block "B" (60) interfaces with the functional execution unit 90. The functional execution unit 90 receives data from the two-port memory block B and executes on it, and then returns results ("writeback") to the same two-port memory structure. The source address generator 70 supplies source or input data to the functional execution unit while the destination address generator 80 supplies addresses for writing results (or intermediate data) from the execution unit to the memory. Put another way, source address generator 70 provides addressing while the functional execution unit is reading input data from memory block B, and the destination address generator 80 provides addressing to the same memory block B whiie the functional execution unit 90 is writing results into the memory.
As mentioned above, when the execution unit has completed its work on the data in block B, the memory effectively "swaps" blocks A and B, so that block B is in communication with the DMA channel 42 to read out the results of the execution. Conversely, and simultaneously, the execution unit proceeds to execute on the new input data in biock A. This "swapping" of memory blocks includes several aspects, the first of which is switching the memory address generator lines so as to couple them to the appropriate physical block of memory.
In an alternative embodiment, the system can be configured so that the entire memory space (blocks A and B in the illustration) are accessed first by an I/O channel, and then the entire memory swapped to be accessed by the processor or execution unit. In general, any or all of the memory can be reconfigured as described. The memory can be SRAM, DRAM or any other type of random access semiconductor memory or functionally equivalent technology. DRAM refresh is provided by address generators, or may not be
required where the speed of execution and updating the memory (access frequency) is sufficient to obviate refresh.
FIGURE 2
Figure 2 illustrates one way of addressing a memory block with two (or more) address generators. Here, one address generator is labeled "DMA" and the other "ADDR GEN" although they are functionally similar. As shown in Fig. 2, one of the address generators 102 has a series of output lines, corresponding to memory word lines. Each output line is coupled to a corresponding buffer (or word line driver or the like), 130 to 140. Each driver has an enable input coupled to a common enable line 142. The other address generator 104 similarly has a series of output lines coupled to respective drivers 150 to 160. The number of word lines is at least equal to the number of rows of the memory block 200. The second set of drivers also have enable inputs coupled to the common enable control line 142, but note the inverter "bubbles" on drivers 130 to 140, indicating the active-low enables of drivers 150 to 160. Accordingly, when the control line 142 is low, the DMA address generator 102 is coupled to the memory 200 row address inputs. When the control line 142 is high, the ADDR GEN 104 is coupled to the memory 200 row address inputs. In this way, the address inputs are "swapped" under control of a single bit. Alternative circuitry can be used to achieve the equivalent effect. For example, the devices illustrated can be tri-state output devices, or open collector or open drain structures can be used where appropriate. Other alternatives include transmission gates or simple pass transistors for coupling the selected address generator outputs to the memory address lines. The same strategy can be extended to more than two address sources, as will be apparent to those skilled in the art in view of this disclosure.
FIGURE 3
Figure 3 is a biock diagram illustrating a physical design of portions of the memory circuitry and address generators of Fig. 1 in a fixed-partition configuration. By "fixed
partition" I mean that the size of memory block A and the size of memory block B cannot change dynamically. In Fig. 3, the memory block A (50) and block B (60) correspond to the same memory blocks of Fig. 1. The memory itself preferably is dynamic RAM, although static RAM or other solid state memory technologies could be used as well. In 5 memory block B, just two bits or memory cells 62 AND 64 are shown by way of illustration. In a typical implementation, the memory block is likely to include thousands or even millions of rows, each row (or word) being perhaps 64 or more bits wide. A typical memory block using today's technology is likely to be one or two megabytes. The memory blocks need not be of equal size. Neither memory depth nor word size is critical to the invention. o Two bits are sufficient here to illustrate the concept without unduly complicating the drawing. The source address generator 70 is coupled to both memory blocks A and B. In block B, the top row includes a series of cells including bit cell 62. In fact, the source address generator preferably has output lines coupled to all of the rows of not only block B, but block A as well, although only one row line is illustrated in block A. Note also that 5 corresponding address lines from the AG 70 and the DMA 102 are shown as connected in common, e.g. at line 69. However, in practice, these address lines are selectable as described above with reference to Fig. 2.
A destination address generator 80 similarly is coupled to the row lines of both blocks of memory. Memory cells 62 and 64 are full two-ported cells on the same column in o this example. Thus, either source AG 70 or DMA 102 address the left port, while either destination AG 80 or DMA 100 address the right port. A write select multiplexer 106 directs data either from the DMA (42 in Fig. 1) (or another block of memory) or from the execution unit 90, responsive to a control signal 108. The control signal is provided by the controller or microprocessor of Fig, 1 , by a configuration bit, or by an MDSPC. The 5 selected write data is provided to column amplifiers 110, 112 which in tum are connected to corresponding memory cell bit lines. 110 and 112 are bit and /bit ("bit bar") drivers. Below cell 64 is a one-bit sense amplifier 116. A bit output from the sense amp 116 is directed, for example, to a latch 72. Both the DMA and the execution unit are coupled to receive data from latch 72, depending on appropriate control, enable and clock signals
(not shown here). Or, both the DMA and the execution path may have separate latches, the specifics being a matter of design choice. Only one sense amp is shown for illustration, while in practice there wiil be at least one sense amp for each column. Use of multiple sense amps is described later.
FIGURE 4
Fig. 4 shows more detail of the connection of cells of the memory to source and destination address lines. This drawing shows how the source address lines (when asserted) couple the write bit line and its complement, i.e. input lines 110,112 respectively, to the memory cells. The destination address lines couple the cell outputs to the read bit lines 114, 115 and thence to sense amp 116. Although only one column is shown, in practice write and read bit lines are provided for each column across the full width of the memory word. The address lines extend across the full row as is conventional.
FIGURES 21, 22A AND 22B Timing
Fig. 21 is a conceptual diagram illustrating an example for the timing of operation of the architecture illustrated in Fig. 1. TOA, T1 A, etc., are specific instances of two operating time cycles TO and T1. The cycle length can be predetermined, or can be a parameter downloaded to the address generators. TO and T1 are not necessarily the same length and are defined as alternating and mutually exclusive, i.e. a first cycle T1 starts at the end of TO, and a second cycle TO starts at the end of the first period T1 , and so on. Both TO and T1 are generally longer than the basic clock or memory cycle. time. Fig. 22A is a block diagram of a single port architecture which will be used to illustrate an example of functional memory swapping in the present invention during repeating TO and T1 cycles. Execution address generator 70 addresses memory block A (50) during TO cycles. This is indicated by the left (TO) portion of AG 70. During T1 cycles, execution address generator 70 addresses memory block B (60), as indicated by the right portion of 70. During T1 , AG 70 also receives setup or configuration data in preparation
for again addressing Mem Block A during the next TO cycle. Similarly, during TO, AG 70 also receives configuration data in preparation for again addressing Mem Block B during the next T1 cycle.
DMA address generator 102 addresses memory block B (60) during TO cycles. This is indicated by the left (TO) portion of DMA AG 102. During T 1 cycles, DMA address generator 102 addresses memory block A (50), as indicated by the right portion of 102. During T1 , DMA AG 102 also receives setup or configuration data in preparation for again addressing Mem Block B during the next TO cycle. Similarly, during TO, DMA 102 also receives configuration data in preparation for again addressing Mem Block A during the next T1 cycle.
During a TO cycle, the functional execution unit (90 in Fig. 1) is operating continuously on data in memory block A 50 under control of execution address generator 70. Simultaneously, DMA address generator 102 is streaming data into memory block B 60. At the beginning of a T1 cycle, memory blocks A and B effectively swap such that execution unit 90 will process the data in memory block B 60 under control of execution address generator 70 and data will stream into memory block A 50 under control of DMA address generator 102. Conversely, at the beginning of a TO cycle, memory blocks A and B again effectively swap such that execution unit 90 wiil process the data in memory block A 50 under control of execution address generator 70 and data will stream into memory block B 60 under control of DMA address generator 102.
In Fig. 22B, the functions of the execution address generator and DMA address generator are performed by the MDPSC 172 under microcode control.
FIGURES 5A-C
Processor Implementation
The preferred architecture for implementation in a processor application, as distinguished from a coprocessor application, is illustrated in Figs. 5A-C. In Fig. 5A, a two- port memory again comprises a block A (150) and a block B (160). Memory block B is
coupled to a DSP execution unit 130. An address generator 170 is coupled to memory block B 160 via address lines 162. In operation, as before, the address generator unit is executing during a first cycle TO and during time TO is loading parameters for subsequent execution in cycle T1. The lower memory block A is accessed via core processor data address register 142A or core processor instruction address register 142B. Thus, in this illustration, the data memory and the instructional program memory are located in the same physical memory. A microprocessor system of the Harvard architecture has separate physical memory for data and instructions. The present invention can be used to advantage in the Harvard architecture environment as well, as described below with reference to Figs. 7A and 7B.
Bit Configuration Tables
Fig. 5A also includes a bit configuration table 140. The bit configuration table can receive and store information from the memory 150 or from the core processor, via bus 180, or from an instruction fetched via the core processor instruction address register 142B. Information is stored in the bit configuration table during cycle TO for controlling execution during the next subsequent cycle T1. The bit configuration tabie can be loaded by a series of operations, reading information from the memory biock A via bus 180 into the bit configuration tables. This information includes address generation parameters and opcodes. Examples of some of the address parameters are starting address, modulo- address counting, and the length of timing cycles TO and T1. Examples of op codes for controlling the execution unit are the multiply and accumulate operations necessary for to perform an FFT.
Essentially, the bit configuration table is used to generate configuration control signal 152 which determines the position of virtual boundary 136 and, therefore, the configuration of memory blocks A and B. It also provides the configuration information necessary for operation of the address generator 170 and the DSP execution unit 130 during the T1 execution cycle time. Path 174 illustrates the execution unit/memory interface control signals from the bit configuration table 140 to the DSP execution unit 130.
Path 176 illustrates the configuration control signal to the execution unit to reconfigure the execution unit. Path 178 illustrates the op codes sent to execution unit 130 which cause execution unit to perform the operations necessary to process data. Path 188 shows configuration information loaded from the configuration tables into the address generator 170.
The architecture illustrated in Fig. 5A preferably would utilize the extended instructions of a given processor architecture to allow the address register from the instruction memory to create the information flow into the bit configuration table. In other words, special instructions or extended instructions in the controller or microprocessor architecture can be used to enable this mechanism to operate as described above. Such an implementation would provide tight coupling to the microprocessor architecture.
Memory-centric DSP Controller
Fig. 5B illustrates an embodiment of the present invention wherein the functions of address generator 170 and bit configuration table 140 of Fig. 5A are performed by memory-centric DSP controller (MDSPC) 172. In the embodiment shown in Fig. 5B, the core processor writes microcode for MDSPC 172 aiong with address parameters into memory block B 150. Then, under core processor control, the microcode and address parameters are downloaded into local memory within MDSPC 172. A DSP process initiated in MDPSC 172 then generates the appropriate memory configuration control signals 152 and execution unit configuration control signals 176 based upon the downloaded microcode to control the position of virtual boundary 136 and structure execution unit 130 to optimize performance for the process corresponding to the microcode. As the DSP process executes, MDSPC 172 generates addresses for memory block B 160 and controls the execution unit/memory interface to load operands from memory into the execution unit 130 which are then processed by execution unit 130 responsive to op codes 178 sent from MDSPC 172 to execution unit 130. In addition, virtual boundary 136 may be adjusted responsive to microcode during process execution in order to dynamically optimize the memory and execution unit configurations.
In addition, the MDSPC 172 supplies the timing and control for the interfaces between memory and the execution unit. Further, algorithm coefficients to the execution unit may be supplied directly from the MDSPC. The use of microcode in the MDSPC results in execution of the DSP process that is more efficient than the frequent downloading of bit configuration tables and address parameters associated with the architecture of Fig. 5A. The microcoded method represented by the MDSPC results in fewer bits to transfer from the core processor to memory for the DSP process and less frequent updates of this information from the core processor. Thus, the core processor bandwidth is conserved along with the amount of bits required to store the control information.
Fig. 5C illustrates an embodiment of the present invention wherein the reconfigurabiiity of memory in the present invention is used to allocate an additional segment of memory, memory block C 190, which permits MDPSC 172 to execute microcode and process address parameters out of memory block C 190 rather than locai memory. This saves the time required for the core processor controlled download of microcode and address parameters to local memory in MDSPC 172 that takes place in the embodiment of Fig. 5B. This embodiment requires an additional set of address 192 and data 194 lines to provide the interface between memory block C 190 and MDSPC 172 and address bus control circuitry 144 under control of MDSPC 172 to disable the appropriate address bits from core processor register file 142. This configuration permits simultaneous access of MDSPC 172 to memory block C 190, DSP execution unit 130 to memory biock B and the core processor to memory block A.
Similar to the embodiments shown in Figs. 5A and 5B, virtual boundaries 136A and 136B are dynamically reconfigurable to optimize the memory configuration for the DSP process executing in MDSPC 172.
The bit tables and microcode discussed above may alternatively reside in durable store, such as ROM or flash memory. The durable store may be part of memory block A or may reside outside of memory block A wherein the content of durable store is transferred to memory block A or to the address generators or MDSPC during system initialization.
Furthermore, the DSP process may be triggered by either decoding a preselected bit pattern corresponding to a DSP function into an address in memory block A containing the bit tabies or microcode required for execution of the DSP function. Yet another approach to triggering the DSP process is to place the bit tables or microcode for the DSP function at a particular location in memory block A and the DSP process is triggered by the execution of a jump instruction to that particular location. For instance, at system initialization, the microcode to perform a DSP function, such as a Fast Fourier Transform (FFT) or IIR, is loaded beginning at a specific memory location within memory block A. Thereafter, execution of a jump instruction to that specific memory location causes execution to continue at that location thus spawning the DSP process.
FIGURES 6A and 6B
Referring now to Fig. 6A, in an alternative embodiment, a separate program counter 190 is provided for DSP operations. The core controller or processor (not shown) loads information into the program counter 190 for the DSP operation and then that program counter in turn addresses the memory block 150 to start the process for the DSP. Information required by the DSP operations would be stored in memory. Alternatively, any register of the core processor, such as data address register 142A or instruction address register 142B, can be used for addressing memory 150. Bit Configuration Table 140, in addition to generating memory configuration signal 152, produces address enable signal 156 to control address bus control circuitry 144 in order to select the address register which accesses memory block A and also to selectively enable or disable address lines of the registers to match the memory configuration (i.e. depending on the position of virtual boundary 136, address bits are enabled if the bit is needed to access all of memory block A and disabled if block A is smaller than the memory space accessed with the address bit).
Thus, Fig. 6A shows the DSP program counter 190 being loaded by the processor with an address to move into memory block A. In that case, the other address sources in register file 142 are disabled, at least with respect to addressing memory 150. In short,
three different alternative mechanisms are illustrated for accessing the memory 150 in order to fetch the bit configurations and other parameters 140. The selection of which addressing mechanism is most advantageous may depend upon the particular processor architecture with which the present invention is implemented. 5 Fig. 6B shows an embodiment wherein MDSPC 172 is used to generate addresses for memory block A in place of DSP PC 190. Address enable signal 156 selects between the address lines of MDSPC 172 and those of register file 142 in response to the microcode executed by MDSPC 172. As discussed above, if the microcode for MDSPC 172 resides in memory block A or a portion thereof, MDSPC 172 o will be executing out of memory block A and therefore requires access to the content of memory biock A.
Memory Arrangement
Referring again to Fig. 5, memory blocks A (150) and B (160) are separated by 5 "virtual boundary" 136. In other words, block A and block B are portions of a single, common memory, in a preferred embodiment. The location of the "virtual boundary" is defined by the configuration control signal generated responsive to the bit configuration table parameters. In this regard, the memory is reconfigurable under software control. Although this memory has a variable boundary, the memory preferably is part of the o processor memory, it is not contemplated as a separate memory distinct from the processor architecture. In other words, in the processor application illustrated by Figs. 5 and 6, the memory as shown and described is essentially reconfigurable directly into the microprocessor itself. In such a preferred embodiment, the memory block B, 160, duly configured, executes into the DSP execution unit as shown in Fig. 5. 5 In regard to Fig. 5B, virtual boundary 136 is controlled based on the microcode downloaded to MDSPC 172. Similarly, in Fig. 5C, microcode determines the position of both virtual boundary 136A and 136B to create memory block C 190.
FIGURES 7A and 7B
Fig. 7A illustrates an alternative embodiment, corresponding to Fig. 5A, of the present invention in a Harvard-type architecture, comprising a data memory block A 206 and block B 204, and a separate core processor instruction memory 200. The instruction memory 200 in addressed by a program counter 202. Instructions fetched from the instruction memory 200 pass via path 220 to a DSP instruction decoder 222. The instruction decoder in tum provides addresses for DSP operations, table configurations, etc., to an address register 230. Address register 230 in turn addresses the data memory block A 206. Data from the memory passes via path 240 to load the bit configuration tables etc. 242 which in turn configure the address generator for addressing the data memory block B during the next execution cycle of the DSP execution unit 250. Fig. 6 thus illustrates an alternative approach to accessing the data memory A to fetch bit configuration data. A special instruction is fetched from the instruction memory that includes an opcode field that indicates a DSP operation, or more specifically, a DSP configuration operation, and includes address information for fetching the appropriate configuration for the subroutine.
In the embodiment of Fig. 7B, corresponding to the embodiments in Figs. 5B and 5C, MDPSC 246 replaces AG 244 and Bit Configuration Table 242. Instructions in core processor instruction memory 200 that correspond to functions to be executed by DSP Execution Unit 250 are replaced with a preselected bit pattern which is not recognized as a valid instruction by the core processor. DSP Instruction Decode 222 decodes the preselected bit patterns and generates an address for DSP operations and address parameters stored in data memory A and also generates a DSP control signal which triggers the DSP process in MDSPC 246. DSP Instruction Decode 222 can also be structured to be responsive to output data from data memory A 206 into producing the addresses latched in address register 230.
The DSP Instruction Decode 222 may be reduced or eliminated if the DSP process is initiated by an instruction causing a jump to the bit table or microcode in memory block A pertaining to the execution of the DSP process.
To summarize, the present invention includes an architecture that features shared,
reconfigurable memory for efficient operation of one or more processors together with one or more functional execution units such as DSP execution units. Fig. 6A shows an implementation of a sequence of operations, much like a subroutine, in which a core controller or processor loads address information into a DSP program counter, in order to fetch parameter information from the memory. Fig. 6B shows an implementation wherein the DSP function is executed under the control of an MDSPC under microcode control. In Figs. 5A-C, the invention is illustrated as integrated with a von Neumann microprocessor architecture. Figs. 7A and 7B illustrate applications of the present invention in the context of a Harvard-type architecture. The system of Fig. 1 illustrates an alternative stand-alone or coprocessor implementation. Next is a description of how to implement a shared, reconfigurable memory system.
Reconfigurable Memory Architecture FIGURE 8 Fig. 8 is a conceptual diagram illustrating a reconfigurable memory architecture for
DSP according to another aspect of the present invention. In Fig. 8, a memory or a block of memory includes rows from 0 through Z. A first portion of the memory 266, addresses 0 to X, is associated, for example, with an execution unit (not shown). A second (hatched) portion of the memory 280 extends from addresses from X+1 to Y. Finally, a third portion of the memory 262, extending from addresses Y+1 to Z, is associated, for example, with a DMA or I/O channel. By the term "associated" here we mean a given memory segment can be accessed directly by the designated DMA or execution unit as further explained herein. The second segment 280 is reconfigurable in that it can be switched so as to form a part of the execution segment 266 or become part of the DMA segment 262 as required.
The large vertical arrows in Fig. 8 indicate that the execution portion and the DMA portion of the memory space can be "swapped" as explained previously. The reconfigurable segment 280 swaps together with whichever segment it is coupled to at the time. In this block of memory, each memory word or row includes data and/or coefficients,
as indicated on the right side of the figure.
Additional "configuration control bits" are shown to the left of dashed line 267. This extended portion of the memory can be used for storing a bit configuration table that provides configuration control bits as described previously with reference to the bit 5 configuration table 140 of Figs. 5A and 6A. These selection bits can include write enable, read enable, and other control information. So, for example, when the execution segment 266 is swapped to provide access by the DMA channel, configuration control bits in 266 can be used to couple the DMA channel to the I/O port of segment 266 for data transfer. In this way, a memory access or software trap can be used to reconfigure the system o without delay.
The configuration control bits shown in Fig. 8 are one method of effecting memory reconfiguration that relates to the use of a separate address generator and bit configuration table as shown in Figs. 5A and 7A. This approach effectively drives an address configuration state machine and requires considerable overhead processing to 5 maintain the configuration control bits in a consistent and current state.
When the MDSPC of Figs. 5B, 5C and 7B is used, the configuration control bits are unnecessary because the MDSPC modifies the configuration of memory algorithmically based upon the microcode executed by the MDSPC. Therefore, the MDSPC maintains the configuration of the memory internally rather than as part of the o reconfigured memory words themselves.
FIGURE 9
Fig. 9 illustrates connection of address and data lines to a memory of the type described in Fig. 8. Referring to Fig. 9, a DMA or I/O channel address port 102 provides 5 sufficient address lines for accessing both the raws of the DMA block of memory 262, indicated as bus 270, as well as the reconfigurable portion of the memory 280, via additional address lines indicated as bus 272. When the block 280 is configured as a part of the DMA portion of the memory, the DMA memory effectively occupies the memory space indicated by the brace 290 and the address lines 272 are controlled by the DMA
channel 102. Fig. 9 also shows an address generator 104 that addresses the execution block of memory 266 via bus 284. Address generator 104 also provides additional address lines for controlling the reconfigurable block 280 via bus 272. Thus, when the entire reconfigurable segment 280 is joined with the execution block 266, the execution 5 block of memory has a total size indicated by brace 294, while the DMA portion is reduced to the size of block 262.
The address lines that control the reconfigurable portion of the memory are switched between the DMA address source 102 and address generator 104 via switching means 296. Illustrative switching means for addressing a single block of memory from 0 multiple address generators was described above, for example with reference to Fig. 2. The particular arrangement depends in part on whether the memory is single-ported (see Fig. 2) or multi-ported (see Figs. 34). Finally, Fig. 9 indicates data access ports 110 and 120. The upper data port 110 is associated with the DMA block of memory, which, as described, is of selectable size. Similarly, port 120 accesses the execution portion of the 5 memory. Circuitry for selection of input (write) data sources and output (read) data destinations for a block of memory was described earlier. Alternative structures and implementation of multiple reconfigurable memory segments are described below. It should be noted that the entire block need not be switched in toto to one memory block or the other. Rather, the reconfigurable biock preferably is partitionable so o that a selected portion (or all) of the block can be switched to join the upper or lower block. The granularity of this selection (indicated by the dashed lines in 280) is a matter of design choice, at a cost of additional hardware, e.g. sense amps, as the granularity increases, as further explained later.
5 FIGURE 10
Fig. 10 illustrates a system that implements a reconfigurable segment of memory 280 under bit selection table control. In Fig. 10, a reconfigurable memory segment 280 receives a source address from either the AG or DMA source address generator 274 and it receives a destination address from either the AG or DMA destination address generator
281. Write control logic 270, for example a word wide multiplexer, selects write input data from either the DMA channel or the execution unit according to a control signal 272. The source address generator 274 includes bit table control circuitry 276. The configuration control circuitry 276, either driven by a bit table or under microcode control, generates the 5 write select signal 272. The configuration control circuitry also determines which source and destination addresses lines are coupled to the memory - either "AG" (address generator) when the block 280 is configured as part of the an "AG" memory biock for access by the execution unit, or the "DMA" address lines when the block 280 is configured as part of the DMA or I/O channel memory block. Finally, the configuration control logic o provides enable and/or clock controls to the execution unit 282 and to the DMA channel
284 for controlling which destination receives read data from the memory output data output port 290.
FIGURE 11 5 Fig. 11 is a partial block/partiai schematic diagram illustrating the use of a single ported RAM in a DSP computing system according to the present invention. In Fig. 11, a single-ported RAM 300 includes a column of memory cells 302, 304, etc. Only a few cells of the array are shown for clarity. A source address generator 310 and destination address generator 312 are arranged for addressing the memory 300. More specifically, o the address generators are arranged to assert a selected one address line at a time to a logic high state. The term "address generator" in this context is not limited to a conventional DSP address generator. It could be implemented in various ways, including a microprocessor core, microcontroller, programmable sequencer, etc. Address generation can be provided by a micro-coded machine. Other implementations that 5 provide DSP type of addressing are deemed equivalents. However, known address generators do not provide control and configuration functions such as those illustrated in Fig. 10 - configuration bits 330. For each row of the memory 300, the corresponding address lines from the source and destination blocks 310, 312, are logically "ORed" together, as illustrated by OR gate 316, with reference to the top row of the memory
comprising memory cell 302. Only one row address line is asserted at a given time. For writing to the memory, a multiplexer 320 selects data either from the DMA or from the execution unit, according to a control signal 322 responsive to the configuration bits in the source address generator 310. The selected data is applied through drivers 326 to the corresponding coiumn of the memory array 300 (only one column, i.e. one pair of bit lines, is shown in the drawing). For each column, the bit lines also are coupled to a sense amplifier 324, which in tum provides output or write data to the execution unit 326 and to the DMA 328 via path 325. The execution unit 326 is enabled by an execution enable control signal responsive to the configuration bits 330 in the destination address block 312. Configuration bits 330 also provide a DMA control enable signal at 332.
The key here is to eliminate the need for a two-ported RAM cell by using a logical OR of the last addresses from the destination and source registers (located in the corresponding destination or source address generators). Source and destination operations are not simultaneous, but operation is still fast. A source write cycle followed by a destination read cycle would take only a total time of two memory cycles.
FIGURE 12
Fig. 12. The techniques and circuits described above for reconfigurable memory can be extended to multiple blocks of memory so as to form a highly flexible architecture for digital signal processing. Fig. 12 illustrates a first segment of memory 400 and a second memory segment 460. In the first segment 400, only a few rows and a few cells are shown for puφoses of illustration. One row of the memory begins at cell 402, a second row of the memory begins at cell 404, etc. Only a single bit line pair, 410, is shown for illustration. At the top of the figure, a first write select circuit such as a multiplexer 406 is provided for selecting a source of write input data. For example, one input to the select circuit 406 may be coupled to a DMA channel or memory block M1. A second input to the MUX 406 may be coupled to an execution unit or another memory block M2. In this discussion, we use the designations M1 , M2, etc., to refer generically, not only to other
blocks of memory, but to execution units or other functional parts of a DSP system in general. The multiplexer 406 couples a selected input source to the bit lines in the memory segment 400. The select circuit couples all, say 64 or 128 bit lines, for example, into the memory. Preferably, the select circuit provides the same number of bits as the word size.
The bit lines, for example bit line pair 410, extend through the memory array segment to a second write select circuit 420. This circuit selects the input source to the second memory segment 460. If the select circuit 420 selects the bit lines from memory segment 400, the result is that memory segment 400 and the second memory segment 460 are effectively coupled together to form a single block of memory. Alternatively, the second select circuit 420 can select write data via path 422 from an alternative input source. A source select circuit 426, for example a similar multiplexer circuit, can be used to select this input from various other sources, indicated as M2 and M1. When the alternative input source is coupled to the second memory segment 460 via path 422, memory segment 460 is effectively isolated from the first memory segment 400. In this case, the bit lines of memory segment 400 are directed via path 430 to sense amps 440 for reading data out of the memory segment 400. When the bitliπes of memory segment 400 are coupled to the second segment 460, sense amps 440 can be sent to a disable or low power standby state, since they need not be used.
FIGURE 13
Fig. 13 shows detail of the input selection logic for interfacing multiple memory segments. In Fig. 13, the first memory segment bit line pair 410 is coupled to the next memory segment 460, or conversely isolated from it, under control of pass devices 466. When devices 466 are turned off, read data from the first memory segment 406 is nonetheless available via lines 430 to the sense amps 440. The input select logic 426 includes a first pair of pass transistors 426 for connecting bit lines from source M1 to bit line drivers 470. A second pair of pass transistors 464 controllably couples an alternative input source M2 bit lines to drivers 470. The pass devices 462, 464, and 466, are all
controllable by control bits originating, for example, in the address generator circuitry described above with reference to Fig. 9. Pass transistors, transmission gates or the like can be considered equivalents for selecting input (write data) sources.
FIGURE 14
Fig. 14 is a high-level block diagram illustrating extension of the architectures of Figs. 12 and 13 to a plurality of memory segments. Details of the selection logic and sense amps is omitted from this drawing for clarity. In general, this drawing illustrates how any available input source can be directed to any segment of the memory under control of the configuration bits.
Fig. 15 is another block diagram illustrating a plurality of configurable memory segments with selectable input sources, as in Fig. 14. In this arrangement, multiple sense amps 482, 484, 486, are coupled to a common data output latch 480. When multiple memory segments are configured together so as to form a single block, fewer than all of the sense amps will be used. For example, if memory segment 0 and memory segment 1 are configured as a single biock, sense amp 484 provides read bits from that combined block, and sense amp 482 can be idle.
Figs. 16A through 16D are block diagrams illustrating various configurations of multiple, reconfigurable blocks of memory. As before, the designations M1 , M2, M3, etc., refer generically to other blocks of memory, execution units, I/O channels, etc. In Fig. 16A, four segments of memory are coupled together to form a singie, large block associated with input source M1. In this case, a single sense amp 500 can be used to read data from this common block of memory (to a destination associated with M1). In Fig. 16B, the first block of memory is associated with resource M1 , and its output is provided through sense amp 502. The other three blocks of memory, designated M2, are configured together to form a single block of memory - three segments long - associated with resource M2. In this configuration, sense amp 508 provides output from the common block (3xM2), while sense amps 504 and 506 can be idle. Figs. 16C and 16D provide additional examples that
are self explanatory in view of the foregoing description. This illustration is not intended to imply that all memory segments are of equal size. To the contrary, they can have various sizes as explained elsewhere herein.
Fig. 17 is a high-level block diagram illustrating a DSP system according to the present invention in which multiple memory blocks are interfaced to multiple execution units so as to optimize performance of the system by reconfiguring it as necessary to execute a given task. In Fig. 17, a first block of memory M1 provides read data via path 530 to a first execution unit ("EXEC A") and via path 532 to a second execution unit (EXEC B"). Execution unit A outputs results via path 534 which in turn is provided both to a first multiplexer or select circuit MUX-1 and to a second select circuit MUX-2. MUX-1 provides select write data into memory M1.
Similarly, a second segment of memory M2 provides read data via path 542 to execution unit A and via path 540 to execution unit B. Output data or results from execution unit B are provided via path 544 to both MUX-1 and to MUX-2. MUX-2 provides selected write data into the memory block M2. In this way, data can be read from either memory block into either execution unit, and results can be written from either execution unit into either memory block.
A first source address generator S1 provides source addressing to memory block M1. Source address generator S1 also includes a selection table for determining read/write configurations. Thus, S1 provides control bit "Select A" to MUX-1 in order to select execution unit A as the input source for a write operation to memory M1. S1 also provides a "Select A" control bit to MUX-2 in order to select execution unit A as the data source for writing into memory M2.
A destination address generator D1 provides destination addressing to memory biock M1. D1 also includes selection tables which provide a "Read 1" control signal to execution A and a second "Read 1" control signal to execution unit B. By asserting a selected one of these control signals, the selection bits in D1 directs a selected one of the execution units to read data from memory Ml
A second source address generator S2 provides source addressing to memory
segment M2. Address generator S2 also provides a control bit "select B" to MUX-1 via path 550 and to MUX-2 via path 552. These signals cause the corresponding multiplexer to select execution unit B as the input source for write back data into the corresponding memory block. A second destination address generator D2 provides destination addressing to memory block M2 via path 560. Address generator D2 also provides control bits for configuring this system. D2 provides a read to signal to execution unit A via path 562 and a read to signal to execution unit B via path 564 for selectively causing the corresponding execution unit to read data from memory block M2.
Fig. 18A illustrates at a high level the parallelism of memory and execution units that becomes available utilizing the reconfigurable architecture described herein. In Fig. 18A, a memory block, comprising for example 1,000 rows, may have, say, 256 bits and therefore 256 outputs from respective sense amplifiers, although the word size is not critical. 64 bits may be input to each of four parallel execution units E1 - E4. The memory block thus is configured into four segments, each segment associated with a respective one of the execution units, as illustrated in Fig. 18B. As suggested in the figure, these memory segments need not be of equal size. Fig. 18C shows a further segmentation, and reconfiguration, so that a portion of segment M2 is joined with segment M1 so as to form a block of memory associated with execution unit El A portion of memory segment M3, designated "M3/2" is joined together with the remainder of segment M2, designated "M2 2", to form a memory block associated with execution unit E2, and so on. Note, however, that the choice of one half block increments for the illustration above is arbitrary. Segmentation of the memory may be designed to permit reconfigurability down to the granularity of words or bits if necessary.
FIG. 19.
The use of multiple sense amps for memory segment configuration was described previously with reference to Figs. 15 and 16. Fig. 19 illustrates an alternative embodiment in which the read bit lines from multiple memory segments, for example read bit lines 604, are directed to a multiplexer circuit 606, or its equivalent, which in turn has an output
coupled to shared or common set of sense amps 610. Sense amps 610 in turn provide output to a data output latch 612, I/O bus or the like. The multiplexer or selection circuitry 604 is responsive to control signals (not shown) which seiect which memory segment output is "tapped" to the sense amps. This architecture reduces the number of sense amps in exchange for the addition of selection circuitry 606.
Fig. 20. is a biock diagram illustrating a memory system of multiple configurable memory segments having multiple sense amps for each segment. This altemative can be used to improve speed of "swapping" read data paths and reduce interconnect overhead in some applications. FIFO MEMORY APPLICATIONS
The following material relates to a method for implementing First In First Out ("FIFO") Memories. FIFO's use embedded logic and DRAM (Dynamic Random Access Memory) on the same chip. The FIFO products that result from this technology will be far less expensive to manufacture, since these products wiil have smaller die size than the devices using conventional SRAM technology manufactured by Cypress Semiconductor, Integrated Device Technology Inc., and other semiconductor manufacturers using conventional SRAM technology (Static Random Technology). The FIFO devices are well described in the literature and will not be described here. The basic concepts behind this invention are described in Fig. 23- - Fig. 26.
Simple Example of DRAM FIFO Technology.
Assumel :
Write pointer goes from word count 0 to 1 ,024 to move, and at the point that write pointer is at 1 ,024 write pointer stops. Read pointer, in response to a Read 64 word command (64 is example) starts to move from 0 to 64 and stops.
The example cited in the assumption above will be used to illustrate the flow through architecture block diagram illustrated in Fig. 23.
In Fig. 23, Block A receives the write command an write clock and. outputs a write pointer to block A into which write data is written into address of the write pointer via the
write port into the input FIFO A, shown in Fig. 23. Note that the input FIFO A stores eight words associated with the input stream. The eight word storage is related to the matching bandwidth between an input data write stream and DRAM bandwidth, e.g., if the DRAM cycles at 40 ns internal or 24 MHz and the input word rate is 5.ns/word, then an eight word deed input FIFO or Buffer Memory is required (note that the input FIFO or Buffer Memory utilized SRAM technology) thus 5.ns write times into this input FIFO or Buffer are readily achieved. After the write pointer steps through the eight words in block A the entire eight words are unloaded into a write interface to DRAM illustrated in B that includes the digital synchronization logic to deal with the asynchronous clocks for write and the internal DRAM clock. Note that in certain synchronous applications the DRAM clock wiil by synchronous with the write and read clocks an digital synchronization logic will note have to be included in the interface mechanisms write interface for DRAM B and read interface for DRAM G. Concurrent with the eight words are loading into the write interface logic the write address logic receives control and write address information from the write count and control block to enable loading of the write interface B data into the correct ROW in the DRAM block. Note that in the case in which the write information into the occurring at a very slow write clock rate, or infrequently the write count and control block H will enable refreshing of the DRAM together with the I row count CL. In addition in the case in which the speed of new data entering into the FIFO DRAM system on a chip is not being written into and then read out of the FIFO sufficiently fast, this not requiring refresh of the DRAM. The following methods would be utilized.
The write input FIFO size would be doubled to hold 16 words; this implies running at high speed of 200 MHz. Incoming data would take a minimum of 80 ns to fill the input FIFO buffer. If the cycle time of the DRAM is 80 ns then 40 ns would be used to implement a refresh operation under control of write control block and 40 ns would be used to implement a write to a row in DRAM. If information comes in very slow or in bursts requiring refresh the write could and control block H together with the Read Count and control block I will feed appropriate control signals to the refresh control block to enable refreshing of the DRAM. Note that in applications in which the overall system utilization of
the FIFO-DRAM does not required refresh operations, e.g., information enters the FIFO DRAM and then is read out before refresh is required the chip control logic in H, I, and J would not have to include the design for refresh requirements; also the input FIFO buffer size and the output buffer FIFO size would not have to be doubled to accommodate the 5 continuous idea of refresh operations occurring every 40 ns. Note that in the above examples the 5ns an eight deep FIFO input buffer and output buffer size are directly related to the cycle time of the DRAM radio in input clock speed, thus for different access time DRAM blocks the buffer size would accordingly decrease or increase. At this point we will continue the discussion of the material on the middle of page 4.0. The 8 data o words are written in parallel on the same row in memory at this point the input FIFO buffer is ready to accept a new word 1 since it has just transferred eight words to the DRAM. In the case of the simple flow example shown in Fig. 23, the continuous writing of 1 ,024 words results in filling up the DRAM to 128 rows see figure 24.
Note that in DRAM FIFO read operations the opposite flow takes place contrasted 5 to the write operations. Read command and read clock input to the read count and control block I, results in I generating the read control to the read address block E. Note this read control to the address block consists of address information plus control information to initiate the read of a row out of DRAM.
The read operation into the read interface block G results in not only latching of o the output data from the DRAM biock but also digital synchronization of the data to the output read clock. Again under control of the read count and control data is parallel loaded from the read interface latches to the out FIFO. Then under control of cycling read pointer information is read out of the read port. Note that the text above describes a flow of information through the FIFO-DRAM system. The flow described must not be confused 5 with the fact that the output FIFO is always pre-loaded to immediately read data out of its read port F, the detailed mechanisms to accomplish this will be described below. The construction of the FIFO block F for reading is described below in Fig. 25.
Explanation of Fig. 25
To accomplish read time operation of the output FIFO block F the following events
must take place.
1. Output FIFO reads output data directly after being loaded with a multiple word load via the DRAM interface.
2. Immediately after the output load of data to output FIFO (1 ) another
5 parallel load is initiated via block I into the DRAM and output FIFO (2) is loaded. (This is the condition for max. clock rate and continuous throughput.)
3. By the methodology of alternatively loading FIFO (1 ) and FIFO (2) and the wire "or" of output data from the two FIFO's out point a in block F continuous output stream data may be maintained without time delay due to DRAM access time since ewe are in o effect creating a double output FIFO buffer.
4. The cycle of the DRAM follows the following criteria. At least on FIFO must be completely full via loading from DRAM at any given instead in time, thus when FIFO (1) states to read outputs FIFO (2) must be full. FIFO (1) does not have to receive a complete parallel load operation until it empties and read operations from FIFO (2) are 5 initiated.
5. Note that the notion of an addressable output dual buffer could replace the use of the two output FIFO buffers.
6. The size of the output FIFO buffers is a function of the following criteria: Refresh requirements - does application require memory refresh or is data 0 written into and read from DRAM so that refresh of the DRAM is not required. No refresh implies less output buffer capacity requirement. Clock speed of output reads from FIFO high speed clocks rates will require more FIFO capacity than required by low speed clocks. 5 Clock speed of output reads from FIFO higher speed clocks rates will require more FIFO capacity than required by low speed clocks. Continuous operation of input FIFO's for write operations in conjunction with read operations occurring in parallel this in effect increases buffer sizes to do the requirement to allow for sequential access of
the DRAM, e.g., one read, one write, one read, etc. This creates a need for doubling the input and output FIFO buffer capacity to meet this requirement. I shall specify a mathematical set of expressions that wiil illustrate all the above consider at for input and output buffer size as a function of refresh, input clock rate, output clock rate, an continuous operation. Special cases
SHORT WRITE - READ EXAMPLE ASSUME WE WRITE IN 10 words and then read 6 words out in a situation that requires 16 word deep
FIFO's on input and output. In this situation assume the DRAM block has been emptied and we are exclusively operating from the input FIFO's this in this case. See Fig. 26.
The input FIFO A's would in effect have a read port tied to the read port of output FIFO (F) to accomplish this operation since the DRAM bank is empty.
Having described and illustrated the principles of the invention in a preferred embodiment thereof, it should be apparent that the invention can be modified in arrangement and detail without departing from such principles. I claim all modifications and variation coming within the spirit and scope of the following claims.
Claims
CLAIMS 1. A FIFO memory device, the device comprising: an input FIFO having an input write port for receiving a first predetermined number of data words, an output port for outputting the predetermined number of data words, a 5 write pointer input port for selecting a position within the input FIFO for each of the incoming data words, and a write control input for controlling the output of the predetermined number of data words from the output port; a write interface buffer having an input port for receiving the predetermined number of data words coupled to the output port of the input FIFO, an output port for o outputting the predetermined number of data words, and a control port for controlling latching of the predetermined number of data words received at the input port of the write interface buffer; a block of memory being the predetermined number of data words wide and having an input port for receiving the predetermined number of data words where the input 5 port of the block of memory is coupled to the output port of the write interface buffer, a row write pointer input for receiving a row write pointer value which selects a row of the block of memory to receive the data words from the input port of the biock of memory, an output port for outputting the predetermined number of data words, and a row read pointer input for receiving a row read pointer value which selects another row of the block of memory to o output the data words to the output port of the block of memory; a write address generator having an output port coupled to the row write pointer input of the block of memory and a control input for receiving a write control signal, where the write address generator generates the write address pointer value responsive to the write control signal; 5 a read address generator having an output port coupled to the row read pointer input of the block of memory and a control input for receiving a read control signai, where the read address generator generates the read address pointer value responsive to the read control signal; a read interface buffer having an input port for receiving the predetermined number of data words coupled to the output port of the block of memory, an output port for outputting the predetermined number of data words, and a control port for controlling latching of the predetermined number of data words received at the input port of the read interface buffer; 5 an output FIFO having an input write port for receiving the first predetermined number of data words from the output port of the read interface buffer, an output port for outputting the predetermined number of data words, a read pointer input port for selecting a position within the output FIFO for each of the outgoing data words, and a read control input for controlling the output of the predetermined number of data words from the output o port; a write controller having a write command input terminal for receiving a write command, a write clock input for receiving a write clock, a write control output for generating the write control signal that is coupled to the write control input of the input FIFO, the write interface buffer and the write address generator, and a row write pointer 5 output for generating the row write pointer value that is coupled to the row write pointer input of the block of memory, where the write controller generates the write control signal and the row write pointer value responsive to the write command and the write clock; and a read controller having a read command input terminal for receiving a read command, a read clock input for receiving a read clock, a read control output for o generating the read control signal that is coupled to the read control input of the output
FIFO, the read interface buffer and the read address generator, and a row read pointer output for generating the row read pointer value that is coupled to the row read pointer input of the block of memory, where the read controller generates the read control signal and the row read pointer value responsive to the read command and the read clock. 5
2. The device of claim 1 , wherein the block of memory further comprises DRAM and the device includes a refresh controller which monitors the write control signai and the read control signal and refreshes the contents of the block of memory responsive thereto.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US5876797P | 1997-09-12 | 1997-09-12 | |
US60/058,767 | 1997-09-12 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO1999013397A1 true WO1999013397A1 (en) | 1999-03-18 |
Family
ID=22018811
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US1998/019115 WO1999013397A1 (en) | 1997-09-12 | 1998-09-11 | Fifo memory device using shared reconfigurable memory block |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO1999013397A1 (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7149139B1 (en) | 2004-01-28 | 2006-12-12 | Marvell Semiconductor Israel Ltd. | Circuitry and methods for efficient FIFO memory |
US7320037B1 (en) | 2002-05-10 | 2008-01-15 | Altera Corporation | Method and apparatus for packet segmentation, enqueuing and queue servicing for multiple network processor architecture |
US7336669B1 (en) | 2002-05-20 | 2008-02-26 | Altera Corporation | Mechanism for distributing statistics across multiple elements |
US7339943B1 (en) | 2002-05-10 | 2008-03-04 | Altera Corporation | Apparatus and method for queuing flow management between input, intermediate and output queues |
US7593334B1 (en) | 2002-05-20 | 2009-09-22 | Altera Corporation | Method of policing network traffic |
US7606248B1 (en) | 2002-05-10 | 2009-10-20 | Altera Corporation | Method and apparatus for using multiple network processors to achieve higher performance networking applications |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4710966A (en) * | 1984-01-04 | 1987-12-01 | Itek Corporation | Digital frame processor pipe line circuit |
EP0590807A2 (en) * | 1992-10-01 | 1994-04-06 | Hudson Soft Co., Ltd. | Image and sound processing apparatus |
US5400288A (en) * | 1987-12-23 | 1995-03-21 | Texas Instruments Incorporated | Semiconductor memory chip |
-
1998
- 1998-09-11 WO PCT/US1998/019115 patent/WO1999013397A1/en active Application Filing
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4710966A (en) * | 1984-01-04 | 1987-12-01 | Itek Corporation | Digital frame processor pipe line circuit |
US5400288A (en) * | 1987-12-23 | 1995-03-21 | Texas Instruments Incorporated | Semiconductor memory chip |
EP0590807A2 (en) * | 1992-10-01 | 1994-04-06 | Hudson Soft Co., Ltd. | Image and sound processing apparatus |
Non-Patent Citations (2)
Title |
---|
BAUMBAUGH A E ET AL: "A REAL TIME DATA COMPACTOR (SPARSIFIER) AND 8 MEGABYTE HIGH SPEED FIFO FOR HEP", IEEE TRANSACTIONS ON NUCLEAR SCIENCE, NEW YORK, NY, US, vol. 33, no. 1, February 1986 (1986-02-01), pages 903 - 906, XP000003981 * |
PHILLIPS B ET AL: "APPLICATIONS OF PARAMETERIZED STORAGE ARRAYS IN APPLICATION SPECIFIC INTEGRATED CIRCUITS", WESCON TECHNICAL PAPERS, vol. 30, 18 November 1986 (1986-11-18), pages 8/4-01 - 3, XP000649958 * |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7320037B1 (en) | 2002-05-10 | 2008-01-15 | Altera Corporation | Method and apparatus for packet segmentation, enqueuing and queue servicing for multiple network processor architecture |
US7339943B1 (en) | 2002-05-10 | 2008-03-04 | Altera Corporation | Apparatus and method for queuing flow management between input, intermediate and output queues |
US7606248B1 (en) | 2002-05-10 | 2009-10-20 | Altera Corporation | Method and apparatus for using multiple network processors to achieve higher performance networking applications |
US7336669B1 (en) | 2002-05-20 | 2008-02-26 | Altera Corporation | Mechanism for distributing statistics across multiple elements |
US7593334B1 (en) | 2002-05-20 | 2009-09-22 | Altera Corporation | Method of policing network traffic |
US7149139B1 (en) | 2004-01-28 | 2006-12-12 | Marvell Semiconductor Israel Ltd. | Circuitry and methods for efficient FIFO memory |
US7333381B2 (en) | 2004-01-28 | 2008-02-19 | Marvel Semiconductor Israel Ltd. | Circuitry and methods for efficient FIFO memory |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US5933855A (en) | Shared, reconfigurable memory architectures for digital signal processing | |
US6895452B1 (en) | Tightly coupled and scalable memory and execution unit architecture | |
US10817414B2 (en) | Apparatuses and methods for memory device as a store for block program instructions | |
US10496286B2 (en) | Apparatuses and methods for parallel writing to multiple memory device structures | |
US5301340A (en) | IC chips including ALUs and identical register files whereby a number of ALUs directly and concurrently write results to every register file per cycle | |
AU2001245761B2 (en) | Enhanced memory algorithmic processor architecture for multiprocessor computer systems | |
EP0786730B1 (en) | High performance, low cost microprocessor | |
US6282627B1 (en) | Integrated processor and programmable data path chip for reconfigurable computing | |
US11474965B2 (en) | Apparatuses and methods for in-memory data switching networks | |
US20060101231A1 (en) | Semiconductor signal processing device | |
AU2001245761A1 (en) | Enhanced memory algorithmic processor architecture for multiprocessor computer systems | |
WO1999000739A1 (en) | An integrated processor and programmable data path chip for reconfigurable computing | |
WO1994022090A1 (en) | Intelligent memory architecture | |
US6446181B1 (en) | System having a configurable cache/SRAM memory | |
US6606684B1 (en) | Multi-tiered memory bank having different data buffer sizes with a programmable bank select | |
Plessis | Mixing fixed and reconfigurable logic for array processing | |
JPH08320786A (en) | Instruction device | |
WO1999013397A1 (en) | Fifo memory device using shared reconfigurable memory block | |
Lee et al. | VLSI design of a wavelet processing core | |
US20230352066A1 (en) | Reconfigurable memory module designed to implement computing operations | |
US7073034B2 (en) | System and method for encoding processing element commands in an active memory device | |
KR20080049727A (en) | Processor array with separate serial module | |
US6961280B1 (en) | Techniques for implementing address recycling in memory circuits | |
WO1998055932A2 (en) | Processor interfacing to memory mapped computing engine | |
WO1999060480A1 (en) | Shared, reconfigurable cache memory execution subsystem |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A1 Designated state(s): JP KR US |
|
AL | Designated countries for regional patents |
Kind code of ref document: A1 Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
DFPE | Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101) | ||
NENP | Non-entry into the national phase |
Ref country code: KR |
|
122 | Ep: pct application non-entry in european phase | ||
122 | Ep: pct application non-entry in european phase |