WO2001001242A1 - Memoire vive dynamique active - Google Patents
Memoire vive dynamique active Download PDFInfo
- Publication number
- WO2001001242A1 WO2001001242A1 PCT/US2000/016451 US0016451W WO0101242A1 WO 2001001242 A1 WO2001001242 A1 WO 2001001242A1 US 0016451 W US0016451 W US 0016451W WO 0101242 A1 WO0101242 A1 WO 0101242A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- dram
- processor
- memory
- data
- active
- Prior art date
Links
- 230000015654 memory Effects 0.000 claims abstract description 158
- 239000013598 vector Substances 0.000 claims abstract description 98
- 238000012545 processing Methods 0.000 claims abstract description 34
- 238000000034 method Methods 0.000 claims description 14
- 238000007667 floating Methods 0.000 claims description 13
- 239000000758 substrate Substances 0.000 claims description 10
- 239000004065 semiconductor Substances 0.000 claims description 9
- 230000008878 coupling Effects 0.000 claims description 2
- 238000010168 coupling process Methods 0.000 claims description 2
- 238000005859 coupling reaction Methods 0.000 claims description 2
- 238000004891 communication Methods 0.000 abstract description 34
- 230000006870 function Effects 0.000 description 24
- 230000007246 mechanism Effects 0.000 description 13
- 238000010586 diagram Methods 0.000 description 11
- 238000003860 storage Methods 0.000 description 10
- 210000004027 cell Anatomy 0.000 description 8
- 230000003287 optical effect Effects 0.000 description 5
- 230000008569 process Effects 0.000 description 5
- 230000001419 dependent effect Effects 0.000 description 4
- 210000000352 storage cell Anatomy 0.000 description 4
- 230000001360 synchronised effect Effects 0.000 description 4
- 239000000872 buffer Substances 0.000 description 3
- 238000004590 computer program Methods 0.000 description 3
- 230000005540 biological transmission Effects 0.000 description 2
- 230000001143 conditioned effect Effects 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 230000003467 diminishing effect Effects 0.000 description 2
- 230000009977 dual effect Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000007334 memory performance Effects 0.000 description 2
- 230000000295 complement effect Effects 0.000 description 1
- 230000001934 delay Effects 0.000 description 1
- 230000000977 initiatory effect Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000012432 intermediate storage Methods 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000000873 masking effect Effects 0.000 description 1
- 238000007620 mathematical function Methods 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 230000000644 propagated effect Effects 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
- 230000011664 signaling Effects 0.000 description 1
- 230000007727 signaling mechanism Effects 0.000 description 1
- 238000000638 solvent extraction Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
- G06F15/76—Architectures of general purpose stored program computers
- G06F15/78—Architectures of general purpose stored program computers comprising a single central processing unit
- G06F15/7839—Architectures of general purpose stored program computers comprising a single central processing unit with memory
- G06F15/7842—Architectures of general purpose stored program computers comprising a single central processing unit with memory on one IC chip (single chip microcontrollers)
- G06F15/7857—Architectures of general purpose stored program computers comprising a single central processing unit with memory on one IC chip (single chip microcontrollers) using interleaved memory
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/30007—Arrangements for executing specific machine instructions to perform operations on data operands
- G06F9/30036—Instructions to perform operations on packed data, e.g. vector, tile or matrix operations
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/30007—Arrangements for executing specific machine instructions to perform operations on data operands
- G06F9/30036—Instructions to perform operations on packed data, e.g. vector, tile or matrix operations
- G06F9/30038—Instructions to perform operations on packed data, e.g. vector, tile or matrix operations using a mask
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3877—Concurrent instruction execution, e.g. pipeline or look ahead using a slave processor, e.g. coprocessor
- G06F9/3879—Concurrent instruction execution, e.g. pipeline or look ahead using a slave processor, e.g. coprocessor for non-native instruction execution, e.g. executing a command; for Java instruction set
Definitions
- This invention relates to computer systems, and more specifically to DRAM memory architectures.
- the application instructions (also referred to as "code") and data are loaded into DRAM modules where the instructions and data may be accessed by the processor(s) as needed.
- Processor wait states or no-ops (null instructions) must be inserted into the processor execution stream to accommodate the latency between when a processor first requests data from memory and when the requested data is returned from memory. These added wait states reduce the average number of instructions that may be executed per unit of time regardless of the processor speed.
- Memory access latency is, at least in part, a product of the physical separation between the processor and the memory device. Memory access is constrained by the bandwidth of the system bus, and is further limited by the need for extra interface circuitry on the processor and memory device, such as signal level translation circuits, to support off-chip bus communication.
- the execution unit of a computer typically comprises one or more processors (also referred to as microprocessors) coupled to one or more DRAM (dynamic random access memory) modules via address and data busses.
- processors also referred to as microprocessors
- DRAM dynamic random access memory
- a processor contains instruction decoders and logic units for carrying out operations on data in accordance with the instructions of a program. These operations include reading data from memory, writing data to memory, and processing data (e.g., with Boolean logic and arithmetic circuits).
- the DRAM modules contain addressable memory cells for storage of application data for use by the processor.
- the processors and DRAM modules comprise separate integrated circuit chips (ICs) connected directly or indirectly to a common electronic substrate, such as a circuit board (e.g., a "motherboard").
- the processor IC is commonly inserted into a socket in the motherboard.
- a DRAM module is commonly provided as a SIMM (single in-line memory module) or DIMM (dual in-line memory module), or a variation thereof, that supports one or more DRAM ICs on a small circuit board that is inserted, via an edge connector, into the motherboard.
- SIMM module provides identical I/O on both sides of the edge connector, whereas a DIMM module provides separate I/O on each side of the edge connector.
- Computers typically support multiple memory modules for greater capacity.
- Each DRAM IC comprises an array of memory cells. Each memory cell stores one bit of data (i.e., a "0" or a "1").
- a DRAM IC may be one dimensional, i.e., configured with one bit per address, or the DRAM IC may be configured with multiple bits per address.
- no more than sixteen bits of a data word are stored in a single DRAM IC.
- a DRAM IC with sixteen megabits of storage may be configured as 16MB x 1 (2 24 one-bit values), 8MB x 2 (2 23 two-bit values), 4MB x 4 (2 22 four-bit values), 2MB x 8 (2 21 8-bit values), or 1MB x 16 (2 20 16-bit values).
- a SIMM might contain eight one-bit wide DRAM ICs (plus an additional one-bit wide DRAM IC for parity in some systems). In this case, each byte of data is split among the DRAM ICs (e.g., with one bit stored in each).
- SIMM and DIMM access operations are asynchronous in nature, and a separate data address is provided for each access operation. Each access operation may take several bus cycles to complete, even when performed within the same memory page.
- Synchronous memory technologies such as synchronous DRAM (SDRAM) and Rambus DRAM (RDRAM), have been developed to improve memory performance for sequential access.
- SDRAM synchronous DRAM
- RDRAM Rambus DRAM
- memory access is synchronized with a bus clock, and a configuration register is provided to specify how many consecutive words to access. The standard delay is incurred in accessing the first addressed word, but the specified number of consecutive data words are accessed automatically by the SDRAM module in consecutive clock cycles.
- the configuration register may be set to specify access for one data word (e.g., one or two bytes), two data words, four data words, etc., up to a full page of data at a time.
- a memory access operation at location X might take five bus clock cycles, but locations X+l, X+2 and X+3 would be output on the following first, second and third bus clock cycles, assuming the access was performed within the bounds of a single page.
- This type of access is denoted as 5/1/1/1, referring to an initial latency of five bus clock cycles for the first data word, and a delay of one bus clock cycle for each of the following three words.
- a SIMM or DIMM might have performance on the order of 5/3/3/3 for four consecutive data words.
- the initial latency value e.g., 5
- the initial latency value would still apply to each data word.
- the minimum data access time is limited by the bus speed, i.e., the period of the bus clock cycle.
- bus speed i.e., the period of the bus clock cycle.
- processor clock speeds are now in the range of 0.5-1 GHz. It is therefore not uncommon for the processor to outpace memory performance by a factor of at least five or more.
- Prior art computer systems have employed mechanisms for hiding, to some extent, the performance limitations of a separate DRAM memory. These mechanisms include complicated caching schemes and access scheduling algorithms.
- Caching schemes place one or more levels of small, high-speed memory (referred to as cache memory or the "cache") between the processor and DRAM.
- the cache stores a subset of data from the DRAM memory, which the processor is able to access at the speed of the cache.
- a memory management system When the desired data is not within the cache (referred to as a cache "miss"), a memory management system must fetch the data into the cache from DRAM, after which the processor may access the data from the cache. If the cache is already full of data, it may also be necessary to write a portion of the cached data back to DRAM before new data may be transferred from DRAM to the cache.
- Cache performance is dependent on the locality of data in a currently executing program. Dispersed data results in a greater frequency of cache misses, diminishing the performance of the cache. Further, the complexity of the system is increased by the memory management unit needed to control cache operations.
- Access scheduling algorithms attempt to anticipate memory access operations and perform prefetching of data to minimize the amount of time a processor must wait for data.
- Memory access operations may be queued and scheduled out of order to optimize access.
- Access scheduling may be performed at compile time by an optimizing compiler and /or at runtime by a scheduling mechanism in the processor.
- Prefetching is effective for applications which perform memory access operations intermittently, and which have sufficient operations independent of the memory access to occupy the processor while prefetching is performed. For example, if there are multiple memory access operations within a short period, each successive prefetching operation may cause delays in subsequent prefetching operations, diminishing the effectiveness of prefetching. Further, where operations are conditioned on data in memory, the processor may still experience wait states if there are insufficient independent operations to perform. Also, where data access is conditioned on certain operations, prefetching may not be feasible. Thus, as with caching schemes, the performance of access scheduling algorithms is program dependent and only partially effective. Further, the implementation of access scheduling introduces further complexity into the system.
- the active DRAM device is configured with a standard DRAM interface, an array of memory cells, a processor, local program memory and high speed interconnect.
- the processor comprises, for example, a vector unit that supports chaining operations in accordance with program instructions stored in the local program memory.
- data processing operations such as graphics or vector processing, may be carried out within the DRAM device without the performance constraints entailed in off-chip bus communication.
- a host processor accesses data from its respective active DRAM devices in a conventional manner via the standard DRAM interface.
- multiple DRAM devices are coupled via the high speed interconnect to implement a parallel processing architecture using a distributed shared memory (DSM) scheme.
- the network provided by the high speed interconnect may include active DRAM devices of multiple host processors.
- FIG. 1 is a block diagram of a general purpose computer system wherein an active DRAM device may be implemented in accordance with an embodiment of the invention.
- FIG. 2 is a block diagram of an active DRAM device in accordance with an embodiment of the invention.
- Figure 3 A is a block diagram of a high-speed interconnect used for DSM communication in an active DRAM device in accordance with an embodiment of the invention.
- Figure 3B is a block diagram of an S-connect node for use in the interconnect of Figure 3A.
- FIG. 4 is a block diagram of an embodiment of a processor for use in an active DRAM device in accordance with an embodiment of the invention.
- Figure 5A is a block diagram of a vector processing apparatus in accordance with an embodiment of the invention.
- Figure 5B is a block diagram of a vector processing apparatus with chaining in accordance with an embodiment of the invention.
- FIG. 6 is a block diagram of a DSM system implemented with multiple active DRAM devices in accordance with an embodiment of the invention.
- DRAM active dynamic random access memory
- the on-chip processor comprises a vector unit for applications such as graphics processing.
- the DRAM memory is made dual ported so that the on-chip processor may access the data via an internal high-speed port while a host computing system accesses the data from off -chip via a conventional DRAM interface.
- the resulting DRAM device appears to the host computing system as a conventional DRAM device, but provides internal processing capabilities with the speed and performance of on-chip signaling.
- a high-bandwidth serial interconnect is provided for distributed shared memory (DSM) communication with other similar active DRAM devices.
- DSM distributed shared memory
- a parallel processing architecture is achieved that provides an inexpensive, scalable supercomputing model in a conventional computer system.
- NC network
- embedded devices e.g., web phones, smart appliances, etc.
- An embodiment of the invention can be implemented, for example, as a replacement for, or an addition to, main memory of a processing system, such as the general purpose host computer 100 illustrated in Figure 1.
- a keyboard 110 and mouse 111 are coupled to a system bus 118.
- the keyboard and mouse are for introducing user input to the computer system and communicating that user input to processor 113.
- Other suitable input devices may be used in addition to, or in place of, the mouse 111 and keyboard 110.
- I/O (input/ output) unit 119 coupled to system bus 118 represents such I/O elements as a printer, A/V (audio/video) I/O, etc.
- Computer 100 includes a video memory 114, main memory 115 (such as one or more active DRAM devices) and mass storage 112, all coupled to system bus 118 along with keyboard 110, mouse 111 and processor 113.
- the mass storage 112 may include both fixed and removable media, such as magnetic, optical or magnetic optical storage systems or any other available mass storage technology.
- Bus 118 may contain, for example, thirty-two address lines for addressing video memory 114 or main memory 115.
- the system bus 118 also includes, for example, a 64-bit data bus for transferring data between and among the components, such as processor 113, main memory 115, video memory 114 and mass storage 112. Alternatively, multiplex data/address lines may be used instead of separate data and address lines.
- the processor 113 is a SPARC microprocessor from Sun Microsystems, Inc., a microprocessor manufactured by Motorola, such as the 680X0 processor or a microprocessor manufactured by Intel, such as the 80X86, or Pentium processor.
- Main memory 115 comprises one or more active DRAM devices in accordance with an embodiment of the invention.
- Video memory 114 is a dual-ported video random access memory. One port of the video memory 114 is coupled to video amplifier 116. The video amplifier 116 is used to drive the cathode ray tube (CRT) raster monitor 117.
- Video amplifier 116 is well known in the art and may be implemented by any suitable apparatus.
- This circuitry converts pixel data stored in video memory 114 to a raster signal suitable for use by monitor 117.
- Monitor 117 is a type of monitor suitable for displaying graphic images.
- the video memory could be used to drive a flat panel or liquid crystal display (LCD), or any other suitable data presentation device.
- Computer 100 may also include a communication interface 120 coupled to bus 118.
- Communication interface 120 provides a two-way data communication coupling via a network link 121 to a local network 122.
- ISDN integrated services digital network
- communication interface 120 provides a data communication connection to the corresponding type of telephone line, which comprises part of network link 121.
- LAN local area network
- communication interface 120 provides a data communication connection via network link 121 to a compatible LAN.
- Communication interface 120 could also be a cable modem or wireless interface. In any such implementation, communication interface 120 sends and receives electrical, electromagnetic or optical signals which carry digital data streams representing various types of information.
- Network link 121 typically provides data communication through one or more networks to other data devices.
- network link 121 may provide a connection through local network 122 to local server computer 123 or to data equipment operated by an Internet Service Provider (ISP) 124.
- ISP 124 in turn provides data communication services through the world wide packet data communication network now commonly referred to as the "Internet" 125.
- Internet 125 uses electrical, electromagnetic or optical signals which carry digital data streams.
- the signals through the various networks and the signals on network link 121 and through communication interface 120, which carry the digital data to and from computer 100, are exemplary forms of carrier waves transporting the information.
- Computer 100 can send messages and receive data, including program code, through the network(s), network link 121, and communication interface 120.
- remote server computer 126 might transmit a requested code for an application program through Internet 125, ISP 124, local network 122 and communication interface 120.
- the received code may be executed by processor 113 as it is received, and/or stored in mass storage 112, or other non-volatile storage for later execution. In this manner, computer 100 may obtain application code in the form of a carrier wave.
- Application code may be embodied in any form of computer program product.
- a computer program product comprises a medium configured to store or transport computer readable code or data, or in which computer readable code or data may be embedded.
- Some examples of computer program products are CD-ROM disks, ROM cards, floppy disks, magnetic tapes, computer hard drives, servers on a network, and carrier waves.
- An embodiment of the invention is implemented as a single active DRAM IC device, with an internal processor and DRAM memory sharing a common semiconductor substrate.
- Prior art DRAM ICs typically store only a portion of a full processor data word (i.e., an "operand"), with the complete processor data word being interleaved between multiple DRAM ICs.
- the DRAM memory within the active DRAM device is configured to store complete operands. For example, for a double- precision floating point data type, sixty-four bits are stored within the active DRAM device. This permits the internal processor to operate on data within the memory space of its respective active DRAM device without the bandwidth limitations of an off-chip system bus or related off-chip I/O circuitry.
- FIG. 2 is a block diagram illustrating an active DRAM device in accordance with an embodiment of the invention.
- the device comprises processor 205, program memory 201, interconnect 203 and two-port DRAM memory 206, all of which are coupled to internal data bus 411 and address bus 412.
- the device may include a control bus for the communication of bus control signals in accordance with the implemented bus communication protocol (e.g., synchronization, bus request and handshaking signals, etc.).
- the implemented bus communication protocol e.g., synchronization, bus request and handshaking signals, etc.
- more than one pair of internal data and address busses may be used (e.g., interconnect 203 and processor 205 may communicate over a separate bus or other signaling mechanism).
- DRAM memory 206 comprises a block of conventional high density DRAM cells. The manner in which the DRAM cells are configured may differ for various embodiments. In one embodiment, DRAM memory 206 comprises multiple banks of DRAM cells. The use of banks can be used to facilitate the interleaving of access operations between the banks for faster memory access times and better collision performance (e.g., the internal processor 205 may be accessing one bank of data while a host processor accesses another bank of data via the conventional DRAM interface 207.
- data bus 411 may be sixty-four bits wide to accommodate access of a full double-precision floating point value from DRAM memory 206 in a single read or write cycle.
- data bus 411 may be eight or sixteen bits wide, in which case, multiple read or write cycles (depending on the number of bytes used to represent the data type of the requested operand) may be utilized to access a complete data value in DRAM memory 206.
- the memory address is typically (though not necessarily) divisible into a bank address, a row address and a column address.
- the bank address is used to select the desired bank of storage cells, and the row and column addresses are used to select the row and column of storage cells within the specified bank.
- the row address may be used to precharge the desired row of storage cells prior to reading of data. Data is read from or written to the subset of storage cells of the selected row specified by the column address.
- a further component of the active DRAM device is a conventional DRAM interface 207 coupled to a second port of DRAM memory 206 for off-chip communication with a host processor.
- DRAM interface 207 may be, for example, a conventional SDRAM or Rambus interface as is known in the art.
- DRAM memory 206 may have a single port that is shared between the internal processor 205 and the DRAM interface 207.
- a collision mechanism may be employed to stall the instruction execution pipeline of one or the other of the internal processor 205 and the external host processor (not shown) in the event that both processors wish to access the single port of DRAM memory 206 at the same time (e.g., concurrent write requests).
- DRAM memory 206 may be configured in multiple banks such that concurrent memory access to different banks of memory do not result in a collision.
- Program memory 201 may be either additional DRAM memory or static RAM (SRAM) memory, for example.
- program memory 201 is addressable by processor 205, but is separate from the addressable memory space accessible via conventional DRAM interface 207. In other embodiments, program memory 201 may be part of the memory space accessible to DRAM interface 207.
- Program memory 201 is used to store the program instructions embodying routines or applications executed by internal processor 205. Additionally, program memory 201 may be used as a workspace or temporary storage for data (e.g., operands, intermediate results, etc.) processed by processor 205. A portion of program memory 201 may serve as a register file for processor 205.
- a non-volatile memory resource 202 is provided in tandem with program memory 201 in the form of flash SRAM, EEPROM or other substantially nonvolatile memory elements.
- Memory 202 is configured to provide firmware for processor 205 for system support such as configuration parameters and routines for start-up, communications (e.g., DSM routines for interacting with interconnect 203) and memory management.
- Interconnect 203 provides a high-bandwidth connection mechanism for internal processor 205 to communicate off-chip with other devices.
- Interconnect 203 provides one or more channels for supporting a distributed shared memory (DSM) environment comprising, for example, multiple active DRAM devices.
- DSM distributed shared memory
- processor and /or memory elements other than active DRAM devices may also be coupled into the DSM environment via connection to interconnect 203.
- Interconnect 203 and processor 205 may be configured to support communications (e.g., DSM communications) via message passing or shared memory as is known in the art.
- communications e.g., DSM communications
- One suitable interconnect mechanism in accordance with an embodiment of the invention is a conventional S-connect node used for packet-based communication.
- Figures 3A and 3B are block diagrams of an S-connect node.
- S-connect node 300 may comprise a crossbar switch (6 x 6: six packet sources and six packet drains) with four 1.3 GHz serial ports (204) and two 66 MHz 16-bit parallel ports (301). The parallel and serial ports are bi-directional and full duplex, and each has a data bandwidth of approximately 220 megabytes per second. As shown in Figure 3B, S-connect node 300 further comprises a pool of buffers (304) for incoming and outgoing packets, transceivers (XCVRs 1-4) for each serial port, a distributed phase-locked loop (PLL 305) circuit and one or more routing tables and queues (303). In Figure 3B, XCVRs 1-4, as well as parallel ports A and B, are coupled to router /crossbar switch 302.
- Router /crossbar switch 302 receives an input data packet from a serial or parallel port acting as a packet source or input and accesses a routing table in routing tables /queues 303 to determine an appropriate serial or parallel port to act as a packet drain or destination. If the given port selected as the packet drain is idle, router /crossbar switch may transfer the packet directly to the given port. Otherwise, the packet may be temporarily stored in buffer pool 304. A pointer to the given buffer is stored in routing tables /queues 303 within a queue or linked list associated with the given port.
- PLL 305 is coupled to each of the serial ports to receive incoming packet streams. Each such packet stream is timed in accordance with the clock of its source interconnect circuit. PLL 305 is configured with multiple phase detectors to determine the phase error between each incoming packet stream and the clock signal (CLK) generated by PLL 305. The sum of the phase errors from all of the serial inputs is used to adjust the clock signal of PLL 305 to provide synchronization with other interconnect nodes in the system.
- CLK clock signal
- An embodiment of the invention configures an S-connect node with both parallel ports 301 coupled to processor 205 (e.g., via busses 411 and 412), and with the serial ports (serial links 204) providing mechanisms for I/O communication with other devices or communication nodes.
- interconnect 203 is implemented with an S-connect macrocell comprising approximately fifty thousand logic gates or less. It will be obvious to one skilled in the art that other forms of interconnect circuits may be utilized in place of, or in addition to, S-connect nodes to provide a high-bandwidth I/O communication mechanism for DSM applications.
- processor 205 provides the main mechanism for executing instructions associated with application programs, operating systems and firmware routines, for example.
- Execution is carried out by a process of fetching individual instructions from program memory 201 or Flash SRAM/EEPROM 202, decoding those instructions, and exerting control over the components of the processor to implement the desired function(s) of each instruction.
- the manner in which processor 205 carries out this execution process is dependent on the individual components of the given processor and the component interaction provided by the defined instruction set for the given processor architecture.
- Processor 205 is described below with reference to Figure 4, which illustrates an example vector processor architecture in accordance with one embodiment of the invention. It will be apparent that processor 205 may similarly implement any known scalar, vector or other form of processor architecture in accordance with further embodiments of the invention.
- processor 205 comprises arithmetic logic units (ALUs) 400A-B, optional register file 404, program counter (PC) register 408, memory address register 407, instruction register 406, instruction decoder 405 (also referred to as the control unit) and address multiplexer (MUX) 409.
- Instruction decoder 405 issues control signals to each of the other elements of processor 205 via control bus 410.
- Data is exchanged between elements 404-408 and memory 201, 202 or 206 via data bus 411.
- Multiplexer 409 is provided to drive address bus 412 from either program counter register 408 or memory address register 407.
- Program counter register 408, memory address register 407 and instruction register 406 are special function registers used by processor 205.
- processor 205 may also include a stack pointer register (not shown) to store the address for the top element of the stack.
- Program counter register 408 is used to store the address of the next instruction for the current process, and is updated during the execution of each instruction.
- Updating program counter register 408 typically consists of incrementing the current value of program counter register 408 to point to the next instruction. However, for branching instructions, program counter register 408 may be directly loaded with a new address as a jump destination, or an offset may be added to the current value. In the execution of a conditional branching instruction, condition codes generated by arithmetic logic units 400 A-B may be used in the determination of which update scheme is applied to program counter register 408.
- program counter register 408 drives the stored PC value onto address bus 412 via MUX 409.
- Control bus 410 from instruction decoder 405 controls the loading of program counter register 408 and the selection mechanism of MUX 409 to drive the PC value onto address bus 412.
- Memory address register 407 is used to hold the target memory address during execution of memory access instructions, such as "load” and "store.” In variations of the load and store instructions, the memory address may be loaded into memory address register 407 from one of the registers in register file 404, from the results of an ALU operation, or from an address bit-field extracted from an instruction (e.g., via masking and shifting) based on the implemented instruction set. When the memory access is initiated, memory address register 407 drives its output signal onto address bus 412 via MUX 409.
- Incoming data (including instructions) loaded via data bus 411 from memory 201, 202 or 206, or other components external to processor 205, may be stored in one of the registers of processor 205, including register file 404.
- the address asserted on address bus 412 by program counter register 408 or memory address register 407 determines the origination of the incoming data.
- Outgoing data is output from one of the registers of processor 205 and driven onto data bus 411.
- the address asserted on address bus 412 determines the outgoing data's destination address, in DRAM memory 206 for example.
- Instruction decoder 405 enables loading of incoming data via control bus 410.
- instruction decoder 405 When an instruction is fetched from memory (e.g., from program memory 201), the instruction is loaded into instruction register 406 (enabled by instruction decoder 405 via control bus 410) where the instruction may be accessed by instruction decoder 405.
- processor 205 is configured as a dual-issue processor, meaning that two instructions are fetched per fetch cycle. Both instructions are placed into instruction register 406 concurrently for decoding and execution in parallel or pipelined fashion.
- Instruction decoder 405 may further comprise a state machine for managing instruction fetch and execution cycles based on decoded instructions.
- Arithmetic logic units 400A-B provide the data processing or calculating capabilities of processor 205.
- Arithmetic logic units 400A-B comprise, for example, double-input/ single-output hardware for performing functions such as integer arithmetic (add, subtract), boolean operations (bitwise AND, OR, NOT (complement)) and bit shifts (left, right, rotate).
- arithmetic logic units 400A-B may further comprise hardware for implementing more complex functions, such as integer multiplication and division, floating-point operations, specific mathematical functions (e.g., square root, sine, cosine, log, etc.) and vector processing or graphical functions, for example.
- Control signals from control bus 410 control multiplexers or other selection mechanisms within arithmetic logic units 400A-B for selecting the desired function.
- arithmetic logic units 400A-B may comprise multiple ALUs in parallel for simultaneous execution of instructions.
- a single ALU may be implemented.
- one ALU embodiment comprises vector operations unit 401, vector addressing unit 402 and a scalar operations unit 403.
- Vector operations unit 401 comprises pipelined vector function hardware (e.g., adders, multipliers, etc.) as will be more fully described later in this specification.
- Vector addressing unit 402 may be implemented to provide automatic generation of memory addresses for elements of a vector based on a specified address for an initial vector element and the vector stride, where the vector stride represents the distance in memory address space between consecutive elements of a vector.
- Scalar operations unit 403 comprises scalar function hardware as previously described for performing single-value integer and floating point operations.
- the inputs and outputs of arithmetic logic units 400A-B are accessed via optional register file 404 (or directly from local memory 201, 202 or 206).
- the specific registers of register file 404 that are used for input and output are selected by instruction decoder 405 based on operand address fields in the decoded instructions.
- Additional output signals may include condition codes such as "zero,” “negative,” “carry” and “overflow.” These condition codes may be used in implementing conditional branching operations during updating of program counter register 408.
- Register file 404 comprises a set of fast, multi-port registers for holding data to be processed by arithmetic logic units 400A-B, or otherwise frequently used data.
- the registers within register file 404 are directly accessible to software through instructions that contain source and destination operand register address fields.
- Register file 404 may comprise multiple integer registers capable of holding 32-bit or 64-bit data words, as well as multiple floating point registers, each capable of storing a double-precision floating point data word.
- a vector processor embodiment may also contain multiple vector registers or may be configured with a vector access mode to access multiple integer or floating point registers as a single vector register.
- register files and registers i.e., data words
- processor 205 may operate directly on data in memories 201, 202 or 206 without the use of register file 404 as an intermediate storage resource.
- register file 404 a region of local memory (e.g., a specified address range of program memory 201) may be treated as a register file.
- an active DRAM device provides an inexpensive, scalable mechanism for implementing a supercomputing system using distributed shared memory (DSM).
- DSM distributed shared memory
- embodiments of the invention may be implemented with a vector processor that supports chaining of vector functions. Vector processing and chaining are described below with reference to Figures 5A-B.
- data operands are accessed as data vectors (i.e., an array of data words).
- Vectors may comprise individual elements of most data types.
- vector elements may comprise double precision floating point values, single precision floating point values, integers, or pixel values (such as RGB values).
- a pipelined vector processor performs equivalent operations on all corresponding elements of the data vectors. For example, vector addition of first and second input vectors involves adding the values of the first elements of each input vector to obtain a value for the first element of an output vector, then adding the values of the second elements of the input vectors to obtain a value for a second element of the output vector, etc.
- Pipelining entails dividing an operation into sequential sections and placing registers for intermediate values between each section. The system can then operate with improved throughput at a speed dependent upon the delay of the slowest pipeline section, rather than the delay of the entire operation.
- An example of a pipelined function is a 64-bit adder with pipeline stages for adding values in eight-bit increments (e.g., from least significant byte to most significant byte). Such an adder would need eight pipeline stages to complete addition of 64-bit operands, but would have a pipeline unit delay equivalent to a single eight-bit adder.
- a second pair of vector elements may be input into the initial pipeline stage while the first pair of vector elements are processed by the second pipeline stage.
- a new output vector element is generated from the vector function each cycle.
- Figure 5A illustrates a vector operation carried out on input vectors A and B to generate output vector C.
- an n-element vector multiplication operation would be represented as:
- Each of vectors A-C has an associated stride value specifying the distance between consecutive vector elements in memory address space.
- Vectors A and B are provided as inputs 501A and 501B, respectively, to vector function 500.
- Output 502 of vector function 500 is directed to vector C.
- Vector function 500 comprises multiple pipeline stages or units each having a propagation delay no greater than the period of the processor clock. The latency of vector function 500 is equivalent to the number of pipeline stages multiplied by the period of the processor clock, plus the delay for constructing vectors A-C, if any.
- Vectors A-C may be stored as elements in respective vector registers, or the vectors may be accessed as needed directly from memory using the address of the first element of a respective array and an offset based on the vector's stride multiplied by the given element index.
- Vector chaining is a technique applied to more complex vector operations having multiple arithmetic operations, such as:
- the above relation could be calculated by first generating vector C from vectors A and B, and then, in a separate vector processing operation, generating vector E from vectors C and D. However, with chaining, it is unnecessary to wait for vector C to be complete before initiating the second vector operation. Specifically, as soon as the first element of vector C is generated by the multiplication pipeline, that element may be input into the addition pipeline with the first element of vector D. The resulting latency is equivalent to the sum of the latency of the multiplication operation and the propagation delay of the addition pipeline. The addition pipeline incurs no delay for creating vectors from memory, and throughput remains at an uninterrupted rate of one vector element per cycle. Chaining may be implemented with other vector operations as well, and is not limited to two such operations.
- Figure 5B illustrates a chained vector function with vector function 500 performing the first operation (i.e., multiplication) and vector function 505 performing the second operation (i.e., addition).
- Vector C is generated from output 502 of vector function 500 as described for Figure 5A, but rather than being collected in its entirety before proceeding, output 502 is provided as input 503B to vector function 505.
- Vector D is provided to input 503A of vector function 505, synchronized with the arrival of the corresponding element of vector C from output 502.
- Function 505 generates vector E one element per cycle via output 504.
- Pipelined vector processors have a potential for high execution rates. For example, a dual-issue processor that achieves a processor speed of 400 MHz through pipelining is theoretically capable of up to 800 MFLOPs (million floatingpoint operations per second). Using the high-bandwidth interconnect circuits to scale the memory system with further active DRAM devices adds further to the processing potential. For example, a four gigabyte distributed shared memory (DSM) system implemented with sixteen-megabit active DRAM devices (i.e., sixteen megabits of DRAM memory 206) comprises 2048 active DRAM devices in parallel for a theoretical potential of 1.6 TeraFLOPs (10 12 floating point operations per second) for strong supercomputing performance. In accordance with an embodiment of the invention, a DSM configuration using active DRAM devices is described below.
- DSM distributed shared memory
- DSM Distributed Shared Memory
- the processing and memory resources of multiple active DRAM devices are joined in a scalable architecture to enable parallel processing of applications over a distributed shared memory space.
- Each added active DRAM device adds to the available shared memory space, and provides another internal processor capable of executing one or more threads or processes of an application.
- the high-bandwidth interconnect 203 previously described provides the hardware mechanism by which the internal processor of one active DRAM device may communicate with another active DRAM device to access its DRAM memory resources. Data packets may be used to transmit data between devices under the direction of the router within each interconnect. Multiple devices can be coupled in a passive network via the serial links 204, with each device acting as a network node with its own router.
- the software implementation of the shared memory space may be managed at the application level, using messaging or remote method invocations under direct control of the application programmer, or it may be managed by an underlying DSM system. If implemented at the application level, the programmer manages how data sets are partitioned throughout the distributed memory, and explicitly programs the transmission of messages, remote method invocations (RMI) or remote procedure calls (RPC) between the devices to permit remote memory access where needed. While the explicit nature of the application-level design may provide certain transmission efficiencies, it may be burdensome to the application programmer to handle the necessary messaging and partitioning details of distributed memory use.
- a DSM system applications are written assuming a single shared virtual memory.
- the distributed nature of the shared memory is hidden from the application in the memory access routines or libraries of the DSM system, eliminating the need for the application programmer to monitor data set partitions or write message passing code.
- An application may perform a simple load or store, for example, and the DSM system will determine whether the requested data location is local or remote. In the event that the data location is local, standard memory access is performed. However, in the event that the data location is remote, the DSM system transparently initiates a memory access request of the remote device using RMI, RPC, or any other suitable messaging protocol. The DSM routines at the remote device will respond to the memory access request independent of any application processes at the remote device.
- multiple active DRAM devices may be coupled together in a shared memory configuration using application-level message passing or DSM techniques.
- DSM hardware support is provided (e.g., within interconnect 203) that is transparent to the instruction set architecture of processor 205.
- a memory reference is translated into either a local reference or a remote reference that is handled by the DSM hardware.
- DSM routines may be implemented as firmware in non-volatile memory 202, as software in program memory 201, as hardware routines built into the instruction set architecture of the internal processor 205, or as some combination of the above.
- FIG. 6 is a block diagram of a shared memory system comprising multiple active DRAM devices in accordance with an embodiment of the invention.
- the system comprises one or more host processors 600 and two or more active DRAM devices 602.
- host processors A and B are coupled to system busses 601A and 601B, respectively.
- Also coupled to system bus 601 A via respective DRAM interfaces are multiple active DRAM devices AI-AN- Active DRAM devices AI-A each have high-bandwidth interconnect (e.g., serial links 604 and 605) coupled to a passive network 603 to exchange messages, such as for DSM communications or application-level messaging.
- multiple active DRAM devices AI-AN are configured to implement a distributed shared memory system.
- Host processor A may access data from active DRAM devices AI-AN via system bus 601 A and the respective conventional DRAM interfaces of the devices.
- the interconnect system described herein can support serial communications between devices up to 10 meters away over standard serial cable and up to 100 meters away over optical cables.
- a DSM system can include other active DRAM devices within a single computer system, to other active DRAM devices in add-on circuit boxes or rack-mount systems, and to other devices in other computer systems within the same building, for example.
- the system is scaled to include a second host processor (host processor B) coupled to a further active DRAM device Bi via a second system bus 601B.
- Active DRAM device Bl is coupled into network 603 via high-bandwidth interconnect 606.
- Network 603 could be used to interconnect other DSM devices as well, including devices other than active DRAM devices.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Computer Hardware Design (AREA)
- Multi Processors (AREA)
Abstract
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
AU54902/00A AU5490200A (en) | 1999-06-30 | 2000-06-15 | Active dynamic random access memory |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US34397199A | 1999-06-30 | 1999-06-30 | |
US09/343,971 | 1999-06-30 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2001001242A1 true WO2001001242A1 (fr) | 2001-01-04 |
Family
ID=23348463
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2000/016451 WO2001001242A1 (fr) | 1999-06-30 | 2000-06-15 | Memoire vive dynamique active |
Country Status (2)
Country | Link |
---|---|
AU (1) | AU5490200A (fr) |
WO (1) | WO2001001242A1 (fr) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB2396442A (en) * | 2002-09-17 | 2004-06-23 | Micron Technology Inc | Host memory interface for a parallel processor |
GB2424503A (en) * | 2002-09-17 | 2006-09-27 | Micron Technology Inc | Active memory device including a processor array |
US10732866B2 (en) | 2016-10-27 | 2020-08-04 | Samsung Electronics Co., Ltd. | Scaling out architecture for DRAM-based processing unit (DPU) |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5581773A (en) * | 1992-05-12 | 1996-12-03 | Glover; Michael A. | Massively parallel SIMD processor which selectively transfers individual contiguously disposed serial memory elements |
US5895487A (en) * | 1996-11-13 | 1999-04-20 | International Business Machines Corporation | Integrated processing and L2 DRAM cache |
-
2000
- 2000-06-15 WO PCT/US2000/016451 patent/WO2001001242A1/fr active Application Filing
- 2000-06-15 AU AU54902/00A patent/AU5490200A/en not_active Abandoned
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5581773A (en) * | 1992-05-12 | 1996-12-03 | Glover; Michael A. | Massively parallel SIMD processor which selectively transfers individual contiguously disposed serial memory elements |
US5895487A (en) * | 1996-11-13 | 1999-04-20 | International Business Machines Corporation | Integrated processing and L2 DRAM cache |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB2396442A (en) * | 2002-09-17 | 2004-06-23 | Micron Technology Inc | Host memory interface for a parallel processor |
GB2396442B (en) * | 2002-09-17 | 2006-03-01 | Micron Technology Inc | Host memory interface for a parallel processor |
GB2424503A (en) * | 2002-09-17 | 2006-09-27 | Micron Technology Inc | Active memory device including a processor array |
US7206909B2 (en) | 2002-09-17 | 2007-04-17 | Micron Technology, Inc. | Host memory interface for a parallel processor |
GB2424503B (en) * | 2002-09-17 | 2007-06-20 | Micron Technology Inc | An active memory device |
US7424581B2 (en) | 2002-09-17 | 2008-09-09 | Micron Technology, Inc. | Host memory interface for a parallel processor |
US7849276B2 (en) | 2002-09-17 | 2010-12-07 | Micron Technology, Inc. | Host memory interface for a parallel processor |
US8024533B2 (en) | 2002-09-17 | 2011-09-20 | Micron Technology, Inc. | Host memory interface for a parallel processor |
US10732866B2 (en) | 2016-10-27 | 2020-08-04 | Samsung Electronics Co., Ltd. | Scaling out architecture for DRAM-based processing unit (DPU) |
US11934669B2 (en) | 2016-10-27 | 2024-03-19 | Samsung Electronics Co., Ltd. | Scaling out architecture for DRAM-based processing unit (DPU) |
Also Published As
Publication number | Publication date |
---|---|
AU5490200A (en) | 2001-01-31 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US6631439B2 (en) | VLIW computer processing architecture with on-chip dynamic RAM | |
US7865667B2 (en) | Multi-core multi-thread processor | |
US5625831A (en) | Extendible clock mechanism | |
EP1402392B1 (fr) | Systeme et procede pour un serveur internet dote d'un processeur reconfigurable fonctionnant au moyen d'une seule image de systeme d'exploitation | |
US6988181B2 (en) | VLIW computer processing architecture having a scalable number of register files | |
US5560029A (en) | Data processing system with synchronization coprocessor for multiple threads | |
US6192384B1 (en) | System and method for performing compound vector operations | |
US7493417B2 (en) | Method and data processing system for microprocessor communication using a processor interconnect in a multi-processor system | |
US20070186077A1 (en) | System and Method for Executing Instructions Utilizing a Preferred Slot Alignment Mechanism | |
US20060136681A1 (en) | Method and apparatus to support multiple memory banks with a memory block | |
AU2002303661A1 (en) | System and method for web server with a reconfigurable processor operating under single operation system image | |
CN114398308A (zh) | 基于数据驱动粗粒度可重构阵列的近内存计算系统 | |
US7320013B2 (en) | Method and apparatus for aligning operands for a processor | |
WO2001001242A1 (fr) | Memoire vive dynamique active | |
Kitai et al. | Distributed storage control unit for the Hitachi S-3800 multivector supercomputer | |
Dally | A universal parallel computer architecture | |
US7877581B2 (en) | Networked processor for a pipeline architecture | |
Evripidou et al. | Data Driven Network of Workstations D2NOW | |
Forsell et al. | Realizing multioperations and multiprefixes in Thick Control Flow processors | |
Schwartz et al. | The optimal synchronous cyclo-static array: a multiprocessor supercomputer for digital signal processing | |
Amamiya et al. | Datarol: a parallel machine architecture for fine-grain multithreading | |
Ostheimer | Parallel Functional Computation on STAR: DUST— | |
Krishnan et al. | Optimizing Performance on Linux Clusters Using Advanced Communication Protocols: How 10+ Teraflops Was Achieved on a 8.6 Teraflops Linpack-Rated Linux Cluster | |
Gad et al. | Virtual Extended Memory Symmetric Multiprocessor (SMP) Organization Design Using LC-3 Processor: Case Study of Dual Processor | |
Wirth | Experiments in computer system design |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A1 Designated state(s): AE AL AM AT AU AZ BA BB BG BR BY CA CH CN CR CU CZ DE DK DM EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG UZ VN YU ZA ZW |
|
AL | Designated countries for regional patents |
Kind code of ref document: A1 Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
REG | Reference to national code |
Ref country code: DE Ref legal event code: 8642 |
|
122 | Ep: pct application non-entry in european phase | ||
NENP | Non-entry into the national phase |
Ref country code: JP |