US8448010B2 - Increasing memory bandwidth in processor-based systems - Google Patents
Increasing memory bandwidth in processor-based systems Download PDFInfo
- Publication number
- US8448010B2 US8448010B2 US12/570,137 US57013709A US8448010B2 US 8448010 B2 US8448010 B2 US 8448010B2 US 57013709 A US57013709 A US 57013709A US 8448010 B2 US8448010 B2 US 8448010B2
- Authority
- US
- United States
- Prior art keywords
- memory
- clock
- low
- during
- phase
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active, expires
Links
- 230000015654 memory Effects 0.000 title claims abstract description 42
- 238000000034 method Methods 0.000 claims description 11
- 210000004027 cell Anatomy 0.000 description 22
- 230000006870 function Effects 0.000 description 6
- 238000010586 diagram Methods 0.000 description 4
- 101100203174 Zea mays SGS3 gene Proteins 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000001960 triggered effect Effects 0.000 description 2
- 210000003719 b-lymphocyte Anatomy 0.000 description 1
- 230000002457 bidirectional effect Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G11—INFORMATION STORAGE
- G11C—STATIC STORES
- G11C7/00—Arrangements for writing information into, or reading information out from, a digital store
- G11C7/10—Input/output [I/O] data interface arrangements, e.g. I/O data control circuits, I/O data buffers
- G11C7/1051—Data output circuits, e.g. read-out amplifiers, data output buffers, data output registers, data output level conversion circuits
-
- G—PHYSICS
- G11—INFORMATION STORAGE
- G11C—STATIC STORES
- G11C7/00—Arrangements for writing information into, or reading information out from, a digital store
- G11C7/10—Input/output [I/O] data interface arrangements, e.g. I/O data control circuits, I/O data buffers
- G11C7/1051—Data output circuits, e.g. read-out amplifiers, data output buffers, data output registers, data output level conversion circuits
- G11C7/1057—Data output buffers, e.g. comprising level conversion circuits, circuits for adapting load
-
- G—PHYSICS
- G11—INFORMATION STORAGE
- G11C—STATIC STORES
- G11C7/00—Arrangements for writing information into, or reading information out from, a digital store
- G11C7/10—Input/output [I/O] data interface arrangements, e.g. I/O data control circuits, I/O data buffers
- G11C7/1051—Data output circuits, e.g. read-out amplifiers, data output buffers, data output registers, data output level conversion circuits
- G11C7/106—Data output latches
-
- G—PHYSICS
- G11—INFORMATION STORAGE
- G11C—STATIC STORES
- G11C7/00—Arrangements for writing information into, or reading information out from, a digital store
- G11C7/10—Input/output [I/O] data interface arrangements, e.g. I/O data control circuits, I/O data buffers
- G11C7/1051—Data output circuits, e.g. read-out amplifiers, data output buffers, data output registers, data output level conversion circuits
- G11C7/1066—Output synchronization
Definitions
- This relates generally to processor-based systems and, particularly, to techniques for increasing the bandwidth to and from memories associated with those processor-based systems.
- a processor-based system is any system that uses a processing unit to execute instructions.
- the processing unit may be a controller, a central processor unit, a graphics processor, or a computer, as examples.
- a graphics processor may be utilized to operate on graphical data.
- a graphics core may include execution units such as a mathematics box and an arithmetic logic unit.
- execution units such as a mathematics box and an arithmetic logic unit.
- it may be desirable to increase the amount of data that may be handled by a memory, called a general purpose register file, to enable the execution unit to implement new functions.
- FIG. 1 is a schematic depiction of one embodiment of the present invention
- FIG. 2 is a schematic depiction of the read data path in accordance with one embodiment
- FIG. 3 is a timing diagram for the read data path in accordance with one embodiment
- FIG. 4 is a schematic depiction of the write data path for one embodiment
- FIG. 5 is a timing diagram for the write data path in accordance with one embodiment.
- FIG. 6 is a system depiction in accordance with one embodiment.
- a memory that supplies data for a processing unit may be capable of higher bandwidth to enable the processing unit to perform more functions.
- the processing unit may be a processing unit of a graphics core of a graphics processor.
- the present invention is not necessarily limited to a particular processing unit or a particular memory.
- At least two execution engines may be utilized, including a mathematics box and an arithmetic logic unit.
- the mathematics box may be dedicated to extended mathematics functions, such as log, sine, exponent, reciprocal, square, and the like.
- the mathematics box may include basic adders, shifters, and multipliers.
- the mathematics box functionality may be expanded, in one embodiment, to perform additional functions, such as single precision floating pointing addition, multiplication, and fused multiply-add (FMA) operations.
- FMA fused multiply-add
- the arithmetic logic unit may perform integer, floating point, and logical operations in one embodiment.
- a processing unit such as a mathematics box
- the bandwidth of a memory that supplies data to the processing unit and which receives data from the processing unit may be increased.
- a graphics processor may use a memory called a general purpose register file to supply data to the mathematics box and to receive data from the mathematics box. It is desirable, in some embodiments, to increase the bandwidth between the processing unit and a memory without adding ports or doubling frequencies.
- a memory 102 such as a general purpose register file, is coupled by a bidirectional bus to swizzle logic 104 .
- the swizzle logic may distribute data to and from the processing units that, in one embodiment, may be a mathematics box 106 and the arithmetic logic unit (ALU) 108 .
- a graphics core 100 may be part of a graphics processor integrated circuit.
- the memory 102 may be an array of memory cells arranged in banks. Each bank may have a predetermined number of memory cells, such as 8, 16, or 32, coupled together through a local bit line. Each bank may include 8 word lines with four cells each, as one example. All the banks of the memory 102 may be tied together by a single global bit line in one embodiment.
- one of the word lines of a bank of interest is on during the high phase of a clock signal and data is read out from a memory cell.
- the other banks stay idle while the selected bank pre-charges the selected bit line.
- the read and write bandwidth of the memory 102 may be doubled.
- a first set of 32 cells of a bank may be read, in an embodiment where a bank includes 32 cells. But, of course, other sizes of banks or numbers of cells per bank may be utilized.
- read word lines from another bank of the remaining banks is read out. Generally, data is not read from the bank that was just read in the previous phase.
- bank 0 is not read again.
- writing to a bank only that bank turns on while the others remain idle.
- Use of a writing technique like the above described read technique, can double the write bandwidth.
- a read data path 10 is depicted for the memory 102 , according to one embodiment.
- the read data path includes a port 0 for the high phase read and a port 1 for the low phase read.
- Port 0 is associated with output latch port A 32 for the high phase.
- port 1 for the low phase is associated with output latch 48 , labeled port B.
- the high phase data path is shown by dashed lines and the low phase data path is shown by dotted lines.
- the port 0 and port 1 decoded signals are combined by an OR gate 12 and supplied to a transistor 14 or 36 for a selected word line.
- This supplies a read word line signal to the selected word line, which is labeled RDWL 0 in this example.
- the selected word line (one of RDWL 0 -RDWL 7 ) that is turned on is coupled to a local bit line LBL 34 .
- each word line also has an enable signal, such as data 0 or data 7 coupled to a transistor 16 or 38 .
- Transistor 18 receives a low pre-charge signal LPCH 0 # (active low) for the local bit line LBL 1 .
- each local bit line has 16 cells.
- the NOR gate 20 also receives the local pre-charge signal LPCH 1 # for the other local bit line, LBL 2 .
- the NAND gate 40 selects one of the two local bit lines (the local bit line LBL 2 only shown at its connection to the gate 40 ) to be coupled to a global bit line GBL 46 , in that embodiment.
- the local bit line LBL 1 is utilized for example, and in the low phase, the other local bit line LBL 2 is utilized.
- the output may be provided in the high phase on the port A latch 32 and in the low phase on the port B latch 48 .
- the p-channel transistors 22 , 24 , 26 , and 28 allow a global bit line (GBL) to charge to the supply voltage when reading a zero from a memory cell.
- GBL global bit line
- the local bit line equals one
- transistor 42 is off
- transistors 22 and 24 will be on
- the global bit line will be high
- the local bit line is zero
- transistor 42 is on
- the global bit line is low.
- the transistor 30 may be used to pre-charge the global bit line and the transistor 44 may be used to discharge the global bit line. Of course, other ways of reading the cell may also be used.
- the operation of the high phase is further illustrated in the timing diagram of FIG. 3 , showing the first low phase of the clock, 1 L, followed by the first high phase 1 H, and then the next cycle includes low phase 2 L and high phase 2 H, and so on.
- the reading of port A signals begins.
- the first thing that is done is that the port A signals are pre-decoded, as indicated by the arrow A.
- the port A word line is activated, as indicated by arrow B, and the port B is pre-decoded, as indicated by arrow C, triggered from the leading edge of the first high phase- 1 H, of the clock (CLK).
- the port A data out is activated, as indicated by arrow D, so that the data is provided through the port A latch 32 , as indicated by arrow E, beginning at the leading edge of the 2 H clock phase.
- the reading of the port B signals begins, as indicated by the arrow F.
- the port B pre-decoding begins, as also indicated by arrow F.
- the port B pre-decoding is followed by the port B word line going high during the second low phase, as indicated by arrow G, and port B read out occurs during the second high phase, as indicated by arrow H.
- the port B data out occurs, as indicated by arrow I. This sequence continues from port A to port B, alternating back and forth.
- a write data path 50 for one embodiment, is shown in FIG. 4 .
- the write word line is ORed at OR gate 52 between port 0 and port 1 decoded addresses.
- port 0 data is written for bank 0
- port 1 data is written for bank 1 .
- the write data path 50 for each bank includes a local write bit line 82 , coupled to a latch array 72 , in turn coupled to a global bit line 80 .
- the local write bit line is selected through a select transistor 54 or 62 , coupled through cross coupled inverters 56 and the selection transistor 60 or 68 .
- a plurality of cells in an array 86 may be coupled to the local write bit line. Each set of 16 cells is coupled throughout a local NAND gate to the local write bit line.
- the local NAND gate is made up of the transistors 54 and 60 and the cross coupled inverters 56 and 58 or 64 and 66 .
- the conventional write drivers are replaced with latches 72 .
- a latch is opened in the same phase as a memory cell write. This overcomes the problem caused by a selected bit line toggling and consuming dynamic power. When a bit line is not being written, the latch may be closed. As a result, the unselected bit lines do not toggle, saving power.
- Separate latches may be used for the high and low phases to overcome this problem.
- a bundle is all the cells connected to a local bit line. For example, in a system with 16 cells in a bank, that same bundle cannot be written in successive high and low cycles.
- the port A signal is triggered from the leading edge of the 1 L phase, as indicated by the arrow J.
- the port A pre-decode initiates the port A word line going as indicated by the arrow K, which in turn results in the port A cell flip, as indicated by the arrow L.
- the timing diagram for port A “cell flip” corresponds to the actual writing of the information into the cell.
- port A is written during the first high cycle and port B is written during the second high cycle.
- the port A word line is addressed at the beginning of the first high cycle and the port B word line is activated in the beginning of the second low phase.
- the port B signals trigger off the leading edge of the 1 H clock phase, as indicated by the arrow M, followed by the port A pre-decode, also indicated by the arrow M, still on the 1 H phase.
- the port B word line goes high in the 2 L phase, as indicated by the arrow N, followed by the port B flip, as indicated by the arrow O.
- the port B cell flip is completed at the end of the 3 L clock phase.
- a computer system 130 may include a hard drive 134 and a removable medium 136 , coupled by a bus 124 to a chipset core logic 110 .
- the core logic may couple to the graphics processor 112 (via bus 105 ) and the main or host processor 122 in one embodiment.
- the graphics processor 112 may also be coupled by a bus 126 to a frame buffer 114 .
- the frame buffer 114 may be coupled by a bus 107 to a display screen 118 , in turn coupled to conventional components by a bus 128 , such as a keyboard or mouse 120 .
- code 139 may be stored in a machine readable medium, such as main memory 132 , for execution by a processor, such as the processor 100 or the graphics processor 112 .
- the graphics core is part of the graphics processor 112 .
- active power savings may be achieved because only the local bit lines to the bank of interest toggles. This is because write drivers are replaced with latches. If data is being written in bank 0 , then only that latch turns on and the other latches, bank 1 in FIG. 4 , will be in the off state.
- graphics processing techniques described herein may be implemented in various hardware architectures. For example, graphics functionality may be integrated within a chipset. Alternatively, a discrete graphics processor may be used. As still another embodiment, the graphics functions may be implemented by a general purpose processor, including a multicore processor.
- references throughout this specification to “one embodiment” or “an embodiment” mean that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one implementation encompassed within the present invention. Thus, appearances of the phrase “one embodiment” or “in an embodiment” are not necessarily referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be instituted in other suitable forms other than the particular embodiment illustrated and all such forms may be encompassed within the claims of the present application.
Landscapes
- Static Random-Access Memory (AREA)
Abstract
Description
Claims (14)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US12/570,137 US8448010B2 (en) | 2009-09-30 | 2009-09-30 | Increasing memory bandwidth in processor-based systems |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US12/570,137 US8448010B2 (en) | 2009-09-30 | 2009-09-30 | Increasing memory bandwidth in processor-based systems |
Publications (2)
Publication Number | Publication Date |
---|---|
US20110078485A1 US20110078485A1 (en) | 2011-03-31 |
US8448010B2 true US8448010B2 (en) | 2013-05-21 |
Family
ID=43781630
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US12/570,137 Active 2031-10-13 US8448010B2 (en) | 2009-09-30 | 2009-09-30 | Increasing memory bandwidth in processor-based systems |
Country Status (1)
Country | Link |
---|---|
US (1) | US8448010B2 (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
TWI737502B (en) * | 2019-09-30 | 2021-08-21 | 台灣積體電路製造股份有限公司 | Memory device and method of latching signal |
US12367914B2 (en) | 2022-03-07 | 2025-07-22 | Intel Corporation | Circuit topology for high performance memory with secondary pre-charge transistor |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4903240A (en) * | 1988-02-16 | 1990-02-20 | Tektronix, Inc. | Readout circuit and method for multiphase memory array |
US6240038B1 (en) * | 2000-02-21 | 2001-05-29 | Hewlett Packard Company | Low area impact technique for doubling the write data bandwidth of a memory array |
US20030076909A1 (en) * | 2001-10-22 | 2003-04-24 | Greenhill David J. | Method for synchronizing clock and data signals |
US20030217244A1 (en) * | 2002-05-15 | 2003-11-20 | Kelly James Daniel | Memory controller configurable to allow bandwidth/latency tradeoff |
US20050144525A1 (en) * | 2003-12-05 | 2005-06-30 | Keerthinarayan Heragu | Method to test memories that operate at twice their nominal bandwidth |
US20050195679A1 (en) * | 2004-03-03 | 2005-09-08 | Faue Jon A. | Data sorting in memories |
US20060176751A1 (en) * | 2005-02-04 | 2006-08-10 | Torsten Partsch | Methods and apparatus for implementing a power down in a memory device |
US20080037357A1 (en) * | 2006-08-11 | 2008-02-14 | Pelley Iii Perry H | Double-rate memory |
US7623404B2 (en) * | 2006-11-20 | 2009-11-24 | Freescale Semiconductor, Inc. | Memory device having concurrent write and read cycles and method thereof |
-
2009
- 2009-09-30 US US12/570,137 patent/US8448010B2/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4903240A (en) * | 1988-02-16 | 1990-02-20 | Tektronix, Inc. | Readout circuit and method for multiphase memory array |
US6240038B1 (en) * | 2000-02-21 | 2001-05-29 | Hewlett Packard Company | Low area impact technique for doubling the write data bandwidth of a memory array |
US20030076909A1 (en) * | 2001-10-22 | 2003-04-24 | Greenhill David J. | Method for synchronizing clock and data signals |
US20030217244A1 (en) * | 2002-05-15 | 2003-11-20 | Kelly James Daniel | Memory controller configurable to allow bandwidth/latency tradeoff |
US20050144525A1 (en) * | 2003-12-05 | 2005-06-30 | Keerthinarayan Heragu | Method to test memories that operate at twice their nominal bandwidth |
US20050195679A1 (en) * | 2004-03-03 | 2005-09-08 | Faue Jon A. | Data sorting in memories |
US20060176751A1 (en) * | 2005-02-04 | 2006-08-10 | Torsten Partsch | Methods and apparatus for implementing a power down in a memory device |
US20080037357A1 (en) * | 2006-08-11 | 2008-02-14 | Pelley Iii Perry H | Double-rate memory |
US7623404B2 (en) * | 2006-11-20 | 2009-11-24 | Freescale Semiconductor, Inc. | Memory device having concurrent write and read cycles and method thereof |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
TWI737502B (en) * | 2019-09-30 | 2021-08-21 | 台灣積體電路製造股份有限公司 | Memory device and method of latching signal |
US12367914B2 (en) | 2022-03-07 | 2025-07-22 | Intel Corporation | Circuit topology for high performance memory with secondary pre-charge transistor |
Also Published As
Publication number | Publication date |
---|---|
US20110078485A1 (en) | 2011-03-31 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US12159063B2 (en) | Apparatuses and methods for in-memory operations | |
US11682449B2 (en) | Apparatuses and methods for compute in data path | |
US12230354B2 (en) | Apparatuses and methods for scatter and gather | |
US20210191624A1 (en) | Apparatuses and methods for parallel writing to multiple memory device structures | |
US11693561B2 (en) | Apparatuses and methods for simultaneous in data path compute operations | |
US10658017B2 (en) | Shifting data | |
TWI584279B (en) | Apparatuses and methods for storing a data value in multiple columns | |
US11276457B2 (en) | Processing in memory | |
US9093135B2 (en) | System, method, and computer program product for implementing a storage array | |
CN114341802B (en) | Method for performing in-memory processing operations and related memory devices and systems | |
US20190189188A1 (en) | Apparatuses and methods for subarray addressing | |
US20190004807A1 (en) | Stream processor with overlapping execution | |
US8448010B2 (en) | Increasing memory bandwidth in processor-based systems | |
US6023441A (en) | Method and apparatus for selectively enabling individual sets of registers in a row of a register array |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: INTEL CORPORATION, CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:DAMARAJU, SATISH K.;MAIYURAN, SUBRAMANIAM;AMBARDAR, ANUPAMA;AND OTHERS;SIGNING DATES FROM 20090928 TO 20091103;REEL/FRAME:026477/0521 |
|
FEPP | Fee payment procedure |
Free format text: PAYOR NUMBER ASSIGNED (ORIGINAL EVENT CODE: ASPN); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
FPAY | Fee payment |
Year of fee payment: 4 |
|
MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 8TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1552); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY Year of fee payment: 8 |
|
AS | Assignment |
Owner name: TAHOE RESEARCH, LTD., IRELAND Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:INTEL CORPORATION;REEL/FRAME:061175/0176 Effective date: 20220718 |
|
MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 12TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1553); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY Year of fee payment: 12 |