GB2583121A

GB2583121A - In memory computation

Info

Publication number: GB2583121A
Application number: GB1905465.9A
Authority: GB
Inventors: Stansfield Anthony
Original assignee: Surecore Ltd
Current assignee: Surecore Ltd
Priority date: 2019-04-17
Filing date: 2019-04-17
Publication date: 2020-10-21
Anticipated expiration: 2039-04-17
Also published as: GB201905465D0; GB2583121B

Abstract

A memory unit comprising one or more global bitlines A(B) and one or more global bitline drivers 20 comprising one or more bit cells BC1/BC2 storing first, second data values (VA1,VA2). A (local) sense circuit SC1 is connected to the cell(s) whilst a charge storage and delivery device 20 stores first and second charge values (CV1, CV2) and is connected to global bitline to transfer a stored charge value. The sense circuit transfers first/second stored charge values to the global bitlines dependent upon the data value within the bitcells, utilising the bitcell local bitline(s). The charge storage and charge transfer unit 20 may comprise two capacitors CA, CB or a single capacitor (CA fig 4) applied differentially across the global bitlines. The memory unit may further include two or more bitlines, and two or more associated wordlines holding a wordline data value. In this way, a bit cell(s) along with the wordline(s) may provide both a data value and a logical value upon which one or other of the stored charge values will be transferred to the global bitline. The unit may further include a row decoder, pre-decoder, and summing circuits for summing outputs from two or more Global Sense Amplifiers GSA (see fig 8). The memory unit may form a part of a computational memory system enabling in-memory computation using SRAM technology, and suitable for dot product and product sum calculations.

Description

IN MEMORY COMPUTATION

The present invention relates to computational memory systems and relates particularly but not exclusively to a system and method of using the same which allows for improved computation processes such as in-memory computation.

In-memory computation is a concept where some basic computational tasks can be performed in logic that is located close to or within the memory arrays in a processing device, rather than in the Central Processing Unit (CPU).

The aim is to achieve a system that provides higher throughput, and/or lower power. This is achieved by reducing the amount of data that needs to be moved between the memory system and the CPU, reducing the traffic between memory system and CPU or providing higher bandwidth to the memory system. Reducing the amount of data that needs to be moved between the memory system and the CPU means that only the result of a calculation needs to be transferred, not all the input values. Reducing the traffic between memory system and CPU can save power by reducing the number of transitions on the chip-level buses or, alternatively, it can increase overall system performance by freeing up bus and CPU time for other tasks.

Providing higher bandwidth to the memory system ensures that more data can be processed at once. For instance, in-memory logic can operate on a whole cache line at once, rather than on the individual words in the cache line, as is the case when processed in the CPU.

Basic Functions for Artificial Intelligence A lot of Artificial intelligence (Al) applications make use of a "Dot Product" or "Inner Product" function: Equation n-i P = a ' b = akbk = aobo + a1b1 + ... + an_ibn_i k =0 Typically, one of the input vectors (e.g. a) is a set of input data, and the other (b) is a set of coefficients, or weights. Each set of inputs will be combined with multiple sets of weights, to give multiple outputs, and these outputs may then be the inputs to another similar calculation.

Bit-level arithmetic The individual numbers ak, bk in Equation 1 are themselves vectors of bits, and so can be written as: Equation 2 1-1 ak = 2aki i=o m-i bk = 12-bk./ i=0 With each of the ak,; and bk,; being a single-bit value (i.e. equal to 0 or 1). The product is: I-1 t-i m-i akbk = / 2 ak.i 12'bk.i =II 2i+ akibk.1 t=0 1=0 1=0 1=0 Therefore, the overall sum is: n-11-In-1 a b=II 2t+j akibkj k=0 i=0 j=0 Or, if the order of summation is changed: Equation 3 I-1 m-i n-i /-1 rt-1 a b=II 2t+i aubki = 2 / 2// amibmi F=0 j=0 k=0 1=0 j=0 k=0 This alternative order means firstly combining bits of equal weight from different words, and then doing two weighted sums of these partial results (one over j -the bits in the coefficient, and one over i -the bits in the input data). This is a different order to the more traditional form of calculating the products of the different words and then summing them, but will give the same result, and will be important later as part of the invention.

Serial arithmetic Serial arithmetic is a known technique for performing arithmetic operations in multiple cycles of a system clock in order to reduce the amount of logic required. For instance, multiplication by a constant can be reduced to a shift-and-add circuit that is used over multiple cycles.

In-Memory Computation for Al Equation 1 can be power-intensive to compute -combining the two n-word vectors (a and b) requires: * Reading 2n words from memory * Moving the 2n words from memory to CPU * Performing n multiplications * Performing n-1 additions on the results of the multiplications * Saving the result If given some assistance by in-memory computation, these requirements can be reduced. For instance: * If the multiplications are performed close to the memory, then only the multiplication results need to be moved from memory to CPU, which halves the number of memoryto-CPU transfers required.

* If the addition is also moved close to the memory, then the number of memory-to-CPU transfers can be further reduced, from n to 1.

The arithmetic functions are still required, but the data transfers can be significantly reduced in this way.

Use of Flash Memory It is known to do computation in a modified flash memory structure. Flash memory is a non-volatile memory technology, where data is stored by varying the threshold voltage (V-r) of the storage transistors. This is done by forcing charge into the gate during a specialised write cycle. This VT shift results in a change in the current that is drawn by the memory cell when a gate voltage is applied, and the difference in current is measured in order to determine whether a cell is storing a 1 or a 0. The VT shift is continually variable and is set by the characteristics of the programming cycle. Therefore, the current drawn by the transistor is also continually variable under programmable control. If more than one bitcell is active in a cycle, then the total current being drawn from a bitline will be the sum of the currents from the individual cells. i.e. = Wilt With I; being the current drawn when a bitcell is on, and W being 1 if bitcell i is active, and 0 if it is inactive (i.e. the W are the wordline states).

The above equation looks very like the equation for the dot product given previously, so that the bitline current (and by extension the bitline voltage) looks like it can be used to compute the dot product of a set of input data (W), and a set of stored coefficients (I) In view of the above, it has been suggested that a flash memory can potentially be used to implement the dot product function that is key to Al applications, and offers the possibility of a compact, power efficient implementation. The power is significantly reduced compared to using digital multipliers and adders, whilst also being comparable to the power used by a few read accesses to a flash memory.

Unfortunately, the same technique cannot be used in a Static Random-Access Memory (SRAM), for two main reasons. Firstly, the read current from an SRAM bitcell is not programmable and is also subject to significant manufacturing variation. Secondly, it is not possible to reliably determine how many bitcells, or which bitcells, are connected to a bitline simply by measuring bitline current. In addition, in an SRAM, if a bitcell storing a 1 is actively connected to the same bitline as a cell containing a 0, when the bitline is pulled down by the 0-containing cell, there is a risk of this overwriting the contents of the 1-containing cell. In the context of using the memory for a computation, this means that the coefficient data could be destroyed during the computation.

In view of the above, there exists a need for better systems and processes to facilitate in memory computation. Accordingly, the present invention aims to overcome the shortfalls of the known approaches whilst also provide a solution which is particularly applicable to systems using SRAM devices. Such an approach overcome the issues listed in a switched-bitline memory of the type having a hierarchical bitline structure, and a capacitive drive of the global bitline during a read. Such a device is as described in GB2512844 or US9406351.

Statement of Invention

The present invention provides a memory unit having one or more global bitlines A, B; and one or more global bitline drivers GBLD1, GBLD2; each of said one or more global bitline drivers GBLD1, GBLD2 having one or more bit cells BC1, BC2 for storing first or second data values VA1 or VA2. A sense circuit SC1 may be connected to one or more of said one or more bit cells BC1, BC2. A charge storage and delivery device for storing first and second charge values CV1, CV2 may be connected to said global bitline A, B to deliver a stored charge value thereto. The sense circuit (SC1 is configured and / or connected and operable to deliver one or other of said stored charge values CV1, CV2 to said global bitlines A, B dependent upon the data value VA1, VA2 within said one or more bitcells BC1, BC2.

In one arrangement there are two or more bitcells BC1, BC2 and there may further be included two or more wordlines WL1, WL2 for holding a wordline data value WDV1, WDV2 and being connected to one or other of said one or more bit cells BC1, BC2. Said sense circuit SC1 may be operable to deliver one or other of said stored charge values CV1, CV2 to one or more of said one or more global bitlines A, B dependent upon the data value VA1, VA2 within said one or more bitcells BC1, BC2 and the data value WDV1, WVD2 within said one or more wordlines WL1, WL2.

It may be that the arrangement includes first and second global bitlines A, B and said sense circuit SC1 may be operable to deliver a first of said stored charge values CV1 to a first of said global bitlines A and to deliver a second of said stored charge values CV2 to the second of said global bitlines B dependent upon the data value VA1, VA2 within said one or more bitcells BC1, BC2.

In an alternative arrangement and having first and second global bitlines A, B, said sense circuit SC1 may be operable to deliver a first of said stored charge values CV1 to a first of said global bitlines A and to deliver the second of said stored charge values CV2 to the second of said global bitlines B dependent upon the data value VA1, VA2 within said one or more bitcells BC1, BC2 and the data value WDV1, VVVD2 within said one or more wordlines WL1, WL2.

There is also disclosed a memory unit having a bitcell BC1 and a wordline WL1 for holding a wordline data value WDV1 and being connected to said bit cell BC1 wherein said sense circuit SC1 is operable to deliver one or other of said stored charge values CV1, CV2 to one or more of said one or more global bitlines A, B dependent upon the logical and data value of the data value VA1, VA2 within said bitcell BC1 and the data value WDV1, VVVD2 within said wordline WL1.

The arrangement may include two or more global bitline drivers GBLD1, GBLD2 each being connected to each of said first and second global bitlines A, B and wherein the charge value on each global bitline is the summation of the charge values delivered by each of the global bitline drivers GBL1, GBLD2 connected thereto.

In a practical embodiment there is provided a plurality of rows of memory LBLGO, LBLG1, LBLG2, LBLG3, each memory group comprising plurality of memory cells arranged in two or more rows RA, RB and each being served by a respective pair of wordlines and respective memory cell in each row being connected to a respective global bitline GBLA1-1-n, GBLB2-1-n.

The arrangement may further include a Row Decoder having an input and an output, said output being connected for activating each of said respective pairs of wordlines.

A Pre-Decoder may be provided and operably connected to the input of the Row decoder and connected to a source of Serial Data or Coefficient Address information.

The memory unit 10 may be arranged such that each pair of respective global bitlines GBLA1-1-n, GBLB2-1-n are each connected to a respective Global Sense Amplifier GSA1-n for amplification of an output therefrom.

The arrangement may include one or more Summing Circuits for summing two or more outputs from the Global Sense Amplifiers GSA1-n.

Description of the Drawings

The present invention will now be described by way of example only with reference to the accompanying drawings, in which: Figure 1 illustrates the implementation of the local and global bitlines with only a single local bitline per bit cell; Figure 2 illustrates the implementation with two local bitlines per bit cell, and global bitline pairs -as would be typical in an SRAM implementation; Figure 3 is a variant of that shown in the above-mentioned figures where there is a multiplexer between local bitlines and the local sense circuit.

Figure 4 is a variant where a single capacitor is used instead of the two capacitors (CA and CB); Figure 5 is a typical prior-art row decoder circuit; Figure 6 is a modified row decoder circuit, as may be used with the present invention; Figure 7 illustrates an alternative way to modify the row decoder circuit; Figure 8 is a top-level block diagram of the memory, showing row decoder, memory array, global bitlines and global sense amplifiers (GSA), and sum circuit as used with the present invention; AND Figure 9 illustrates potential summing circuits.

Detailed Description

Referring to the drawings in general but particularly to figures 1, 2 and 8, a memory unit 10 has one or more global bitlines (A, B; and one or more global bitline drivers GBLD1, GBLD2. Each of said one or more global bitline drivers GBLD1, GBLD2 may comprise one or more bit cells BC1, BC2 for storing first or second data values VA1 or VA2. A sense circuit SC1, SC2 is connected via a local bitline LBL1, LBL2 to one or more of said one or more bit cells BC1, BC2 or BC1.1, BC1.2. A charge storage and delivery device 20 is provided for storing first and second charge values CV1, CV2 in Charge Storage Devices CSD1 and CSD2 respectively may comprise capacitors and may be connected to said global bitline A, B to deliver a stored charge value thereto. The capacitors can be precharged (for instance) to Vdd and Gnd.

The sense circuit SC1 is operable to deliver one or other of said stored charge values CV1, CV2 to said global bitlines A, B dependent upon the data value VA1, VA2 within said one or more bitcells BC1, BC2.Two or more wordlines WL1, WL2 are provided for holding a wordline data value WDV1, WDV2 and being connected to one or other of said one or more bit cells BC1, BC2 and wherein said sense circuit SC1 is operable to deliver one or other of said stored charge values CV1, CV2 to one or more of said one or more global bitlines A, B dependent upon the data value VA1, VA2 within said one or more bitcells BC1, BC2 and the data value WDV1, WVD2 within said one or more wordlines WL1, WL2.

The arrangement of each figure has a first and a second outputs 40, 42 of respective local sense circuits SC1, SC2 and the arrangement of Figure 1 includes a single arrangement in which each output can only be applied to a single bitline GA or GB and can only apply the output once associated switches 50, 52 are activated. This arrangement is shown in each of the two Global Bitline Drivers GBLD1, GBLD2.

As mentioned, figure 1 shows the implementation of the local and global bitlines, with only a single local bitline per bit cell. In a "compute" cycle. The capacitors CA or CB are precharged to different voltages. A wordline optionally goes high in each local wordline group. This connects a bitcell to the local bitline, so that a signal (corresponding to a 1 or a 0) is developed on the local bitline. The local sense circuits then sense the states of the local bitlines. Each local sense circuit activates one of its outputs, depending on the result of the sense operation, so that either CA or CB is connected to the global bitline. The overall voltage on the global bitline depends on the sum of the precharge voltages of the capacitors that are connected to it, and therefore on the sum of the results of the local bitline sense operations Figure 2 shows an arrangement with two global bitlines A, B and in this arrangement said sense circuit SC1 is connected to each of said bitlines via connections 30, 32, 34, 36 and is thereby operable to deliver a first of said stored charge values CV1 to a first of said global bitlines A or B and to deliver the second of said stored charge values CV2 to the second of said global bitlines B or A dependent upon the data value VA1, VA2 within said one or more bitcells BC1, BC2.

A connection is made between each of said first and second switches 50, 52 by means of connection 53 and the charge storage device CSD2 connected such as to allow for the delivery of the charge value CV2 stored therein by means of an electrical connection shown at 53A. A mirror arrangement is provided between third and fourth switches 54, 56 by means of connection 55 between the switches 54, 56 and an electrical connection 55A between connection 55 and charge storage device CSD1. For convenience, the group of connections 30, 32 and 50 will herein be referred to as Delivery Line Group A (DLGA) whilst connections 34, 36 and 55 will be referred to as Delivery Line Group B (DLGB).

In an alternative arrangement said sense circuit SC1, SC2 may be operable to deliver a first of said stored charge values CV1 to a first of said global bitlines A or B and to deliver the second of said stored charge values CV2 to the second of said global bitlines B or A dependent upon the data value VA1, VA2 within said one or more bitcells BC1, BC2 and the data value WDV1, WVD2 within said one or more wordlines WL1, WL2.

In a still further alternative having a bitcell BC1 and a wordline WL1 for holding a wordline data value WDV1 and being connected to said bit cell BC1, said sense circuit SC1 may be operable to deliver one or other of said stored charge values CV1, CV2 to one or more of said one or more global bitlines A or B dependent upon the logical and data value of the data value VA1, VA2 within said bitcell C1 and the data value WDV1, WVD2 within said wordline WL1.

Cross-over connections 60A, 60B may provided for allowing the output 40, 42 from respective local sense circuits SC1, SC2 to be applied to either of the Global Bitlines GBLA, GBLB, as shown in figure 2. The cross-over connections 60 cross-connect connections 30/36 and 32/34 which extend between outputs 40, 42 of respective local sense circuits SC1, SC2 and one or other of said Global Bitlines GBL1, GBL2, as best seen in Figure 2. Switches 50, 52, 54, 56 may be provided between respective pairs of connections 30/36 and 32/34 for this purpose. The switches may comprise transistors and may be of the NMOS or PMOS type depending upon functional requirements not part of this invention so not mentioned further herein.

As highlighted in Figure 2 but also as shown in subsequent Figures, the above described arrangements may including two or more global bitline drivers (GBLD1, GBLD2) each being connected to each of said first and second global bitlines (A, B) and in such an arrangement the charge value on each global bitline will be the summation of the charge values delivered by each of the global bitline drivers (GBL1, GBLD2) connected thereto.

Figure 2 is the implementation with two local bitlines per bit cell, and global bitline pairs -as would be typical in an SRAM implementation. Operation is similar to the arrangement of figure 1 except that the bitline sense operation is differential, and the CA / CB are both connected, but which connects to which global bitline is data-dependent.

Figure 3 is a variant of that shown in figure 2 where there is a multiplexer between local bitlines and the local sense circuit. This sort of implementation is often preferred when using a very small bit cell as it makes the physical implementation of the local sense circuit easier. Figure 3 is substantially the same as the arrangement shown in figure 2 save for the addition of an optional Column Multiplexer 200. Such will only be necessary when there are multiple bit cell pairs BCP1, BCP2 are provided.

Figure 4 is a variant of the arrangement of figure 2 save for the connections within the charge storage and delivery device 20. In this arrangement a single stored charge value CV1 or CVA is deliverable to one or other of delivery lines DLGA and GLGB to one or other of the Global Bitlines GBL1. GBLB. Connection is made via lines 130A to DLGA and 130B to DLGB. A single charge storage device CSD1 or capacitor is used instead of the two devices or capacitors (CA and CB). The circuit operation depends on reasonable matching between the CA and CB capacitors both within each local bitline group, and between groups. By sharing a single capacitor per group, rather than using separate capacitors, one source of variation between capacitors is removed. It should be noted that the arrangements of figures 3 and 4 could be combined in a single implementation (not shown).

Figure 5 illustrates a row decoder circuit as well known in the prior art. The row select circuit is basically an AND gate, so a row is selected when all the inputs to the gate are high (the use of the chain of inverters is to increase signal strength, as the word line usually has a high capacitance). The inputs to the AND gate are the predecoder outputs. The predecoder is a group of 2 or more (shown here as 3) circuits that convert an n-bit binary value to a 1-of-n encoded bus (here shown as n=2, but other values are possible). By 1-of-n I mean a bus with n wires, where at most 1 of the wires will have a logic 1 on it at any one time, the others will be a logic 0. It is not a requirement that all the predecoder subsections use the same value of n. In a memory with a hierarchical wordline architecture, some of the subsections of the predecode bus select a local bitline group, and the others select the row within a local bitline group.

The row driver circuit consists of a NAND gate that drives a chain of inverters (with an odd number of inverters in the chain so that the overall function is a logical AND). The inputs of the NAND gate connect to wires in the predecode bus, with each NAND gate connecting to a different combination of wires in the predecode bus. The predecode bus is subdivided into segments, each segment is driven by the outputs of a decoder circuit that converts an n-bit binary input into a 1-of-2n output. This output contains n wires, but only at most 1 of the n wires will have a logic 1 value at any given time. The all-zero case is used to turn off all row decoder outputs when no access is taking place. Figure 5 shows the predecode bus driven by three 2to-1-of-4 decoders, with one of the segments used as a "local group select" bus, and the other two segments used to identify a wordline within the selected group.

The row decoder has to be modified for use with the current invention, as: * More than one wordline may be driven high in each cycle -a wordline in each local bitline group can be high in each cycle, and * One or more data signals have to be combined with the row selecting logic.

Allowing more than one wordline to go high in a single cycle can be achieved by allowing more than one bit in the local group select bus to be high.

Figure 6 shows an option to allow a data signal to be combined with the row select. An additional NAND gate is added to each row driver and is connected to a data input. The data input is shared amongst all row drivers in a local bitline group. This arrangement may be used for the current invention. The row driver circuit is modified by ANDing the normal row select signal with an input data bit. The same data bit must be used for all rows within a local bitline group. The reason for the data bit being combined with the row select signal is explained elsewhere herein.

Figure 7 is an alternative way to modify the row decoder circuit. Since the same data bit is sent to all rows in a local bitline group, the data bit can be used in place of the group select bit.

Doing it this way means that there is no need to run extra independent data wires in the row decoder logic. Instead, a multiplexer selects whether to use the normal predecoder signals (if performing an ordinary read or write operation in the memory) or the data signals (if performing a compute operation) Figure 8 is a Top-level block diagram of the memory, showing row decoder, memory array, global bitlines and global sense amplifiers (GSA), and sum circuit. For illustration, this memory has four local memory groups 82, 84, 86, 88 but it will be appreciated that other numbers of such groups are possible, e.g. 8 or 16. Each local bitline group has two rows RA, RB and two wordlines 100, 102, 104, 106. A realistic implementation would probably have 32 or 64 rows per group.

The memory array is labelled to indicate where the data for the coefficients (i.e. the bk in Equation 1) is stored. The bk are here treated as 4-bit numbers, the actual value could be different. The individual bits of a single coefficient are shown as being stored in adjacent global bitline columns. This is for convenience when wiring into the sum circuit but is not a requirement. The memory array is shown as storing 4 different sets of coefficients: bk, bk', bk", and bk''. This allows the computing of four different dot products a * b in parallel, using the same value of the input vector a, but different values of the coefficient vector b.

As shown in figure 8, the memory unit 10 may be formed into a plurality of rows of memory groups 82, 84, 86, 88 (LBLGO, LBLG1, LBLG2, LBLG3), each memory group comprising plurality of memory cells 90, 92, 94, 96 arranged in two rows RA, RB and each being served by a respective pair of wordlines 100, 102, 104, 106. Respective memory cells 90, 92, 94, 96 in each row are linked indirectly to a respective global bitline GBLA1-1-n, GBLB2-1-n such as to deliver an output thereto. This linking is bit cell to local sense amplifier to charge storage and delivery circuit to global bitline. A Row Decoder 120 having an input 122) and an output 124 is also shown and said output is connected for activating each of said respective pairs of wordlines 100, 102, 104, 106. A Pre-Decoder 130 may also be provided and is operably connected to the input 122 of the Row decoder 120 and connected to a source of one or both of Serial Data 140 or Coefficient Address information 150. Each pair of respective global bitlines (GBLA1-1-n, GBLB2-1-n) are each connected to a respective Global Sense Amplifier (GSA1-n) for amplification of an output therefrom whilst one or more Summing Circuits 160, 162, 166, 168 may be provided for summing two or more outputs from the Global Sense Amplifiers (GSA1-n).

A switched-bitline memory has a 2-stage read access scheme: * Bitcells are connected to local bitlines. When a wordline is activated it connects a cell to a pair of local bitlines, and selectively discharges one of the local pair based on the cell contents. A local sense circuit detects the voltage swing, in order to identify whether the cell contains a 0 or a 1.

* Based on the result of the local sense, a precharged capacitor is connected (or not connected) to a global bitline, and the capacitive coupling between this capacitor and the global bitlines parasitic capacitance determines the final voltage on the global bitline.

It is only the local bitline voltage that can potentially disturb the contents of the bitcell. The global bitline cannot affect the bitcell. Therefore, it is possible to allow more than one local bitline to affect the global bitline without risking disturbing the bitcell contents.

In-memory computation in a switched bitline memory In what follows, we make one modification to the normal switched-bitline access scheme following the local bitline sense, one of two of the Charge Storage Devices CSD1, CSD2 or capacitors is connected to the global bitline GBL. The modified circuit is shown in figure 1. The Charge Storage Devices CSD1, CSD2 or two capacitors, CA and CB are of equal capacitance, but are pre-charged to different voltages. The capacitor to be connected is chosen based on the result of the local bitline sense. This means that the total capacitance connected to the global bitline is independent of the data, which simplifies what follows.

The final voltage on the global bitline, V, is calculated using conservation of charge: Equation 4 n-1 n-1) Q = 17,C y = V (C9 C i =0 i=0 Equation 5 72-1 n-1) V = (Vg + Vi Ci) ,/ (C + Ci i=v i =0 Where V0 is the initial voltage of the global bitline, and C0 its capacitance. The denominator of the RHS of this equation is a constant, so the equation can be simplified: Equation 6 a = 1 Equation 7 1=0 V = aV C 9 9 i=0 Since a switched-bitline SRAM usually uses a differential global bitline scheme, as shown in figures 2, 3, and 4, the voltage on the complementary bitline can be written as: Equation 8 n-1) V = a(VgCg+IViCt i=o and the global differential voltage is: 1=0 AV = Equation 9 or With n-1 AV =V-V=a -VOCi) ( AVi Ci =0 AVi = V, -Vi Although this looks like the dot product equation (Equation 1), it does not have full flexibility, since the Ci are part of the chip hardware, and therefore are not under program control. However, the voltages AV; are themselves a combination of two factors -they are the precharge voltage differences of the driving capacitors (C) that are connected to the global bitlines. These voltages are themselves set by the results of a memory access that sets the value of the local bitlines, and therefore depend on the combination of the wordline activity and the stored bits in the memory.

There are three cases to consider: 1. A wordline goes high and selects a memory cell that is storing a 1. The local bitline remains high, and the complementary local bitline is pulled low.

2. A wordline goes high and selects a memory cell that is storing a 0. The local bitline is pulled low, and the complementary local bitline remains high.

3. No wordline goes high, so no memory cell is selected, and both bitline and complementary bitline remain high.

In the current invention, we choose that cases 2 and 3 will be treated in the same way (this can be achieved by introducing a skew into the local sense amplifier so that the case of equal inputs is resolved as reading a 0). This means that there are actually two distinct cases: 1. Wordline high and cell storing a 1 -local sense amplifier detects a 1 2. Wordline low or cell storing a 0 -local sense amplifier detects a 0 This means that the local sense amplifier result is a logical AND of the wordline state and the memory cell contents. The logical AND is also the same as a 1-bit multiply of the same two values.

If the local sense amplifier detecting a 1 this results is a positive AVi = vi being applied to the global bitline/complementary bitline pair, and it detecting a 0 results in a negative AVi = -vi, 15 then: Equation 10 AWL = (2WiBi -1)vi With Wi being the wordline state, and Bi the stored bit state. Substituting this back into the global bitline voltage difference (Equation 9) gives: Equation 11 n-1 AV = aI((2WiBi -1)120Ci i=0 ( n-1 n-1 AV = a I2WiB iv iCi)-alviCi i=0 i=o If the vi and Ci are constant (i.e. independent of i) then this becomes: Equation 12 n-1 AV = 2avC (IW1B1)-navC i=o The second part of this equation is a renormalisation constant, while the first part looks like the innermost summation in Equation 3 -a sum of bitwise multiplications of equal weights in the overall calculation. The individual terms in the sum (i.e. the WiBi) are each either 0 or 1, and therefore the sum can be one of the n+1 integers between 0 and n inclusive. Therefore, the result of this sum can be computed by using a sense amplifier able to distinguish between these n+1 different states in order to sense the global bitline differential. The n+1 different states are evenly distributed about 0, and therefore in the case of n being even one of the states will have AV close to 0, and will therefore be subject to error due to sense amplifier input differential voltage. This problem can be removed as suggested previously by using a sense amplifier with a built-in skew so that values close to 0 are always resolved in the same way. The case of N being odd does not have the same issue, and there is a special case where n is both odd and 1 less than a power of 2 (i.e. n = 2P-1) -in this case the sense result will fit exactly into a p-bit binary number, which may simplify the peripheral logic described below.

Summation of the partial results It then remains to combine these partial results to give an overall total. This will be a 2-part process, as shown in Equation 3: 1. Summation over the bits of the coefficient 2. Summation over the bits of the input data The coefficients are the bits stored in the memory, and so the summation over the coefficient bits corresponds to adding the results of the multi-bit reads from multiple columns. This can be done with a group of adders located after the global sense amplifiers.

Figure 8 shows a possible data organisation with different bits of the coefficients (bk) stored in different columns within a memory array, and figure 9 shows a possible implementation of a sum circuit, the first part of which is the group of adders that produce a weighted sum of the outputs from the global sense amplifiers. The power-of-two weights of the different bits (i.e. the 2j terms in the sum in Equation 3) are implemented by shifting the bits from different sense amplifiers before adding them, as shown in figure 9.

The bits of the input data correspond to the wordline values, and so a sum over these values corresponds to summing the results of multiple separate accesses, with different bits of the input(s) applied to the wordlines in different cycles. This is a serial operation, with results being collected in an accumulator. The power-of-two weights of the different bits (i.e. the 2' terms in the sum in Equation 3) can be implemented by including a shift in the loop from accumulator output to input. Figure 9 includes an adder, register, and shifter to implement this accumulator. The register should be reset at the beginning of each multiplication.

Figure 10 shows an alternative implementation of the sum circuit. In this version the shifter is moved to the accumulator input rather than the accumulator feedback loop. The shifter is also changed to allow shifting by different numbers of bits, rather than being a fixed shift. The advantage of this implementation is that it allows the results of multiple separate multiply-add operations to be combined in the accumulator.

Alternative summation of the partial results The preceding discussion assumes that all accesses are treated as independent, with the global bitlines being precharged (i.e. returned to V9 after each individual access). However, if this is not the case -if the global bitlines are not precharged between accesses, but retain the voltage from the previous access, then: * The bitline voltage after the first access is given by Equation 7 12 n-1) V = a (V C +IViCi

D i=0

* The voltage after the second access is obtained by substituting V for Vg back into the same equation: n-1 n-1 n-1 V' = a (V Cy iCi)= a (a (VyCy +IViCi)Cy +IV/ iCi) =o i=o i= 0 n-1 n-1 V' = (aCy)2Vy + aCyaIViCi + a1V i=o i=o (where dashes -V' -indicate voltages in the second cycle) The same applies to the complementary bitline, and so the differential voltage is: Equation 13 n-1 n-1 AV' = aCga AViCi + a AV' iCi i=o i Now, consider the special case acg = 2, or (from Equation 6) Cg = Errol 6', then: Equation 14 1 11-1 n-i AV' = -2 a I AViCi + alAV1 iCi i = 0 t=0 If we then make the same substitutions as in Equation 12, then this becomes: AV' = 2avC (-21W iB +114"fir 2) 1 3 anCv) t=o t=o n-1 Since we are using the special case aCg = 1 so Cg = Ei_g 1 = nC and anC = then: Equation 15 n-1 n-1 2v(1 3v / Al ' = -2n -2/ WiBi +XIV fit) -T 1=0 i = 0 1"-1 n-1 AV' = a (-2I AViCi +1 Ar iCi) :=0 1=0 n-1 n-1 AV' = 2a (-vCIWiBi + vC114/1 iB1 i) -(-avC + navC) 2 2 1 11 i=0 i=0 n-1 n-1 n-1 n-i 3v =2n v(Iwisi+ 2 1471i Wi) -T i=o i=o Compared to Equation 12: * The renormalisation constant has changed to reflect the fact that the bracketed term has become more complex, and can take on more values (it can now take any of the 3n+1 values between 0 and 3n inclusive) * The bracketed term has become a power-of-2 weighted sum of the accesses in successive cycles. This means that the capacitive add on the global bitline can replace some or all of the accumulator described in section 0, simply as a result of retaining the charge on the global bitlines between accesses rather than pre-charging to VG.

Figure 7 shows an alternative implementation. This makes use of the fact that the local group select signal is also common to all rows in a local bitline group. The data input is therefore introduced in place of the local group select signal. Figure 7 shows a multiplexer between the predecoder logic that normally drives the local group select bus and the data inputs. The predecoder is used as normal when performing a read or write access to the memory, but the data inputs are used instead when performing a compute operation. The advantage of figure 7 is that the necessary changes can be localised to the predecoder, and there is no need to modify the row driver circuit.

The other bits of the predecode bus (those that select a row within a local group) are still used to select the row that is combined with the input data. Selecting the row determines which coefficients stored within the memory array are used in the multiplication. Therefore, the section of the input address bus that maps to the word-in-group bus(es) is labelled as the coefficient address input.

Operation of the circuit operation is as follows: * The set of stored coefficients to be used is selected via the "coefficient address" input to the row decoder. E.g. when set to 0 this selects bo, bi, b2, and b3, in local bitline groups 0, 1, 2, and 3 respectively.

* Serial input data arrives at the 'serial input'. Four bits arrive in each cycle -one from each of four input words -and each bit is sent to one of the local bitline groups. If using the dividing by 2 on the global bitlines between cycles approach (Section 0), then the data must be sent LSB-first.

* In each cycle, using the operation as described in section 0: O Data is selectively read onto the local bitlines o Charge is summed on the global bitlines o The global bitlines are sensed, and the result passed to the sum circuits o Each sum circuit adds its inputs from the global sense amplifiers together (with appropriate shifting to make sure the inputs are correctly aligned) and the total to date together and updates the total with the new value.

o Local and global bitlines are precharged ready for the next cycle * If using the operation as described in section 0: o In each cycle: * Data is selectively read onto the local bitlines * Charge is summed on the global bitlines * Local bitlines are precharged ready for the next cycle o In some cycles: * The global bitlines are sensed, and the result passed to the sum circuits * Each sum circuit adds its inputs from the global sense amplifiers together (with appropriate shifting to make sure the inputs are correctly aligned) and the total to date together and updates the total with the new value.

* Global bitlines are precharged ready for the next cycle * i.e. global bitline precharge follows global bitline sensing, and local bitline precharge follows local bitline sensing.

* Once all input bits from the first set of input words have been processed, the coefficient address can be updated and a new set of input data can be applied.

Claims

CLAIMS1. A memory unit having one or more global bitlines (A, B); and one or more global bitline drivers (GBLD1, GBLD2); each of said one or more global bitline drivers (GBLD1, GBLD2) comprising: a) one or more bit cells (BC1, BC2) for storing first or second data values (VA1 or VA2); b) a sense circuit (SC1) connected to one or more of said one or more bit cells (BC1, BC2); c) a charge storage and delivery device (XX) for storing first and second charge values (CV1, CV2) and connected to said global bitline (A, B) to deliver a stored charge value thereto; wherein said sense circuit (SC1) is operable to deliver one or other of said stored charge values (CV1, CV2) to said global bitlines (A, B) dependent upon the data value (VA1, VA2) within said one or more bitcells (BC1, BC2).
2. A memory unit as claimed in claim 1 having two or more bitcells (BC1, BC2) and further including two or more wordlines (WL1, WL2) for holding a wordline data value (WDV1, WDV2) and being connected to one or other of said one or more bit cells (BC1, BC2) and wherein said sense circuit (SC1) is operable to deliver one or other of said stored charge values (CV1, CV2) to one or more of said one or more global bitlines (A, B) dependent upon the data value (VA1, VA2) within said one or more bitcells (BC1, BC2) and the data value (WDV1, WVD2) within said one or more wordlines (WL1, WL2).
3. A memory unit as claimed in claim 1 and having first and second global bitlines (A, B) and wherein said sense circuit (SC1) is operable to deliver a first of said stored charge values (CV1) to a first of said global bitlines (A) and to deliver the second of said stored charge values (CV2) to the second of said global bitlines (B) dependent upon the data value (VA1, VA2) within said one or more bitcells (BC1, BC2).
4. A memory unit as claimed in claim 2 and having first and second global bitlines (A, B) and wherein said sense circuit (SC1) is operable to deliver a first of said stored charge values (CV1) to a first of said global bitlines (A) and to deliver the second of said stored charge values (CV2) to the second of said global bitlines (B) dependent upon the data value (VA1, VA2) within said one or more bitcells (BC1, BC2) and the data value (WDV1, WVD2) within said one or more wordlines (WL1, WL2).
5. A memory unit as claimed in claim 1 A memory unit as claimed in claim 1 having a bitcell (BC1) and a wordline (WL1) for holding a wordline data value (WDV1) and being connected to said bit cell (BC1) and wherein said sense circuit (SC1) is operable to deliver one or other of said stored charge values (CV1, CV2) to one or more of said one or more global bitlines (A, B) dependent upon the logical and data value of the data value (VA1, VA2) within said bitcell (BC1) and the data value (WDV1, WVD2) within said wordline (WL1).
A memory unit as claimed in any one of claims 3 to 5 and including two or more global bitline drivers (GBLD1, GBLD2) each being connected to each of said first and second global bitlines (A, B) and wherein the charge value on each global bitline is the summation of the charge values delivered by each of the global bitline drivers (GBL1, GBLD2) connected thereto.
7. A memory unit (10) as claimed in any one of claims 1 to 6 and having a plurality of rows of memory groups (82, 84, 86, 88) ( LBLGO, LBLG1, LBLG2, LBLG3), each memory group comprising plurality of memory cells (90, 92, 94, 96) arranged in two or more rows (Ra, RB) and each being served by a respective pair of wordlines (100, 102, 104, 106) and respective memory cell (90, 92, 94, 96) in each row being connected to a respective global bitline (GBLA1-1-n, GBLB2-1-n).
8. A memory until (10) as claimed in claim 7 and further including a Row Decoder (120) having an input (122) and an output (124), said output being connected for activating each of said respective pairs of wordlines (100, 102, 104, 106).
9. A memory unit (10) as claimed in claim 8 and further including a Pre-Decoder 030) operably connected to the input (122) of the Row decoder (120) and connected to a source of Serial Data 040) or Coefficient Address information (150).
10. A memory unit (10) as claimed in any one of claims 7 to 9 and wherein each pair of respective global bitlines (GBLA1-1-n, GBLB2-1-n) are each connected to a respective Global Sense Amplifier (GSA1-n) for amplification of an output therefrom.
11. A memory unit (10) as claimed in claim 10 and including one or more Summing Circuits (160, 162, 166, 168) for summing two or more outputs from the Global Sense Amplifiers (GSA1-n).