CN117409830A

CN117409830A - In-memory computing memory device and solid state drive module

Info

Publication number: CN117409830A
Application number: CN202310115295.6A
Authority: CN
Inventors: 吕函庭; 徐子轩; 叶腾豪; 谢志昌; 洪俊雄; 李永骏
Original assignee: Macronix International Co Ltd
Current assignee: Macronix International Co Ltd
Priority date: 2022-07-13
Filing date: 2023-02-14
Publication date: 2024-01-16

Abstract

The present disclosure provides a memory device for in-memory computing AND a solid state drive module, applicable to a 3D AND flash memory, the memory device including a memory array, a plurality of input word line pairs, AND a signal processing circuit. The memory array has a plurality of first pairs of memory cells and a plurality of second pairs of memory cells, each first pair of memory cells including a first group of memory cells coupled to a first global bit line and a second group of memory cells coupled to a second global bit line, each second pair of memory cells including a third group of memory cells coupled to the first global bit line and a fourth group of memory cells coupled to the second global bit line. Each input word line pair includes first and second input word lines, the first input word line being coupled to the first and second sets of memory cells, the second input word line being coupled to the third and fourth sets of memory cells. The signal processing circuit is coupled to the first and second global bit lines.

Description

In-memory computing memory device and solid state drive module

Technical Field

The present disclosure relates to a memory device, and more particularly, to an in-memory computing memory device and a solid state drive module.

Background

Vector Matrix Multiplication (VMM) is suitable for "memory-centric computation" in deep neural networks (Deep Neural Network, DNN), cosine similarity (cosine similarity), and simulated annealing. VMM accelerators with high density and high bandwidth are suitable for complementing Von Neumann (Von-Neumann) digital approaches.

There are several problems with vector matrix multiplication using in-memory operations. First, the VMM typically involves both positive (+) and negative (-) inputs and weight values. Therefore, how to implement a simulation circuit of positive/negative polarity is a challenging topic. In addition, the input and weight values tend to be multi-bit resolution (32 b-FP in software, but can be reduced to 4 bits in edge DNN, and even less resolution (e.g., 2-3 bits) in similarity search).

Therefore, developing a VMM accelerator is a major topic in the art.

Disclosure of Invention

Based on the above description, the present disclosure proposes a VMM accelerator architecture using 3D AND NOR flash.

According to one embodiment of the present disclosure, a memory device for in-memory computing is provided, including a memory array, a plurality of input word line pairs, and a signal processing circuit. The memory array has a plurality of first pairs of memory cells and a plurality of second pairs of memory cells, wherein each of the plurality of first pairs of memory cells includes a first group of memory cells coupled to a first global bit line and a second group of memory cells coupled to a second global bit line, and each of the plurality of second pairs of memory cells includes a third group of memory cells coupled to the first global bit line and a fourth group of memory cells coupled to the second global bit line. Each of the plurality of input word line pairs includes a first input word line coupled to the first set of memory cells and the second set of memory cells and a second input word line coupled to the third set of memory cells and the fourth set of memory cells. The signal processing circuit is coupled to the first global bit line and the second global bit line.

Based on the above, according to embodiments of the present disclosure, the 3D AND type NOR flash memory is utilized to construct the operation architecture of the memory device calculated in memory. Thus, the embodiment of the disclosure can save system data without reading data in a memory to the outside and calculating by using another ALU, and data updating is not required all the time because the data is read to an external storage device. At the same time, the architecture of the present disclosure can achieve high capacity, high speed, and efficient in-memory computing. Thus, VMM calculations, IMS calculations, etc., commonly used in big data or AI applications such as image processing, face recognition, deep neural networks, etc., may be implemented by the architecture of the present disclosure.

Drawings

FIG. 1 is a schematic diagram illustrating a 3D AND-type NOR flash memory device according to an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of an equivalent circuit of a 3D AND-type NOR flash memory device according to an embodiment of the disclosure;

FIG. 3A is an illustration of a 3D AND-type NOR flash memory device in performing vector matrix multiplication operations, according to embodiments of the present disclosure;

FIG. 3B is an illustration of a 3D AND-type NOR flash memory device according to another embodiment of the present disclosure in vector matrix multiplication operations;

FIG. 3C shows the gate voltage versus read current Icell (left), the cell read current after trimming versus standard deviation σ (middle), and the regular scribe RTN versus bit count (right);

FIG. 3D is a schematic diagram showing a read current Icell distribution of the memory cell;

FIG. 4 is a schematic diagram of a digital domain architecture for generating 4 input 4 weights (4I 4W);

FIG. 5 is a schematic diagram of a solid state drive module according to an embodiment of the disclosure;

FIG. 6A illustrates an architecture AND operation of a 3D AND-type NOR flash memory for cosine similarity calculation according to an embodiment of the present disclosure; and

FIG. 6B is a schematic diagram showing the distribution of the read current Icell of the memory cell under the architecture of FIG. 6A.

Reference numerals illustrate:

10: a laminated structure;

12. 14: a conductive post;

16: an isolation structure;

18: a hollow channel column;

20: a gate layer;

100. 200: a memory device;

110. 111: laminating;

150: a sense amplifier comparator;

211 to 218: first to fourth groups of memory cells;

220: inputting a word line pair;

250: a differential analog-to-digital converter;

300: a memory device;

301a, 301b, 301c, 301d: a memory array;

302a, 302b, 302c, 302d: an X decoder;

303a, 303b, 303c, 303d: an AD converter;

350: a solid state drive module;

352: a controller chip;

354: a general matrix multiplication chip;

356: an interface;

400: a memory device;

411 to 414: first to fourth memory cells;

420: inputting a word line pair;

450: a differential sense amplifier;

452: a comparator;

454: a reference current generator;

460: a control circuit;

BL1, BL8, BL9, BL16: a bit line;

SL1, SL8, SL9, SL16: a source line;

LBL1, LBL8, LBL9, LBL16: a local bit line;

LSL1, LSL8, LSL9, LSL16: a local source line;

CSL: sharing a source line;

SLT1, SLT8, SLT9, SLT16: a source line transistor;

BLT1, BLT8, BLT9, BLT16: a bit line transistor;

WL: a word line;

input_1: a first input word line;

input_1b: a second input word line;

c: a storage unit;

GBL (N): a first global bit line;

GBLB (N): a second global bit line;

iref: and (5) a reference current.

Detailed Description

The present disclosure relates to an architecture for memory internal computing. With this architecture, data stored inside the memory can be transferred to an external arithmetic logic unit (Arithmetic logic unit, ALU) for arithmetic operation without being read out. The read current (Icell) can be obtained directly by sensing the weight value (weight) stored in the memory and the voltage value input by the word line. After accumulating the read currents, it is possible to directly perform, for example, vector matrix multiplication (vector matrix multiplication, VMM), cosine similarity (cosine similarity), or in-memory search (IMS). The 3D AND-type NOR flash memory is an architecture suitable for such in-memory Computing (CIM).

Fig. 1 is a schematic diagram illustrating a structure of a 3D AND-type NOR flash memory device according to an embodiment of the present disclosure. The 3D AND-type NOR flash memory device may include a plurality of stacked structures 10 shown in fig. 1. The stacked structure 10 is extended in a vertical direction (Z direction) to form a multi-layered gate layer 20, and the gate layer 20 may be further coupled to a conductor layer as a word line (not shown). An ONO layer 22 is formed between the gate layer 20 and the hollow channel pillars. The stacked structure 10 includes hollow channel pillars (channel pillars) 18 extending along the vertical direction Z. Formed within the hollow channel pillar 18 are two conductive pillars (conductive pillar) 12, 14 extending in the vertical direction Z, which can serve as the source and drain of the memory cell. The two conductive pillars 12, 14 have isolation structures 16 extending along the vertical direction Z to isolate the two conductive pillars 12, 14.

The stacked structure 10 may be, for example, a 32-layer structure, and may easily produce billions of memory cells in a small die size, which may be used to perform a large number of CIM operations. In other embodiments, the laminate structure 10 may be a 64-layer or higher structure.

Fig. 2 is an equivalent circuit schematic diagram of a 3D NOR flash memory device according to an embodiment of the present disclosure. As shown in fig. 2, the 3D NOR flash memory device 100 is provided in a stacked structure, such as a stack 110, a stack 111, and the like. Each stack comprises a plurality of memory cells C stacked together. For example, the stack 100 includes a plurality of local bit lines LBL 1-16 and a plurality of local source lines LSL 1-16. Each local bit line LBL 1-16 extends vertically and is connected to a first end (source/drain end) of each memory cell, respectively, and each local bit line of each stack (e.g., 110, 111) is coupled to a corresponding bit line BL 1-16, respectively, as illustrated by bit lines BL1, BL8, BL9, BL16, etc. in fig. 2. In addition, each of the local source lines LSL1 to 16 extends vertically and is connected to a second terminal (another source/drain terminal) of each memory cell, respectively, and each of the local source lines LSL1 to 16 of each stack (e.g., 110, 111) is coupled to a corresponding source line SL1 to 16, respectively, such as source lines SL1, SL8, SL9, SL16, etc. illustrated in fig. 2.

In addition, a set of bit lines BL1, BL8, etc. are coupled to the first global bit line GBL (N) via bit line transistors BLT1, BLT8, etc., respectively, i.e., two first drain side conductive strings (BL 1, BL 8) are coupled to the memory cell and the first global bit line GBL (N), as illustrated in FIG. 2. The other set of bit lines BL9, BL16, etc. are also coupled to the second global bit line GBLB (N), i.e., two second drain side conductive strings (BL 9, BL 16) are coupled to the memory cell and the second global bit line GBLB (N), respectively, via bit line transistors BLT9, BLT16, etc. In addition, the source bit lines SBL1, SBL8, SL9, SL16, etc. are also coupled to the common source line CSL via source line transistors SLT1, SLT8, SLT9, SLT16, etc., respectively.

In addition, the control terminals (gates) of the memory cells C in the same layer of each stack are coupled to the same word line WL. As an example, the word line WL may have 4K, 128 segments (sector). In addition, the first global bit line GBL (N) and the second global bit line GBLB (N) are coupled to the sense amplifier comparator 150. In the normal read mode, the sense amp comparator 150 serves to sense the read current Icell flowing through the selected memory cell C.

In the normal read mode, assuming that the memory cell C circled in fig. 2 is to be read, the word line WL corresponding to the memory cell C is applied with a read voltage Vread (selected word line), for example vread=7v; the unselected word lines corresponding to the unselected memory cells C are applied with a non-selected voltage, e.g., 0V. In addition, the bit line transistor BLT1 is turned on, and the other bit line transistors BLT8, BLT9, BLT16, etc. are turned off. At the same time, the source line transistor SLT1 is turned on, so that the source line SL1 is coupled to the common source line CSL (e.g., 0V is applied), and the other source line transistors SLT8, SLT9, SLT16 are turned off. The first global bit line GBL (N) and the second global bit line GBLB (N) are applied with, for example, 1.2V. In this manner, the read current Icell of the selected memory cell is transferred to the sense amp comparator 150. The read current Icell of the selected memory cell C can thus be sensed via the first global bit line GBL (N), which now acts as a read path. In addition, the second global bit line GBLB (N) acts as a capacitive matching path (capacitive matching path).

Fig. 3A is an explanatory diagram of a 3D NOR flash memory device in performing a vector matrix multiplication operation according to an embodiment of the present disclosure. Next, how to apply the above 3D NOR flash memory to vector matrix burn-in (vector matrix multiplication, VMM), or called in-memory-in-memory (CIM), will be described. This embodiment illustrates an example of a single-order weight CIM.

When applied to a VMM, the memory device 100 of fig. 2 is reconfigured as the memory device 200, and the same or similar symbols will continue to be used, with only the differences being identified. As shown in fig. 3A, a memory array (e.g., formed of stacks 110, 111, etc. shown in fig. 2) has a plurality of first pairs of memory cells and a plurality of second pairs of memory cells. Here, for simplicity of explanation, only one first pair of memory cells and one second pair of memory cells are illustrated. The first pair of memory cells includes a first group of memory cells (or first memory cells) 215 coupled to a first global bit line GBL (N) and a second group of memory cells (or second memory cells) 216 coupled to a second global bit line GBLB (N), and the second pair of memory cells includes a third group of memory cells (or third memory cells) 217 coupled to the first global bit line GBL (N) and a fourth group of memory cells (or fourth memory cells) 218 coupled to the second global bit line GBLB (N). In this embodiment, each set of memory cells 215, 216, 217, 218 includes one memory cell.

The memory device 200 also includes a plurality of input word line pairs 220, one of which is illustrated herein as an illustrative example. Each of the input word line pairs 220 includes a first input word line input_1 and a second input word line input_1b, the first input word line input_1 being coupled to the first set of memory cells 215 and the second set of memory cells 216, and the second input word line input_1b being coupled to the third set of memory cells 217 and the fourth set of memory cells 218. The memory device 200 also includes a signal processing circuit 250 coupled to the first global bit line GBL (N) and the second global bit line GBLB (N). In this embodiment, the signal processing circuit 250 may be implemented with a differential analog-to-digital converter (differential ADC) 250. The input word line pair 220 may provide a binary (binary) input signal or a ternary (ternary) input signal. Further, the input to the input word line pair 220 is herein a single-level (SLC) input.

In addition, each bit line (e.g., BL 1) of the first group 215 and the third group 217 may be coupled to the first global bit line GBL (N) via a bit line transistor BLT1, and each bit line (e.g., BL 9) of the second group 216 and the fourth group 218 may be coupled to the second global bit line GBLB (N) via a bit line transistor BLT 8. The first global bit line GBL (N) and the second global bit line GBLB (N) are coupled as inputs to the differential analog-to-digital converter 250. Here, a first global bit line GBL (N) may be used to collect read currents representing VMM products greater than 0, while a second global bit line GBLB (N) may be used to collect read currents representing VMM products less than 0.

The differential analog-to-digital converter 250 is used for detecting which path of the first global bit line GBL (N) and the second global bit line GBLB (N) has larger current value. In one embodiment, after detecting the first global bit line GBL (N) and the second global bit line GBLB (N), the differential analog-to-digital converter 250 cancels the currents of the two paths to obtain the ADC value.

In performing VMM operation using the memory array of fig. 3A, the source line transistors SLT1, SLT9 are turned on and the source line transistors SLT8, SLT16 are turned off, so that the source lines SL1, SL9 are coupled to the common source line CSL, e.g., the common source line CSL is applied with a voltage of 0V. In addition, the bit line transistors BLT1, BLT9 are turned on and the bit line transistors BLT8, BLT16 are turned off, such that the source line BL1 is coupled to the first global bit line GBL (N) and BL9 is coupled to the second global bit line GBLB (N), e.g., the first global bit line GBL (N) and the second global bit line GBLB (N) are both applied with a voltage of 0.2V.

The data stored in the first group memory cell 211, the second group memory cell 212, the third group memory cell 213, and the fourth group memory cell 214 are, for example, weight values of single level (single level).

When performing VMM multiplication, the result of the operation is differentiated between positive and negative. Further, as described above, a first global bit line GBL (N) may be used to collect a read current representing a VMM product greater than 0, while a second global bit line GBLB (N) may be used to collect a read current Icell representing a VMM product less than 0. Thus, the circuit must be operable to produce positive and negative inputs (word line voltages) and positive and negative weight values. However, in practice, there is no physical negative input and no negative weight value on the VMM computing application. Therefore, an operation rule must be designed.

As described above, according to the embodiment of the present disclosure, in the input voltage (voltage applied to the word line), an input word line pair 220 is employed, in which the first input word line input_1 may input 1 or 0 and the second input word line input_1b may also input 1 or 0. Here, 1 or 0 represents logic, and when 1 is input, a voltage of about 3V may be applied to the word line, for example, and when 0 is input, a voltage of about 0V may be applied to the word line, for example. Thus, a ternary (ternary) input signal may be generated by the input combination of the first input word line input_1 and the second input word line input_1b of the input word line pair 220. For example, a first input word line input_1 input 1 and a second input word line input_1b input 0, may produce a positive input (+1); the first input word line input_1 inputs 0 and the second input word line input_1b inputs 0, which may produce an input of zero (0); and a first input word line input_1 input 0 and a second input word line input_1b input 1, a negative input (-1) may be generated. In this way, the present disclosure may generate a ternary input signal (+1, 0, -1) without physically providing a negative input. Furthermore, a binary (binary) input signal can also be generated in this way.

Regarding the portions where the weight values are positive and negative, according to the embodiment of the present disclosure, for example, when the first group of memory cells 215 and the fourth group of memory cells 218 can read out the read current Icell, and the read current Icell of the second group of memory cells 216 and the third group of memory cells 217 is 0, a positive weight value (+1) can be generated in this case. When the second group of memory cells 216 and the third group of memory cells 217 can read the read current Icell, and the first group of memory cells 215 and the fourth group of memory cells 218 can read the read current Icell to 0, a negative weight (-1) can be generated in this case. In addition, if the read current Icell of the first to fourth sets of memory cells 215 to 218 is 0, it represents a zero weight value.

When operating the memory device of FIG. 3A, if a positive voltage is input, the input voltage is applied to the first input word line input_1, and if the first group of memory cells 215 are positively weighted, the multiplication of the two represents a positive read current Icell. At this time, the read current Icell flows to the differential analog-to-digital converter 250 via the first global bit line GBL (N), which represents a negative product. Similarly, when the input voltage is applied to the first input word line input_1 and the second group of memory cells 216 is negatively weighted, the multiplication of the two represents a negative read current Icell. At this time, the read current Icell flows to the differential analog-to-digital converter 250 via the second global bit line GBLB (N), which represents a negative product. Similarly, when the input voltage is applied to the second input word line input_1b (representing negative input), the third group of memory cells 217 are weighted negatively, and the multiplication of the two represents a positive read current Icell. At this time, the read current Icell flows to the differential analog-to-digital converter 250 via the first global bit line GBL (N), which represents a positive product. Similarly, when the input voltage is applied to the second input word line input_1b (representing negative input), the fourth set of memory cells 218 are weighted positive, and the multiplication of the two represents negative read current Icell. At this time, the read current Icell flows to the differential analog-to-digital converter 250 via the second global bit line GBLB (N), which represents a negative product.

Fig. 3B is an explanatory diagram of a 3D AND type NOR flash memory device in performing a vector matrix multiplication operation according to an embodiment of the present disclosure. Next, how to apply the above 3D AND-type NOR flash memory to vector matrix burn-in (vector matrix multiplication, VMM), or in-memory Computing (CIM) will be described. This embodiment illustrates an example of performing a multi-level weight CIM.

When applied to a VMM, the memory device 100 of fig. 2 is reconfigured as the memory device 200, and the same or similar symbols will continue to be used, with only the differences being identified. As shown in fig. 3B, the memory array (e.g., formed by the stacks 110, 111, etc. shown in fig. 2) has a plurality of first pairs of memory cells and a plurality of second pairs of memory cells. Here, for simplicity of explanation, only one first pair of memory cells and one second pair of memory cells are illustrated. The first pair of memory cells includes a first group of memory cells 211 coupled to a first global bit line GBL (N) and a second group of memory cells 212 coupled to a second global bit line GBLB (N), and the second pair of memory cells includes a third group of memory cells 213 coupled to the first global bit line GBL (N) and a fourth group of memory cells coupled to the second global bit line GBLB (N). Each set of memory cells 211, 212, 213, 214 is illustrated herein as including two memory cells, but is not intended to limit embodiments of the present disclosure. The memory device 200 also includes a plurality of input word line pairs 220, one of which is illustrated herein as an illustrative example. Each of the input word line pairs 220 includes a first input word line input_1 and a second input word line input_1b, the first input word line input_1 being coupled to the first set of memory cells 211 and the second set of memory cells 212, and the second input word line input_1b being coupled to the third set of memory cells 213 and the fourth set of memory cells 214. The memory device 200 also includes a signal processing circuit 250 coupled to the first global bit line GBL (N) and the second global bit line GBLB (N). In this embodiment, the signal processing circuit 250 may be implemented using a differential analog-to-digital converter (differential ADC) 250. The input word line pair 220 may provide a binary (binary) input signal or a ternary (ternary) input signal. Further, the input to the input word line pair 220 is herein a single-level (SLC) input.

Furthermore, in cooperation with the 3D NOR flash memory structure shown in fig. 1, the memory device 200 includes two first drain side conductive strings and two second drain side conductive strings, which respectively correspond to the local bit lines LBL1, LBL8, LBL9, LBL16. The two first drain side conductive strings are coupled to the first group of memory cells 211 and the third group of memory cells 213, respectively, and to the first global bit line GBL (N). The two second drain side conductive strings are coupled to the second group of memory cells 212 and the fourth group of memory cells 214, respectively, and to the second global bit line GBLB (N). In addition, the memory device 200 includes two first source side conductive strings and two second source side conductive strings. The two first source-side conductive strings are coupled to the first group of memory cells 211 and the third group of memory cells 213, respectively, and to the first common source line CSL. The two second source side conductive strings are coupled to the second group of memory cells 212 and the fourth group of memory cells 214, respectively, and to the common source line CSL.

In addition, each bit line (e.g., BL1, BL 8) of the first group of memory cells 211 and the third group of memory cells 213 can be coupled to the first global bit line GBL (N) via bit line transistors BLT1, BLT8, respectively, and each bit line (e.g., BL9, BL 16) of the second group of memory cells 212 and the fourth group of memory cells 214 can be coupled to the second global bit line GBLB (N) via bit line transistors BLT8, BLT16, respectively. The first global bit line GBL (N) and the second global bit line GBLB (N) are coupled as outputs to the differential analog-to-digital converter 250. Here, a first global bit line GBL (N) may be used to collect read currents representing VMM products greater than 0, while a second global bit line GBLB (N) may be used to collect read currents representing VMM products less than 0.

The differential analog-to-digital converter 250 is used for detecting which path of the first global bit line GBL (N) and the second global bit line GBLB (N) has larger current value. In one embodiment, after detecting the first global bit line GBL (N) and the second global bit line GBLB (N), the differential analog-to-digital converter 250 can cancel each other out the two paths of current to obtain the ADC value.

In performing VMM operation using the memory array of fig. 3B, the source line transistors SLT1, SLT8, SLT9, SLT16 are turned on, so that the source lines SL1, SL8, SL9, SL16 are coupled to the common source line CSL, for example, the common source line CSL is applied with a voltage of 0V. In addition, the bit line transistors BLT1, BLT8, BLT9, BLT16 are turned on, such that the bit lines BL1, BL8 are coupled to the first global bit line GBL (N) and the bit lines BL9, BL16 are coupled to the second global bit line GBLB (N), e.g., the first global bit line GBL (N) and the second global bit line GBLB (N) are both applied with a voltage of 0.2V.

The data stored in the first group of storage units 211, the second group of storage units 212, the third group of storage units 213, and the fourth group of storage units 214 are, for example, weight values of four steps (4 levels). In this example, each group of memory cells is two memory cells, so that 8-order weight values can be generated. Of course, if more levels of weight data are required, more memory cells can be connected in parallel per group of memory cells to generate more levels of weight.

When performing VMM multiplication, the result of the operation is differentiated between positive and negative. Further, as described above, a first global bit line GBL (N) may be used to collect a read current Icell representing a VMM product greater than 0, while a second global bit line GBLB (N) may be used to collect a read current Icell representing a VMM product less than 0. Thus, the circuit must be operable to produce positive and negative inputs (word line voltages) and positive and negative weight values. In this embodiment, there is no physically negative input and no negative weight value as applied to the VMM calculation. A new algorithm is designed.

As described above, according to the embodiment of the present disclosure, in the input voltage (voltage to which the word line is applied) section. An input word line pair 220 is employed, wherein a first input word line input_1 may input either 1 or 0, and a second input word line input_1b may also input either 1 or 0. Here, 1 or 0 represents logic, and when 1 is input, a voltage of about 3V may be applied to the word line, for example, and when 0 is input, a voltage of about 0V may be applied to the word line, for example. Thus, a ternary (ternary) input signal may be generated by the input combination of the first input word line input_1 and the second input word line input_1b of the input word line pair 220. For example, a first input word line input_1 input 1 and a second input word line input_1b input 0, may produce a positive input (+1); the first input word line input_1 inputs 0 and the second input word line input_1b inputs 0, which may produce an input of zero (0); and a first input word line input_1 input 0 and a second input word line input_1b input 1, a negative input (-1) may be generated. In this way, the present disclosure may generate a ternary input signal (+1, 0, -1) without physically providing a negative input. Furthermore, a binary (binary) input signal can also be generated in this way.

Regarding the portions where the weight values are positive and negative, according to the embodiment of the present disclosure, for example, when the first group of memory cells 211 and the fourth group of memory cells 214 can read out the read current Icell, and the read current Icell of the second group of memory cells 212 and the third group of memory cells 213 is 0, then in this case, a positive weight value (+1) can be formed. When the read current Icell can be read out from the second group of memory cells 212 and the third group of memory cells 213, and the read current Icell of the first group of memory cells 211 and the fourth group of memory cells 214 is 0, a negative weight value (-1) can be formed in this case. In addition, if the read currents Icell of the first to fourth groups of memory cells 211 to 214 are all 0, then a zero weight value may be formed in this case.

When operating the memory device of FIG. 3B, if a positive voltage is input, the input voltage is applied to the first input word line input_1, and if the first group of memory cells 211 are weighted positively, the multiplication of the two represents a positive read current Icell. At this time, the read current Icell flows to the differential analog-to-digital converter 250 via the first global bit line GBL (N), which represents a positive product. Similarly, when the input voltage is applied to the first input word line input_1 and the second group of memory cells 212 is negatively weighted, the multiplication of the two represents a negative read current Icell. At this time, the read current Icell flows to the differential analog-to-digital converter 250 via the second global bit line GBLB (N), which represents a negative product. Similarly, when the input voltage is applied to the second input word line input_1b (representing negative input), the third group of memory cells 213 are weighted negatively, and the multiplication of the two represents a positive read current Icell. At this time, the read current Icell flows to the differential analog-to-digital converter 250 via the first global bit line GBL (N), which represents a positive product. Similarly, when the input voltage is applied to the second input word line input_1b (representing negative input), the fourth group of memory cells 214 are weighted positive, and the multiplication of the two represents negative read current Icell. At this time, the read current Icell flows to the differential analog-to-digital converter 250 via the second global bit line GBLB (N), which represents a negative product.

In summary, the following table I lists the relationships between the outputs of the first global bit line GBL (N) and the second global bit line GBLB (N) and the first Input word line input_1, the second Input word line input_1b (positive, zero and negative inputs) and the weight values (positive, zero and negative weight values).

TABLE I

In this way, the positive read currents Icell in all word lines and bit lines are summed to produce a positive VMM product and a negative VMM product, and passed to differential analog to digital converter 250 for comparison to produce a digital value.

To summarize, by the architecture and operation rule shown in fig. 3B, the sum of the read currents Icell through the first global bit line GBL (N) may represent a positive VMM product value VMM (positive), and the sum of the read currents Icell through the second global bit line GBLB (N) may represent a negative VMM product value VMM (negative). The calculation of both can be as follows.

Wherein g _m (i, k) is transduction of memory cells, V _WL (i) The voltage applied to the word line is i the number of word lines, k the number of bits county, and j the number of global bit lines. Thereby, the voltage V applied to the character line _WL (i) Riding the transduction g of a memory cell _m (i, k) corresponds to the read current Icell of the memory cell. Transduction g _m (i, k) corresponds to the weight described above. Thus, p can be calculated by summing the read currents of the memory cells of the memory array _i x q _i > 0 (VMM product greater than 0) and p _i x q _i < 0 (VMM product less than 0). Wherein p is _i And q _i Is of any number, i.e. the word line voltage V described above can be used _WL (i) And weight g _m (i, k) a numerical value calculated.

FIG. 3C shows the gate voltage versus read current Icell (left), the cell read current after trimming versus standard deviation σ (middle), and the regular scribe RTN versus bit count (right). As shown on the left side of fig. 3C, which is a plot of drain current (Id) versus gate voltage (Vg) for ISPP (incremental step pulse programming) programming. The horizontal axis represents the gate voltage Vg, i.e., the voltage applied to the word line; the vertical axis is the bit line voltage V _BL Read current Icell at 0.2V. Here, it is desirable to operate at a low bit line voltage V _BL Control of the read current Icell is performed with =0.2v (bit line voltage V at normal read _BL =1.2v). In the above-described example, the input voltage (word line voltage V _WL ) Is about 2V to 3V, so that the corresponding current can be found between vg=2v to 3V of fig. 3BTrimming (trim) of different read current Icell ranges, such as from sub 100nA to sub 1 ua. The read current Icell is better in the sub-1 mua range from the middle graph of fig. 3C, while there is less RTN in the sub-1 mua range from the right graph of fig. 3C.

FIG. 3D is a schematic diagram showing the distribution of the read current Icell of the memory cell. As described above, in order to achieve good results in the above-described in-memory operation, it is desirable to generate a compact and properly spaced read current Icell distribution, and to have a small RTN and good retention. Therefore, if the read current Icell distribution is in the range of 2 to 3V or so at the input voltage (word line voltage), it is preferable to trim (trim) to a distribution in the range of sub 1 μa (sub-1 μa) as shown in fig. 3D, such as 200nA, 400nA, 600nA, 800nA. When the input voltage is about 2 to 3V, it is preferable to correct the read current Icell distribution to the sub-1 μa range. Thus, a weight value of 4 th order can be obtained.

Taking the first group of memory cells 211 (storing a positive weight value) and the second group of memory cells 212 (storing a negative weight value) as an example, each group of memory cells includes two memory cells, so the first pair of memory cells has four memory cells in total, and each memory cell has a 4-order read current Icell. When the four bit line transistors BLT1, BLT8, BLT9, BLT16 are all on, a total of 16-order weights (e.g., negative weights of-8 to-1 and positive weights of 0 to +7) may be generated, i.e., representing a resolution of 4 bits.

In the above architecture, the input signal is mainly single-order. If multiple levels of input are to be generated, multiple of the above-described FIG. 3B-based architectures may be employed. FIG. 4 is a schematic diagram of an architecture for generating 4 input 4 weights (4I 4W) in the digital domain.

As shown in fig. 4, the memory device 300 includes 4 memory arrays 301a, 301b, 301c, 301d (4 blocks). Each memory array 301a, 301b, 301c, 301d has a respective X decoder 302a, 302b, 302c, 302d and AD converter 303a, 303b, 303c, 303d. Each memory array 301a, 301B, 301c, 301d and its corresponding X-decoder 302a, 302B, 302c, 302d and AD converter 303a, 303B, 303c, 303d may use the architecture shown in fig. 3B. Each memory array 301a, 301b, 301c, 301d has a weight value of 4 bits, i.e., a read current Icell of 4 orders with 4 bit line transistors BLTs. Thus, the memory cell here is a multi-level cell (multiple level cell, MLC), which is 4-level for example.

In addition, the word lines of each memory array 301a, 301b, 301c, 301d are each receiving a single-level (SLC) input, but the input voltages are different, e.g., the input of memory array 301a is a ₀ The input to memory array 301b is a ₁ The input to memory array 301c is a ₂ The input to memory array 301d is a ₃ 。

In addition, the four memory arrays 301a, 301b, 301c, 301d output results by repeating the operation in a cyclic manner, and finally, the outputs of the four AD converters are summed up. This can be achieved here using a shifter (shifter) and an adder (adder). Wherein the output of memory array 301a is equivalent to the least significant bit (least significant bit, LSB) and the output of memory array 301d is equivalent to the most significant bit (most significant bit, MSB). Thus, the outputs of the four memory arrays 301a, 301b, 301c, 301d are multiplied by the corresponding weighting coefficients (weight coefficient), e.g., 1 (=2) ⁰ )、2(＝2 ¹ )、4(＝2 ² )、8(＝2 ³ ) Etc.

With the architecture described above, a 4-input 4-weight (414W) architecture with positive and negative polarities can be produced. To summarize, this architecture need arises.

[1] The design of 4 memory cells in two blocks (tiles) to generate positive and negative polarities;

[2] a multi-level cell memory cell (4 levels in this example) to generate 4 read currents Icell (corresponding to 4 weight values (W0, W1, W2, W3);

[3]4 bit line transistors BLT connected to each bit cell;

[4]4-block elements to produce a 4-bit input (a ₀ 、a ₁ 、a ₂ 、a ₃ )。

Finally, the VMM output of the memory device 300 may be expressed by the following equation:

VMM＝(W3W2W1W0)×1×a ₀ +(W3W2W1W0)×2×a ₁ +(W3W2W1W0)×4×a ₂ +(W3W2W1W0)×8×a ₃ 。

fig. 5 is a schematic diagram of a solid state drive module according to an embodiment of the disclosure. The solid state drive module (SSD module) 350 shown in FIG. 5 may be applied, for example, to an AI inference system (AI inference system) that requires operations on large amounts of data, particularly matrix multiplication operations. As shown in fig. 5, the solid state drive module 350 includes a controller chip 352 and a generic matrix multiplication (general matrix multiplication, GEMM) chip 354, between which the controller chip 352 and the generic matrix multiplication chip 354 can both communicate data via an interface 356. This interface may be, for example, an equivalent or similar interface to DDR 4/5. In addition, the controller chip 352 may be coupled to a plurality of universal matrix multiplication chips 354. In other embodiments, the generic matrix multiplication chip 354 is a stand alone chip (standby chip).

The generic matrix multiplication chip 354 constructed (i.e., using 3D NOR flash memory) from fig. 3B described above may have 512 inputs (4 bits), 1024 outputs (4 bits), for example. Each GEMM chip 354 may support multiple GB memory units to directly calculate billions of parameters in a large neural network. GEMM chip 354 is connected to controller chip 352 via an interface 356 such as DDR5 (4.8 gbps,16 i/O). The controller chip 352 requires only one appropriate SRAM size to store metadata (meta data) in addition to the control circuitry for controlling AI data flow, without requiring a large number of ALUs and cores (e.g., SOC ASIC architecture requires more than 100 cores to achieve equal operations) to support Vector Matrix Multiplication (VMM). Under this architecture, all VMM computations are performed in GEMM chip 354. Under the 4I4W architecture described above, the internal maximum VMM computes a bandwidth of 3.7TOPS, which is much greater than the DDR 5I/O. Furthermore, the power consumption per chip is less than 1W. Therefore, the GEMM chip 354 has the efficacy of fast and low power consumption.

In this architecture, because all vector matrix multiplication operations are performed within the GEMM chip 354, the controller chip 352 need only provide inputs to the GEMM chip 354. The GEMM chip 354 performs vector matrix multiplication and outputs the result to the controller chip 352. Therefore, the vector matrix multiplication of a large amount of data can be efficiently and quickly calculated under the architecture without reading out the data in the memory and then calculating through the ALU.

Fig. 6A illustrates an architecture AND operation of applying a 3D AND type NOR flash memory for cosine similarity calculation according to an embodiment of the present disclosure. As shown in fig. 6, this architecture is substantially similar to the architecture of fig. 3B, and only the differences are described below, with the remainder being the same as that of fig. 3B. The cosine similarity calculation may be applied to memory search (IMS).

The memory array of the memory device 400 has a plurality of first and second sets of memory cells. For simplicity of explanation, the memory device includes a plurality of first pairs of memory cells and a plurality of second pairs of memory cells. Here, for simplicity of explanation, only one first pair of memory cells and one second pair of memory cells are illustrated. The first pair of memory cells includes a first group of memory cells (or first memory cells) 411 coupled to a first global bit line GBL (N) and a second group of memory cells (or second memory cells) 412 coupled to a second global bit line GBLB (N), and the second pair of memory cells includes a third group of memory cells (or third memory cells) 413 coupled to the first global bit line GBL (N) and a fourth group of memory cells (or fourth memory cells) 414 coupled to the second global bit line GBLB (N). In this embodiment, each of the first to fourth groups of memory cells 411 to 414 includes one memory cell.

The memory device 400 further includes a plurality of input word line pairs 420, wherein each of the plurality of input word line pairs 420 (e.g., WL1 pair) includes a first input word line input_1 and a second input word line input_1b, wherein the first input word line input_1 is coupled to the first memory cell 411 and the second memory cell 412, and the second input word line input_1b is coupled to the third memory cell 413 and the fourth memory cell 414. Each of the plurality of input word line pairs provides a ternary input signal, i.e., a ternary input (+ 1,0, -1) as described above, and the detailed description may refer to the description of fig. 3A or 3B.

Here, the positive input signal (+1) is generated by turning on the first input word line input_1 of the input word line pair 420 (WL 1 pair is taken as an example), and turning off the second input word line input_1b; the zero input signal (0) is asserted to turn off the first input word line input_1 of the input word line pair 420 (WL 1 pair, for example) and the second input word line input_1b; the negative input signal (-1) is provided by turning off the first input word line input_1 of the input word line pair 420 (WL 1 pair is an example), and turning on the second input word line input_1b. Likewise, the input to the input word line pair 420 is here a single-order (SLC) input.

The memory device 400 also includes a signal processing circuit 450 coupled to the first global bit line GBL (N) and the second global bit line GBLB (N). In one embodiment, the signal processing circuit 450 may be implemented using a differential sense amplifier 450. When the architecture is used in cosine similarity calculation, it is mainly to compare the input signal with the data stored in the memory, so that the differential analog-to-digital converter 350 shown in fig. 3A or fig. 3B is not required.

Further, as with the VMM calculation of fig. 3A or 3B, the memory array stores weight value information used as an IMS calculation, in which a positive IMS weight value is stored in the first storage unit 411 and the fourth storage unit 414, and a negative IMS weight value is stored in the second storage unit 412 and the third storage unit 414.

In addition, the memory device 400 may further include a control circuit 460 coupled to the memory array and the plurality of input word line pairs for controlling the memory array to perform cosine similarity calculation. For example, the control circuit 460 may include a decoder that inputs an input signal to a corresponding input word line pair. The memory device 400 may also include a comparator 452 and a reference current generator 454. The comparator 452 is coupled to the differential sense amplifier 450 and the reference current generator 454. The reference current generator 454 generates a reference signal Iref, and the comparator 452 compares the output of the differential sense amplifier 450 with the reference signal Iref. In one embodiment, the reference signal Iref is adjustable in response to a cosine similarity calculation threshold.

In addition, as in the operation described in fig. 3A or 3B, the first global bit line GBL (N) is collecting the positive read current Icell, and the second global bit line GBLB (N) is collecting the negative read current Icell. The sum of the positive reading current Icell and the negative reading current Icell is transmitted to the differential sense amplifier 450, and the difference between the sum of the positive reading current Icell and the sum of the negative reading current Icell is outputted.

The cosine similarity calculation is shown in the following equation:

the cosine similarity calculation also tries the application of vector matrix multiplication. Where p is _i Is an input vector (query), i.e., an input signal (e.g., a ternary signal of +1, 0, -1) input from the word line pair 420. q _i For data stored in the memory, i.e. weight value information.

In cosine similarity calculation, the memory cell uses a single-level read current distribution as shown in FIG. 6B, preferably a 200Ma distribution of read current Icell. With this distribution, the standard deviation σ thereof is 4%.

In addition, under this architecture, there may be 512 word lines WL,1024 outputs. I.e., to 1024 differential sense amplifiers. Further, the thread (tread) is about 100ns, so the bandwidth of the similarity search is 512×1024/100ns, i.e., 5TB/s. Therefore, high-capacity and high-speed operation can be achieved.

Thus, when the comparator 452 compares the output of the sense amplifier 450 with the reference signal Iref, it can be detected that the input signal matches (passes) or does not match (fail) the data stored inside the memory. Therefore, when the in-memory computation is applied to the cosine similarity computation, the in-memory computation can be used for face recognition application. Under this architecture, there is no need to read out the data inside the memory device for searching, as long as an input signal (e.g., face data to be confirmed) is input into the memory device for IMS calculation. The memory device will provide the search to the external system. In addition, the memory device of the present disclosure has a large capacity and a high execution speed, so that the memory device can quickly output a search without occupying system resources.

According to embodiments of the present disclosure, the 3D AND-type NOR flash memory is utilized to construct the operating architecture of the in-memory computing memory device. Thus, the embodiment of the disclosure can save system data without reading data in a memory to the outside and calculating by using another ALU, and data updating is not required all the time because the data is read to an external storage device. At the same time, the architecture of the present disclosure can achieve high capacity, high speed, and efficient in-memory computing. Thus, VMM calculations, IMS calculations, etc., commonly used in big data or AI applications such as image processing, face recognition, deep neural networks, etc., may be implemented by the architecture of the present disclosure.

Claims

1. A memory device for in-memory computing, comprising:

a memory array having a plurality of first pairs of memory cells each including a first group of memory cells coupled to a first global bit line and a second group of memory cells coupled to a second global bit line, and a plurality of second pairs of memory cells each including a third group of memory cells coupled to the first global bit line and a fourth group of memory cells coupled to the second global bit line;

A plurality of input word line pairs, each of the plurality of input word line pairs comprising a first input word line coupled to the first set of memory cells and the second set of memory cells and a second input word line coupled to the third set of memory cells and the fourth set of memory cells; and

a signal processing circuit is coupled to the first global bit line and the second global bit line.

2. The memory device of claim 1, wherein the plurality of input word line pairs provide a binary input signal or a ternary input signal.

3. The memory device of claim 2, wherein the memory array stores weight value information used as in-memory calculations, wherein a first VMM weight value is stored in the first set of memory cells and the fourth set of memory cells, and a second VMM weight value is stored in the second set of memory cells and the third set of memory cells.

4. The memory device of claim 1, wherein the signal processing circuit is a differential analog-to-digital converter, the first set of memory cells to the fourth set of memory cells each comprising one memory cell.

5. The memory device of claim 1, wherein the signal processing circuit is a differential analog-to-digital converter, the first set of memory cells to the fourth set of memory cells each comprising two memory cells, the memory device further comprising:

two first drain side conductive strings coupled to the first set of memory cells, the third set of memory cells, and the first global bit line; and

two second drain side conductive strings coupled to the fourth set of memory cells, the second set of memory cells, and the second global bit line.

6. The memory device of claim 5, further comprising:

a plurality of bit line transistors coupled between the two first drain side conductive strings and the first global bit line, and coupled between the two second drain side conductive strings and the second global bit line.

7. The memory device of claim 5, further comprising:

two first source side conductive strings coupled to the first group of memory cells and the third group of memory cells, respectively, and to a common source line; and

two second source side conductive strings are coupled to the second group of memory cells and the fourth group of memory cells, respectively, and to the common source line.

8. The memory device of claim 4, wherein the weight information stored in the memory array comprises a 4-order weight value.

9. The memory device of claim 1, wherein the first global bit line and the second global bit line are used to sum memory cell currents from the memory array and the memory cell current for one memory cell of the memory array is greater than 100nA and less than 1 μΑ.

10. The memory device of claim 1, wherein a sense voltage is applied to the first global bit line and the second global bit line to sum a memory cell current from the memory array, and the sense voltage is less than 0.2V.

11. The memory device of claim 1, wherein each of the plurality of input word line pairs is to provide a 1-bit input signal.

12. The memory device of claim 1, wherein the memory array is a 3DNOR flash memory.

13. The memory device of claim 5, wherein the two first drain side conductive strings and the two second drain side conductive strings are doped polysilicon plugs.

14. The memory device of claim 1, wherein the first through fourth sets of memory cells each comprise one memory cell,

The memory device is configured to perform an in-memory search (IMS), an

The signal processing circuit is a differential sense amplifier coupled to the first global bit line and the second global bit line.

15. The memory device of claim 14, wherein each of the plurality of input word line pairs provides a binary input signal or a ternary input signal.

16. The memory device of claim 14, wherein the memory array stores weight value information used as the in-memory search, wherein a first IMS weight value is stored in the first set of memory cells and the fourth set of memory cells, and a second IMS weight value is stored in the second set of memory cells and the third set of memory cells.

17. The memory device of claim 14, further comprising:

control circuitry coupled to the memory array and the plurality of input word line pairs, the control circuitry controlling the memory array to perform the in-memory search using cosine similarity calculations; and

and a comparator coupled to the differential sense amplifier and a reference signal generator, wherein the reference signal generator generates a reference signal, and the comparator compares the output of the differential sense amplifier with the reference signal.

18. The memory device of claim 17, wherein the reference signal is adjustable corresponding to a cosine similarity calculation threshold.

19. A solid state drive module comprising:

a controller chip;

a memory chip, which is the in-memory computing memory device according to claim 1, coupled to the controller chip; and

an interface coupled to the controller chip and the memory chip.

20. The solid state drive module of claim 19, wherein the interface is DDR4 or DDR5.