CN113674786A - In-memory computing unit, module and system - Google Patents

In-memory computing unit, module and system Download PDF

Info

Publication number
CN113674786A
CN113674786A CN202110960405.XA CN202110960405A CN113674786A CN 113674786 A CN113674786 A CN 113674786A CN 202110960405 A CN202110960405 A CN 202110960405A CN 113674786 A CN113674786 A CN 113674786A
Authority
CN
China
Prior art keywords
memory
equal
bit
data
bit lines
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110960405.XA
Other languages
Chinese (zh)
Inventor
杨展悌
苏炳熏
叶甜春
罗军
赵杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Greater Bay Area Institute of Integrated Circuit and System
Ruili Flat Core Microelectronics Guangzhou Co Ltd
Original Assignee
Aoxin Integrated Circuit Technology Guangdong Co ltd
Guangdong Greater Bay Area Institute of Integrated Circuit and System
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Aoxin Integrated Circuit Technology Guangdong Co ltd, Guangdong Greater Bay Area Institute of Integrated Circuit and System filed Critical Aoxin Integrated Circuit Technology Guangdong Co ltd
Priority to CN202110960405.XA priority Critical patent/CN113674786A/en
Publication of CN113674786A publication Critical patent/CN113674786A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C16/00Erasable programmable read-only memories
    • G11C16/02Erasable programmable read-only memories electrically programmable
    • G11C16/04Erasable programmable read-only memories electrically programmable using variable threshold transistors, e.g. FAMOS
    • G11C16/0483Erasable programmable read-only memories electrically programmable using variable threshold transistors, e.g. FAMOS comprising cells having several storage transistors connected in series
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C16/00Erasable programmable read-only memories
    • G11C16/02Erasable programmable read-only memories electrically programmable
    • G11C16/06Auxiliary circuits, e.g. for writing into memory
    • G11C16/10Programming or data input circuits
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C7/00Arrangements for writing information into, or reading information out from, a digital store
    • G11C7/18Bit line organisation; Bit line lay-out
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C8/00Arrangements for selecting an address in a digital store
    • G11C8/14Word line organisation; Word line lay-out

Landscapes

  • Engineering & Computer Science (AREA)
  • Microelectronics & Electronic Packaging (AREA)
  • Read Only Memory (AREA)

Abstract

The invention relates to an in-memory computing unit, comprising: the memory array comprises a plurality of memory cells arranged in N rows and N columns, and the memory cell positioned in the ith row and the jth column is marked as Si,j(ii) a The data values stored in the storage units positioned in the same column are the same; the storage array is used for storing first data of N bits; n word lines for inputting N bits of second data; the control ends of the memory units in the same row are sequentially connected in series through the same word line; m bit line groups, the kth group of bit lines being marked as bit line group BLk, M being equal to 2N-1; when k is equal to or greater than 1 and equal to or less than N, the kth group of bit lines has k bit lines connected to the outputs of the memory cells in the same line as the memory cell S1, k and the memory cell Sk,1A terminal; when k is larger than N and smaller than or equal to M, the kth group of bit lines has 2N-k bit lines, and the 2N-k bit lines are respectively connected to the output ends of the memory cells which are positioned on the same straight line with the memory cells Sk-N +1 and N and the memory cells SN, k-N + 1.

Description

In-memory computing unit, module and system
Technical Field
The invention relates to computing and storage integration, in particular to an in-memory computing unit, a module and a system.
Background
The integration of computing and storage is a new computing technology appearing in recent years, and the aim of the technology is to complete data computation in a memory, avoid or reduce the transportation of data between the memory and a CPU and improve the computing efficiency. Especially, with the development of Artificial Intelligence (AI), the amount of data and computation is rapidly increasing, and the traditional von neumann computer architecture is being challenged more and more. Taking a Convolutional Neural Network (CNN) as an example, after each multiplication operation, the product needs to be stored first, taken out to the CPU, added, and repeated. The continuous access of data between the memory and the CPU consumes a great deal of energy, and the efficiency is very low.
In order to further improve the calculation efficiency, a memory calculation idea is proposed, that is, data calculation is completed inside the memory module without transferring the data into the CPU for operation. However, in the conventional memory computing structure, the saturation current output by each memory cell needs to be merged into the same output line and then converted into a digital signal to obtain the product sum. Since the saturation currents output by different memory cells cannot be completely consistent, certain errors inevitably exist, and thus, in the current collection process, the risk of increasing the accumulated errors exists. The cumulative error increases as the number of saturation currents collected increases.
Disclosure of Invention
In view of the foregoing, there is a need to provide an in-memory computing unit, module and system.
Memory computing unitThe method comprises the following steps: the storage array comprises a plurality of storage units arranged in N rows and N columns, and the storage unit positioned in the ith row and the jth column is marked as Si,j(ii) a The data values stored in the storage units positioned in the same column are the same; the storage array is used for storing first data of N bits; wherein N is more than or equal to 1, i is more than or equal to 1 and less than or equal to N, and j is more than or equal to 1 and less than or equal to N; n word lines for inputting N bits of second data; the control ends of the memory units in the same row are sequentially connected in series through the same word line; m bit line groups, wherein the kth group of bit lines is marked as a bit line group BLk, M is equal to 2N-1, and k is greater than or equal to 1 and less than or equal to M; when k is greater than or equal to 1 and less than or equal to N, the kth group of bit lines has k bit lines which are respectively connected with the memory cells S1,kAnd a memory cell Sk,1The output ends of the storage units are positioned on the same straight line; when k is greater than N and less than or equal to M, the kth group of bit lines has 2N-k bit lines, and the 2N-k bit lines are respectively connected to the memory cells Sk-N+1,NAnd a memory cell SN,k-N+1And the output ends of the storage units are positioned on the same straight line.
The memory computing unit directly acts the second data on the control end of the storage unit through the word line, stores the first data in the storage units arranged in the array according to a certain rule, and can complete the binary multiplication operation of N bits and N bits in one clock cycle. The operation can be directly finished in the storage module without carrying the storage data into the CPU for operation, so that the data carrying is reduced, the operation speed can be greatly improved under the condition of large operation amount, and the power consumption is reduced. And the output end of each memory cell is independently connected to an independent bit line, compared with the traditional technical scheme, the current output by different memory cells does not need to be converged on one bit line, and the problem of error accumulation caused by current convergence is solved.
In one embodiment, the memory unit includes a nonvolatile memory.
In one embodiment, the non-volatile memory comprises NOR flash memory cells.
In one embodiment, the control terminal of the memory cell comprises a gate of a non-volatile memory; the output of the memory cell includes a drain of the non-volatile memory.
In one embodiment, the first data is binary data, and the nonvolatile memory is used for storing a bit value of 0 or 1; the second data is binary data, and when the voltage on the word line is greater than or equal to a preset voltage, the bit value on the word line is 1; and when the voltage on the word line is smaller than a preset voltage, the bit value on the word line is 0.
In one embodiment, the memory computing unit further includes M-2 bit encoders, where the M-2 bit encoders are connected to the 2 nd to M-1 st bit line groups in a one-to-one correspondence, and the bit encoders are configured to encode output signals of the bit line groups to obtain digital signals.
The memory computing unit connects each bit line group to the corresponding bit encoder, encodes the current and voltage signals on each bit line in the bit line group, and does not need to adopt an analog-to-digital conversion module to convert the current signals in the bit lines, so that the time of analog-to-digital conversion is saved in time, and the area of the analog-to-digital conversion module is also saved in area. Although the number of bit lines and the bit encoder are increased, the area of the memory computing unit is reduced and the computing speed is improved.
In one embodiment, the non-volatile memory comprises: the substrate comprises a substrate, a substrate dielectric layer and a fully depleted channel layer; wherein, a well region is formed in the substrate; the substrate dielectric layer is positioned on the substrate and covers the well region; the fully depleted channel layer is positioned on the substrate dielectric layer by layer; the grid structure is positioned on the upper surface of the fully depleted channel layer; the source electrode is positioned on the upper surface of the fully depleted channel layer and positioned on one side of the grid structure; the drain electrode is positioned on the upper surface of the fully depleted channel layer and positioned on one side, far away from the source electrode, of the grid structure; wherein the source electrode and the drain electrode are formed on the upper surface of the fully depleted channel layer by an epitaxial process.
In the memory computing unit, each nonvolatile memory adopts a semiconductor structure with a fully depleted channel layer and a substrate medium layer, so that electric leakage can be reduced, and the memory computing unit can be applied to an AI device for edge computing. In addition, the source electrode and the drain electrode are formed by adopting an epitaxial process, so that the saturation current can be improved, the reading speed is increased, and the calculation efficiency is improved.
In one embodiment, the gate structure comprises a gate stack structure located on an upper surface of the fully depleted channel layer; the grid laminated structure comprises a tunneling dielectric layer, a floating gate, a control dielectric layer and a control grid which are sequentially overlapped from bottom to top; and the grid side walls are positioned at two opposite sides of the grid laminated structure.
An in-memory computing module comprising one or more of the in-memory computing units described in the above embodiments.
An in-memory computing system comprising one or more of the in-memory computing modules described in the above embodiments.
The in-memory computing module and the in-memory computing system can directly complete data operation in the memory array without data operation by means of a CPU (central processing unit), so that the time and energy consumption for data movement are reduced, and the operation efficiency is improved; meanwhile, a bit encoder is adopted to replace a traditional analog-digital conversion module or an induction amplifier, and a digital circuit is completely adopted, so that the time consumed by converting an analog signal into a digital signal is saved, the acquisition speed of the digital signal is improved, the problem of error accumulation caused by current combination in the analog-digital conversion process is avoided, and the volume of the whole structure is reduced.
Drawings
Fig. 1 is a schematic structural diagram of a memory computing unit according to an embodiment of the present application.
Fig. 2 is an enlarged schematic structural diagram of a partial memory computing unit in a dashed box a of fig. 1.
FIG. 3 is a schematic diagram of the calculation process for multiplying two binary data of 8 bits.
Fig. 4 is a schematic structural diagram of a memory computing unit according to another embodiment of the present application.
Fig. 5 is a diagram illustrating summing of digital signals on bit line groups of a memory computing unit according to an embodiment of the present application.
Fig. 6 is a schematic cross-sectional view of a nonvolatile memory according to an embodiment of the present application.
Fig. 7 is a schematic diagram illustrating a dot product operation performed between an input data matrix and a Filter according to an embodiment of the present application.
FIG. 8 is a schematic matrix expansion diagram of the operation process shown in FIG. 7.
Fig. 9 is a schematic diagram illustrating a computing process of the memory computing system according to an embodiment of the present application.
The reference numbers illustrate: 1. a substrate; 11. a substrate; 12. a substrate dielectric layer; 13. a fully depleted channel layer; 21. tunneling through the dielectric layer; 22. a floating gate; 23. a control dielectric layer; 24. a control gate; 25. a gate side wall; 3. a source electrode; 4. and a drain electrode.
Detailed Description
To facilitate an understanding of the invention, the invention will now be described more fully with reference to the accompanying drawings. Preferred embodiments of the present invention are shown in the drawings. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.
In describing positional relationships, unless otherwise specified, when an element such as a layer, film or substrate is referred to as being "on" another layer, it can be directly on the other layer or intervening layers may also be present. Further, when a layer is referred to as being "under" another layer, it can be directly under, or one or more intervening layers may also be present. It will also be understood that when a layer is referred to as being "between" two layers, it can be the only layer between the two layers, or one or more intervening layers may also be present.
Where the terms "comprising," "having," and "including" are used herein, another element may be added unless an explicit limitation is used, such as "only," "consisting of … …," etc. Unless mentioned to the contrary, terms in the singular may include the plural and are not to be construed as being one in number.
With the development of computers, the computing power of the computers is continuously improved, but the bottlenecks are gradually met. It is obvious that in the field of Artificial Intelligence (AI), the amount of calculation is increased sharply, and it is difficult to further increase the calculation speed by means of the conventional von neumann structure. As a result, many scientific circles and companies have begun to improve upon traditional computer architectures. One of the ideas is to simulate the human brain, complete the calculation function and the storage function in the storage unit without carrying the data in the storage unit into the CPU for calculation, and then carry the calculation result to the storage unit.
As shown in fig. 1, one embodiment of the present application provides an in-memory computing unit, including: the storage array comprises a plurality of storage units arranged in N rows and N columns, and the storage unit positioned in the ith row and the jth column is marked as Si,j(ii) a The data values stored in the storage units positioned in the same column are the same; the storage array is used for storing first data of N bits; wherein N is more than or equal to 1, i is more than or equal to 1 and less than or equal to N, and j is more than or equal to 1 and less than or equal to N; n word lines for inputting N bits of second data; the control ends of the memory units in the same row are sequentially connected in series through the same word line; m sets of bit lines, M equals 2N-1, and the kth set of bit lines is denoted as bit line BLkK is 1 or more and M or less; when k is greater than or equal to 1 and less than or equal to N, the kth group of bit lines has k bit lines which are respectively connected with the memory cells S1,kAnd a memory cell Sk,1All the storage sheets on the same straight lineAn output of the element; when k is greater than N and less than or equal to M, the kth group of bit lines has 2N-k bit lines, and the 2N-k bit lines are respectively connected to the memory cells Sk-N+1,NAnd a memory cell SN,k-N+1And the output ends of the storage units are positioned on the same straight line.
As an example, for the arrangement of rows in the memory array, the uppermost row of the memory array may be the first row and the lowermost row may be the nth row in the top-to-bottom direction. For the arrangement of the columns in the memory array, the rightmost column of the memory array may be the first column and the leftmost column may be the nth column from the right to the left. In other embodiments, the rows and columns may be arranged in other defined manners, which is not necessarily limited in this application. Where N may be any positive integer, for example, 3, 5, 8, or 10, and in this embodiment, N is 8.
In FIG. 1, N is 8, and the memory cell in the upper right corner of the memory array is denoted as memory cell S1,1The storage unit at the lower left corner is a storage unit S8,8. The memory cells in each column of the memory array store the same bit value, and by way of example, the memory cells in the first column store data W0, the memory cells in the second column store data W1, … …, and the memory cells in the 8 th column store data W7. Therefore, the first data stored in the memory array is W ═ W7, W6, W5, W4, W3, W2, W1, W0]. Each word line inputs a bit value, for example, the control ends of the memory cells in the first row are all connected to the first word line, the input data carried by the first word line is D0, the control ends of the memory cells in the second row are all connected to the second word line, the input data carried by the second word line is D1, … …, the control ends of the memory cells in the eighth row are all connected to the eighth word line, and the input data carried by the eighth word line is D7. The second data input to the memory array by the 8 word lines is D ═ D7, D6, D5, D4, D3, D2, D1, D0]。
As an example, when the voltage on the word line is greater than or equal to the preset voltage, the bit value on the word line is 1; when the voltage on the word line is less than the preset voltage, the bit value on the word line is 0. For example, when the voltage on the first word line is greater than or equal to the preset voltage, D0 is 1; when the voltage on the first word line is less than the predetermined voltage, D0 is 0. The preset voltage may be a threshold voltage of the memory cell.
The connection relationship between the bit lines is described in two parts, the first part, when k is greater than or equal to 1 and less than or equal to 8, and the memory cell S1,kAnd a memory cell Sk,1Output end of each memory cell and bit line group BL on same straight linekThe k bit lines in (1) are connected in one-to-one correspondence. For example, when k is equal to 1, the bit line group BL1Having only one bit line, and a memory cell S1,1The output ends of the two are connected; when k is equal to2, the bit line group BL2Two bit lines respectively connected to the memory cells S2,1And a storage unit S1,2The output ends of the two are connected; when k is equal to3, the bit line group BL3There are three bit lines, each of which is connected to a memory cell S3,1And a storage unit S2,2And a memory cell S1,3The output ends of the two are connected. A second part, when k is greater than 8 and less than or equal to 15, and the memory cell Sk-8+1,8And a memory cell S8,k-8+1Output end of each memory cell and bit line group BL on same straight linekThe 2N-k bit lines are correspondingly connected one by one. For example, when k is equal to 15, the bit line group BL15Having only one bit line, and a memory cell S8,8The output ends of the two are connected; when k is equal to 14, the bit line group BL14Two bit lines respectively connected to the memory cells S7,8And a storage unit S8,7The output ends of the two are connected; when k is equal to 13, bit line group BL13There are three bit lines, each of which is connected to a memory cell S6,8And a storage unit S7,7And a memory cell S8,6The output ends of the two are connected. Bit line group BL13 andmemory cell S6,8And a storage unit S7,7And a memory cell S8,6The enlarged schematic diagram of the structure of the output terminal connection is shown in fig. 2.
The memory computing unit can be used for finishing binary multiplication of two 8-bit data. Wherein the input data is second data D ═ D7, D6, D5, D4, D3, D2, D1, D0]The stored data is first data W ═ W7, W6, W5, W4, W3, W2,W1,W0]. The binary multiplication process of data D and data W is shown in fig. 3. Binary multiplication is carried out on two 8-bit data, and finally a product P ═ P of 15 bits is obtained14,P13,P12,P11,P10,P9,P8,P7,P6,P5,P4,P3,P2,P1,P0]. Each bit data in the product P corresponds to the 15bit line group [ BL ] in fig. 115,BL14,BL13,BL12,BL11,BL10,BL9,BL8,BL7,BL6,BL5,BL4,BL3,BL2,BL1]. In this embodiment, the maximum values of D and W are 255, and the maximum value of P is 65025.
Wherein, the calculation logic of the single storage unit is as follows:
when a memory cell stores data 1 and the data on the word line connected to the gate of the memory cell is also 1, the memory cell is turned on, generating a saturation current. The saturation current represents that the product result is 1, i.e., 1 × 1 ═ 1.
When a memory cell stores data 0 and data on a word line connected to a gate of the memory cell is 1, the memory cell is not turned on and cannot generate a saturation current. The result of the multiplication is 0, i.e., 0 × 1 ═ 0.
When a memory cell stores data 1 and data on a word line connected to a gate of the memory cell is 0, the memory cell is not turned on and cannot generate a saturation current. The result of the multiplication is 0, i.e., 1 × 0 ═ 0.
Based on the above operation logic, when the binary data W and the binary data D are binary multiplied by the memory computing unit, the number of bit lines having saturation current in each bit line group is the value of the digital signal that can be output by the bit line group.
The memory computing unit directly acts the second data on the control end of the storage unit through the word line, stores the first data in the storage units arranged in the array according to a certain rule, and can complete the binary multiplication operation of N bits and N bits in one clock cycle. The operation can be directly finished in the storage module without carrying the storage data into the CPU for operation, so that the data carrying is reduced, the operation speed can be greatly improved under the condition of large operation amount, and the power consumption is reduced. And the output end of each memory cell is independently connected to an independent bit line, compared with the traditional technical scheme, the current output by different memory cells does not need to be converged on one bit line, and the problem of error accumulation caused by current convergence is solved.
In one embodiment, the memory unit may be a non-volatile memory that can hold data without connecting to an external power source. As an example, the control terminal of the memory cell may be a gate of the non-volatile memory, and the output terminal of the memory cell may be a drain of the non-volatile memory. Alternatively, the memory cells in the array may also be charge storing memory cells, such as floating gate cells or dielectric charge trapping cells, having drains coupled to corresponding bit lines, and sources coupled to ground. Other types of memory cells may be used in other embodiments, including but not limited to many types of programmable resistive memory cells, such as phase change based memory cells, magnetoresistive based memory cells, metal oxide based memory cells, or other cells.
In one embodiment, the memory cells may be NOR flash memory cells. Such as bulk silicon technology floating gate NOR flash memory cells, fully depleted silicon-on-insulator (FDSOI) technology floating gate NOR flash memory cells. The NOR flash memory cell has a gate connected to a word line, a drain connected to a bit line, and a source and a back electrode grounded.
In one embodiment, the memory computing unit further includes M-2 bit encoders, the M-2 bit encoders are connected to the 2 nd to M-1 st bit line groups in a one-to-one correspondence, and the bit encoders are configured to encode output signals of the bit line groups to obtain digital signals.
As an example, see the figure4, the memory computing unit includes 13 bit encoders respectively connected to the bit line groups BL2To bit line group BL14And the connection is in one-to-one correspondence. When a memory cell is turned on, a saturation current is output through a bit line connected to the memory cell, and at the same time, a voltage on the bit line changes from a low level to a high level. In the present embodiment, the bit line group BL2There are two bit lines connected between the 2to2 bit encoder and the two memory cells. Bit line group BL2Up to2 high level signals may be sent to the bit encoder, and the 2to2 bit encoder may encode two high level signals into the BCD code 10. Bit line group BL3There are three bit lines connected between the 3to2 bit encoder and the three memory cells. Bit line group BL3Up to3 high level signals may be transmitted to the bit encoder, and the 3to2 bit encoder may encode the three high level signals into the BCD code 11. Bit line group BL4Up to 4 high level signals may be sent to the bit encoder, and the 4to3 bit encoder may encode four high level signals into the BCD code 100. In summary, the bit encoder can encode a high-level signal conveyed in a bit line connected thereto into a BCD code. For bit line group BL1And bit line group BL15Since there is only one bit line each, the digital signal that can be conveyed is either 0 (low) or 1 (high), and encoding using a bit encoder is not necessary.
In the embodiment, the bit line signals in each bit line group are converted into the digital signals through the bit encoder, and a plurality of saturation currents do not need to be input into the same bit line, so that the problem of error accumulation caused by current convergence is solved. In addition, in the embodiment, the bit line signals in each bit line group are encoded into digital signals by skillfully utilizing the bit encoder, and the binary multiplication operation of N bits can be completed in one period by completely adopting the digital circuits and the combinational logic circuits.
After the bit encoder outputs the digital signals, the digital signals are subjected to shift addition to obtain the final product sum, as shown in fig. 5. As an example, each digital signal may be added using an adder.
In one embodiment, as shown in FIG. 6, the non-volatile memory includes: the substrate comprises a substrate, a substrate dielectric layer and a fully depleted channel layer; wherein, a well region is formed in the substrate; the substrate dielectric layer is positioned on the substrate and covers the well region; the fully depleted channel layer is positioned on the substrate dielectric layer by layer; the grid structure is positioned on the upper surface of the fully depleted channel layer; the source electrode is positioned on the upper surface of the fully depleted channel layer and positioned on one side of the grid structure; the drain electrode is positioned on the upper surface of the fully depleted channel layer and positioned on one side, far away from the source electrode, of the grid structure; wherein the source electrode and the drain electrode are formed on the upper surface of the fully depleted channel layer by an epitaxial process.
By arranging the substrate medium layer 12 between the substrate 11 and the fully depleted channel layer 13, an electronic channel between the source electrode 3 and the drain electrode 4 can be limited in the fully depleted channel layer 13, electron transfer between the source electrode 3 and the drain electrode 4 through a well region is avoided, and therefore leakage current is greatly reduced. The fully depleted channel layer 13 is combined with the substrate medium layer 12, and a channel of saturation current is limited in the fully depleted channel layer 13 under the condition that the semiconductor structure is conducted, so that the consistency of the semiconductor structure is greatly improved, and the variability among different semiconductor structures is reduced. In addition, the source electrode 3 and the drain electrode 4 may be formed on the upper surface of the fully depleted channel layer 13 through an epitaxial process to obtain an epitaxial source electrode and an epitaxial drain electrode, which may greatly increase the saturation current in the channel when the transistor is turned on, and increase the switching speed of the transistor.
In one embodiment, with continued reference to fig. 6, the gate structure includes a gate stack structure on the top surface of the fully depleted channel layer 13; the grid laminated structure comprises a tunneling dielectric layer 21, a floating gate 22, a control dielectric layer 23 and a control grid 24 which are sequentially overlapped from bottom to top; and the gate side walls 25 are positioned at two opposite sides of the gate laminated structure. As an example, the non-volatile memory shown in fig. 6 may be a floating gate NOR flash memory cell under FDSOI process.
In one embodiment, the present application further discloses an in-memory computing module, which includes one or more in-memory computing units in the above embodiments.
Each in-memory computing unit can complete an N-bit by N-bit binary operation, and thus each in-memory computing module can simultaneously complete one or more N-bit by N-bit binary operations. The in-memory computation module may be used as a filter (filter) for generating a feature map (feature map) in the convolutional neural network computation, that is, a stored value is written into the in-memory computation module in advance as a value of each element in the filter. Taking the CNN architecture for image recognition as an example, in the feature map (feature map) calculation of the first layer, each data in the input data matrix (input) may represent a black and white pixel of an image, and the value of each pixel has L bits, where L may be any positive integer, such as 5, 8, 12, or 16. In this embodiment, L is 8, and the input data matrix is a 5 × 5 matrix. Filter is also a 5 x 5 matrix, and each element in the Filter is also an 8bit binary number.
A schematic diagram of the dot product operation performed by the Filter and the input data matrix is shown in fig. 7. Wherein, WijIs the value of filter, DijFor the input value, i is 0,1,2,3,4, j is 0,1,2,3,4, 5. As can be appreciated from the foregoing, for each Dij*WijThe calculation of (2) requires an in-memory calculation unit to perform the calculation. For example, D00*W00、D01*W01、D02*W02、D03*W03、D04*W04The data, each of which is 8 bits × 8 bits, are multiplied, wherein,
D00=[D00[0],D00[1],D00[2],D00[3],D00[4],D00[5],D00[6],D00[7]]
W00=[W00[0],W00[1],W00[2],W00[3],W00[4],W00[5],W00[6],W00[7]]
for D00*W00W may be first00Writing into the first memory array, and storing D00D is completed in one clock period by inputting the data into the first memory array through the input line00*W00And (4) calculating the data. A total of 25 data calculations of 8 bits by 8 bits, i.e. D, are required00*W00、D01*W01、D02*W02、D03*W03、……D43*W43、D44*W44Therefore, 25 memory computing units can be arranged to operate simultaneously, and each memory computing unit completes data computation of 8 bits × 8 bits, namely, one dot product operation can be completed in one clock cycle.
For ease of understanding, the equation in fig. 7 may be developed in the form of vectors and matrices, as shown in fig. 8. Wherein the left column matrix represents data D input from the word line to the memory computing unit00To D44The column matrix has 200 rows and 1 column. Wherein each 8 rows represents an 8bit input data. For example, D00[0]To D00[7]Representative data D00
The data matrix on the right side of FIG. 8 may represent the memory array W00To W44The data matrix has 200 rows and 8 columns. Wherein, every 8 rows represent 8 bits of storage data. Binary data W of 8 bits00To W44Are arranged from top to bottom in sequence. As an example, the first 8 rows represent the stored data W00. Specifically, in the first 8 rows, each column element has the same value, for example, the first column elements in the first 8 rows are all W00[7]The second row elements are all W00[6]… …, elements of the seventh column all being W00[0]。
The memory computing module is provided with 25 memory computing units and can complete one dot product operation in one clock cycle.
An embodiment of the present application further discloses an in-memory computing system, which includes one or more in-memory computing modules described in the above embodiments.
If an in-memory computing module is used as a filter, the in-memory computing system comprises one or more filters. For the CNN network architecture, there may be multiple filters in the feature map calculation process of each layer. Taking K filters as an example, the memory computing system includes K memory computing modules.
As an example, each filter is an N rectangular data array. Each data is L bits of binary data, and there are K filters in total. A schematic diagram of the calculation process for completing a feature map layer in the CNN network architecture is shown in fig. 9.
Taking N as 5, K as 32, and L as 8 as an example, the structure of fig. 9 can perform multiplication of 800 pieces of 8-bit data by 5 × 5 × 32 and addition by almost the same amount in one clock cycle, and therefore, the calculation power of the memory computing system in this embodiment is 1600operations (ops) per one clock cycle. The memory computing system architecture described above can reach GHz levels because the time it takes for the signal to pass through the NOR unit, bit encoder and adder is extremely short. That is, on a chip with an area of about 51200 (i.e., 800 × 64) NOR cells, it is possible to provide 1.6tops (tera operations) for the calculation, which is a very advanced structure.
Compared with the traditional memory computing scheme, although the area of the memory array is increased to a certain extent by arranging an independent bit line for each memory cell on the bit line layout and connecting the bit line to the encoder, the memory computing unit omits an analog-to-digital signal conversion module or a sensing amplifier with a larger area from the perspective of the whole memory computing unit, and simultaneously omits the time of analog-to-digital conversion. Therefore, the memory computing unit in the application reduces the area, improves the speed and reduces the power consumption. The memory computing module or the memory computing system composed of the memory computing units does not need to carry data frequently, greatly improves the data processing speed, greatly reduces the power consumption, and can realize edge computing on an edge device.
The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. An in-memory computing unit, comprising:
the storage array comprises a plurality of storage units arranged in N rows and N columns, and the storage unit positioned in the ith row and the jth column is marked as Si,j(ii) a The data values stored in the storage units positioned in the same column are the same; the storage array is used for storing first data of N bits; wherein N is more than or equal to 1, i is more than or equal to 1 and less than or equal to N, and j is more than or equal to 1 and less than or equal to N;
n word lines for inputting N bits of second data; the control ends of the memory units in the same row are sequentially connected in series through the same word line;
m bit line groups, wherein the kth group of bit lines is marked as a bit line group BLk, M is equal to 2N-1, and k is greater than or equal to 1 and less than or equal to M;
when k is greater than or equal to 1 and less than or equal to N, the kth group of bit lines is provided with k bit lines, and the k bit lines are respectively connected to the output ends of the memory cells which are positioned on the same straight line with the memory cells S1, k and the memory cells Sk, 1;
when k is larger than N and smaller than or equal to M, the k group of bit lines has 2N-k bit lines, and the 2N-k bit lines are respectively connected to the output ends of the memory units which are positioned on the same straight line with the memory units Sk-N +1 and N and the memory units SN, k-N + 1.
2. The in-memory computing unit of claim 1, wherein the storage unit comprises a non-volatile memory.
3. The in-memory computing unit of claim 2, wherein the non-volatile memory comprises NOR flash memory cells.
4. The memory cell of claim 2, wherein the control terminal of the memory cell comprises a gate of a non-volatile memory; the output of the memory cell includes a drain of the non-volatile memory.
5. The in-memory computing unit of claim 2, wherein the first data is binary data, and the non-volatile memory is configured to store a bit value of 0 or 1;
the second data is binary data, and when the voltage on the word line is greater than or equal to a preset voltage, the bit value on the word line is 1; and when the voltage on the word line is smaller than the preset voltage, the bit value on the word line is 0.
6. The in-memory computing unit of any of claims 1-5, further comprising:
the M-2 bit encoders are connected with the 2 nd to the M-1 st bit line groups in a one-to-one correspondence mode, and are used for encoding output signals of the bit line groups to obtain digital signals.
7. The in-memory computing unit of claim 2, wherein the non-volatile memory comprises:
the substrate comprises a substrate, a substrate dielectric layer and a fully depleted channel layer; wherein, a well region is formed in the substrate; the substrate dielectric layer is positioned on the substrate and covers the well region; the fully depleted channel layer is positioned on the substrate dielectric layer by layer;
the grid structure is positioned on the upper surface of the fully depleted channel layer;
the source electrode is positioned on the upper surface of the fully depleted channel layer and positioned on one side of the grid structure;
the drain electrode is positioned on the upper surface of the fully depleted channel layer and positioned on one side, far away from the source electrode, of the grid structure;
wherein the source electrode and the drain electrode are formed on the upper surface of the fully depleted channel layer by an epitaxial process.
8. The memory compute unit of claim 7 wherein the gate structure comprises:
the grid laminated structure is positioned on the upper surface of the fully depleted channel layer; the grid laminated structure comprises a tunneling dielectric layer, a floating gate, a control dielectric layer and a control grid which are sequentially overlapped from bottom to top;
and the grid side walls are positioned at two opposite sides of the grid laminated structure.
9. An in-memory computing module comprising one or more in-memory computing units as claimed in any one of claims 1 to 8.
10. An in-memory computing system comprising one or more in-memory computing modules of claim 9.
CN202110960405.XA 2021-08-20 2021-08-20 In-memory computing unit, module and system Pending CN113674786A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110960405.XA CN113674786A (en) 2021-08-20 2021-08-20 In-memory computing unit, module and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110960405.XA CN113674786A (en) 2021-08-20 2021-08-20 In-memory computing unit, module and system

Publications (1)

Publication Number Publication Date
CN113674786A true CN113674786A (en) 2021-11-19

Family

ID=78544456

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110960405.XA Pending CN113674786A (en) 2021-08-20 2021-08-20 In-memory computing unit, module and system

Country Status (1)

Country Link
CN (1) CN113674786A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH08236650A (en) * 1994-12-26 1996-09-13 Nippon Steel Corp Nonvolatile semiconductor memory and writing method therefor
US20130229868A1 (en) * 2012-03-02 2013-09-05 Pao-Ling Koh Saving of Data in Cases of Word-Line to Word-Line Short in Memory Arrays
CN111128279A (en) * 2020-02-25 2020-05-08 杭州知存智能科技有限公司 Memory computing chip based on NAND Flash and control method thereof
CN211016545U (en) * 2020-02-25 2020-07-14 杭州知存智能科技有限公司 Memory computing chip based on NAND Flash, memory device and terminal
CN111816233A (en) * 2020-07-30 2020-10-23 中科院微电子研究所南京智能技术研究院 In-memory computing unit and array

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH08236650A (en) * 1994-12-26 1996-09-13 Nippon Steel Corp Nonvolatile semiconductor memory and writing method therefor
US20130229868A1 (en) * 2012-03-02 2013-09-05 Pao-Ling Koh Saving of Data in Cases of Word-Line to Word-Line Short in Memory Arrays
CN111128279A (en) * 2020-02-25 2020-05-08 杭州知存智能科技有限公司 Memory computing chip based on NAND Flash and control method thereof
CN211016545U (en) * 2020-02-25 2020-07-14 杭州知存智能科技有限公司 Memory computing chip based on NAND Flash, memory device and terminal
CN111816233A (en) * 2020-07-30 2020-10-23 中科院微电子研究所南京智能技术研究院 In-memory computing unit and array

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
DAICHI FUJIKI ET AL.: ""In-Memory Data Parallel Processor"", 《ACM:HTTPS://DOI.ORG/10.1145/3173162.3173171》 *
杨展悌 等: ""基于先进FDSOI SRAM的存内计算架构实现快速与低功耗的CNN处理"", 《微纳电子与智能制造》, vol. 3, no. 1, pages 0 - 4 *

Similar Documents

Publication Publication Date Title
CN111431536B (en) Subunit, MAC array and bit width reconfigurable analog-digital mixed memory internal computing module
CN110209375B (en) Multiply-accumulate circuit based on radix-4 coding and differential weight storage
Xiao et al. Analog architectures for neural network acceleration based on non-volatile memory
CN110647983B (en) Self-supervision learning acceleration system and method based on storage and calculation integrated device array
US11521051B2 (en) Memristive neural network computing engine using CMOS-compatible charge-trap-transistor (CTT)
Chu et al. PIM-prune: Fine-grain DCNN pruning for crossbar-based process-in-memory architecture
US20200192971A1 (en) Nand block architecture for in-memory multiply-and-accumulate operations
KR20180116094A (en) A monolithic multi-bit weight cell for neuromorphic computing
WO2021223547A1 (en) Subunit, mac array, and analog and digital combined in-memory computing module having reconstructable bit width
CN112585623A (en) Configurable analog neural memory system for deep learning neural networks
CN112181895A (en) Reconfigurable architecture, accelerator, circuit deployment and data flow calculation method
CN110991624A (en) Variable pulse width input charge accumulation type memristor neural network circuit
CN115906976A (en) Full-analog vector matrix multiplication memory computing circuit and application thereof
EP4022426A1 (en) Refactoring mac operations
CN110751279B (en) Ferroelectric capacitance coupling neural network circuit structure and multiplication method of vector and matrix in neural network
Tsai et al. RePIM: Joint exploitation of activation and weight repetitions for in-ReRAM DNN acceleration
CN110244817B (en) Partial differential equation solver based on photoelectric computing array and method thereof
US20210287745A1 (en) Convolution operation method based on nor flash array
CN113674786A (en) In-memory computing unit, module and system
US20220318612A1 (en) Deep neural network based on flash analog flash computing array
Chen et al. RIMAC: An array-level ADC/DAC-free ReRAM-based in-memory DNN processor with analog cache and computation
CN111710356A (en) Coding type flash memory device and coding method
CN115691613A (en) Charge type memory calculation implementation method based on memristor and unit structure thereof
CN113674785A (en) In-memory computing unit, module and system
CN114546332A (en) In-memory computing unit, module and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20220913

Address after: 510000 building a, 136 Kaiyuan Avenue, Guangzhou Development Zone, Guangzhou City, Guangdong Province

Applicant after: Guangdong Dawan District integrated circuit and System Application Research Institute

Applicant after: Ruili flat core Microelectronics (Guangzhou) Co.,Ltd.

Address before: 510000 building a, 136 Kaiyuan Avenue, Guangzhou Development Zone, Guangzhou City, Guangdong Province

Applicant before: Guangdong Dawan District integrated circuit and System Application Research Institute

Applicant before: AoXin integrated circuit technology (Guangdong) Co.,Ltd.