CN113674786A

CN113674786A - In-memory computing unit, module and system

Info

Publication number: CN113674786A
Application number: CN202110960405.XA
Authority: CN
Inventors: 杨展悌; 苏炳熏; 叶甜春; 罗军; 赵杰
Original assignee: Aoxin Integrated Circuit Technology Guangdong Co ltd; Guangdong Greater Bay Area Institute of Integrated Circuit and System
Current assignee: Guangdong Greater Bay Area Institute of Integrated Circuit and System; Ruili Flat Core Microelectronics Guangzhou Co Ltd
Priority date: 2021-08-20
Filing date: 2021-08-20
Publication date: 2021-11-19

Abstract

The invention relates to an in-memory computing unit, comprising: the memory array comprises a plurality of memory cells arranged in N rows and N columns, and the memory cell positioned in the ith row and the jth column is marked as S_i,j(ii) a The data values stored in the storage units positioned in the same column are the same; the storage array is used for storing first data of N bits; n word lines for inputting N bits of second data; the control ends of the memory units in the same row are sequentially connected in series through the same word line; m bit line groups, the kth group of bit lines being marked as bit line group BLk, M being equal to 2N-1; when k is equal to or greater than 1 and equal to or less than N, the kth group of bit lines has k bit lines connected to the outputs of the memory cells in the same line as the memory cell S1, k and the memory cell Sk,1A terminal; when k is larger than N and smaller than or equal to M, the kth group of bit lines has 2N-k bit lines, and the 2N-k bit lines are respectively connected to the output ends of the memory cells which are positioned on the same straight line with the memory cells Sk-N +1 and N and the memory cells SN, k-N + 1.

Description

In-memory computing unit, module and system

Technical Field

The invention relates to computing and storage integration, in particular to an in-memory computing unit, a module and a system.

Background

The integration of computing and storage is a new computing technology appearing in recent years, and the aim of the technology is to complete data computation in a memory, avoid or reduce the transportation of data between the memory and a CPU and improve the computing efficiency. Especially, with the development of Artificial Intelligence (AI), the amount of data and computation is rapidly increasing, and the traditional von neumann computer architecture is being challenged more and more. Taking a Convolutional Neural Network (CNN) as an example, after each multiplication operation, the product needs to be stored first, taken out to the CPU, added, and repeated. The continuous access of data between the memory and the CPU consumes a great deal of energy, and the efficiency is very low.

In order to further improve the calculation efficiency, a memory calculation idea is proposed, that is, data calculation is completed inside the memory module without transferring the data into the CPU for operation. However, in the conventional memory computing structure, the saturation current output by each memory cell needs to be merged into the same output line and then converted into a digital signal to obtain the product sum. Since the saturation currents output by different memory cells cannot be completely consistent, certain errors inevitably exist, and thus, in the current collection process, the risk of increasing the accumulated errors exists. The cumulative error increases as the number of saturation currents collected increases.

Disclosure of Invention

In view of the foregoing, there is a need to provide an in-memory computing unit, module and system.

Memory computing unitThe method comprises the following steps: the storage array comprises a plurality of storage units arranged in N rows and N columns, and the storage unit positioned in the ith row and the jth column is marked as S_i,j(ii) a The data values stored in the storage units positioned in the same column are the same; the storage array is used for storing first data of N bits; wherein N is more than or equal to 1, i is more than or equal to 1 and less than or equal to N, and j is more than or equal to 1 and less than or equal to N; n word lines for inputting N bits of second data; the control ends of the memory units in the same row are sequentially connected in series through the same word line; m bit line groups, wherein the kth group of bit lines is marked as a bit line group BLk, M is equal to 2N-1, and k is greater than or equal to 1 and less than or equal to M; when k is greater than or equal to 1 and less than or equal to N, the kth group of bit lines has k bit lines which are respectively connected with the memory cells S_1,kAnd a memory cell S_k,1The output ends of the storage units are positioned on the same straight line; when k is greater than N and less than or equal to M, the kth group of bit lines has 2N-k bit lines, and the 2N-k bit lines are respectively connected to the memory cells S_k-N+1,NAnd a memory cell S_N,k-N+1And the output ends of the storage units are positioned on the same straight line.

The memory computing unit directly acts the second data on the control end of the storage unit through the word line, stores the first data in the storage units arranged in the array according to a certain rule, and can complete the binary multiplication operation of N bits and N bits in one clock cycle. The operation can be directly finished in the storage module without carrying the storage data into the CPU for operation, so that the data carrying is reduced, the operation speed can be greatly improved under the condition of large operation amount, and the power consumption is reduced. And the output end of each memory cell is independently connected to an independent bit line, compared with the traditional technical scheme, the current output by different memory cells does not need to be converged on one bit line, and the problem of error accumulation caused by current convergence is solved.

In one embodiment, the memory unit includes a nonvolatile memory.

In one embodiment, the non-volatile memory comprises NOR flash memory cells.

In one embodiment, the control terminal of the memory cell comprises a gate of a non-volatile memory; the output of the memory cell includes a drain of the non-volatile memory.

In one embodiment, the first data is binary data, and the nonvolatile memory is used for storing a bit value of 0 or 1; the second data is binary data, and when the voltage on the word line is greater than or equal to a preset voltage, the bit value on the word line is 1; and when the voltage on the word line is smaller than a preset voltage, the bit value on the word line is 0.

In one embodiment, the memory computing unit further includes M-2 bit encoders, where the M-2 bit encoders are connected to the 2 nd to M-1 st bit line groups in a one-to-one correspondence, and the bit encoders are configured to encode output signals of the bit line groups to obtain digital signals.

The memory computing unit connects each bit line group to the corresponding bit encoder, encodes the current and voltage signals on each bit line in the bit line group, and does not need to adopt an analog-to-digital conversion module to convert the current signals in the bit lines, so that the time of analog-to-digital conversion is saved in time, and the area of the analog-to-digital conversion module is also saved in area. Although the number of bit lines and the bit encoder are increased, the area of the memory computing unit is reduced and the computing speed is improved.

In one embodiment, the non-volatile memory comprises: the substrate comprises a substrate, a substrate dielectric layer and a fully depleted channel layer; wherein, a well region is formed in the substrate; the substrate dielectric layer is positioned on the substrate and covers the well region; the fully depleted channel layer is positioned on the substrate dielectric layer by layer; the grid structure is positioned on the upper surface of the fully depleted channel layer; the source electrode is positioned on the upper surface of the fully depleted channel layer and positioned on one side of the grid structure; the drain electrode is positioned on the upper surface of the fully depleted channel layer and positioned on one side, far away from the source electrode, of the grid structure; wherein the source electrode and the drain electrode are formed on the upper surface of the fully depleted channel layer by an epitaxial process.

In the memory computing unit, each nonvolatile memory adopts a semiconductor structure with a fully depleted channel layer and a substrate medium layer, so that electric leakage can be reduced, and the memory computing unit can be applied to an AI device for edge computing. In addition, the source electrode and the drain electrode are formed by adopting an epitaxial process, so that the saturation current can be improved, the reading speed is increased, and the calculation efficiency is improved.

In one embodiment, the gate structure comprises a gate stack structure located on an upper surface of the fully depleted channel layer; the grid laminated structure comprises a tunneling dielectric layer, a floating gate, a control dielectric layer and a control grid which are sequentially overlapped from bottom to top; and the grid side walls are positioned at two opposite sides of the grid laminated structure.

An in-memory computing module comprising one or more of the in-memory computing units described in the above embodiments.

An in-memory computing system comprising one or more of the in-memory computing modules described in the above embodiments.

The in-memory computing module and the in-memory computing system can directly complete data operation in the memory array without data operation by means of a CPU (central processing unit), so that the time and energy consumption for data movement are reduced, and the operation efficiency is improved; meanwhile, a bit encoder is adopted to replace a traditional analog-digital conversion module or an induction amplifier, and a digital circuit is completely adopted, so that the time consumed by converting an analog signal into a digital signal is saved, the acquisition speed of the digital signal is improved, the problem of error accumulation caused by current combination in the analog-digital conversion process is avoided, and the volume of the whole structure is reduced.

Drawings

Fig. 1 is a schematic structural diagram of a memory computing unit according to an embodiment of the present application.

Fig. 2 is an enlarged schematic structural diagram of a partial memory computing unit in a dashed box a of fig. 1.

FIG. 3 is a schematic diagram of the calculation process for multiplying two binary data of 8 bits.

Fig. 4 is a schematic structural diagram of a memory computing unit according to another embodiment of the present application.

Fig. 5 is a diagram illustrating summing of digital signals on bit line groups of a memory computing unit according to an embodiment of the present application.

Fig. 6 is a schematic cross-sectional view of a nonvolatile memory according to an embodiment of the present application.

Fig. 7 is a schematic diagram illustrating a dot product operation performed between an input data matrix and a Filter according to an embodiment of the present application.

FIG. 8 is a schematic matrix expansion diagram of the operation process shown in FIG. 7.

Fig. 9 is a schematic diagram illustrating a computing process of the memory computing system according to an embodiment of the present application.

The reference numbers illustrate: 1. a substrate; 11. a substrate; 12. a substrate dielectric layer; 13. a fully depleted channel layer; 21. tunneling through the dielectric layer; 22. a floating gate; 23. a control dielectric layer; 24. a control gate; 25. a gate side wall; 3. a source electrode; 4. and a drain electrode.

Detailed Description

To facilitate an understanding of the invention, the invention will now be described more fully with reference to the accompanying drawings. Preferred embodiments of the present invention are shown in the drawings. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

In describing positional relationships, unless otherwise specified, when an element such as a layer, film or substrate is referred to as being "on" another layer, it can be directly on the other layer or intervening layers may also be present. Further, when a layer is referred to as being "under" another layer, it can be directly under, or one or more intervening layers may also be present. It will also be understood that when a layer is referred to as being "between" two layers, it can be the only layer between the two layers, or one or more intervening layers may also be present.

Where the terms "comprising," "having," and "including" are used herein, another element may be added unless an explicit limitation is used, such as "only," "consisting of … …," etc. Unless mentioned to the contrary, terms in the singular may include the plural and are not to be construed as being one in number.

With the development of computers, the computing power of the computers is continuously improved, but the bottlenecks are gradually met. It is obvious that in the field of Artificial Intelligence (AI), the amount of calculation is increased sharply, and it is difficult to further increase the calculation speed by means of the conventional von neumann structure. As a result, many scientific circles and companies have begun to improve upon traditional computer architectures. One of the ideas is to simulate the human brain, complete the calculation function and the storage function in the storage unit without carrying the data in the storage unit into the CPU for calculation, and then carry the calculation result to the storage unit.

As shown in fig. 1, one embodiment of the present application provides an in-memory computing unit, including: the storage array comprises a plurality of storage units arranged in N rows and N columns, and the storage unit positioned in the ith row and the jth column is marked as S_i,j(ii) a The data values stored in the storage units positioned in the same column are the same; the storage array is used for storing first data of N bits; wherein N is more than or equal to 1, i is more than or equal to 1 and less than or equal to N, and j is more than or equal to 1 and less than or equal to N; n word lines for inputting N bits of second data; the control ends of the memory units in the same row are sequentially connected in series through the same word line; m sets of bit lines, M equals 2N-1, and the kth set of bit lines is denoted as bit line BL_kK is 1 or more and M or less; when k is greater than or equal to 1 and less than or equal to N, the kth group of bit lines has k bit lines which are respectively connected with the memory cells S_1,kAnd a memory cell S_k,1All the storage sheets on the same straight lineAn output of the element; when k is greater than N and less than or equal to M, the kth group of bit lines has 2N-k bit lines, and the 2N-k bit lines are respectively connected to the memory cells S_k-N+1,NAnd a memory cell S_N,k-N+1And the output ends of the storage units are positioned on the same straight line.

As an example, for the arrangement of rows in the memory array, the uppermost row of the memory array may be the first row and the lowermost row may be the nth row in the top-to-bottom direction. For the arrangement of the columns in the memory array, the rightmost column of the memory array may be the first column and the leftmost column may be the nth column from the right to the left. In other embodiments, the rows and columns may be arranged in other defined manners, which is not necessarily limited in this application. Where N may be any positive integer, for example, 3, 5, 8, or 10, and in this embodiment, N is 8.

In FIG. 1, N is 8, and the memory cell in the upper right corner of the memory array is denoted as memory cell S_1,1The storage unit at the lower left corner is a storage unit S_8,8. The memory cells in each column of the memory array store the same bit value, and by way of example, the memory cells in the first column store data W0, the memory cells in the second column store data W1, … …, and the memory cells in the 8 th column store data W7. Therefore, the first data stored in the memory array is W ═ W7, W6, W5, W4, W3, W2, W1, W0]. Each word line inputs a bit value, for example, the control ends of the memory cells in the first row are all connected to the first word line, the input data carried by the first word line is D0, the control ends of the memory cells in the second row are all connected to the second word line, the input data carried by the second word line is D1, … …, the control ends of the memory cells in the eighth row are all connected to the eighth word line, and the input data carried by the eighth word line is D7. The second data input to the memory array by the 8 word lines is D ═ D7, D6, D5, D4, D3, D2, D1, D0]。

As an example, when the voltage on the word line is greater than or equal to the preset voltage, the bit value on the word line is 1; when the voltage on the word line is less than the preset voltage, the bit value on the word line is 0. For example, when the voltage on the first word line is greater than or equal to the preset voltage, D0 is 1; when the voltage on the first word line is less than the predetermined voltage, D0 is 0. The preset voltage may be a threshold voltage of the memory cell.

The connection relationship between the bit lines is described in two parts, the first part, when k is greater than or equal to 1 and less than or equal to 8, and the memory cell S_1,kAnd a memory cell S_k,1Output end of each memory cell and bit line group BL on same straight line_kThe k bit lines in (1) are connected in one-to-one correspondence. For example, when k is equal to 1, the bit line group BL₁Having only one bit line, and a memory cell S_1,1The output ends of the two are connected; when k is equal to2, the bit line group BL₂Two bit lines respectively connected to the memory cells S_2,1And a storage unit S_1,2The output ends of the two are connected; when k is equal to3, the bit line group BL₃There are three bit lines, each of which is connected to a memory cell S_3,1And a storage unit S_2,2And a memory cell S_1,3The output ends of the two are connected. A second part, when k is greater than 8 and less than or equal to 15, and the memory cell S_k-8+1,8And a memory cell S_8,k-8+1Output end of each memory cell and bit line group BL on same straight line_kThe 2N-k bit lines are correspondingly connected one by one. For example, when k is equal to 15, the bit line group BL₁₅Having only one bit line, and a memory cell S_8,8The output ends of the two are connected; when k is equal to 14, the bit line group BL₁₄Two bit lines respectively connected to the memory cells S_7,8And a storage unit S_8,7The output ends of the two are connected; when k is equal to 13, bit line group BL₁₃There are three bit lines, each of which is connected to a memory cell S_6,8And a storage unit S_7,7And a memory cell S_8,6The output ends of the two are connected. Bit line group BL_{13 and}memory cell S_6,8And a storage unit S_7,7And a memory cell S_8,6The enlarged schematic diagram of the structure of the output terminal connection is shown in fig. 2.

The memory computing unit can be used for finishing binary multiplication of two 8-bit data. Wherein the input data is second data D ═ D7, D6, D5, D4, D3, D2, D1, D0]The stored data is first data W ═ W7, W6, W5, W4, W3, W2,W1，W0]. The binary multiplication process of data D and data W is shown in fig. 3. Binary multiplication is carried out on two 8-bit data, and finally a product P ═ P of 15 bits is obtained₁₄，P₁₃，P₁₂，P₁₁，P₁₀，P₉，P₈，P₇，P₆，P₅，P₄，P₃，P₂，P₁，P₀]. Each bit data in the product P corresponds to the 15bit line group [ BL ] in fig. 1₁₅，BL₁₄，BL₁₃，BL₁₂，BL₁₁，BL₁₀，BL₉，BL₈，BL₇，BL₆，BL₅，BL₄，BL₃，BL₂，BL₁]. In this embodiment, the maximum values of D and W are 255, and the maximum value of P is 65025.

Wherein, the calculation logic of the single storage unit is as follows:

when a memory cell stores data 1 and the data on the word line connected to the gate of the memory cell is also 1, the memory cell is turned on, generating a saturation current. The saturation current represents that the product result is 1, i.e., 1 × 1 ═ 1.

When a memory cell stores data 0 and data on a word line connected to a gate of the memory cell is 1, the memory cell is not turned on and cannot generate a saturation current. The result of the multiplication is 0, i.e., 0 × 1 ═ 0.

When a memory cell stores data 1 and data on a word line connected to a gate of the memory cell is 0, the memory cell is not turned on and cannot generate a saturation current. The result of the multiplication is 0, i.e., 1 × 0 ═ 0.

Based on the above operation logic, when the binary data W and the binary data D are binary multiplied by the memory computing unit, the number of bit lines having saturation current in each bit line group is the value of the digital signal that can be output by the bit line group.

In one embodiment, the memory unit may be a non-volatile memory that can hold data without connecting to an external power source. As an example, the control terminal of the memory cell may be a gate of the non-volatile memory, and the output terminal of the memory cell may be a drain of the non-volatile memory. Alternatively, the memory cells in the array may also be charge storing memory cells, such as floating gate cells or dielectric charge trapping cells, having drains coupled to corresponding bit lines, and sources coupled to ground. Other types of memory cells may be used in other embodiments, including but not limited to many types of programmable resistive memory cells, such as phase change based memory cells, magnetoresistive based memory cells, metal oxide based memory cells, or other cells.

In one embodiment, the memory cells may be NOR flash memory cells. Such as bulk silicon technology floating gate NOR flash memory cells, fully depleted silicon-on-insulator (FDSOI) technology floating gate NOR flash memory cells. The NOR flash memory cell has a gate connected to a word line, a drain connected to a bit line, and a source and a back electrode grounded.

In one embodiment, the memory computing unit further includes M-2 bit encoders, the M-2 bit encoders are connected to the 2 nd to M-1 st bit line groups in a one-to-one correspondence, and the bit encoders are configured to encode output signals of the bit line groups to obtain digital signals.

As an example, see the figure4, the memory computing unit includes 13 bit encoders respectively connected to the bit line groups BL₂To bit line group BL₁₄And the connection is in one-to-one correspondence. When a memory cell is turned on, a saturation current is output through a bit line connected to the memory cell, and at the same time, a voltage on the bit line changes from a low level to a high level. In the present embodiment, the bit line group BL₂There are two bit lines connected between the 2to2 bit encoder and the two memory cells. Bit line group BL₂Up to2 high level signals may be sent to the bit encoder, and the 2to2 bit encoder may encode two high level signals into the BCD code 10. Bit line group BL₃There are three bit lines connected between the 3to2 bit encoder and the three memory cells. Bit line group BL₃Up to3 high level signals may be transmitted to the bit encoder, and the 3to2 bit encoder may encode the three high level signals into the BCD code 11. Bit line group BL₄Up to 4 high level signals may be sent to the bit encoder, and the 4to3 bit encoder may encode four high level signals into the BCD code 100. In summary, the bit encoder can encode a high-level signal conveyed in a bit line connected thereto into a BCD code. For bit line group BL₁And bit line group BL₁₅Since there is only one bit line each, the digital signal that can be conveyed is either 0 (low) or 1 (high), and encoding using a bit encoder is not necessary.

In the embodiment, the bit line signals in each bit line group are converted into the digital signals through the bit encoder, and a plurality of saturation currents do not need to be input into the same bit line, so that the problem of error accumulation caused by current convergence is solved. In addition, in the embodiment, the bit line signals in each bit line group are encoded into digital signals by skillfully utilizing the bit encoder, and the binary multiplication operation of N bits can be completed in one period by completely adopting the digital circuits and the combinational logic circuits.

After the bit encoder outputs the digital signals, the digital signals are subjected to shift addition to obtain the final product sum, as shown in fig. 5. As an example, each digital signal may be added using an adder.

In one embodiment, as shown in FIG. 6, the non-volatile memory includes: the substrate comprises a substrate, a substrate dielectric layer and a fully depleted channel layer; wherein, a well region is formed in the substrate; the substrate dielectric layer is positioned on the substrate and covers the well region; the fully depleted channel layer is positioned on the substrate dielectric layer by layer; the grid structure is positioned on the upper surface of the fully depleted channel layer; the source electrode is positioned on the upper surface of the fully depleted channel layer and positioned on one side of the grid structure; the drain electrode is positioned on the upper surface of the fully depleted channel layer and positioned on one side, far away from the source electrode, of the grid structure; wherein the source electrode and the drain electrode are formed on the upper surface of the fully depleted channel layer by an epitaxial process.

By arranging the substrate medium layer 12 between the substrate 11 and the fully depleted channel layer 13, an electronic channel between the source electrode 3 and the drain electrode 4 can be limited in the fully depleted channel layer 13, electron transfer between the source electrode 3 and the drain electrode 4 through a well region is avoided, and therefore leakage current is greatly reduced. The fully depleted channel layer 13 is combined with the substrate medium layer 12, and a channel of saturation current is limited in the fully depleted channel layer 13 under the condition that the semiconductor structure is conducted, so that the consistency of the semiconductor structure is greatly improved, and the variability among different semiconductor structures is reduced. In addition, the source electrode 3 and the drain electrode 4 may be formed on the upper surface of the fully depleted channel layer 13 through an epitaxial process to obtain an epitaxial source electrode and an epitaxial drain electrode, which may greatly increase the saturation current in the channel when the transistor is turned on, and increase the switching speed of the transistor.

In one embodiment, with continued reference to fig. 6, the gate structure includes a gate stack structure on the top surface of the fully depleted channel layer 13; the grid laminated structure comprises a tunneling dielectric layer 21, a floating gate 22, a control dielectric layer 23 and a control grid 24 which are sequentially overlapped from bottom to top; and the gate side walls 25 are positioned at two opposite sides of the gate laminated structure. As an example, the non-volatile memory shown in fig. 6 may be a floating gate NOR flash memory cell under FDSOI process.

In one embodiment, the present application further discloses an in-memory computing module, which includes one or more in-memory computing units in the above embodiments.

Each in-memory computing unit can complete an N-bit by N-bit binary operation, and thus each in-memory computing module can simultaneously complete one or more N-bit by N-bit binary operations. The in-memory computation module may be used as a filter (filter) for generating a feature map (feature map) in the convolutional neural network computation, that is, a stored value is written into the in-memory computation module in advance as a value of each element in the filter. Taking the CNN architecture for image recognition as an example, in the feature map (feature map) calculation of the first layer, each data in the input data matrix (input) may represent a black and white pixel of an image, and the value of each pixel has L bits, where L may be any positive integer, such as 5, 8, 12, or 16. In this embodiment, L is 8, and the input data matrix is a 5 × 5 matrix. Filter is also a 5 x 5 matrix, and each element in the Filter is also an 8bit binary number.

A schematic diagram of the dot product operation performed by the Filter and the input data matrix is shown in fig. 7. Wherein, W_ijIs the value of filter, D_ijFor the input value, i is 0,1,2,3,4, j is 0,1,2,3,4, 5. As can be appreciated from the foregoing, for each D_ij*W_ijThe calculation of (2) requires an in-memory calculation unit to perform the calculation. For example, D₀₀*W₀₀、D₀₁*W₀₁、D₀₂*W₀₂、D₀₃*W₀₃、D₀₄*W₀₄The data, each of which is 8 bits × 8 bits, are multiplied, wherein,

D₀₀＝[D₀₀[0],D₀₀[1],D₀₀[2],D₀₀[3],D₀₀[4],D₀₀[5],D₀₀[6],D₀₀[7]]

W₀₀＝[W₀₀[0],W₀₀[1],W₀₀[2],W₀₀[3],W₀₀[4],W₀₀[5],W₀₀[6],W₀₀[7]]

for D₀₀*W₀₀W may be first₀₀Writing into the first memory array, and storing D₀₀D is completed in one clock period by inputting the data into the first memory array through the input line₀₀*W₀₀And (4) calculating the data. A total of 25 data calculations of 8 bits by 8 bits, i.e. D, are required₀₀*W₀₀、D₀₁*W₀₁、D₀₂*W₀₂、D₀₃*W₀₃、……D₄₃*W₄₃、D₄₄*W₄₄Therefore, 25 memory computing units can be arranged to operate simultaneously, and each memory computing unit completes data computation of 8 bits × 8 bits, namely, one dot product operation can be completed in one clock cycle.

For ease of understanding, the equation in fig. 7 may be developed in the form of vectors and matrices, as shown in fig. 8. Wherein the left column matrix represents data D input from the word line to the memory computing unit₀₀To D₄₄The column matrix has 200 rows and 1 column. Wherein each 8 rows represents an 8bit input data. For example, D₀₀[0]To D₀₀[7]Representative data D₀₀。

The data matrix on the right side of FIG. 8 may represent the memory array W₀₀To W₄₄The data matrix has 200 rows and 8 columns. Wherein, every 8 rows represent 8 bits of storage data. Binary data W of 8 bits₀₀To W₄₄Are arranged from top to bottom in sequence. As an example, the first 8 rows represent the stored data W₀₀. Specifically, in the first 8 rows, each column element has the same value, for example, the first column elements in the first 8 rows are all W₀₀[7]The second row elements are all W₀₀[6]… …, elements of the seventh column all being W₀₀[0]。

The memory computing module is provided with 25 memory computing units and can complete one dot product operation in one clock cycle.

An embodiment of the present application further discloses an in-memory computing system, which includes one or more in-memory computing modules described in the above embodiments.

If an in-memory computing module is used as a filter, the in-memory computing system comprises one or more filters. For the CNN network architecture, there may be multiple filters in the feature map calculation process of each layer. Taking K filters as an example, the memory computing system includes K memory computing modules.

As an example, each filter is an N rectangular data array. Each data is L bits of binary data, and there are K filters in total. A schematic diagram of the calculation process for completing a feature map layer in the CNN network architecture is shown in fig. 9.

Taking N as 5, K as 32, and L as 8 as an example, the structure of fig. 9 can perform multiplication of 800 pieces of 8-bit data by 5 × 5 × 32 and addition by almost the same amount in one clock cycle, and therefore, the calculation power of the memory computing system in this embodiment is 1600operations (ops) per one clock cycle. The memory computing system architecture described above can reach GHz levels because the time it takes for the signal to pass through the NOR unit, bit encoder and adder is extremely short. That is, on a chip with an area of about 51200 (i.e., 800 × 64) NOR cells, it is possible to provide 1.6tops (tera operations) for the calculation, which is a very advanced structure.

Compared with the traditional memory computing scheme, although the area of the memory array is increased to a certain extent by arranging an independent bit line for each memory cell on the bit line layout and connecting the bit line to the encoder, the memory computing unit omits an analog-to-digital signal conversion module or a sensing amplifier with a larger area from the perspective of the whole memory computing unit, and simultaneously omits the time of analog-to-digital conversion. Therefore, the memory computing unit in the application reduces the area, improves the speed and reduces the power consumption. The memory computing module or the memory computing system composed of the memory computing units does not need to carry data frequently, greatly improves the data processing speed, greatly reduces the power consumption, and can realize edge computing on an edge device.

The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. An in-memory computing unit, comprising:

the storage array comprises a plurality of storage units arranged in N rows and N columns, and the storage unit positioned in the ith row and the jth column is marked as S_i,j(ii) a The data values stored in the storage units positioned in the same column are the same; the storage array is used for storing first data of N bits; wherein N is more than or equal to 1, i is more than or equal to 1 and less than or equal to N, and j is more than or equal to 1 and less than or equal to N;

n word lines for inputting N bits of second data; the control ends of the memory units in the same row are sequentially connected in series through the same word line;

m bit line groups, wherein the kth group of bit lines is marked as a bit line group BLk, M is equal to 2N-1, and k is greater than or equal to 1 and less than or equal to M;

when k is greater than or equal to 1 and less than or equal to N, the kth group of bit lines is provided with k bit lines, and the k bit lines are respectively connected to the output ends of the memory cells which are positioned on the same straight line with the memory cells S1, k and the memory cells Sk, 1;

when k is larger than N and smaller than or equal to M, the k group of bit lines has 2N-k bit lines, and the 2N-k bit lines are respectively connected to the output ends of the memory units which are positioned on the same straight line with the memory units Sk-N +1 and N and the memory units SN, k-N + 1.

2. The in-memory computing unit of claim 1, wherein the storage unit comprises a non-volatile memory.

3. The in-memory computing unit of claim 2, wherein the non-volatile memory comprises NOR flash memory cells.

4. The memory cell of claim 2, wherein the control terminal of the memory cell comprises a gate of a non-volatile memory; the output of the memory cell includes a drain of the non-volatile memory.

5. The in-memory computing unit of claim 2, wherein the first data is binary data, and the non-volatile memory is configured to store a bit value of 0 or 1;

the second data is binary data, and when the voltage on the word line is greater than or equal to a preset voltage, the bit value on the word line is 1; and when the voltage on the word line is smaller than the preset voltage, the bit value on the word line is 0.

6. The in-memory computing unit of any of claims 1-5, further comprising:

the M-2 bit encoders are connected with the 2 nd to the M-1 st bit line groups in a one-to-one correspondence mode, and are used for encoding output signals of the bit line groups to obtain digital signals.

7. The in-memory computing unit of claim 2, wherein the non-volatile memory comprises:

the substrate comprises a substrate, a substrate dielectric layer and a fully depleted channel layer; wherein, a well region is formed in the substrate; the substrate dielectric layer is positioned on the substrate and covers the well region; the fully depleted channel layer is positioned on the substrate dielectric layer by layer;

the grid structure is positioned on the upper surface of the fully depleted channel layer;

the source electrode is positioned on the upper surface of the fully depleted channel layer and positioned on one side of the grid structure;

the drain electrode is positioned on the upper surface of the fully depleted channel layer and positioned on one side, far away from the source electrode, of the grid structure;

wherein the source electrode and the drain electrode are formed on the upper surface of the fully depleted channel layer by an epitaxial process.

8. The memory compute unit of claim 7 wherein the gate structure comprises:

the grid laminated structure is positioned on the upper surface of the fully depleted channel layer; the grid laminated structure comprises a tunneling dielectric layer, a floating gate, a control dielectric layer and a control grid which are sequentially overlapped from bottom to top;

and the grid side walls are positioned at two opposite sides of the grid laminated structure.

9. An in-memory computing module comprising one or more in-memory computing units as claimed in any one of claims 1 to 8.

10. An in-memory computing system comprising one or more in-memory computing modules of claim 9.