Disclosure of Invention
In view of the foregoing, there is a need to provide an in-memory computing unit, module and system.
Memory computing unitThe method comprises the following steps: the storage array comprises a plurality of storage units arranged in N rows and N columns, and the storage unit positioned in the ith row and the jth column is marked as Si,j(ii) a The data values stored in the storage units positioned in the same column are the same; the storage array is used for storing first data of N bits; wherein N is more than or equal to 1, i is more than or equal to 1 and less than or equal to N, and j is more than or equal to 1 and less than or equal to N; n word lines for inputting N bits of second data; the control ends of the memory units in the same row are sequentially connected in series through the same word line; m bit line groups, wherein the kth group of bit lines is marked as a bit line group BLk, M is equal to 2N-1, and k is greater than or equal to 1 and less than or equal to M; when k is greater than or equal to 1 and less than or equal to N, the kth group of bit lines has k bit lines which are respectively connected with the memory cells S1,kAnd a memory cell Sk,1The output ends of the storage units are positioned on the same straight line; when k is greater than N and less than or equal to M, the kth group of bit lines has 2N-k bit lines, and the 2N-k bit lines are respectively connected to the memory cells Sk-N+1,NAnd a memory cell SN,k-N+1And the output ends of the storage units are positioned on the same straight line.
The memory computing unit directly acts the second data on the control end of the storage unit through the word line, stores the first data in the storage units arranged in the array according to a certain rule, and can complete the binary multiplication operation of N bits and N bits in one clock cycle. The operation can be directly finished in the storage module without carrying the storage data into the CPU for operation, so that the data carrying is reduced, the operation speed can be greatly improved under the condition of large operation amount, and the power consumption is reduced. And the output end of each memory cell is independently connected to an independent bit line, compared with the traditional technical scheme, the current output by different memory cells does not need to be converged on one bit line, and the problem of error accumulation caused by current convergence is solved.
In one embodiment, the memory unit includes a nonvolatile memory.
In one embodiment, the non-volatile memory comprises NOR flash memory cells.
In one embodiment, the control terminal of the memory cell comprises a gate of a non-volatile memory; the output of the memory cell includes a drain of the non-volatile memory.
In one embodiment, the first data is binary data, and the nonvolatile memory is used for storing a bit value of 0 or 1; the second data is binary data, and when the voltage on the word line is greater than or equal to a preset voltage, the bit value on the word line is 1; and when the voltage on the word line is smaller than a preset voltage, the bit value on the word line is 0.
In one embodiment, the memory computing unit further includes M-2 bit encoders, where the M-2 bit encoders are connected to the 2 nd to M-1 st bit line groups in a one-to-one correspondence, and the bit encoders are configured to encode output signals of the bit line groups to obtain digital signals.
The memory computing unit connects each bit line group to the corresponding bit encoder, encodes the current and voltage signals on each bit line in the bit line group, and does not need to adopt an analog-to-digital conversion module to convert the current signals in the bit lines, so that the time of analog-to-digital conversion is saved in time, and the area of the analog-to-digital conversion module is also saved in area. Although the number of bit lines and the bit encoder are increased, the area of the memory computing unit is reduced and the computing speed is improved.
In one embodiment, the non-volatile memory comprises: the substrate comprises a substrate, a substrate dielectric layer and a fully depleted channel layer; wherein, a well region is formed in the substrate; the substrate dielectric layer is positioned on the substrate and covers the well region; the fully depleted channel layer is positioned on the substrate dielectric layer by layer; the grid structure is positioned on the upper surface of the fully depleted channel layer; the source electrode is positioned on the upper surface of the fully depleted channel layer and positioned on one side of the grid structure; the drain electrode is positioned on the upper surface of the fully depleted channel layer and positioned on one side, far away from the source electrode, of the grid structure; wherein the source electrode and the drain electrode are formed on the upper surface of the fully depleted channel layer by an epitaxial process.
In the memory computing unit, each nonvolatile memory adopts a semiconductor structure with a fully depleted channel layer and a substrate medium layer, so that electric leakage can be reduced, and the memory computing unit can be applied to an AI device for edge computing. In addition, the source electrode and the drain electrode are formed by adopting an epitaxial process, so that the saturation current can be improved, the reading speed is increased, and the calculation efficiency is improved.
In one embodiment, the gate structure comprises a gate stack structure located on an upper surface of the fully depleted channel layer; the grid laminated structure comprises a tunneling dielectric layer, a floating gate, a control dielectric layer and a control grid which are sequentially overlapped from bottom to top; and the grid side walls are positioned at two opposite sides of the grid laminated structure.
An in-memory computing module comprising one or more of the in-memory computing units described in the above embodiments.
An in-memory computing system comprising one or more of the in-memory computing modules described in the above embodiments.
The in-memory computing module and the in-memory computing system can directly complete data operation in the memory array without data operation by means of a CPU (central processing unit), so that the time and energy consumption for data movement are reduced, and the operation efficiency is improved; meanwhile, a bit encoder is adopted to replace a traditional analog-digital conversion module or an induction amplifier, and a digital circuit is completely adopted, so that the time consumed by converting an analog signal into a digital signal is saved, the acquisition speed of the digital signal is improved, the problem of error accumulation caused by current combination in the analog-digital conversion process is avoided, and the volume of the whole structure is reduced.
Detailed Description
To facilitate an understanding of the invention, the invention will now be described more fully with reference to the accompanying drawings. Preferred embodiments of the present invention are shown in the drawings. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.
In describing positional relationships, unless otherwise specified, when an element such as a layer, film or substrate is referred to as being "on" another layer, it can be directly on the other layer or intervening layers may also be present. Further, when a layer is referred to as being "under" another layer, it can be directly under, or one or more intervening layers may also be present. It will also be understood that when a layer is referred to as being "between" two layers, it can be the only layer between the two layers, or one or more intervening layers may also be present.
Where the terms "comprising," "having," and "including" are used herein, another element may be added unless an explicit limitation is used, such as "only," "consisting of … …," etc. Unless mentioned to the contrary, terms in the singular may include the plural and are not to be construed as being one in number.
With the development of computers, the computing power of the computers is continuously improved, but the bottlenecks are gradually met. It is obvious that in the field of Artificial Intelligence (AI), the amount of calculation is increased sharply, and it is difficult to further increase the calculation speed by means of the conventional von neumann structure. As a result, many scientific circles and companies have begun to improve upon traditional computer architectures. One of the ideas is to simulate the human brain, complete the calculation function and the storage function in the storage unit without carrying the data in the storage unit into the CPU for calculation, and then carry the calculation result to the storage unit.
As shown in fig. 1, one embodiment of the present application provides an in-memory computing unit, including: the storage array comprises a plurality of storage units arranged in N rows and N columns, and the storage unit positioned in the ith row and the jth column is marked as Si,j(ii) a The data values stored in the storage units positioned in the same column are the same; the storage array is used for storing first data of N bits; wherein N is more than or equal to 1, i is more than or equal to 1 and less than or equal to N, and j is more than or equal to 1 and less than or equal to N; n word lines for inputting N bits of second data; the control ends of the memory units in the same row are sequentially connected in series through the same word line; m sets of bit lines, M equals 2N-1, and the kth set of bit lines is denoted as bit line BLkK is 1 or more and M or less; when k is greater than or equal to 1 and less than or equal to N, the kth group of bit lines has k bit lines which are respectively connected with the memory cells S1,kAnd a memory cell Sk,1All the storage sheets on the same straight lineAn output of the element; when k is greater than N and less than or equal to M, the kth group of bit lines has 2N-k bit lines, and the 2N-k bit lines are respectively connected to the memory cells Sk-N+1,NAnd a memory cell SN,k-N+1And the output ends of the storage units are positioned on the same straight line.
As an example, for the arrangement of rows in the memory array, the uppermost row of the memory array may be the first row and the lowermost row may be the nth row in the top-to-bottom direction. For the arrangement of the columns in the memory array, the rightmost column of the memory array may be the first column and the leftmost column may be the nth column from the right to the left. In other embodiments, the rows and columns may be arranged in other defined manners, which is not necessarily limited in this application. Where N may be any positive integer, for example, 3, 5, 8, or 10, and in this embodiment, N is 8.
In FIG. 1, N is 8, and the memory cell in the upper right corner of the memory array is denoted as memory cell S1,1The storage unit at the lower left corner is a storage unit S8,8. The memory cells in each column of the memory array store the same bit value, and by way of example, the memory cells in the first column store data W0, the memory cells in the second column store data W1, … …, and the memory cells in the 8 th column store data W7. Therefore, the first data stored in the memory array is W ═ W7, W6, W5, W4, W3, W2, W1, W0]. Each word line inputs a bit value, for example, the control ends of the memory cells in the first row are all connected to the first word line, the input data carried by the first word line is D0, the control ends of the memory cells in the second row are all connected to the second word line, the input data carried by the second word line is D1, … …, the control ends of the memory cells in the eighth row are all connected to the eighth word line, and the input data carried by the eighth word line is D7. The second data input to the memory array by the 8 word lines is D ═ D7, D6, D5, D4, D3, D2, D1, D0]。
As an example, when the voltage on the word line is greater than or equal to the preset voltage, the bit value on the word line is 1; when the voltage on the word line is less than the preset voltage, the bit value on the word line is 0. For example, when the voltage on the first word line is greater than or equal to the preset voltage, D0 is 1; when the voltage on the first word line is less than the predetermined voltage, D0 is 0. The preset voltage may be a threshold voltage of the memory cell.
The connection relationship between the bit lines is described in two parts, the first part, when k is greater than or equal to 1 and less than or equal to 8, and the memory cell S1,kAnd a memory cell Sk,1Output end of each memory cell and bit line group BL on same straight linekThe k bit lines in (1) are connected in one-to-one correspondence. For example, when k is equal to 1, the bit line group BL1Having only one bit line, and a memory cell S1,1The output ends of the two are connected; when k is equal to2, the bit line group BL2Two bit lines respectively connected to the memory cells S2,1And a storage unit S1,2The output ends of the two are connected; when k is equal to3, the bit line group BL3There are three bit lines, each of which is connected to a memory cell S3,1And a storage unit S2,2And a memory cell S1,3The output ends of the two are connected. A second part, when k is greater than 8 and less than or equal to 15, and the memory cell Sk-8+1,8And a memory cell S8,k-8+1Output end of each memory cell and bit line group BL on same straight linekThe 2N-k bit lines are correspondingly connected one by one. For example, when k is equal to 15, the bit line group BL15Having only one bit line, and a memory cell S8,8The output ends of the two are connected; when k is equal to 14, the bit line group BL14Two bit lines respectively connected to the memory cells S7,8And a storage unit S8,7The output ends of the two are connected; when k is equal to 13, bit line group BL13There are three bit lines, each of which is connected to a memory cell S6,8And a storage unit S7,7And a memory cell S8,6The output ends of the two are connected. Bit line group BL13 andmemory cell S6,8And a storage unit S7,7And a memory cell S8,6The enlarged schematic diagram of the structure of the output terminal connection is shown in fig. 2.
The memory computing unit can be used for finishing binary multiplication of two 8-bit data. Wherein the input data is second data D ═ D7, D6, D5, D4, D3, D2, D1, D0]The stored data is first data W ═ W7, W6, W5, W4, W3, W2,W1,W0]. The binary multiplication process of data D and data W is shown in fig. 3. Binary multiplication is carried out on two 8-bit data, and finally a product P ═ P of 15 bits is obtained14,P13,P12,P11,P10,P9,P8,P7,P6,P5,P4,P3,P2,P1,P0]. Each bit data in the product P corresponds to the 15bit line group [ BL ] in fig. 115,BL14,BL13,BL12,BL11,BL10,BL9,BL8,BL7,BL6,BL5,BL4,BL3,BL2,BL1]. In this embodiment, the maximum values of D and W are 255, and the maximum value of P is 65025.
Wherein, the calculation logic of the single storage unit is as follows:
when a memory cell stores data 1 and the data on the word line connected to the gate of the memory cell is also 1, the memory cell is turned on, generating a saturation current. The saturation current represents that the product result is 1, i.e., 1 × 1 ═ 1.
When a memory cell stores data 0 and data on a word line connected to a gate of the memory cell is 1, the memory cell is not turned on and cannot generate a saturation current. The result of the multiplication is 0, i.e., 0 × 1 ═ 0.
When a memory cell stores data 1 and data on a word line connected to a gate of the memory cell is 0, the memory cell is not turned on and cannot generate a saturation current. The result of the multiplication is 0, i.e., 1 × 0 ═ 0.
Based on the above operation logic, when the binary data W and the binary data D are binary multiplied by the memory computing unit, the number of bit lines having saturation current in each bit line group is the value of the digital signal that can be output by the bit line group.
The memory computing unit directly acts the second data on the control end of the storage unit through the word line, stores the first data in the storage units arranged in the array according to a certain rule, and can complete the binary multiplication operation of N bits and N bits in one clock cycle. The operation can be directly finished in the storage module without carrying the storage data into the CPU for operation, so that the data carrying is reduced, the operation speed can be greatly improved under the condition of large operation amount, and the power consumption is reduced. And the output end of each memory cell is independently connected to an independent bit line, compared with the traditional technical scheme, the current output by different memory cells does not need to be converged on one bit line, and the problem of error accumulation caused by current convergence is solved.
In one embodiment, the memory unit may be a non-volatile memory that can hold data without connecting to an external power source. As an example, the control terminal of the memory cell may be a gate of the non-volatile memory, and the output terminal of the memory cell may be a drain of the non-volatile memory. Alternatively, the memory cells in the array may also be charge storing memory cells, such as floating gate cells or dielectric charge trapping cells, having drains coupled to corresponding bit lines, and sources coupled to ground. Other types of memory cells may be used in other embodiments, including but not limited to many types of programmable resistive memory cells, such as phase change based memory cells, magnetoresistive based memory cells, metal oxide based memory cells, or other cells.
In one embodiment, the memory cells may be NOR flash memory cells. Such as bulk silicon technology floating gate NOR flash memory cells, fully depleted silicon-on-insulator (FDSOI) technology floating gate NOR flash memory cells. The NOR flash memory cell has a gate connected to a word line, a drain connected to a bit line, and a source and a back electrode grounded.
In one embodiment, the memory computing unit further includes M-2 bit encoders, the M-2 bit encoders are connected to the 2 nd to M-1 st bit line groups in a one-to-one correspondence, and the bit encoders are configured to encode output signals of the bit line groups to obtain digital signals.
As an example, see the figure4, the memory computing unit includes 13 bit encoders respectively connected to the bit line groups BL2To bit line group BL14And the connection is in one-to-one correspondence. When a memory cell is turned on, a saturation current is output through a bit line connected to the memory cell, and at the same time, a voltage on the bit line changes from a low level to a high level. In the present embodiment, the bit line group BL2There are two bit lines connected between the 2to2 bit encoder and the two memory cells. Bit line group BL2Up to2 high level signals may be sent to the bit encoder, and the 2to2 bit encoder may encode two high level signals into the BCD code 10. Bit line group BL3There are three bit lines connected between the 3to2 bit encoder and the three memory cells. Bit line group BL3Up to3 high level signals may be transmitted to the bit encoder, and the 3to2 bit encoder may encode the three high level signals into the BCD code 11. Bit line group BL4Up to 4 high level signals may be sent to the bit encoder, and the 4to3 bit encoder may encode four high level signals into the BCD code 100. In summary, the bit encoder can encode a high-level signal conveyed in a bit line connected thereto into a BCD code. For bit line group BL1And bit line group BL15Since there is only one bit line each, the digital signal that can be conveyed is either 0 (low) or 1 (high), and encoding using a bit encoder is not necessary.
In the embodiment, the bit line signals in each bit line group are converted into the digital signals through the bit encoder, and a plurality of saturation currents do not need to be input into the same bit line, so that the problem of error accumulation caused by current convergence is solved. In addition, in the embodiment, the bit line signals in each bit line group are encoded into digital signals by skillfully utilizing the bit encoder, and the binary multiplication operation of N bits can be completed in one period by completely adopting the digital circuits and the combinational logic circuits.
After the bit encoder outputs the digital signals, the digital signals are subjected to shift addition to obtain the final product sum, as shown in fig. 5. As an example, each digital signal may be added using an adder.
In one embodiment, as shown in FIG. 6, the non-volatile memory includes: the substrate comprises a substrate, a substrate dielectric layer and a fully depleted channel layer; wherein, a well region is formed in the substrate; the substrate dielectric layer is positioned on the substrate and covers the well region; the fully depleted channel layer is positioned on the substrate dielectric layer by layer; the grid structure is positioned on the upper surface of the fully depleted channel layer; the source electrode is positioned on the upper surface of the fully depleted channel layer and positioned on one side of the grid structure; the drain electrode is positioned on the upper surface of the fully depleted channel layer and positioned on one side, far away from the source electrode, of the grid structure; wherein the source electrode and the drain electrode are formed on the upper surface of the fully depleted channel layer by an epitaxial process.
By arranging the substrate medium layer 12 between the substrate 11 and the fully depleted channel layer 13, an electronic channel between the source electrode 3 and the drain electrode 4 can be limited in the fully depleted channel layer 13, electron transfer between the source electrode 3 and the drain electrode 4 through a well region is avoided, and therefore leakage current is greatly reduced. The fully depleted channel layer 13 is combined with the substrate medium layer 12, and a channel of saturation current is limited in the fully depleted channel layer 13 under the condition that the semiconductor structure is conducted, so that the consistency of the semiconductor structure is greatly improved, and the variability among different semiconductor structures is reduced. In addition, the source electrode 3 and the drain electrode 4 may be formed on the upper surface of the fully depleted channel layer 13 through an epitaxial process to obtain an epitaxial source electrode and an epitaxial drain electrode, which may greatly increase the saturation current in the channel when the transistor is turned on, and increase the switching speed of the transistor.
In one embodiment, with continued reference to fig. 6, the gate structure includes a gate stack structure on the top surface of the fully depleted channel layer 13; the grid laminated structure comprises a tunneling dielectric layer 21, a floating gate 22, a control dielectric layer 23 and a control grid 24 which are sequentially overlapped from bottom to top; and the gate side walls 25 are positioned at two opposite sides of the gate laminated structure. As an example, the non-volatile memory shown in fig. 6 may be a floating gate NOR flash memory cell under FDSOI process.
In one embodiment, the present application further discloses an in-memory computing module, which includes one or more in-memory computing units in the above embodiments.
Each in-memory computing unit can complete an N-bit by N-bit binary operation, and thus each in-memory computing module can simultaneously complete one or more N-bit by N-bit binary operations. The in-memory computation module may be used as a filter (filter) for generating a feature map (feature map) in the convolutional neural network computation, that is, a stored value is written into the in-memory computation module in advance as a value of each element in the filter. Taking the CNN architecture for image recognition as an example, in the feature map (feature map) calculation of the first layer, each data in the input data matrix (input) may represent a black and white pixel of an image, and the value of each pixel has L bits, where L may be any positive integer, such as 5, 8, 12, or 16. In this embodiment, L is 8, and the input data matrix is a 5 × 5 matrix. Filter is also a 5 x 5 matrix, and each element in the Filter is also an 8bit binary number.
A schematic diagram of the dot product operation performed by the Filter and the input data matrix is shown in fig. 7. Wherein, WijIs the value of filter, DijFor the input value, i is 0,1,2,3,4, j is 0,1,2,3,4, 5. As can be appreciated from the foregoing, for each Dij*WijThe calculation of (2) requires an in-memory calculation unit to perform the calculation. For example, D00*W00、D01*W01、D02*W02、D03*W03、D04*W04The data, each of which is 8 bits × 8 bits, are multiplied, wherein,
D00=[D00[0],D00[1],D00[2],D00[3],D00[4],D00[5],D00[6],D00[7]]
W00=[W00[0],W00[1],W00[2],W00[3],W00[4],W00[5],W00[6],W00[7]]
for D00*W00W may be first00Writing into the first memory array, and storing D00D is completed in one clock period by inputting the data into the first memory array through the input line00*W00And (4) calculating the data. A total of 25 data calculations of 8 bits by 8 bits, i.e. D, are required00*W00、D01*W01、D02*W02、D03*W03、……D43*W43、D44*W44Therefore, 25 memory computing units can be arranged to operate simultaneously, and each memory computing unit completes data computation of 8 bits × 8 bits, namely, one dot product operation can be completed in one clock cycle.
For ease of understanding, the equation in fig. 7 may be developed in the form of vectors and matrices, as shown in fig. 8. Wherein the left column matrix represents data D input from the word line to the memory computing unit00To D44The column matrix has 200 rows and 1 column. Wherein each 8 rows represents an 8bit input data. For example, D00[0]To D00[7]Representative data D00。
The data matrix on the right side of FIG. 8 may represent the memory array W00To W44The data matrix has 200 rows and 8 columns. Wherein, every 8 rows represent 8 bits of storage data. Binary data W of 8 bits00To W44Are arranged from top to bottom in sequence. As an example, the first 8 rows represent the stored data W00. Specifically, in the first 8 rows, each column element has the same value, for example, the first column elements in the first 8 rows are all W00[7]The second row elements are all W00[6]… …, elements of the seventh column all being W00[0]。
The memory computing module is provided with 25 memory computing units and can complete one dot product operation in one clock cycle.
An embodiment of the present application further discloses an in-memory computing system, which includes one or more in-memory computing modules described in the above embodiments.
If an in-memory computing module is used as a filter, the in-memory computing system comprises one or more filters. For the CNN network architecture, there may be multiple filters in the feature map calculation process of each layer. Taking K filters as an example, the memory computing system includes K memory computing modules.
As an example, each filter is an N rectangular data array. Each data is L bits of binary data, and there are K filters in total. A schematic diagram of the calculation process for completing a feature map layer in the CNN network architecture is shown in fig. 9.
Taking N as 5, K as 32, and L as 8 as an example, the structure of fig. 9 can perform multiplication of 800 pieces of 8-bit data by 5 × 5 × 32 and addition by almost the same amount in one clock cycle, and therefore, the calculation power of the memory computing system in this embodiment is 1600operations (ops) per one clock cycle. The memory computing system architecture described above can reach GHz levels because the time it takes for the signal to pass through the NOR unit, bit encoder and adder is extremely short. That is, on a chip with an area of about 51200 (i.e., 800 × 64) NOR cells, it is possible to provide 1.6tops (tera operations) for the calculation, which is a very advanced structure.
Compared with the traditional memory computing scheme, although the area of the memory array is increased to a certain extent by arranging an independent bit line for each memory cell on the bit line layout and connecting the bit line to the encoder, the memory computing unit omits an analog-to-digital signal conversion module or a sensing amplifier with a larger area from the perspective of the whole memory computing unit, and simultaneously omits the time of analog-to-digital conversion. Therefore, the memory computing unit in the application reduces the area, improves the speed and reduces the power consumption. The memory computing module or the memory computing system composed of the memory computing units does not need to carry data frequently, greatly improves the data processing speed, greatly reduces the power consumption, and can realize edge computing on an edge device.
The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.