CN114254743B

CN114254743B - Circuit for parallel multiply-accumulate operation in binary neural network based on RRAM array

Info

Publication number: CN114254743B
Application number: CN202111395976.XA
Authority: CN
Inventors: 蔺智挺; 朱陈宇; 吴秀龙; 朱志国; 彭春雨; 卢文娟; 赵强; 陈军宁
Original assignee: Hefei Microelectronics Research Institute Co ltd; Anhui University
Current assignee: Hefei Microelectronics Research Institute Co ltd; Anhui University
Priority date: 2021-09-14
Filing date: 2021-11-23
Publication date: 2024-03-15
Anticipated expiration: 2041-11-23
Also published as: CN114254743A

Abstract

The invention discloses a parallel multiply-accumulate operation circuit in a binary neural network formed by RRAM arrays, wherein a storage array formed by 1T1R units is a 64x64 RRAM array adopting a pseudo-cross structure, and each 1T1R unit consists of an NMOSFET and a resistive random access memory; the word lines WL of each row of the memory array are connected with parallel input circuits, so that 64 data in the maximum 8X8 weight matrix and 64 data stored in the memory array complete binary neural network BNN convolution operation; the bit line BL of each column of the memory array is connected with the current input end of the cascade current mirror circuit, and the output end of the cascade current mirror circuit is connected to the upper polar plate of the output capacitor. The circuit avoids the problems of crosstalk among different nodes and easy damage of stored data of a storage unit in the traditional SRAM during multi-row reading, improves the reliability of a system, and reduces leakage power consumption among the units.

Description

Circuit for parallel multiply-accumulate operation in binary neural network based on RRAM array

Technical Field

The invention relates to the technical field of integrated circuit design, in particular to a circuit for parallel multiply-accumulate (Multiply and Accumulate, abbreviated as MAC) operation in a binary neural network formed by a resistive random access memory (Resistive Random Access Memory, abbreviated as RRAM) array.

Background

At present, various artificial intelligent devices have extremely high requirements on the volume and the power consumption of electronic devices, the existing neural network generally adopts floating point calculation, larger storage space and calculation amount are needed, and the generated high power consumption seriously hinders the application at a mobile terminal.

In order to effectively reduce long delay and high power consumption caused by complex multiply-accumulate operation, the prior art proposes a binary neural network (Binarized Neural Networks, BNN), signals such as weight, input, hidden layer output and the like are converted into binary values, and binary codes are then 0 and +1 or-1 and +1, so that the power consumption problem caused by that the neural network occupies a large amount of storage resources and frequent data access in the reasoning process is effectively reduced, but the traditional SRAM storage unit has the defects of easiness in damaging storage data, low reliability, high power consumption and the like, and the prior art lacks a corresponding solution.

Disclosure of Invention

The invention aims to provide a circuit for parallel multiply-accumulate operation in a binary neural network based on an RRAM array, which solves the problems of crosstalk and easy damage of stored data among different nodes of a storage unit in a traditional SRAM when a plurality of rows are read, improves the reliability of a system and reduces leakage power consumption among the units.

The invention aims at realizing the following technical scheme:

a circuit for parallel multiply-accumulate operation in a binary neural network based on RRAM arrays, the circuit comprises a parallel input circuit, a mode selection circuit, a storage array based on 1T1R units, a cascading type current mirror circuit and an analog voltage output circuit, wherein:

the output end Out of the parallel input circuit is respectively connected with the input port 1 of the alternative data selector MUX of the mode selection circuit and is connected to the word line WL of the memory array through the data selector;

the input port 0 of the mode selection circuit is correspondingly connected with a read-write address input signal R/W addr, and the selection port is connected with a mode selection control signal MSEL;

the input end CCM-IN of the cascade current mirror circuit is connected with the bit line BL of the memory array, and the output end CCM-OUT is connected with the upper electrode plate of the capacitor of the analog voltage output circuit;

the storage array formed by 1T1R units is a 64x64 RRAM array adopting a pseudo-cross structure, and each 1T1R unit is formed by an NMOSFET and a RRAM; the bottom electrode BE end of the RRAM is connected to the drain electrode of the NMOSFET to form a memory cell of which the RRAM is controlled by the NMOSFET, and the grid electrode, the source electrode and the top electrode TE of the RRAM are respectively a control port and a data read-write port of the memory cell;

in the memory array, the top electrode TE of the RRAM of the 1T1R unit in the same column is connected to the bit line BL of the column, and the grid electrode and the source electrode of the NMOSFET are respectively connected to the word line WL and the source line SL of the memory array;

the word lines WL of each row of the memory array are connected with parallel input circuits, the parallel input circuits adopt 64-bit parallel input data, and the binary neural network BNN convolution operation of 64 data in the maximum 8X8 weight matrix and 64 data stored in the memory array is completed in one period of calClk;

bit lines BL of each column of the storage array are connected with the current input end of the cascade current mirror circuit, and the output end of the cascade current mirror circuit is connected to the upper polar plate of the output capacitor;

based on the structure of the circuit, the parallel input circuit converts the data with the value of 0 in the input data into the activation signal of the corresponding word line WL when calculating the low level of the clock, and converts the data with the value of 1 in the input data into the activation signal of the corresponding word line WL when calculating the high level of the clock;

and the cascade current mirror circuit mirrors the current on the bit line BL to the output end in the time period when the word line WL is in an activated state, charges an output capacitor to obtain an analog output voltage value after BNN convolution operation, and obtains a final BNN convolution operation result through a lookup table of a preset voltage and an actual value.

According to the technical scheme provided by the invention, the problems of crosstalk among different nodes and easiness in damage of storage data of a storage unit in a traditional SRAM during multi-row reading are avoided, the reliability of a system is improved, leakage power consumption among the units is reduced, the storage data quantity in a storage mode is doubled, and the read-write power consumption is reduced by one time relatively.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a circuit for parallel multiply-accumulate operation in a binary neural network based on RRAM arrays according to an embodiment of the present invention;

FIG. 2 is a schematic circuit diagram of a memory array formed by 1T1R cells according to an embodiment of the present invention;

FIG. 3 is a schematic diagram showing the structure of a 1T1R cell and the voltammetric characteristic of RRAM according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a circuit of a parallel input circuit according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a cascaded current mirror circuit according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of an embodiment of the present invention for implementing writing low resistance and writing high resistance into a 1T1R cell;

FIG. 7 is a schematic diagram of a data encoding and calculation process according to an example of the present invention.

Detailed Description

The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments of the present invention, and this is not limiting to the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.

As shown in fig. 1, the overall schematic diagram of a circuit for parallel multiply-accumulate operation in a binary neural network formed based on an RRAM array according to an embodiment of the present invention mainly includes a parallel input circuit, a mode selection circuit, a memory array formed based on 1T1R cells, a cascaded current mirror circuit (Cascade Current Mirror, abbreviated as CCM), and an analog voltage output circuit, where:

fig. (a) is a parallel input circuit, fig. (b) is a mode selection circuit, fig. (c) is a memory array based on 1T1R cells, fig. (d) is a cascade current mirror circuit, and fig. (e) is an analog voltage output circuit;

the output terminals Out <0..63> of the parallel input circuits in fig. (a) are respectively connected to the input ports 1 of the alternative data selectors MUX <0..63> of the mode selection circuits in fig. (b), and are connected to Word Lines (WL) WL <0..63> of the memory arrays in fig. (c) through the data selectors;

the input port 0 of the Mode Selection circuit is correspondingly connected with a read-write address input signal R/W addr <0..63>, and the Selection port of the data selector is connected with a Mode Selection control signal (MSEL);

the input end CCM-IN of the cascade current mirror circuit IN the diagram (d) is connected with a bit line (BL for short) of the memory array IN the diagram (c), and the output end CCM-OUT is connected with a capacitance upper polar plate of the analog voltage output circuit IN the diagram (e);

as shown in fig. 2, a circuit schematic diagram of a memory array formed based on 1T1R cells according to an embodiment of the present invention is shown, where the memory array formed based on 1T1R cells is a 64×64 RRAM array with a pseudo-cross structure, and each 1T1R cell is formed by one NMOSFET and one resistive random access memory RRAM, where the pseudo-cross structure can avoid serious leakage problems of the memory array designed by an individual RRAM device;

in the memory array, the top electrode TE of the RRAM of the 1T1R cell in the same column is connected to the bit Line BL of the column, and the gate and Source of the NMOSFET are connected to the word Line WL and Source Line (abbreviated as SL) of the memory array, respectively; the substrate connection of the NMOSFET is omitted from fig. 2 and is connected to ground by default;

as shown in fig. 3, the structure of the 1T1R cell according to the embodiment of the present invention is shown, the bottom electrode BE terminal of the RRAM is connected to the drain electrode of the NMOSFET to form a memory cell controlled by the NMOSFET, and the Gate (Gate), source (Source) and top electrode TE of the NMOSFET are respectively the control port and the data read/write port of the memory cell: the gray solid line in the figure is the voltammetric characteristic curve of the RRAM, the black solid line is the voltammetric characteristic curve of 1T1R, in this embodiment, the RRAM is a bipolar RRAM, the RRAM is a novel nonvolatile variable resistive device, and the two end electrodes are respectively called Top Electrode (TE) and bottom electrode (Bottom Electrical BE) and the middle special oxide material is a dioxide that can ionize under the action of electric field;

the word lines WL of each row of the memory array are connected with parallel input circuits, the parallel input circuits adopt 64-bit parallel input data, and the binary neural network BNN convolution operation (multiply-accumulate operation) is completed by 64 data in the maximum 8X8 weight matrix and 64 data stored in the memory array in one period of calClk;

Fig. 4 is a schematic structural diagram of a circuit in a parallel input circuit according to an embodiment of the present invention, including a D flip-flop DFF; a NOT gate INV; two-input OR gates, OR2-0 and OR2-1, respectively; two-input AND gates, AND2-0 AND AND2-1, respectively, wherein:

the data Input end D of the D trigger DFF is connected to the data Input port Input, the in-phase output end is connected with one Input port of the two-Input OR gate OR2-0, and the opposite-phase output end is connected with one Input port of the two-Input OR gate OR2-1;

the computing clock signal calClk is respectively connected with a clock input end CP of the D trigger DFF, an input end of the NOT gate INV and the other input end of the two-input OR gate OR 2-0;

the output end of the NOT gate INV is connected with the other input end of the two-input OR gate OR2-1;

the output ends of the two-input OR gate OR2-0 AND the two-input OR gate OR2-1 are respectively connected with the two input ends of the two-input AND gate AND 2-0;

the two input ends of the two-input AND gate AND2-1 are respectively connected with the output end of the two-input AND gate AND2-0 AND a word line pulse width modulation signal (Word Line Pulse Width Modulation, abbreviated as WLPWM);

the output end of the two-input AND gate AND2-1 is used as the output end Out of one-bit data in the parallel input circuit; the word line pulse width modulation signal WLPWM is used to control the time the word line is activated during the calculation process;

the parallel input circuit is used for converting 0 and 1 in input data into activating signals of word lines WL when the calClk is in high and low level respectively by using a calculation clock signal calClk.

As shown in fig. 5, the schematic structural diagram of the cascaded current mirror circuit CCM according to the embodiment of the present invention is used for improving the stability and linearity of the circuit, and the cascaded current mirror circuit CCM includes four low-threshold P-type Metal-Oxide-Semiconductor Field Effect Transistor (abbreviated as PMOSFET) which are respectively denoted as M0, M1, M2 and M3;

the analog voltage output circuit converts the output current of the cascade current mirror circuit into output voltage and comprises an N-type Metal-Oxide-Semiconductor Field Effect Transistor (abbreviated as NMOSFET) and an output capacitor which are respectively marked as M4 and C; wherein:

the grid electrodes of M0 and M1 are connected with the drain electrode of M2 and are respectively connected to a high level VDD and a bit line BL of each column of the memory array through transmission gates TG0 and TG1, and the transmission gates TG0 and TG1 are controlled by a control signal PRE;

the grid electrodes of M2 and M3 are connected, and the source electrode of M2 is connected with the drain electrode of M0 as the input end of a control signal VCM;

the drain electrode of M1 is connected with the source electrode of M3, and the drain electrodes of M3 and M4 are connected to the upper polar plate of the output capacitor C;

the lower polar plate of the output capacitor C and the source electrode of the M4 are connected to the ground, and the grid electrode of the M4 is a signal input end for outputting a clear control model clearC;

wherein M1 and M3 have the same size; m0 and M2 have the same size; the width ratio between them is 1:10;

the cascade current mirror circuit and the analog voltage output circuit mirror the current on the bit line BL onto the output circuit in proportion, the current reduced in proportion reduces the variation amplitude of the current in the in-situ line, the stability of the current is improved, and meanwhile, the cascade current mirror can clamp the voltage of the bit line, so that the output current is stable.

Based on the circuit structure, as shown in fig. 6, a schematic diagram of writing low resistance and high resistance into a 1T1R unit according to an embodiment of the present invention is shown, and activation function data participating in BNN convolution operation is written into the 1T1R unit of the memory array, wherein:

FIG. a is a diagram showing the operation of the write circuit when writing a high resistance state (High Resistance Sate, abbreviated HRS), when writing a high resistance state is required, the source line SL of the memory array is set high, the bit line BL is set low, and thus the RRAM in the 1T1R cell is set high;

FIG. b shows the operation state of the write circuit when writing low resistance (Low Resistance Sate, abbreviated as LRS), when writing low resistance is required, the source line SL of the memory array is set to low level, the bit line BL is set to high level, and thus RRAM in the 1T1R cell is set to low resistance;

to enhance the conductivity of the switch, the pass gate in the circuit acts as a write circuit to the switch to which the BL and SL are connected. In order to more clearly demonstrate the technical solution provided by the present invention and the technical effects produced, the following details are given by taking an 8x8 array as an example:

in the embodiment, accumulation operation is performed in an analog domain, the probability of large current flowing through bit lines is reduced by dividing parallel same-or operation into two steps, meanwhile, the accumulation operation and the same-or operation are performed simultaneously, so that the calculation efficiency is improved, and the preparation work before the accumulation operation is performed:

obtaining the numbers of 0 and 1 in the N-bit parallel input data, and setting the numbers as m and N respectively; in the first step, the number of units with the working state of ON is counted in the two-step operation, namely the number corresponding to the sum or result of 1 is marked as alpha, and the number of 0 is m-alpha; the second step also counts the number of units with ON working state, but the number corresponding to the same or result is 0, and the number is recorded as beta, and the number of 1 is n-beta; the m rows participating in calculation in the first step in the corresponding coding columns have alpha rows with the result of "+1", namely the column addition result of "+1" in the first step is +alpha; similarly, in the second step, the n rows participating in the calculation have β rows with a result of "-1", so that the convolution calculation formulas of the input column and the activation column can be converted into one-by-one accumulation operation of the bitwise or result:

α-m+α-β+n-β

thus, the results of the parallel multiply-accumulate operation can be calculated in the digital domain or in a general purpose computer by only obtaining the values of alpha and beta in the analog domain.

In this example, the parallel input activation example and the weight example are subjected to single-period two-phase exclusive nor, as shown in fig. 7, which is a data encoding and calculating process of the example of the present invention, and the calculating process is shown in the following table 1:

TABLE 1

The first step, the data with the Weight (Weight) column being "0" and the corresponding data in the activation (activation) column are nor operated, the data with the Weight column being "0" in the 4 th, 5 th and 8 th rows in the table 1 is the same as the corresponding row data in the activation column or, the result of the 4 th and 5 th rows is "1", the result of the 8 th row is "0";

and secondly, performing exclusive OR operation on the data with the weight column being 1 and the data corresponding to the activated column. In the weight columns of table 1, "1" in rows 1, 2, 3, 6, 7 and 9 is exclusive ored with the data of the corresponding row in the activated column, wherein the exclusive ored result of rows 1, 2 and 9 is "0", and the exclusive ored result of rows 3, 6 and 7 is "1".

The number of m and n in 9-bit parallel input data is 3 and 6 respectively, 3 rows are shared for performing an exclusive nor operation in the first step, wherein the result of the coding column is "+1" and2 rows are provided, namely alpha=2; the second step has 6 rows of exclusive nor operations, wherein 3 rows with exclusive nor encoding column result of "-1", namely beta=3; substituting the formula to obtain the following formula:

2-3+2-3+6-3, then the MAC results in "+1".

Therefore, when parallel multiply-accumulate operation is carried out based on the circuit structure, the activation matrix is written into the memory array of the circuit according to the data corresponding relation of the convolution operation of the weight matrix, and the 0 and 1 in the weight data are divided into two-stage word line opening signals;

the convolution operation in the binary neural network, namely the multiply-accumulate operation of the weight matrix and the activation function matrix, is realized based on the storage array formed by the resistive random access memory; wherein the maximum allowable 8x8 of the weighting matrix modular rule, and the operation result ranges from-64 to +64.

It is noted that what is not described in detail in the embodiments of the present invention belongs to the prior art known to those skilled in the art.

In summary, the circuit according to the embodiment of the invention has the following advantages:

1. the utilization rate of the array unit is effectively improved through unilateral reading and writing;

2. by adopting a multi-row reading scheme, the speed of in-memory calculation is effectively improved, meanwhile, 1T1R is used as a nonvolatile memory unit, the problems of crosstalk among different nodes and easy damage of stored data when the memory units in the traditional SRAM are read in a plurality of rows are avoided, and the stability of the system is improved;

3. in order to reduce the problem of bit line voltage integral nonlinearity in a calculation mode, the scheme introduces a cascading current mirror circuit for clamping bit line voltage, and simultaneously mirrors bit line current in proportion, thereby improving the stability and linearity of output.

The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions easily contemplated by those skilled in the art within the scope of the present invention should be included in the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims. The information disclosed in the background section herein is only for enhancement of understanding of the general background of the invention and is not to be taken as an admission or any form of suggestion that this information forms the prior art already known to those of ordinary skill in the art.

Claims

1. The circuit is characterized by comprising a parallel input circuit, a mode selection circuit, a storage array formed by 1T1R units, a cascading type current mirror circuit and an analog voltage output circuit, wherein the parallel multiply-accumulate operation circuit comprises the following components:

2. The circuit for parallel multiply-accumulate operation in a binary neural network based on an RRAM array of claim 1, wherein each of the parallel input circuits comprises: a D flip-flop DFF; a NOT gate INV; two-input OR gates, OR2-0 and OR2-1, respectively; two-input AND gates, AND2-0 AND AND2-1, respectively, wherein:

two input ends of the two-input AND gate AND2-1 are respectively connected with the output end of the two-input AND gate AND2-0 AND the word line pulse width modulation signal WLPWM;

3. The circuit for parallel multiply-accumulate operation in a binary neural network based on an RRAM array according to claim 1, wherein the cascaded current mirror circuit CCM is used for improving the stability and linearity of the circuit, and comprises four low-threshold P-type metal oxide field effect transistors, which are respectively denoted as M0, M1, M2 and M3;

the analog voltage output circuit converts the output current of the cascade current mirror circuit into output voltage and comprises an N-type metal oxide semiconductor field effect transistor and an output capacitor which are respectively marked as M4 and C; wherein:

4. The circuit for parallel multiply-accumulate operations in a binary neural network based on an RRAM array of claim 1, wherein the activation function data participating in the BNN convolution operation is written into 1T1R cells of the memory array based on the circuit structure, wherein:

when the high resistance state needs to be written, the source line SL of the memory array is set to be high level, the bit line BL is set to be low level, and thus RRAM in the 1T1R unit is set to be high resistance state;

when a low resistance state needs to be written, the source line SL of the memory array is set to a low level, the bit line BL is set to a high level, and thus the RRAM in the 1T1R cell is set to a low resistance state.

5. The circuit for parallel multiply-accumulate operation in a binary neural network formed based on an RRAM array according to claim 1, wherein when the parallel multiply-accumulate operation is performed based on the circuit structure, an activation matrix is written into a memory array of the circuit according to a data corresponding relation with a weight matrix convolution operation, and a '0' and a '1' in weight data are divided into two-stage word line opening signals;

parallel multiply-accumulate operation results are implemented in the memory array as-64 to +64, with the maximum allowable weight matrix being an 8x8 array.