CN114937470B

CN114937470B - Fixed point full-precision memory computing circuit based on multi-bit SRAM unit

Info

Publication number: CN114937470B
Application number: CN202210549764.0A
Authority: CN
Inventors: 贺雅娟; 骆宏阳; 王梓霖; 张波
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2022-05-20
Filing date: 2022-05-20
Publication date: 2023-04-07
Anticipated expiration: 2042-05-20
Also published as: CN114937470A

Abstract

The invention belongs to the technical field of integrated circuits, and particularly relates to a fixed-point full-precision memory computing circuit based on a multi-bit SRAM unit. According to the invention, multiplication is realized by adding two transistors to form a transmission gate on the basis of a traditional SRAM storage array circuit, an adder tree is added for partial sum accumulation, and a bit serial input mode and a shift accumulator are adopted to complete multi-bit operation, so that precision-lossless matrix vector multiplication operation in an SRAM storage array is realized. The invention realizes multi-bit SRAM memory calculation without precision loss, has the characteristics of small area and high parallelism, and is suitable for a convolutional neural network system needing large-scale multiply-accumulate calculation.

Description

Fixed point full-precision memory computing circuit based on multi-bit SRAM unit

Technical Field

The invention belongs to the technical field of integrated circuits, and particularly relates to a fixed point full-precision memory computing circuit based on a multi-bit SRAM unit.

Background

In recent years, the effort has been increasing due to the development of integrated circuits. The field of artificial intelligence has also advanced rapidly, and given that its application scenarios typically involve pictures, audio and video, these are data intensive applications that are distinct from the traditional computationally intensive, control intensive ones. Convolutional Neural Networks (CNNs), especially in the context of processing pictures and video, have become widely used.

However, since the convolutional layer and the fully-connected layer both need a large amount of weights and a large amount of convolution operations, not only is a challenge to the computing power of the conventional von neumann architecture, but also a large amount of data transportation becomes a bottleneck of the power consumption and speed of the whole system. Especially in the embedded field, more and more internet of things devices need to be provided with intelligence by AI, but are limited by endurance limit of a battery and computing power limit of an MCU, and AI tasks can be completed only by sending data to a cloud end and returning the data after processing. Not only is this delay high and in some cases does not meet the requirements, but the privacy of the individual is not well protected.

The SRAM memory calculation array circuit is a solution for data intensive application, multiplication and accumulation operation is completed in a memory, multi-bit data are multiplied and accumulated in parallel, the calculation mode is matched with the CNN calculation mode, and the real-time mode is met. Meanwhile, because the weight is stored in the array, the power consumption of carrying the weight back and forth is avoided, and the power consumption of the whole system is reduced.

Disclosure of Invention

Aiming at the problem that the traditional SRAM array circuit can not realize internal calculation, the invention provides a fixed-point full-precision memory calculation circuit based on a multi-bit SRAM unit, which realizes the multi-bit memory calculation function under the condition of obviously improving the energy efficiency ratio through the structural innovative design.

The technical scheme of the invention is as follows:

the fixed-point full-precision memory computing circuit based on the multi-bit SRAM unit is characterized by comprising 64 rows and 4 columns of storage units, 1 adder tree, 4 sense amplifiers and 1 accumulator.

In the memory array, a memory unit in each column is connected with two signal lines BL and BLB, and the BL and BLB signal lines are read-write operation bit lines and are used for loading data during read-write operation; the memory cells in each row are connected with three signal lines, namely WL, input and output, wherein the WL signal line is a read-write operation word line and is used for selecting a row during read-write operation, the input signal line is an input signal line and is used for inputting signals during a memory computing mode, and the output signal line is an output signal line and is used for outputting a multiplication result of the input and a stored value during the memory computing mode.

The memory computing circuit has an SRAM mode and a memory computing mode; the input end of the sense amplifier is connected with a BL signal line and a BLB signal line, the SRAM mode uses the output of the sense amplifier, and the memory computing mode combines the outputs of 4 columns into out1[3 ] to out64[ 0] and feeds the combined outputs into an adder tree for accumulation.

Specifically, the adder tree has 64 input ports in1[3 ] to in64[3 ] in total, each port corresponds to an output in each row of the storage array, and one output port sum [9 ] in each output port is a 10-bit port sum [9 ]. The adder tree does not work in the SRAM mode, and the multiplication results of 4 units in each row are accumulated in parallel in the memory calculation mode.

Specifically, the accumulator is responsible for accumulating the result of the adder tree, and has a 10-bit input port iat [ 9.

Specifically, the memory unit in the memory array is an 8-transistor memory unit and comprises a first PMOS transistor, a second PMOS transistor, a first NMOS transistor, a second NMOS transistor, a third NMOS transistor, a fourth NMOS transistor, a fifth NMOS transistor and a sixth NMOS transistor, wherein a source electrode of the first PMOS transistor and a source electrode of the second PMOS transistor are connected with a power supply, a drain electrode of the first PMOS transistor is connected with a grid electrode of the second PMOS transistor, a drain electrode of the first NMOS transistor, a grid electrode of the second NMOS transistor, a drain electrode of the third NMOS transistor and a grid electrode of the fifth NMOS transistor; the drain electrode of the second PMOS tube is connected with the grid electrode of the first PMOS tube, the grid electrode of the first NMOS tube, the drain electrode of the second NMOS tube, the drain electrode of the fourth NMOS tube and the grid electrode of the sixth NMOS tube; the source electrode of the first NMOS tube and the source electrode of the second NMOS tube are grounded; the grid electrode of the third NMOS tube is connected with a row read-write operation signal WL, and the source electrode of the third NMOS tube is connected with a column read-write operation signal BL; the grid electrode of the fourth NMOS tube is connected with a row read-write operation signal WL, and the source electrode of the fourth NMOS tube is connected with a column write operation signal BLB; the source electrode of the fifth NMOS tube is connected with an input signal input, and the drain electrode of the fifth NMOS tube is connected with the drain electrode of the sixth NMOS tube to be used as an output signal output; and the source electrode of the sixth NMOS tube is grounded.

Specifically, the adder tree comprises 6 levels of adder trees, and the adder trees are alternately arranged by using 10T full adders and 28T full adders respectively for accumulation, wherein the 1 st level of adder tree is formed by 32 4-bit 10T full adders to generate 32 5-bit accumulation sums, the accumulation combination mode is from 0 to 63, and two adjacent inputs are sequentially combined; the 2-stage addition tree is formed by 16 5-bit 28T full adders to generate 16 6-bit accumulation sums, and the accumulation combination mode is the same as that of the 1 st stage; the 3 rd-level addition tree is formed by 8 6-bit 10T full adders, 8 accumulated sums of 7 bits are generated, and the accumulated combination mode is the same as that of the 1 st level; the 4 th-level addition tree is formed by 4 7bit 28T full adders to generate 4 accumulated sums of 8 bits, and the accumulated combination mode is the same as that of the 1 st level; the 5 th-stage addition tree is formed by 2 8-bit 10T full adders, 2 9-bit accumulation sums are generated, and the accumulation combination mode is the same as that of the 1 st stage; the 6 th-stage addition tree is formed by 1 9bit 28T full adders, generates 1 accumulated sum of 10 bits as output and inputs the output to the accumulator.

Specifically, the accumulator comprises a first D trigger, a second D trigger, a third D trigger, a 14bit adder and a shift circuit; the input end of the first D trigger is connected with the output port of the adder tree, the output end of the first D trigger is connected with one input end of the 14-bit adder, the other input end of the 14-bit adder is connected with the output end of the shifting circuit, the input end of the shifting circuit is connected with the output end of the second D trigger, the input end of the second D trigger is connected with the output end of the 14-bit adder, and the shifting circuit is used for shifting the output of the second D trigger by one bit to the left and then inputting the output of the second D trigger into the 14-bit adder; the input end of the third D trigger is connected with the output end of the 14bit adder, and the output end of the third D trigger is the output end of the accumulator; the accumulator shifts the last accumulated result to the left by one bit every clock cycle and adds the accumulated sum generated by the current clock cycle by a 14-bit adder. The method specifically comprises the following steps: the inputs are clocked in sequence from MSB to LSB, i.e. each cycle requires 2 that the adder tree produces a 10bit partial sum which is shifted left in sequence and then accumulated. The purpose of multi-bit multiply accumulation is achieved, and 5 periods are needed for completing the operation for 4-bit input.

Specifically, the supported operation data type is an unsigned number.

The invention has the beneficial effects that: the invention realizes the multi-bit SRAM memory calculation without precision loss by modifying the SRAM basic memory cell and adding the adder tree and the shift accumulator circuit, has the characteristics of small area and high parallelism, and is suitable for a convolutional neural network system which needs large-scale multiply-accumulate calculation.

Drawings

FIG. 1 is a fixed point full precision memory computing circuit based on a multi-bit SRAM cell according to the present invention.

FIG. 2 is a schematic diagram of an 8T SRAM memory operation unit.

FIG. 3 is a diagram of an adder tree architecture.

FIG. 4 is a diagram of 10T and 28T full adders used in an adder tree.

Fig. 5 is a schematic diagram of an accumulator and a corresponding timing diagram.

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings.

Fig. 1 is a schematic diagram of a fixed-point full-precision memory computing circuit based on a multi-bit SRAM cell according to the present invention. The memory-computation integrated SRAM array circuit is composed of 64 rows and 4 columns, each row shares two root word lines of WL and input, each column shares two bit lines of BL and BLB, and output of each row is connected to an adder tree. The array circuit comprises 64-by-4 8-transistor storage units in total, and the 8-transistor storage units comprise a first NMOS transistor, a second NMOS transistor, a third NMOS transistor, a fourth NMOS transistor, a fifth NMOS transistor, a sixth NMOS transistor, a first PMOS transistor and a second PMOS transistor.

FIG. 2 is a schematic diagram of an 8-T SRAM cell structure. The first PMOS source electrode and the second PMOS source electrode in the 8-tube storage unit are connected with a power supply voltage. The first, second and sixth NMOS source stages are grounded. The drain of the first PMOS (denoted as node Q) is connected with the gate of the second PMOS, the drain of the first NMOS, the gate of the second NMOS, the drain of the third NMOS and the gate of the fifth NMOS. The gate of the first PMOS (denoted as node QB) is connected to the drain of the second PMOS, the gate of the first NMOS, the drain of the second NMOS, the drain of the fourth NMOS, and the gate of the sixth NMOS. The gates of the third and fourth NMOS are connected to a word line WL. The source of the third NMOS is connected to BL and the source of the fourth NMOS is connected to BLB. And the source of the fifth NMOS is connected with input. And the drain electrode of the fifth NMOS is connected with the drain electrode of the sixth NMOS and used as an output node.

In the fixed point full-precision memory computing circuit based on the multi-bit SRAM unit, the body ends of all NMOS tubes are connected with a grounding voltage GND, and the body ends of all PMOS tubes are connected with a power supply voltage VDD.

In order to realize matrix vector multiplication operation in the storage array, the invention uses two transistors to form a transmission gate, the input end is respectively connected with the input, AND two grids are connected with Q AND QB to form AND logic realization. input is an input signal so that the result of the multiplication of data and input data can be stored in the output port output unit.

In order to realize multi-bit accumulation operation in the memory array, the invention adopts an adder tree and a shift accumulator, wherein the adder tree generates multiplication results of 4 memory units in each row of 64 rows of the memory array and corresponding input, and the multiplication results are accumulated in parallel. The shift accumulator is responsible for shifting left and accumulating the partial sums generated by the adder tree over 4 cycles in turn, as will be described in detail later.

The operation of the memory array circuit of the present invention is described in detail below with reference to fig. 1, 2, 3, 4 and 5:

1. SRAM mode:

(1) And (3) keeping operation:

during the period in which the memory cell holds data, the word line WL is kept at a low level. At this time, the third NMOS transistor MN3 and the fourth NMOS transistor MN4 are both turned off, and the read bit lines BL and BLB do not affect the storage node Q or QB. The latch structure formed by the first PMOS transistor MP1, the second PMOS transistor MP2, the first NMOS transistor MN1, and the second NMOS transistor MN2 latches data of the storage nodes Q and QB.

(2) And (3) writing operation:

suppose that 8-pipe memory cell storage node Q is high and QB is low before a write operation, i.e. storing data as '1'. When writing data '0', the write operation word line is pulled high to high level to select the cell, and simultaneously, data '0' to be written is loaded on the write bit line, namely BL is low level and BLB is high level. The BL pulls down the node Q through the third NMOS transistor MN3, the BLB pulls up the node QB through the fourth NMOS transistor MN4, the latch structure feedback loop is broken, and the data '0' is written in the storage unit. Writing data '1' is the same as the above process.

(3) Read operation

Suppose before a read operation, the storage node Q of the memory cell is high and QB is low, i.e., data is stored as '1'. When the read operation starts, the bit lines BL and BLB are precharged to a high level, the read lines WL and WLB are pulled high to a high level, the third NMOS transistor MN3 and the fourth NMOS transistor MN4 are turned on, and the second NMOS transistor is turned on because the Q point is a high level. At this time, BLB is pulled low through the second NMOS transistor MN2 and the fourth NMOS transistor, BL remains unchanged, and "1" reading is completed. Reading data "0" is the same as the process described above.

2. Memory computing mode:

in the memory calculation mode, the data stored in the memory cell represents 0 if the data is '0', and represents 1 if the data is '1'.

If the storage unit stores data as 1, that is, the Q point is "1", QB is "0", at this time, the fifth NMOS transistor MN5 is turned on, and the sixth NMOS transistor MN6 is turned off. At this time, output is connected to input through a fifth NMOS transistor MN5, and if input is "0", output is also "0"; if input is "1", output is also "1".

Assuming that the storage data of the memory cell is 0, i.e. the Q point is "0", QB is "1", at this time, the fifth NMOS transistor MN5 is turned off, the sixth NMOS transistor MN6 is turned on, and the output is "0" no matter whether the input is "0" or "1".

Since the four memory cells in each row share an input word line, i.e., the output of the nth row at this time:

wherein W _i The storage data '0' or '1' of the corresponding column in fig. 1 is represented. The adder tree accumulates the 4-bit multiplication outputs generated by the 64 rows in parallel.

Input in of adder in FIG. 3 _i Corresponding output out generated by each row connected to the array _i . The first 64 inputs through the first stage 10t 4bit adder will produce 32 partial 5bit summations, which will reduce area but will incur a loss of threshold. The second stage would then use a 28T 5bit mirror adder to avoid successive dips leading to errors. The third stage adopts a 10T adder again, the fourth stage adopts a 28T adder, the fifth stage adopts a 10T adder, and the sixth stage adopts a 28T adder to complete the whole accumulation process to generate an accumulation sum output sum[9:0]。

Fig. 5 depicts a timing diagram for the bit-serial input mode and a circuit diagram for the corresponding accumulator. For the case of a 4-bit input X times a 4-bit weight W, the inputs are sequentially input from MSB to LSB in an always periodic manner. I.e. the 1 st cycle input X ₃ 2 nd cycle input X ₂ The third cycle inputs X ₁ Fourth cycle input X ₀ . Each clock cycle, the adder tree generates an accumulated sum, denoted sum _i 。

In the accumulator, the output sum of the adder tree is connected to the input iat [9 ]. Each clock cycle shifts the previous accumulation result to the left by one bit and adds the accumulation sum generated in the current clock cycle by a 14-bit adder. The specific process is as follows:

cycle 1: result =0, S = result < 1+ sum ₀ ；

And 2, period: result = sum ₀ ，S＝result＜＜1+sum ₁ ；

Cycle 3: result = sum ₀ ＜＜1+sum ₁ ，S＝result＜＜1+sum ₂ ；

And 4, period: result = sum ₀ ＜＜2+sum ₁ ＜＜1+sum ₂ ，S＝result＜＜1+sum ₃ ；

Cycle 5: result = sum ₀ ＜＜3+sum ₁ ＜＜2+sum ₂ ＜＜1+sum ₃ ；

For binary, left-shifting by one bit is equivalent to multiplying by 2, then the above equation is all substituted for simplification to yield the final accumulator output result as:

that is, the multiply-accumulate operation of 64 4-bit inputs and 4-bit weights is completed. The vector matrix multiplication can be completed by splicing a plurality of banks.

In summary, the fixed-point full-precision memory computing circuit based on the multi-bit SRAM cell according to the present invention implements matrix vector multiplication by improving the structure. Compared with the traditional structure, the invention realizes AND logic by adding two transistors to form a transmission gate in the aspect of the array circuit structure, AND obtains the result of multiplying the input AND the memory cell through the output port. The data of each column are added in parallel by adding an adder tree, the area is greatly reduced by using full adders of 10T and 28T in a crossed mode, multi-bit input is realized by a bit serial input mode and a shift accumulator, and therefore multi-bit matrix vector multiplication operation in an SRAM array is realized.

It will be appreciated by those of ordinary skill in the art that the embodiments described herein are intended to assist the reader in understanding the principles of the invention, and it is to be understood that the scope of the invention is not limited to such specific statements and embodiments (e.g., number of rows 64 and number of columns 4). Those skilled in the art can make various other specific changes and combinations based on the teachings of the present invention without departing from the spirit of the invention, and these changes and combinations are within the scope of the invention.

Claims

1. The fixed point full-precision memory computing circuit based on the multi-bit SRAM unit is characterized in that the memory computing circuit has an SRAM mode and a memory computing mode and comprises a storage array, an adder tree, a sensitive amplifier and an accumulator;

the memory array is composed of 64 rows and 4 columns of memory cells, and the memory cells in each column are connected with a column read operation signal BL and a column write operation signal BLB; the memory cells of each row are connected with a row read-write operation signal WL, an input signal input and an output signal output, the read-write operation signal WL is used for selecting a middle row during read-write operation, the input signal input is used for inputting data during an in-memory calculation mode, and the output signal output is used for outputting multiplication results of the input data and a stored value during the in-memory calculation mode;

the number of the sensitive amplifiers is 4, the input of each sensitive amplifier corresponds to a read operation signal BL and a column write operation signal BLB of a column of storage units, and the output of each sensitive amplifier is the output of the memory computing circuit in an SRAM mode;

the adder tree has 64 4-bit input ports in1[3 ] to in64[3 ], each port corresponds to the output of each row of the storage array, and the output port of the adder tree is a 10-bit port sum [9 ], which represents the accumulation result of all inputs; the adder tree does not work in an SRAM mode, and multiplication results of 4 units in each row are accumulated in parallel in an in-memory calculation mode;

the accumulator is responsible for accumulating the result of the adder tree, and has a 10-bit input port iat [9 ]:0 ] and a 14-bit output port result [13 ]:0, and in the memory calculation mode, the output of the accumulator is the output of the memory calculation circuit;

the storage unit in the storage array is an 8-transistor storage unit and comprises a first PMOS (P-channel metal oxide semiconductor) transistor, a second PMOS transistor, a first NMOS (N-channel metal oxide semiconductor) transistor, a second NMOS transistor, a third NMOS transistor, a fourth NMOS transistor, a fifth NMOS transistor and a sixth NMOS transistor, wherein the source electrode of the first PMOS transistor and the source electrode of the second PMOS transistor are connected with a power supply, and the drain electrode of the first PMOS transistor is connected with the grid electrode of the second PMOS transistor, the drain electrode of the first NMOS transistor, the grid electrode of the second NMOS transistor, the drain electrode of the third NMOS transistor and the grid electrode of the fifth NMOS transistor; the drain electrode of the second PMOS tube is connected with the grid electrode of the first PMOS tube, the grid electrode of the first NMOS tube, the drain electrode of the second NMOS tube, the drain electrode of the fourth NMOS tube and the grid electrode of the sixth NMOS tube; the source electrode of the first NMOS tube and the source electrode of the second NMOS tube are grounded; the grid electrode of the third NMOS tube is connected with a row read-write operation signal WL, and the source electrode of the third NMOS tube is connected with a column read-write operation signal BL; the grid electrode of the fourth NMOS tube is connected with a row read-write operation signal WL, and the source electrode of the fourth NMOS tube is connected with a column write operation signal BLB; the source electrode of the fifth NMOS tube is connected with an input signal input, and the drain electrode of the fifth NMOS tube is connected with the drain electrode of the sixth NMOS tube to be used as an output signal output; the source electrode of the sixth NMOS tube is grounded;

the adder tree comprises 6 levels of addition trees, and the addition trees are alternately arranged by using 10T full adders and 28T full adders respectively for accumulation, wherein the 1 st level of addition tree is formed by using 32 4-bit 10T full adders to generate 32 5-bit accumulation sums, the accumulation combination mode is from 0 to 63, and two adjacent inputs are combined in sequence; the 2-level addition tree is formed by 16 5bit 28T full adders, 16 6-bit accumulation sums are generated, and the accumulation combination mode is the same as that of the 1 st level; the 3 rd stage addition tree is formed by 8 6-bit 10T full adders, 8 accumulated sums of 7 bits are generated, and the accumulation combination mode is the same as that of the 1 st stage; the 4 th-level addition tree is formed by 4 7bit 28T full adders to generate 4 accumulated sums of 8 bits, and the accumulated combination mode is the same as that of the 1 st level; the 5 th-stage addition tree is formed by 2 8-bit 10T full adders, 2 9-bit accumulation sums are generated, and the accumulation combination mode is the same as that of the 1 st stage; the 6 th-stage addition tree is formed by 1 9bit 28T full adder, generates 1 accumulated sum of 10 bits as output and inputs the output to the accumulator;

the accumulator comprises a first D trigger, a second D trigger, a third D trigger, a 14bit adder and a shift circuit; the input end of the first D trigger is connected with the output port of the adder tree, the output end of the first D trigger is connected with one input end of the 14-bit adder, the other input end of the 14-bit adder is connected with the output end of the shifting circuit, the input end of the shifting circuit is connected with the output end of the second D trigger, the input end of the second D trigger is connected with the output end of the 14-bit adder, and the shifting circuit is used for shifting the output of the second D trigger by one bit to the left and then inputting the output of the second D trigger into the 14-bit adder; the input end of the third D trigger is connected with the output end of the 14bit adder, and the output end of the third D trigger is the output end of the accumulator; the accumulator shifts the last accumulation result to the left by one bit each clock cycle and adds the accumulated sum generated by the current clock cycle through a 14-bit adder.