CN117153218A

CN117153218A - Single bit weight generation unit, multi-bit weight generation unit, array group and calculation macro

Info

Publication number: CN117153218A
Application number: CN202310968651.9A
Authority: CN
Inventors: 卢文娟; 张宇龙; 刘玉; 彭春雨; 戴成虎; 郝礼才; 李鑫; 蔺智挺; 吴秀龙
Original assignee: Anhui University
Current assignee: Anhui University
Priority date: 2023-08-02
Filing date: 2023-08-02
Publication date: 2023-12-01

Abstract

The invention relates to the technical field of dynamic random access storage, in particular to a single bit weight generating unit, a multi-bit weight generating unit, an array group and a calculation macro. The single bit weight generating unit comprises n standard 6T-SRAM units and 1 transposition XNOR accumulating unit, wherein the transposition XNOR accumulating unit is used as a calculating unit and is externally connected to the standard 6T-SRAM, so that multi-bit simultaneous or accumulating reasoning and training operation is realized. The multi-bit weight generating unit consists of 4 single-bit weight generating units, the array group consists of multi-bit weight generating units distributed in an array, and the in-memory calculation macro is constructed based on the array group. According to the invention, different quantization schemes are formulated according to the characteristics of reasoning and training operation, integration is realized, chip resources are effectively utilized, and the problems of slow speed and low backward propagation accuracy of the conventional reasoning-training chip in the reasoning operation are solved.

Description

Single bit weight generation unit, multi-bit weight generation unit, array group and calculation macro

Technical Field

The invention relates to the technical field of dynamic random access storage, in particular to a single bit weight generating unit, a multi-bit weight generating unit consisting of 4 single bit weight generating units, an array group consisting of multi-bit weight generating unit array distribution and an in-memory computing macro.

Background

With the advent of the "computing age", large-scale data was required to traverse between memory and processor, whereas in conventional von neumann architectures, computing units are separated from memory units, and frequent data access consumes significant power consumption and time. The birth of the in-memory Computing (CIM) technology breaks through the bottleneck of von Neumann and breaks through the problem of a storage wall in the traditional computing architecture, thereby having revolutionary significance for the computing age. Because the SRAM reads data fast and has better compatibility with advanced logic technology, the memory calculation based on the SRAM has great development prospect.

Machine learning has driven widespread use of artificial intelligence (Artificial Intelligence, AI) from image classification to speech recognition. While cloud computing provides powerful computational support for AI training and computing, it relies on the personal data of users, many of whom are reluctant to send personal data to the cloud end to retrain the model; meanwhile, the application of the edge end needs real-time network connection, and the off-network edge equipment cannot execute retraining in real time so as to cope with new situations encountered on site. Based on these factors, edge device learning (or on-chip training) is a preferred approach. That is, the training chip and the inference chip are integrated, the chip can learn the sample incrementally after performing a one-stage inference operation, and the training of the existing model is facilitated by updating the knowledge base, learning user-specific features to personalize the model, and alleviating privacy-related concerns.

However, most of the existing chip circuit designs still stay in the reasoning stage, and integration with training is not realized. The existing reasoning-training chip adopts the bit width of the same bit in the reasoning and training processes, so that the situation that the speed is reduced and the backward propagation accuracy is reduced during the reasoning operation can occur.

Disclosure of Invention

Based on this, it is necessary to provide a single bit weight generation unit, a multi-bit weight generation unit, an array group and a calculation macro for the problems of slow speed and reduced backward propagation accuracy occurring in the conventional inference-training chip at the time of the inference operation.

The invention is realized by adopting the following technical scheme:

in a first aspect, the invention discloses a single bit weight generation unit comprising n standard 6T-SRAM units, 1 transpose XNOR accumulation unit.

n standard 6T-SRAM cells are used as storage units; wherein the reading of any standard 6T-SRAM cell is controlled by its word line WL and reflects its stored weight values on its BL, BLB. The bit lines BL of the n standard 6T-SRAM cells are commonly connected to the local bit lines LBL, and the bit lines BLB of the n standard 6T-SRAM cells are commonly connected to the local bit lines LBLB; n is more than or equal to 1.

1 transpose XNOR accumulation unit acts as the computation unit. The transpose XNOR accumulation unit includes: NMOS transistors N1 to N6 and PMOS transistors P1 to P6. Wherein the drain terminal of N1 is connected to the signal line C-XACN, and the gate is connected to the control signal RWLAN. The drain terminal of N2 is connected to C-XACN and the gate is connected to control signal RWLBN. The drain terminal of N3 is connected to the source terminal of N1, the gate terminal is connected to the control signal CWLAN, and the source terminal is connected to C-XACN. The drain terminal of N4 is connected to the source terminal of N2, the gate is connected to the control signal CWLBN, and the source terminal is connected to C-XACN. The drain terminal of N5 is connected with the source terminal of N1 and the drain terminal of N3, the grid electrode is connected with LBL, and the source terminal is connected with signal line R-XACN. The drain terminal of N6 is connected with the source terminal of N2 and the drain terminal of N4, the grid electrode is connected with LBLB, and the source terminal is connected with R-XACN. P1 has a gate connected to the control signal CWLAP and a source connected to the signal line C-XACP. P2 has a gate connected to the control signal CWLBP and a source connected to the C-XACP. The drain terminal of P3 is connected to C-XACN, the gate is connected to control signal RWLAP, and the source terminal is connected to the drain terminal of P1. The drain terminal of P4 is connected to C-XACN, the gate is connected to control signal RWLBP, and the source terminal is connected to the drain terminal of P2. The drain terminal of P5 is connected to the drain terminal of P1, the source terminal of P3, the gate is connected to LBL, and the source is connected to signal line R-XACP. The drain terminal of P6 is connected to the drain terminal of P2, the source terminal of P4, the gate is connected to LBLB, and the source is connected to R-XACP.

Implementation of such a single bit weight generation unit is in accordance with methods or processes of embodiments of the present disclosure.

In a second aspect, the present invention discloses a multi-bit weight generation unit comprising 4 single bit weight generation units as disclosed in the first aspect.

The 4 single bit weight generating units are in the same row and share the same RWLAN, the same RWLBN, the same RWLAP, the same RWLBP, the same CWLAP, the same CWLBP, the same CWLAN and the same CWLBN. In the 4 single bit weight generating units in the same row, the m standard 6T-SRAM units of each single bit weight generating unit share the same WL, m is [1, n ].

Implementation of such a multi-bit weight generation unit is in accordance with methods or processes of embodiments of the present disclosure.

In a third aspect, the present invention discloses an array group, including n×n multi-bit weight generating units disclosed in the second aspect distributed in an array; n=2 ⁱ ，i>0。

Wherein the multi-bit weight generating units located in the same column share the same CWLAP, the same CWLBP, the same CWLAN and the same CWLBN. The q-th single-bit weight generating unit of each multi-bit weight generating unit shares the same C-XACN and the same C-XACP in N multi-bit weight generating units of the same column; q is E [1,4]. The multi-bit weight generating units located in the same row share the same RWLAN, the same RWLBN, the same RWLAP and the same RWLBP. And in N multi-bit weight generating units in the same row, the q-th single-bit weight generating unit of each multi-bit weight generating unit shares the same R-XACN and the same R-XACP.

Implementation of such array sets is in accordance with methods or processes of embodiments of the present disclosure.

In a fourth aspect, the present invention discloses an in-memory computation macro, including the array group, the word line driver, the backward channel input driver, the forward bit line input device, the forward channel input driver, the backward bit line input device, the in-memory computation controller, the flash memory analog-to-digital converter, the successive approximation analog-to-digital converter, and the timing controller disclosed in the third aspect.

The array set is used for forward propagation or backward propagation. The word line driver is used to control the WL switch. The back channel input driver is used to control CWLAN, CWLAP, CWLBN, CWLBP the switch. The forward bit line input device is used for precharging the C-XACN to VDD/2 during forward propagation and connecting the C-XACN to VSS and the C-XACP to VDD during backward propagation. The forward channel input drive controller is used to control the RWLAN, RWLAP, RWLBN, RWLBP switch. The backward bit line input controller is used to precharge R-XACN to VDD and R-XACP to VSS during backward propagation. The backward bit line input is used to precharge R-XACN to VDD and R-XACP to VSS during backward propagation. The in-memory computing controller is used for switching the functions of the array group. The flash analog-to-digital converter is used for obtaining 4bit output in forward propagation. The successive approximation analog-to-digital converter is used to obtain an 8bit output during backward propagation. The time schedule controller is used for controlling clock pulses of the signals.

Implementation of such in-memory computing macros is in accordance with methods or processes of embodiments of the present disclosure.

Compared with the prior art, the invention has the following beneficial effects:

according to the invention, different quantization schemes are respectively formulated according to the characteristics of reasoning and training operation, so that integration is realized, and effective utilization of chip resources is realized.

2, compared with the existing 4bit input and 4bit output circuit structure, the invention has obviously improved accuracy and achieves the level similar to the existing 8bit input and 8bit output circuit structure; and the energy consumption of the invention is improved compared with that of the invention.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the description below are only some embodiments of the invention, and that other drawings can be obtained from these drawings without inventive faculty for a person skilled in the art.

FIG. 1 is a circuit diagram of a single bit weight generation unit according to embodiment 1 of the present invention;

FIG. 2 is a block diagram of 6 cases of forward propagation by the single bit weight generation unit of FIG. 1;

FIG. 3 is a block diagram of 6 cases of backward propagation by the single bit weight generation unit of FIG. 1;

FIG. 4 is a block diagram of a multi-bit weight generation unit according to embodiment 2 of the present invention;

FIG. 5 is a block diagram of an array set according to embodiment 2 of the present invention;

FIG. 6 is a block diagram of an in-memory computing macro according to embodiment 3 of the present invention;

FIG. 7 is a schematic diagram of the in-memory computing macro of FIG. 6 as it propagates in the forward direction;

FIG. 8 is a waveform diagram of the signals of FIG. 7;

FIG. 9 is a schematic diagram of the in-memory computing macro of FIG. 6 as it propagates backward;

FIG. 10 is a waveform diagram of the signals of FIG. 9;

FIG. 11 is a diagram showing the comparison of the accuracy of the in-memory computing macro and the conventional computing circuit in embodiment 3 of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

It is noted that when an element is referred to as being "mounted to" another element, it can be directly on the other element or intervening elements may also be present. When an element is referred to as being "disposed on" another element, it can be directly on the other element or intervening elements may also be present. When an element is referred to as being "secured to" another element, it can be directly secured to the other element or intervening elements may also be present.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. The term "or/and" as used herein includes any and all combinations of one or more of the associated listed items.

Example 1

Referring to fig. 1, a circuit configuration diagram of a single bit weight generation unit provided in this embodiment 1 is shown. The single bit weight generation unit (may be abbreviated as BWPU) includes: n standard 6T-SRAM cells (abbreviated as W), 1 transpose XNOR accumulation unit (abbreviated as TXAC).

Standard 6T-SRAM cells are used as memory cells. n is more than or equal to 1. In the embodiment 1, by simulation, the best effect can be achieved by taking n into consideration various indexes such as balance area, energy consumption, delay, throughput and the like.

The structure of a single standard 6T-SRAM cell is well known. The description is briefly made here: the standard 6T-SRAM comprises 2 PMOS tubes (PM 1-PM 2) and 4 NMOS tubes (NM 1-NM 4). Wherein PM1, PM2 are used as pull-up tubes, NM1, NM2 are used as pull-down tubes, NM3, NM4 are used as transmission tubes, word lines WL control NM3, NM4, bit lines BL, BLB are respectively connected with NM3, NM4, PM1, NM3 are connected with a storage node Q, and PM2, NM4 are connected with a storage node QB.

In general, the reading of any one standard 6T-SRAM cell is controlled by its word line WL and reflects its stored weight values on its BL, BLB. For a standard 6T-SRAM cell, if the stored weight value is 1, namely Q is 1, QB is 0, when WL is at high level, Q is communicated with BL, QB is communicated with BLB, Q discharges BL, BLB discharges QB, BL is 1, BLB is 0. If the stored weight value is 0, i.e. Q is "0", QB is "1", when WL is at high level, Q is communicated with BL, QB is communicated with BLB, BL discharges Q and QB discharges BLB, BL is 0, BLB is 1.

The bit lines BL of the n standard 6T-SRAM cells are commonly connected to the local bit lines LBL. The bit lines BLB of the n standard 6T-SRAM cells are commonly connected to the local bit lines LBLB.

Thus, each time 1 standard 6T-SRAM cell is turned on, LBL and LBLB synchronously reflect BL and BLB of the standard 6T-SRAM cell, and thus the weight value stored in the standard 6T-SRAM cell can be input into a transpose XNOR accumulation unit.

The transpose XNOR accumulation unit acts as a calculation unit. The transpose XNOR accumulation unit includes 6 NMOS transistors (N1 to N6) and 6 PMOS transistors (P1 to P6). Wherein the drain terminal of N1 is connected to C-XACN and the gate is connected to control signal RWLAN. The drain terminal of N2 is connected to C-XACN and the gate is connected to control signal RWLBN. The drain terminal of N3 is connected with the source terminal of N1, the grid electrode is connected with the control signal CWLAN, and the source terminal is connected with the signal line C-XACN. The drain terminal of N4 is connected with the source terminal of N2, the grid electrode is connected with the control signal CWLBN, and the source terminal is connected with the signal line C-XACN. The drain terminal of N5 is connected with the source terminal of N1 and the drain terminal of N3, the grid electrode is connected with the local bit line LBL, and the source terminal is connected with the signal line R-XACN. The drain terminal of N6 is connected with the source terminal of N2 and the drain terminal of N4, the grid electrode is connected with the local bit line LBLB, and the source terminal is connected with the signal line R-XACN. P1 has a gate connected to the control signal CWLAP and a source connected to the signal line C-XACP. P2 has a gate connected to the control signal CWLBP and a source connected to the signal line C-XACP. The drain terminal of P3 is connected to signal line C-XACN, the gate is connected to control signal RWLAP, and the source terminal is connected to the drain terminal of P1. The drain terminal of P4 is connected to signal line C-XACN, the gate is connected to control signal RWLBP, and the source terminal is connected to the drain terminal of P2. The drain terminal of P5 is connected to the drain terminal of P1, the source terminal of P3, the gate is connected to local bit line LBL, and the source is connected to signal line R-XACP. The drain terminal of P6 is connected to the drain terminal of P2, the source terminal of P4, the gate is connected to local bit line LBLB, and the source is connected to signal line R-XACP.

In other words, the drain of N1, the drain of N2, the source of N3, and the source of N4 are commonly connected to C-XACN; the source end of N1 and the drain end of N3 are commonly connected to the drain end of N5; the gate of N1 is connected to RWLAN.

The drain end of N2, the drain end of N1, the source end of N3 and the source end of N4 are commonly connected to the C-XACN; the source terminal of N2 and the drain terminal of N4 are commonly connected to the drain terminal of N6: the gate of N2 is connected to RWLBN.

The source end of N3, the drain end of N1, the drain end of N2 and the source end of N4 are commonly connected to the C-XACN; the drain terminal of N3 and the source terminal of N1 are commonly connected to the drain terminal of N5; the gate of N3 is connected to the CWLAN.

The source end of N4, the drain end of N1, the drain end of N2 and the source end of N3 are commonly connected to the C-XACN; the drain terminal of N4 and the source terminal of N2 are commonly connected to the drain terminal of N6; the gate of N4 is connected to CWLBN.

The grid electrode of N5 is connected with LBL, the source end is connected with R-XACN, and the drain end is connected with the source end of N1 and the drain end of N3.

The grid of N6 is connected with LBLB, the source end lianjieR-XACN, the drain end is connected with the source end of N2 and the drain end of N4.

The source end of P1 and the source end of P2 are commonly connected to the C-XACP; the drain end of P2 and the source end of P3 are commonly connected to the drain end of P5; the gate of P2 is connected to CWLAP.

The source end of P2 and the source end of P1 are commonly connected to the C-XACP; the drain end of P2 and the source end of P4 are commonly connected to the drain end of P6; the gate of P2 is connected to CWLBP.

The drain terminal of P3 and the drain terminal of P4 are commonly connected to the C-XACN; the source end of P3 and the source end of P1 are commonly connected to the drain end of P5; the gate of P2 is connected to RWLAP.

The drain terminal of P4 and the drain terminal of P3 are commonly connected to the C-XACN; the source end of P4 and the source end of P2 are commonly connected to the drain end of P6; the gate of P4 is connected to RWLBP.

The grid electrode of P5 is connected with LBL; the source end of P5 is connected with R-XACP; the drain terminal of P5 is connected to the drain terminal of P1, the source terminal of P3.

The grid electrode of P6 is connected with LBLB; the source end of P6 is connected with R-XACP; the drain terminal of P6 is connected to the drain terminal of P2, the source terminal of P4.

For the sake of full description, there is partially repeated content in the above connection relationship.

Based on the structure of the single bit weight generating unit, the working modes of forward propagation and backward propagation are respectively described.

(1) On forward propagation, the C-XACN is precharged to VDD/2.

Referring to fig. 2, 6 cases are shown:

if LBL is high and LBLB is low (weight= +1), N5 is open, P5 is closed, N6 is closed, P6 is open; RWLAN connects VDD, RWLBN connects VSS, RWLAP connects VSS, RWLBP connects VDD (input= +1), N1 is opened, N2 is closed, P3 is opened, P4 is closed; R-XACN is connected with VSS, R-XACP is connected with VDD; C-XACN discharges to R-XACN through N1 and N5, and the voltage of C-XACN is reduced, namely XNOR= +1.

If LBL is high and LBLB is low (weight= +1), N5 is open, P5 is closed, N6 is closed, P6 is open; RWLAN connects VSS, RWLBN connects VDD, RWLAP connects VDD, RWLBP connects VSS (input= -1), N2 is opened, N1 is closed, P4 is opened, P3 is closed; R-XACN is connected with VSS, R-XACP is connected with VDD; the R-XACP discharges to the C-XACN through P4 and P6, and the voltage of the C-XACN rises, namely XNOR= -1.

If LBL is high and LBLB is low (weight= +1), N5 is open, P5 is closed, N6 is closed, P6 is open; RWLAN connects VSS, RWLBN connects VSS, RWLAP connects VDD, RWLBP connects VDD (input=0), N1 closes, N2 closes, P3 closes, P4 closes; R-XACN is connected with VSS, R-XACP is connected with VDD; C-XACN keeps VDD/2, C-XACN has no voltage change, i.e. xnor=0.

If LBL is low and LBLB is high (weight= -1), P5 is on, N5 is off, P6 is off, N6 is on; RWLAN connects VDD, RWLBN connects VSS, RWLAP connects VSS, RWLBP connects VDD (input= +1), N1 is opened, N2 is closed, P3 is opened, P4 is closed; R-XACN is connected with VSS, R-XACP is connected with VDD; the R-XACP discharges to the C-XACN through P5 and P3, and the voltage of the C-XACN rises, namely XNOR= -1.

If LBL is low and LBLB is high (weight= -1), P5 is on, N5 is off, P6 is off, N6 is on; RWLAN connects VSS, RWLBN connects VSS, RWLAP connects VDD, RWLBP connects VDD (input= -1), N1 closes, N2 closes, P3 closes, P4 closes; R-XACN is connected with VSS, R-XACP is connected with VDD; C-XACN discharges to R-XACN through N2 and N6, and the voltage of C-XACN is reduced, namely XNOR= +1.

If LBL is low and LBLB is high (weight= -1), P5 is on, N5 is off, P6 is off, N6 is on; RWLAN connects VSS, RWLBN connects VSS, RWLAP connects VDD, RWLBP connects VDD (input=0), N1 closes, N2 closes, P3 closes, P4 closes; R-XACN is connected with VSS, R-XACP is connected with VDD; C-XACN keeps VDD/2, C-XACN has no voltage change, i.e. xnor=0.

In general, RWLAN, RWLBN, RWLAP, RWLBP as input, is computed with the 1bit weight of the memory location, and the sum or accumulation result is embodied on the C-XACN at the time of forward propagation.

(2) During backward propagation, R-XACN is precharged to VDD, and R-XACP is precharged to VSS; C-XACN is connected with VSS, and C-XACP is connected with VDD.

Referring to fig. 3, 6 cases are shown:

if LBL is high and LBLB is low (weight= +1), N5 is open, P5 is closed, N6 is closed, P6 is open; CWLAP is connected with VSS, CWLBP is connected with VDD, CWLAN is connected with VDD, CWLBN is connected with VSS (input= +1), P1 is opened, P2 is closed, N3 is opened, N4 is closed, R-XACN discharges to C-XACN through N5 and N3, R-XACN voltage is reduced, R-XACP voltage is unchanged, namely XNOR= +1.

If LBL is high and LBLB is low (weight= +1), N5 is open, P5 is closed, N6 is closed, P6 is open; CWLAP is connected with VDD, CWLBP is connected with VSS, CWLAN is connected with VSS, CWLBN is connected with VDD (input= -1), P2 is opened, P1 is closed, N4 is opened, N3 is closed, C-XACP discharges to R-XACP through P2 and P6, R-XACN voltage is unchanged, and R-XACP voltage rises, namely XNOR= -1.

If LBL is high and LBLB is low (weight= +1), N5 is open, P5 is closed, N6 is closed, P6 is open; CWLAP is connected with VDD, CWLBP is connected with VDD, CWLAN is connected with VSS, CWLBN is connected with VSS (input=0), P1 is closed, P2 is closed, N3 is closed, N4 is closed, R-XACN is kept as VDD, R-XACP is kept as VSS, and R-XACN and R-XACP have no voltage change, namely XNOR=0.

If LBL is low and LBLB is high (weight= -1), P5 is on, N5 is off, P6 is off, N6 is on; CWLAP is connected with VSS, CWLBP is connected with VDD, CWLAN is connected with VDD, CWLBN is connected with VSS (input= +1), P1 is opened, P2 is closed, N3 is opened, N4 is closed, C-XACP discharges to R-XACP through P1 and P5, R-XACN voltage is unchanged, and R-XACP voltage rises, namely XNOR= -1.

If LBL is low and LBLB is high (weight= -1), P5 is on, N5 is off, P6 is off, N6 is on; CWLAP is connected with VDD, CWLBP is connected with VSS, CWLAN is connected with VSS, CWLBN is connected with VDD (input= -1), P2 is opened, P1 is closed, N4 is opened, N3 is closed, R-XACN discharges to C-XACN through N6 and N4, R-XACN voltage is reduced, R-XACP voltage is unchanged, namely XNOR= +1.

If LBL is low and LBLB is high (weight= -1), P5 is on, N5 is off, P6 is off, N6 is on; CWLAP is connected with VDD, CWLBP is connected with VDD, CWLAN is connected with VSS, CWLBN is connected with VSS (input=0), P1 is closed, P2 is closed, N3 is closed, N4 is closed, R-XACN is kept at VDD, R-XACP is kept at VSS, and R-XACN and R-XACP have no voltage change, namely XNOR=0.

In general, on back propagation CWLAN, CWLBN, CWLAP, CWLBP is used as an input to calculate the 1bit weights of the memory cells, and the sum or accumulation results are embodied on R-XACN and R-XACP.

Example 2

Referring to fig. 4, a block diagram of a multi-bit weight generating unit according to embodiment 2 is shown. The multi-bit weight generation unit (abbreviated MWPU) includes 4 single bit weight generation units as in embodiment 1.

The 4 single bit weight generating units are in the same row and share the same RWLAN, the same RWLBN, the same RWLAP, the same RWLBP, the same CWLAP, the same CWLBP, the same CWLAN and the same CWLBN.

Wherein, in the 4 single bit weight generating units of the same row, the m standard 6T-SRAM unit of each single bit weight generating unit shares the same WL, m is [1, n ].

In combination with the operation principle of the single bit weight generation unit of embodiment 1,4 single bit weight generation units in a single multi-bit weight generation unit are operated in parallel. Although 4 standard 6T-SRAM units can be opened at one time through WL, and 4 1bit weights are respectively input to the 4 calculation units, the final output result can be controlled by controlling the number of units actually participating in calculation of the 4 calculation units. For example, if only 1 computing unit participates in the computation, the output result is only 1 group of 4 bits, and can be regarded as only 1bit weight as input; if only 2 computing units participate in the computation, the output result has 2 groups of 4 bits, and can be also regarded as that only 2bit weights are used as input; if all 4 computing units participate in the computation, the output result has 4 groups of 4 bits, and can also be regarded as 4bit weights as input.

Referring to fig. 5 again, the array set provided in embodiment 2 is shown. The array group comprises N x N multi-bit weight generating units according to claim 7 distributed in an array; n=2 ⁱ ，i>0. In this embodiment 2, i is 4, i.e., the array group is composed of multi-bit weight generating units distributed in 16×16.

Wherein the multi-bit weight generating units located in the same column share the same CWLAP, the same CWLBP, the same CWLAN and the same CWLBN.

And in N multi-bit weight generating units in the same column, the q single-bit weight generating unit of each multi-bit weight generating unit shares the same C-XACN and the same C-XACP. q is E [1,4].

That is, for any column of multi-bit weight generation units, there are 4C-XACN and 4C-XACP.

The multi-bit weight generating units located in the same row share the same RWLAN, the same RWLBN, the same RWLAP and the same RWLBP.

And in N multi-bit weight generating units in the same row, the q-th single-bit weight generating unit of each multi-bit weight generating unit shares the same R-XACN and the same R-XACP.

That is, there are 4R-XACN and 4R-XACP for the multi-bit weight generating unit of any line.

In combination with the operation principle of the single bit weight generating unit of embodiment 1, in the N multiple bit weight generating units in the same column, the q-th single bit weight generating unit of each multiple bit weight generating unit adds the same or result to the q-th C-XACN during forward propagation.

In the back propagation, the q single bit weight generating units of each multi-bit weight generating unit respectively accumulate the same or result on the q-th R-XACN and the q-th R-XACP in N multi-bit weight generating units of the same line.

Example 3

Referring to fig. 6, an in-memory calculation macro is provided in embodiment 3. The in-memory computing macro includes: the array group, word line driver, backward channel input driver, forward bit line input driver, forward channel input driver, backward bit line input, in-memory computation controller, flash analog-to-digital converter, successive approximation analog-to-digital converter, timing controller as disclosed in embodiment 2.

The word line driving controller, the backward channel input driving controller and the forward channel input driving controller are equivalent to a switch: only when turned on, the corresponding signal can be input from the timing controller into the array group.

The forward bit line input controller and the backward bit line input controller are equivalent to switches for precharging or connecting the relevant signal lines to the corresponding levels for forward propagation or backward propagation.

The in-memory computing controller can switch the array group storage mode and the computing mode: the array group is in a storage mode, and the weight value can be written in a storage unit; the array set is in a compute mode, i.e., forward propagating or backward propagating as described above.

In addition, referring to fig. 7, there are 4C-XACNs for the multi-bit weight generating unit of any column. The flash analog-to-digital converter is provided with 4 and is connected with 4 pieces of C-XACN one-to-one.

In this way, in the forward propagation, for any column of multi-bit weight generating units, RWLAN, RWLBN, RWLAP, RWLBP are respectively provided with 4-bit pulse width signals, so that 4-bit input is formed, the 4-bit input and the weights are combined or accumulated, the result is reflected on 4C-XACN, and then the 4-bit input is obtained through quantization of a flash memory analog-to-digital converter.

Referring to fig. 8, a waveform diagram of a portion of a column of multi-bit weight generation units propagating forward is shown:

a row of multi-bit weight generating units is provided to include 16 multi-bit weight generating units. The jth multi-bit weight generating unit is identified as < j-1>, and then the jth multi-bit weight generating unit includes WL < j-1>, LBL < j-1>, LBLB < j-1>, RWLAN < j-1>, RWLBN < j-1>, RWLAP < j-1>, RWLBP < j-1>, X-ACN < j-1>, etc.

The 1 st standard 6T-SRAM cell of each single bit weight generation cell is selectively turned on by the corresponding WL. And sets the weight value stored by the 16 standard 6T-SRAM cells to 1. Only the 1 st multi-bit Weight generation unit is shown in fig. 8—wl <0> is set high, LBL <0> is high, LBLB <0> is low, weight= +1.

RWLAN <0> to RWLAN <15>, RWLBN <0> to RWLBN <15>, RWLAP <0> to RWLAP <15>, RWLBP <0> to RWLBP <15> are set with 4bit pulse width signals.

If RWLAN <0> is 0, RWLBN <0> is 0, RWLAP <0> is 1, RWLBP <0> is 1, XNOR=0, and C-XACN <0> is unchanged, the XAC value is 0.

Then, when RWLAN <0> to RWLAN <15> are all 0, RWLBN <0> to RWLBN <15> are all 0, RWLAP <0> to RWLAP <15> are all 1, RWLBP <0> to RWLBP <15> are all 1, the added value of C-XACN <0> is unchanged, and XAC value is 0.

If RWLAN <0> is 1, RWLBN <0> is 0, RWLAP <0> is 0, RWLBP <0> is 1, XNOR= +1, C-XACN <0> is decreased, and XAC value is 15 after binary conversion.

Then, when RWLAN <0> to RWLAN <15> are all 1, RWLBN <0> to RWLBN <15> are all 0, RWLAP <0> to RWLAP <15> are all 0, RWLBP <0> to RWLBP <15> are all 1, there are 16 drops added to C-XACN <0>, and XAC value is 240.

Referring to fig. 9, there are 4R-XACNs, 4R-XACPs for the multi-bit weight generating unit of any row. The successive approximation analog-to-digital converter is provided with 8, 4 of which are connected one-to-one with 4R-XACNs and the other 4 are connected one-to-one with 4R-XACPs.

In this way, in the backward propagation, for the multi-bit weight generating unit of any row, CWLAN, CWLBN, CWLAP, CWLBP is respectively provided with a 4-bit pulse width signal, thus forming a 4-bit input, carrying out the same or accumulation with weights, reflecting the result on 4R-XACN and 4R-XACP, and obtaining 8 groups of 8-bit outputs after quantization by a successive approximation analog-digital converter.

Referring to fig. 10, a waveform diagram of a row of multi-bit weight generation units for a partial case of backward propagation is shown:

a row of multi-bit weight generating units is provided to include 16 multi-bit weight generating units. The jth multi-bit weight generating unit is identified as < j-1>, and then the jth multi-bit weight generating unit comprises WL < j-1>, LBL < j-1>, LBLB < j-1>, CWLAN < j-1>, CWLBN < j-1>, CWLAP < j-1>, CWLBP < j-1>, R-XACN < j-1>, and R-XACP < j-1>.

The 1 st standard 6T-SRAM cell of each single bit weight generation cell is selectively turned on by the corresponding WL. And sets the weight value stored by the 16 standard 6T-SRAM cells to 1. Only the 1 st multi-bit Weight generation unit is shown in fig. 10—wl <0> is set high, LBL <0> is high, LBLB <0> is low, weight= +1.

The CWLAN <0> to CWLAN <15>, CWLBN <0> to CWLBN <15>, CWLAP <0> to CWLAP <15>, CWLBP <0> to CWLBP <15> inputs are set with 4bit pulse width signals.

If CWLAN <0> is 0, CWLBN <0> is 0, CWLAP <0> is 1, CWLBP <0> is 1, XNOR=0, R-XACN <0>, R-XACP <0> are unchanged, and XAC1 corresponding to R-XACN <0> is 0, and XAC2 corresponding to R-XACP <0> is 0, thus the whole XAC=XAC 1+XAC2 is also 0.

Then, when CWLAN <0> to CWLAN <15> are all 0, CWLBN <0> to CWLBN <15> are all 0, CWLAP <0> to CWLAP <15> are all 1, CWLBP <0> to CWLBP <15> are all 1, the sum is R-XACN <0>, R-XACP <0> or not, so that the whole XAC is also 0.

If CWLAN <0> is 1, CWLBN <0> is 0, CWLAP <0> is 0, CWLBP <0> is 1, R-XACN <0> is lowered and R-XACP <0> is unchanged, XAC1 corresponding to R-XACN <0> is 15, XAC2 corresponding to R-XACP <0> is 0, and thus XAC=XAC 1+XAC2 is 15 as a whole.

Then, when CWLAN <0> to CWLAN <15> are all 1, CWLBN <0> to CWLBN <15> are all 0, CWLAP <0> to CWLAP <15> are all 0, CWLBP <0> to CWLBP <15> are all 1, there are 16 drops added to R-XACN <0>, and no change is added to R-XACP <0>, XAC1 corresponding to R-XACN <0> is 240, XAC2 corresponding to R-XACP <0> is 0, and thus the whole XAC is 240.

It should be noted that, when the calculation macro in the present memory performs the reasoning operation, the forward propagation is performed; when training is performed, forward propagation is performed first, and then backward propagation is performed. Based on the structural design and the working mode, the integration of reasoning and training is realized.

Finally, the inventor also simulates the invention with the existing two operation circuits, and compares the Accuracy (Accumey) under the condition of inputting a 1-bit weight and inputting a 2-bit weight. Wherein, the first existing arithmetic circuit is 4bit input, 4bit output, the second existing arithmetic circuit is 8bit input, 8bit output.

As can be seen from fig. 11, compared with the existing 4-bit input and 4-bit output circuit structures, the accuracy of the present invention is significantly improved, and the level of the present invention is similar to that of the existing 8-bit input and 8-bit output circuit structures.

The inventor also carries out simulation experiments on the energy consumption ratio of the invention, namely, when forward propagation is carried out, the energy consumption ratio of the invention is 106.85Tops/W; the energy consumption ratio of the invention was 18.25Tops/W when back propagation was performed. This is because the flash analog-to-digital converter is used for quantization during forward propagation, thereby improving the power consumption ratio.

The technical features of the above-described embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above-described embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples illustrate only a few embodiments of the invention, which are described in detail and are not to be construed as limiting the scope of the invention. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the invention, which are all within the scope of the invention. Accordingly, the scope of protection of the present invention is to be determined by the appended claims.

Claims

1. A single bit weight generation unit, comprising:

n standard 6T-SRAM cells as memory cells; wherein, the reading of any standard 6T-SRAM cell is controlled by the word line WL thereof, and the stored weight value is reflected on BL, BLB thereof;

the bit lines BL of the n standard 6T-SRAM cells are commonly connected to the local bit lines LBL, and the bit lines BLB of the n standard 6T-SRAM cells are commonly connected to the local bit lines LBLB; n is more than or equal to 1;

and

1 transpose XNOR accumulation unit as a calculation unit; the transpose XNOR accumulation unit includes: NMOS transistors N1 to N6 and PMOS transistors P1 to P6; wherein,

the drain end of N1 is connected to the signal line C-XACN, and the grid electrode is connected to the control signal RWLAN;

the drain end of N2 is connected to C-XACN, the grid electrode is connected to control signal RWLBN;

the drain end of N3 is connected with the source end of N1, the grid electrode is connected with the control signal CWLAN, and the source end is connected with the C-XACN;

the drain end of N4 is connected with the source end of N2, the grid electrode is connected with the control signal CWLBN, and the source end is connected with the C-XACN;

the drain end of N5 is connected with the source end of N1 and the drain end of N3, the grid electrode is connected with LBL, and the source end is connected with a signal line R-XACN;

the drain end of N6 is connected with the source end of N2 and the drain end of N4, the grid electrode is connected with LBLB, and the source end is connected with R-XACN;

the grid of P1 is connected to the control signal CWLAP, and the source terminal is connected to the signal line C-XACP;

p2 has its gate connected to the control signal CWLBP and its source connected to the C-XACP;

the drain end of P3 is connected to C-XACN, the grid electrode is connected to control signal RWLAP, the source end is connected to the drain end of P1;

the drain end of P4 is connected to C-XACN, the grid electrode is connected to control signal RWLBP, the source end is connected to the drain end of P2;

the drain terminal of P5 is connected to the drain terminal of P1, the source terminal of P3, the gate is connected to LBL, and the source is connected to signal line R-XACP;

the drain terminal of P6 is connected to the drain terminal of P2, the source terminal of P4, the gate is connected to LBLB, and the source is connected to R-XACP.

2. The single bit weight generation unit of claim 1, wherein n = 8.

3. The single bit weight generation unit of claim 1, wherein,

during forward propagation, the C-XACN is precharged to VDD/2;

if LBL is high level and LBLB is low level, N5 is opened, P5 is closed, N6 is closed and P6 is opened; RWLAN connects VDD, RWLBN connects VSS, N1 is opened, N2 is closed; RWLAP is connected with VSS, RWLBP is connected with VDD, P3 is opened, P4 is closed; R-XACN is connected with VSS, R-XACP is connected with VDD; C-XACN discharges to R-XACN through N1, N5;

if LBL is high level and LBLB is low level, N5 is opened, P5 is closed, N6 is closed and P6 is opened; RWLAN connects VSS, RWLBN connects VDD, N2 opens, N1 closes; RWLAP is connected with VDD, RWLBP is connected with VSS, P4 is opened, P3 is closed; R-XACN is connected with VSS, R-XACP is connected with VDD; R-XACP discharges through P4, P6 to C-XACN;

if LBL is high level and LBLB is low level, N5 is opened, P5 is closed, N6 is closed and P6 is opened; RWLAN connects VSS, RWLBN connects VSS, N1 closes, N2 closes; RWLAP is connected with VDD, RWLBP is connected with VDD, P3 is closed, P4 is closed; R-XACN is connected with VSS, R-XACP is connected with VDD; the C-XACN maintains VDD/2.

4. The single bit weight generation unit of claim 1, wherein,

during forward propagation, the C-XACN is precharged to VDD/2;

if LBL is low level and LBLB is high level, P5 is opened, N5 is closed, P6 is closed and N6 is opened; RWLAN connects VDD, RWLBN connects VSS, N1 is opened, N2 is closed; RWLAP is connected with VSS, RWLBP is connected with VDD, P3 is opened, P4 is closed; R-XACN is connected with VSS, R-XACP is connected with VDD; R-XACP discharges through P5, P3 to C-XACN;

if LBL is low level and LBLB is high level, P5 is opened, N5 is closed, P6 is closed and N6 is opened; RWLAN connects VSS, RWLBN connects VSS, N1 closes, N2 closes; RWLAP is connected with VDD, RWLBP is connected with VDD, P3 is closed, P4 is closed; R-XACN is connected with VSS, R-XACP is connected with VDD; C-XACN discharges to R-XACN through N2, N6;

if LBL is low level and LBLB is high level, P5 is opened, N5 is closed, P6 is closed and N6 is opened; RWLAN connects VSS, RWLBN connects VSS, N1 closes, N2 closes; RWLAP is connected with VDD, RWLBP is connected with VDD, P3 is closed, P4 is closed; R-XACN is connected with VSS, R-XACP is connected with VDD; the C-XACN maintains VDD/2.

5. The single bit weight generation unit of claim 1, wherein,

during backward propagation, R-XACN is precharged to VDD, and R-XACP is precharged to VSS; C-XACN is connected with VSS, C-XACP is connected with VDD;

if LBL is high level and LBLB is low level, N5 is opened, P5 is closed, N6 is closed and P6 is opened; CWLAP is connected with VSS, CWLBP is connected with VDD, P1 is opened, P2 is closed, CWLAN is connected with VDD, CWLBN is connected with VSS, N3 is opened, N4 is closed, R-XACN discharges to C-XACN through N5 and N3;

if LBL is high level and LBLB is low level, N5 is opened, P5 is closed, N6 is closed and P6 is opened; CWLAP is connected with VDD, CWLBP is connected with VSS, P2 is opened, P1 is closed, CWLAN is connected with VSS, CWLBN is connected with VDD, N4 is opened, N3 is closed, C-XACP discharges to R-XACP through P2 and P6;

if LBL is high level and LBLB is low level, N5 is opened, P5 is closed, N6 is closed and P6 is opened; CWLAP connects VDD, CWLBP connects VDD, P1 closes, P2 closes, CWLAN connects VSS, CWLBN connects VSS, N3 closes, N4 closes, R-XACN remains VDD, R-XACP remains VSS.

6. The single bit weight generation unit of claim 1, wherein,

if LBL is low level and LBLB is high level, P5 is opened, N5 is closed, P6 is closed and N6 is opened; CWLAP is connected with VSS, CWLBP is connected with VDD, P1 is opened, P2 is closed, CWLAN is connected with VDD, CWLBN is connected with VSS, N3 is opened, N4 is closed, C-XACP discharges to R-XACP through P1 and P5;

if LBL is low level and LBLB is high level, P5 is opened, N5 is closed, P6 is closed and N6 is opened; CWLAP is connected with VDD, CWLBP is connected with VSS, P2 is opened, P1 is closed, CWLAN is connected with VSS, CWLBN is connected with VDD, N4 is opened, N3 is closed, R-XACN discharges to C-XACN through N6, N4;

if LBL is low level and LBLB is high level, P5 is opened, N5 is closed, P6 is closed and N6 is opened; CWLAP connects VDD, CWLBP connects VDD, P1 closes, P2 closes, CWLAN connects VSS, CWLBN connects VSS, N3 closes, N4 closes, R-XACN remains VDD, R-XACP remains VSS.

7. A multi-bit weight generation unit, characterized in that the multi-bit weight generation unit comprises 4 single bit weight generation units according to any of claims 1-6;

the 4 single bit weight generating units are positioned in the same row and share the same RWLAN, the same RWLBN, the same RWLAP, the same RWLBP, the same CWLAP, the same CWLBP, the same CWLAN and the same CWLBN;

in the 4 single bit weight generating units in the same row, the m standard 6T-SRAM units of each single bit weight generating unit share the same WL, m is [1, n ].

8. An array group comprising n×n multi-bit weight generating units according to claim 7 distributed in an array; n=2 ⁱ ，i>0；

Wherein the multi-bit weight generating units positioned in the same column share the same CWLAP, the same CWLBP, the same CWLAN and the same CWLBN;

the q-th single-bit weight generating unit of each multi-bit weight generating unit shares the same C-XACN and the same C-XACP in N multi-bit weight generating units of the same column; q is E [1,4];

the multi-bit weight generating units positioned in the same row share the same RWLAN, the same RWLBN, the same RWLAP and the same RWLBP;

9. The array package of claim 8, wherein,

in the forward propagation, the q single bit weight generating unit of each multi-bit weight generating unit adds the same or result to the q C-XACN in N multi-bit weight generating units of the same column;

10. An in-memory computing macro, comprising:

the array set of claim 8 or 9 for forward propagation or backward propagation;

a word line driving controller for controlling the WL switch;

a back channel input drive controller for controlling the CWLAN, CWLAP, CWLBN, CWLBP switch;

a forward bit line input controller for precharging the C-XACN to VDD/2 during forward propagation and connecting the C-XACN to VSS and the C-XACP to VDD during backward propagation;

a forward channel input drive controller for controlling the RWLAN, RWLAP, RWLBN, RWLBP switch;

a backward bit line input controller for precharging R-XACN to VDD and R-XACP to VSS upon backward propagation;

a memory calculation control circuit for switching the functions of the array group;

the flash memory analog-to-digital converter is used for obtaining 4bit output during forward propagation;

the successive approximation analog-to-digital converter is used for obtaining 8bit output during backward propagation;

and

and a timing controller for controlling clock pulses of the respective signals.