WO2021232949A1 - 子单元、mac阵列、位宽可重构的模数混合存内计算模组 - Google Patents
子单元、mac阵列、位宽可重构的模数混合存内计算模组 Download PDFInfo
- Publication number
- WO2021232949A1 WO2021232949A1 PCT/CN2021/084022 CN2021084022W WO2021232949A1 WO 2021232949 A1 WO2021232949 A1 WO 2021232949A1 CN 2021084022 W CN2021084022 W CN 2021084022W WO 2021232949 A1 WO2021232949 A1 WO 2021232949A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- calculation
- capacitor
- mac
- differential
- input
- Prior art date
Links
Images
Classifications
-
- H—ELECTRICITY
- H03—ELECTRONIC CIRCUITRY
- H03M—CODING; DECODING; CODE CONVERSION IN GENERAL
- H03M1/00—Analogue/digital conversion; Digital/analogue conversion
- H03M1/12—Analogue/digital converters
- H03M1/34—Analogue value compared with reference values
- H03M1/38—Analogue value compared with reference values sequentially only, e.g. successive approximation type
- H03M1/46—Analogue value compared with reference values sequentially only, e.g. successive approximation type with digital/analogue converter for supplying reference values to converter
- H03M1/466—Analogue value compared with reference values sequentially only, e.g. successive approximation type with digital/analogue converter for supplying reference values to converter using switched capacitors
- H03M1/468—Analogue value compared with reference values sequentially only, e.g. successive approximation type with digital/analogue converter for supplying reference values to converter using switched capacitors in which the input S/H circuit is merged with the feedback DAC array
-
- G—PHYSICS
- G11—INFORMATION STORAGE
- G11C—STATIC STORES
- G11C7/00—Arrangements for writing information into, or reading information out from, a digital store
- G11C7/10—Input/output [I/O] data interface arrangements, e.g. I/O data control circuits, I/O data buffers
- G11C7/1051—Data output circuits, e.g. read-out amplifiers, data output buffers, data output registers, data output level conversion circuits
- G11C7/1069—I/O lines read out arrangements
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F7/38—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
- G06F7/48—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
- G06F7/544—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices for evaluating functions by calculation
- G06F7/5443—Sum of products
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
- G06N3/065—Analogue means
-
- G—PHYSICS
- G11—INFORMATION STORAGE
- G11C—STATIC STORES
- G11C11/00—Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor
- G11C11/21—Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor using electric elements
- G11C11/34—Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor using electric elements using semiconductor devices
- G11C11/40—Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor using electric elements using semiconductor devices using transistors
- G11C11/41—Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor using electric elements using semiconductor devices using transistors forming static cells with positive feedback, i.e. cells not needing refreshing or charge regeneration, e.g. bistable multivibrator or Schmitt trigger
-
- G—PHYSICS
- G11—INFORMATION STORAGE
- G11C—STATIC STORES
- G11C11/00—Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor
- G11C11/54—Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor using elements simulating biological cells, e.g. neuron
-
- G—PHYSICS
- G11—INFORMATION STORAGE
- G11C—STATIC STORES
- G11C7/00—Arrangements for writing information into, or reading information out from, a digital store
- G11C7/10—Input/output [I/O] data interface arrangements, e.g. I/O data control circuits, I/O data buffers
- G11C7/1078—Data input circuits, e.g. write amplifiers, data input buffers, data input registers, data input level conversion circuits
- G11C7/109—Control signal input circuits
-
- G—PHYSICS
- G11—INFORMATION STORAGE
- G11C—STATIC STORES
- G11C7/00—Arrangements for writing information into, or reading information out from, a digital store
- G11C7/16—Storage of analogue signals in digital stores using an arrangement comprising analogue/digital [A/D] converters, digital memories and digital/analogue [D/A] converters
-
- H—ELECTRICITY
- H03—ELECTRONIC CIRCUITRY
- H03M—CODING; DECODING; CODE CONVERSION IN GENERAL
- H03M1/00—Analogue/digital conversion; Digital/analogue conversion
- H03M1/12—Analogue/digital converters
- H03M1/34—Analogue value compared with reference values
- H03M1/38—Analogue value compared with reference values sequentially only, e.g. successive approximation type
- H03M1/46—Analogue value compared with reference values sequentially only, e.g. successive approximation type with digital/analogue converter for supplying reference values to converter
- H03M1/462—Details of the control circuitry, e.g. of the successive approximation register
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Definitions
- the present invention relates to the field of analog-digital mixed memory in-memory calculation, and more specifically, to a sub-unit, MAC array, and a reconfigurable analog-digital mixed memory in-memory calculation module with a bit width.
- the digital circuit occupies a large chip area and consumes a large amount of power, which makes it difficult to realize a large-scale neural network with high energy efficiency.
- the data exchange bottleneck between the memory and the central processing unit caused by the von Neumann structure used in traditional digital circuits will severely limit the computing energy efficiency and computing speed under the large-scale data handling in DNN applications.
- the analog circuit implementation of MAC has the advantages of simple structure and low power consumption, so analog and analog-digital mixed-signal calculations have the potential to achieve high energy efficiency.
- in-memory computing which has become a research hotspot in recent years, cannot essentially be realized in the form of pure digital circuits, and requires the assistance of analog circuits.
- DNN application-specific integrated circuits ASICs
- the addition stage uses charge sharing.
- Each 1-bit calculation unit of the above 1-bit MAC calculation has 10 transistors.
- the problems in the prior art of Papers 1 and 2 are: (1) For each addition operation, the transmission gate in each computing unit is unconditionally driven, and the sparsity of input data cannot be used to save energy; (2) Each arithmetic unit that performs 1-bit multiplication is equipped with an independent capacitor, and the metal oxide metal (MOM) of the successive approximation (Successive Approximation, SAR) analog-to-digital converter (Analog to Digital Converter, ADC) The capacitor is located outside the Static Random Access Memory (SRAM) calculation array, because there is no space inside the array, which reduces the area efficiency; (3) The addition stage using charge sharing needs to be connected to the top plate of the capacitor that stores the XNOR operation result .
- MOM metal oxide metal
- This circuit topology makes the addition susceptible to non-ideal effects such as charge injection, clock feedthrough, non-linear parasitic capacitance at the drain/source of the pass gate transistor, and leakage of the transistor connected to the top plate of the capacitor. Causes calculation errors.
- the mismatch between the arithmetic capacitor and the capacitor in the digital-to-analog converter in the ADC due to the mismatch of the physical layout will also cause calculation errors.
- Paper 3 proposes one that only supports Binary neural network (BNN) operation module with binarized weights and activation values.
- BNN Binary neural network
- the shortcomings of the computing module in Paper 3 are: (1) The architecture only supports BNN and cannot be used for large-scale DNN models for vision applications, such as object detection, etc., which has a small scope of application; (2) The multiplication stage of 1-bit MAC calculation is at least Need one OR (OR) gate, two XNOR gates, two exclusive OR (NOR) gates and a latch, the number of transistors used is large, and the area is large.
- Paper 4 proposes an embedded convolution Energy-saving SRAM for computing functions.
- the shortcomings of SRAM in Paper 4 are: (1) Each 1-bit computing SRAM cell has 10 transistors. The higher the number of transistors in each cell, the lower the storage density; (2) Use the parasitic capacitance on the bit line to store charge for subsequent averaging operations.
- the article’s solution uses a ramp-based ADC that takes up to 2 N -1 (N is the ADC resolution) steps to converge, which reduces the speed of analog-to-digital conversion and leads to lower computational throughput; (5) the array’s
- the input uses an additional DAC circuit to convert the input data X in (usually a characteristic map) from a digital representation to an analog representation.
- the non-ideal characteristics of the DAC circuit will cause more accuracy loss and area and energy costs.
- the calculation unit for 1-bit multiplication in the prior art MAC array uses many transistors; the capacitor that stores the multiplication result and is used for accumulation corresponds to the storage unit one-to-one, that is, the storage unit
- the number of is the same as the number of capacitors, and the capacitance is generally much larger than that of SRAM cells, especially under the advanced technology process, which will cause the MAC array to occupy a large area; at the same time, there is unconditional driving of the transistors in the multiplication and addition operation, resulting in inefficient operation. High; In addition, the high calculation error rate leads to limited applicable scenarios.
- the present invention provides a sub-unit, MAC array, and a reconfigurable analog-digital mixed memory computing module.
- the realization of the MAC array of the differential system is also provided.
- the present invention adopts the following technical solutions:
- an in-memory calculation sub-unit for 1-bit multiplication calculation, including: a traditional 6T SRAM unit, a complementary transmission gate, a first N-type MOS transistor, and a calculation capacitor;
- the traditional 6T SRAM unit is composed of MOS transistors M 1 , M 2 , M 3 , M 4 , M 5 , and M 6.
- a CMOS inverter composed of MOS transistors M 1 and M 2 and MOS transistors M 3 , M 4
- the formed CMOS inverters are cross-coupled, and the two cross-coupled CMOS inverters store 1-bit filter parameters, and MOS tubes M 5 and M 6 are control switches for the bit lines used for reading and writing the filter parameters;
- the output terminal of the CMOS inverter composed of MOS transistors M 1 and M 2 in the traditional 6T SRAM cell is connected to the input terminal of the complementary transmission gate, and the output terminal of the complementary transmission gate is connected to the drain of the first N-type MOS transistor;
- the source of the first N-type MOS transistor is grounded, and the drain is connected to the bottom plate of the computing capacitor;
- the gate of the N-type MOS transistor of the complementary transmission gate of the sub-unit is connected to the input signal, and the input signal of the gate of the P-type MOS transistor and the gate of the first N-type MOS transistor are at the same level during operation. same;
- the multiplication result of the input signal and the filter parameter is stored as the voltage of the bottom plate of the calculation capacitor, multiple sub-units are used to form a calculation unit, and each sub-unit in the same calculation unit shares the same first N-type MOS transistor Calculate the capacitor.
- the filter parameter/weight w is written and stored in the SRAM cell, the input signal A is connected to the gate of the N-type MOS transistor of the complementary transmission gate, and the gate of the P-type MOS transistor of the complementary transmission gate is connected to the complementary signal nA , The gate of the first N-type MOS tube is connected to signal B.
- the level of the complementary input signal nA is the same as the signal B during calculation.
- a plurality of the sub-units constitute a calculation unit, and each sub-unit in the same calculation unit shares the same first N-type MOS transistor and calculation capacitor, and the sub-units are divided into 2 ⁇ 2, 4 ⁇ 2 and other feasible ways.
- this solution reduces the number of first N-type MOS transistors and the number of calculation capacitors. Taking the calculation unit composed of 2 ⁇ 2 sub-units as an example, it reduces 3 first N-type MOS transistors and 3 calculation capacitors. .
- the number of devices and area required at this time are equally distributed to each sub-unit, and the number of transistors required for each sub-unit is closer to 8.
- the area of the required capacitor is also divided equally.
- the sub-units in the computing unit are activated in a time-division multiplexing manner. For example, when one sub-unit is activated, other sub-units in the same computing unit are deactivated. After a sub-unit participates in the calculation, the filter parameters stored in the SRAM unit in the other sub-units included in the same calculation unit can be used for in-memory calculations immediately, without the need to move data from the outside and store it in the SRAM before performing calculations. This greatly improves the calculation speed, improves throughput and reduces energy loss and area consumption.
- a MAC array including the first aspect and possible implementations of the first aspect, which performs multiplication and addition operations, including: multiple calculation units, and the top plates of calculation capacitors in all calculation units in the same column are connected to the same Cumulative bus.
- the MAC array can store more neural network parameters or activation values for the next layer of the network by using the calculation unit in the mode of sharing capacitors and transistors. Ground, the calculation unit completes the calculation of 1-bit multiplication and stores the calculation result in the calculation capacitor.
- the calculation units in the same column in the MAC array accumulate their 1-bit multiplication results through the same bus connected to the top plate of the calculation capacitor. The voltage of a cumulative bus corresponds to the cumulative sum calculated by the multiplication of each column in the MAC array.
- the device sharing method is adopted, that is, multiple sub-units for 1-bit multiplication share a transistor and a calculation capacitor for calculation and storage. Compared with the other 1 sub-unit for 1-bit multiplication, it is necessary to connect a capacitor to store the result of the multiplication calculation, which can greatly increase the storage capacity per unit area.
- the MAC array per unit area includes more SRAM cells, which can store more neural network filter parameters at one time to reduce data movement.
- the MAC array further includes a differential complementary transmission gate, a differential calculation capacitor, and a first P-type MOS transistor; in each calculation unit of the MAC array, M in each traditional 6T SRAM unit 3.
- the output terminals of the CMOS inverter composed of M 4 are respectively connected to the input terminals of a differential complementary transmission gate, and the output terminals of the differential complementary transmission gates connected to all the CMOS inverters composed of M 3 and M 4 are connected to the same first
- the drain of the P-type MOS transistor; the drain of the first P-type MOS transistor is connected to the bottom plate of the differential calculation capacitor, and the source is connected to VDD; the differential multiplication result is stored as the voltage of the bottom plate of the differential calculation capacitor.
- the top plate of the differential calculation capacitor is connected to the same differential accumulation bus.
- the MAC array further includes a first CMOS inverter and a differential calculation capacitor; the output terminals of all complementary transmission gates in each calculation unit constituting the MAC array are connected to the same first CMOS inverter
- the input terminal of the first CMOS inverter and the output terminal of the first CMOS inverter are connected to the bottom plate of a differential calculation capacitor, the differential multiplication result is stored as the voltage of the bottom plate of the differential calculation capacitor, and the top plates of all differential calculation capacitors in the same column are connected to the same differential accumulation bus.
- a reconfigurable bit-width arithmetic module for calculation in mixed-modulus memory including:
- the column-wise accumulated multiplication result in the array is expressed as an analog voltage
- the filter/ifmap module provides the filter parameters written and stored in the MAC array or the activation value calculated by the previous layer; the ifmap/filter module provides the input of the MAC array, and the filter parameters of the neural network or The activation value calculated in the upper layer is multiplied; the analog-to-digital conversion module converts the analog voltage obtained after the MAC array calculation into a digital representation; the digital processing module at least performs multi-bit fusion, For operations such as bias, scaling, or non-linearity, the output result is a partial sum or an activation value that can be directly used as the input of the next layer of the network.
- the filter parameters or activation values calculated in the upper layer of the neural network are written through the filter/ifmap module and stored in the MAC array.
- This process is operated in accordance with the standard traditional 6T SRAM write process, so that the sub-unit The SRAM in the memory stores logic 1 or 0, and performs multiplication and addition operations with the input provided by the ifmap/filter module.
- the multiplication between the stored value in each subunit and the input is a digital operation, which is equivalent to AND.
- the result of the multiplication is stored in the calculation capacitor.
- the analog-to-digital conversion module adopts a SAR ADC, specifically a SAR ADC of a binary weighted capacitor array.
- the sparseness of the input value and the stored value of the MAC array can prevent the switching sequence of some capacitors in the SAR DAC from switching, thereby obtaining more High energy efficiency and ADC conversion speed.
- the bit width of each column of SAR ADC in the MAC array can be determined in real time by the input value and the sparsity of the stored value.
- the MAC DAC and the SAR DAC can be connected together.
- the MAC DAC refers to It is a column of calculation capacitors in the MAC array, that is, all the capacitors in the MAC array are connected in parallel with the capacitors in the SAR DAC.
- the MAC DAC is allowed to be multiplexed into a SAR DAC through the backplane sampling, thereby using the same
- the capacitor array realizes MAC operation and analog-to-digital conversion, avoiding the mismatch and accuracy loss caused by using different capacitor arrays in the MAC DAC in the MAC operation link and the SAR DAC in the analog-to-digital conversion stage. Furthermore, it allows the realization of fully differential SAR ADC , To better solve the problem of common-mode-related comparator input offset voltage offset.
- FIG. 1 is a schematic diagram of the structure of a subunit for 1-bit multiplication in an embodiment of the present invention
- FIG. 2 is a schematic diagram of a truth table of a 1-bit multiplier unit in an embodiment of the present invention
- 3a is a schematic diagram of the arrangement of sub-units in a computing unit in an embodiment of the present invention.
- Figure 3b is a schematic diagram of a computing unit composed of multiple subunits in an embodiment of the present invention.
- Fig. 3c is a truth table when the calculation unit is running in an embodiment of the present invention.
- Figure 4a is a schematic diagram of a MAC array in an embodiment of the present invention.
- Figure 4b is a schematic diagram of calculating the bottom and top plate voltages of a capacitor in an embodiment of the present invention.
- Figure 5a is a schematic diagram of a 10T structure in an embodiment of the present invention.
- FIG. 5b is a schematic diagram of an expansion of a computing unit according to an embodiment of the present invention.
- Figure 6a is a schematic diagram of the 8T structure in an embodiment of the present invention.
- Figure 6b is a schematic diagram of the expansion of a computing unit in another embodiment of the present invention.
- 6c is a schematic diagram of the MAC array structure under the differential system in an embodiment of the present invention.
- FIG. 7 is a schematic diagram of an in-memory calculation module in an embodiment of the present invention.
- FIG. 8 is a schematic diagram of an analog-to-digital conversion module in an embodiment of the present invention.
- FIG. 9 is a schematic diagram of an analog-to-digital conversion module in another embodiment of the present invention.
- FIG. 10 is a schematic diagram of an analog-to-digital conversion module in another embodiment of the present invention.
- FIG. 11 is a schematic diagram of an analog-to-digital conversion module in another embodiment of the present invention.
- FIG. 12 is a schematic diagram of the differential structure of an analog-to-digital conversion module in another embodiment of the present invention.
- FIG. 13 is a schematic diagram of an architecture for reducing the energy consumption of analog-to-digital conversion in an embodiment of the present invention.
- bit-width reconfigurable analog-to-digital hybrid computing module provided by the embodiment of the present invention can be applied in the visual and acoustic DNN architecture, and more specifically, it can be used to achieve object detection and low power. Acoustic feature extraction and so on.
- the data to be processed is convolved with a filter composed of weights in the feature extractor, and then the corresponding feature map is output.
- Different filter selections will result in different extracted features.
- the convolution operation of the data to be processed and the filter requires the highest energy consumption, and it is necessary to avoid the energy consumption caused by unconditional driving of the circuit, especially when the data to be processed is a sparse matrix.
- FIG. 1 is a schematic diagram of an embodiment of a sub-unit structure used for 1-bit multiplication operations, performing 1-bit multiplication calculations, including: a traditional 6T SRAM cell, a complementary transmission gate, a first N-type MOS transistor, and a calculation capacitor;
- the traditional 6T SRAM cell is composed of MOS transistors M 1 , M 2 , M 3 , M 4 , M 5 , and M 6 , among them, a CMOS inverter composed of MOS transistors M 1 and M 2 and MOS transistors M 3 , M 4
- the CMOS inverters are cross-coupled, and the two cross-coupled CMOS inverters store 1-bit filter parameters, and MOS tubes M 5 and M 6 are control switches for the bit lines used for reading and writing the filter parameters;
- the output terminal of the CMOS inverter composed of MOS transistors M 1 and M 2 in the traditional 6T SRAM cell is connected to the input terminal of the complementary transmission gate, and the output terminal of the complementary transmission gate is connected to the drain of the first N-type MOS transistor;
- the source of the first N-type MOS transistor is grounded, and the drain is connected to the bottom plate of the computing capacitor;
- the gate of the N-type MOS transistor of the complementary transmission gate of the subunit in the operation is connected to the input signal, and the gate of the P-type MOS transistor and the gate input signal of the first N-type MOS transistor have the same level during operation;
- the multiplication result of the input signal and the filter parameter is stored as the voltage of the bottom plate of the calculation capacitor, a plurality of calculation sub-units are used to form a calculation unit, and each sub-unit in the same calculation unit shares the same first N-type MOS tube, calculation capacitor.
- the input signals of the N-terminal and P-terminal gates of the complementary transmission gate are A and nA, respectively, and the gate signal B of the first N-type MOS transistor.
- the signal B and nA are the same.
- signal nA and signal B share a node to provide the same level.
- the traditional 6T SRAM stores the written filter parameter W, and the writing process follows the writing of the standard 6T SRAM cell, that is, the word line WL is set to VDD, and the bit lines BL and nBL are 0 or 1 according to the written value.
- the word line WL is set to the high level VDD, MOS tube M 5 and M 6 are turned on, the bit line BL is set 0, set nBL VDD, W M 6 through the voltage drop, logic 0; nW M 5 through the voltage rises to a logic 1.
- the process for the subunit to perform one-bit multiplication calculation is as follows:
- the bottom plate voltage V btm of the calculated capacitor is either kept at 0 or enters VDD.
- the output result of the multiplication operation is the voltage of the calculated capacitor bottom plate, expressed as VDD ⁇ w ⁇ A.
- the first N-type MOS transistor in the sub-unit plays a control role, and the 1-bit multiplication calculation result of the input signal A and the filter parameter w stored in the SRAM unit is stored as the voltage of the capacitor bottom plate.
- the structure in which the SRAM cell is connected to the complementary transmission gate includes 8 transistors, which is called an 8T structure (8T sub-cell, 8 transistors).
- the calculation subunit is an external expansion of the standard traditional 6T SRAM unit.
- the implementation of a standardized structure has better economic benefits, and on the other hand, it improves the scalability of the subunit.
- the complementary transmission gate is connected to the bottom plate of the calculation capacitor. Compared with the prior art, it is connected to the top plate of the calculation capacitor, which can minimize the calculation error, especially because the clock feedthrough when the MOS tube is used as a switch changes from conduction to conduction. Errors caused by charge injection during turn-off, non-linear parasitic capacitance at the drain/source of the complementary pass gate transistor, and leakage of the transistor itself.
- a plurality of the sub-units are used to form a computing unit, and the sub-units are arranged in a feasible manner such as 2 ⁇ 2, 4 ⁇ 2, and the like. Refer to FIG. 3a for the arrangement.
- Figure 3b it represents a computing unit composed of four sub-units, where the parameter WL 0 represents the word line shared by sub-units a and b, WL 1 represents the word line shared by sub-units c and d, and BL 0 and nBL 0 Represents the bit line shared by the subunits a and c, BL 1 and nBL 1 represent the bit line shared by the subunits b and d, B 0 represents the gate signal, and the parameters W 0a -W 0d and nW 0a -nW 0d both represent the four subunits.
- the location in the unit used to store the weight, and the parameter V btm0 represents the bottom plate voltage of the first calculation unit.
- Each sub-unit in the same computing unit retains its own 8T structure (8T sub-cell, 8 transistors), and all sub-units share the same first N-type MOS transistor and computing capacitor.
- the output terminal of the complementary transmission gate of each sub-unit in the same calculation unit is connected to the drain of the same first N-type MOS transistor, and the drain of the first N-type MOS transistor is connected to the bottom plate of a calculation capacitor.
- the calculation unit composed of 2 ⁇ 2 sub-units as an example, it reduces 3 first N-type MOS transistors and 3 calculation capacitors. . It should be understood that the more sub-units share the first N-type MOS transistor and the calculation capacitor, the closer the number of transistors for each sub-unit will be to 8.
- the area occupied by a single capacitor is generally several times the area occupied by the entire 6T SRAM, there is a huge gap.
- the sub-units share the device method, that is, multiple sub-units used for 1-bit multiplication share a capacitor to store the calculation results.
- a capacitor needs to be connected to store the multiplication calculation results, which can greatly increase the storage capacity per unit area, that is, the storage capacity per unit area is more than that of the prior art. Multiple filter parameters or weights.
- the sub-units in the computing unit are activated in a time-division multiplexing manner, that is, when one sub-unit is activated, other sub-units in the same computing unit are deactivated.
- the one-bit multiplication operation is performed according to the above-mentioned operation. Refer to Figure 3c for the truth table calculated in the unit.
- the subunit complements the gate of the N-type MOS transistor of the transmission gate and the P-type
- the signals of the gate of the MOS tube are A ij and nA ij respectively , where i is the index of the cell column and is a non-negative integer from 0 to (n-1), and j is the index of the sub-unit in the cell, which is in the range of 2 ⁇ 2.
- j a,b,c,d.
- the sub-units share a calculation capacitor and a first N-type MOS transistor, which means that a calculation unit contains multiple sub-units that can be used for multiplication and addition operations. It should be noted that it is different from a single independent sub-unit.
- the input signal B i of the gate of the first N-type MOS transistor and the complementary input signal nA ij of the P-terminal gate of the complementary transmission gate of each sub-unit are controlled separately Under time-division multiplexing, although the complementary input signal nA ij and the signal B i in the subunits that are in the working state at a certain moment have the same level, the situation that the two share nodes is no longer applicable.
- the sub-units share a calculation capacitor and a first N-type MOS transistor, which means that a calculation unit contains multiple sub-units that can be used for multiplication and addition operations, compared to the same number of independent sub-units.
- the calculation capacitors and the number of the first N-type MOS transistors required in the calculation unit formed by the sub-units are reduced by n-1, respectively, and the sub-unit structure that completes 1-bit multiplication in the calculation unit is close to 8 transistors.
- the area occupied by the calculation capacitor is several times that of the SRAM unit, reducing the number of calculation capacitors per unit area can increase the storage capacity of the array module composed of the calculation unit.
- a MAC array including the sub-units of the first aspect and possible implementations of the first aspect is provided to perform multiplication and addition operations.
- the MAC array includes multiple calculation units, and all calculation units in the same column The top plate of the calculation capacitor is connected to the same accumulation bus.
- the calculation unit that adopts the mode of sharing capacitors and transistors can store more neural network parameters or values calculated by the upper layer network, specifically ,
- the calculation unit completes the calculation of 1-bit multiplication and stores the calculation result in the calculation capacitor.
- the calculation units in the same column in the MAC array accumulate their respective 1-bit multiplication results through the same accumulation bus connected to the top plate of the calculation capacitor.
- the MAC array per unit area includes more SRAM cells, which can store more filter parameters at one time than in the prior art. After the calculation of one sub-unit is completed, the filters stored in other sub-units of the same unit The parameters can be used for in-memory calculations immediately, and there is no need to move data from the outside to the SRAM and then perform calculations, which can improve throughput and reduce energy consumption and area consumption.
- the area occupied by the calculation capacitor is several times that of the traditional 6T SRAM cell, reducing the number of capacitors in the calculation unit can improve the throughput of the module and reduce the energy consumption.
- the top plates of all calculation capacitors in the same column are connected together by a cumulative bus, and the cumulative bus voltage is V top .
- V top the cumulative bus voltage
- multiple calculation units are distributed in a column, and 1 calculation unit corresponds to 1 One calculation capacitor, and one calculation unit includes multiple sub-units described in the first aspect or the embodiments of the first aspect.
- the parameters V btm0 -V btm(N-1) in Fig. 4b respectively represent the voltages of the bottom plates of the first calculation unit to the Nth calculation unit.
- the MAC array performs multiplication and addition operations in the following "mode one":
- the filter parameters (or the activation value calculated by the upper layer of the network) are first written to the unit according to the 6T SRAM writing process, and are stored in the subunit;
- V top of the capacitor calculates the top plate voltage V top of the capacitor and reset it to V rst through the reset switch S rst on the accumulation bus, and V rst can be 0;
- the bottom plate voltage V btmi of the calculation capacitor is either kept at 0 or enters VDD.
- the charge is redistributed in a column of calculation capacitors, similar to the redistribution of charge in SARDAC capacitors. If the parasitic capacitance and other non-idealities are not considered, the analog output voltage V top of a column of calculation capacitors represents the cumulative result of the following formula, as shown in Figure 4b.
- the MAC array can be operated according to the following "Method 2":
- the signals A ij and nA ij are activated in a time-division multiplexed manner.
- V btmi of each calculation capacitor is either kept at 0 or enters VDD. Then disconnect S rst , set the backplane voltage V btmi to 0 or VDD, and the MOS switch in the control module of each computing unit runs a successive approximation algorithm for analog-to-digital conversion. Taking V btmi all set to 0 as an example, the voltage V top can be expressed as:
- Wij represents the filter parameter of the jth subunit in the ith calculation unit.
- the MAC array described can be used for the calculation of multi-bit weights.
- the calculation unit of each column performs bit-by-bit MAC operation, and obtains the output result of multi-bit weight by shifting and adding the digital representation after analog-to-digital conversion.
- Each column performs a bitwise MAC, which can be the lowest bit in the first column, that is, the value of the 0th bit and the MAC of the input signal, and the kth column performs the highest bit, that is, the value of the kth bit and the MAC of the input signal.
- each column is equivalent to MAC for one bit of a multi-bit binary weight.
- the MAC result obtained by all columns participating in the calculation contains k elements. Finally, the k elements after the analog-to-digital conversion are performed in the digital domain. The shifts are added.
- the MAC array further includes a differential complementary transmission gate, a differential calculation capacitor, and a first P-type MOS transistor; each calculation unit of the MAC array ,
- the output terminal of the CMOS inverter composed of MOS transistors M 3 and M 4 in each traditional 6T SRAM unit is respectively connected to the input terminal of a differential complementary transmission gate, and all MOS transistors M 3 , M 4 in the same calculation unit
- the output terminal of the differential complementary transmission gate connected to the composed CMOS inverter is connected to the drain of the same first P-type MOS tube; the drain of the first P-type MOS tube is connected to the bottom plate of the differential calculation capacitor, and the source is connected to VDD ;
- the differential multiplication result is stored as the voltage of the bottom plate of the differential calculation capacitor, and the top plate of the differential calculation capacitor of each differential unit in the same column is connected to the same differential accumulation bus.
- the structure of the 6T SRAM structure connecting the differential complementary transmission gate and the complementary transmission gate is a 10T structure (10T Sub-cell, including 10 transistors).
- the connection structure of the computing units constituting the MAC array is called the first differential unit.
- the first differential unit is an extension of the calculation unit, and further, is an extension of the subunits that make up the calculation unit.
- some transistors and capacitors are included in the first differential unit. Is shared, specifically, the first N-type MOS transistor, the first P-type MOS transistor, the differential calculation capacitor, and the calculation capacitor are shared, and the sub-units in the first differential unit also use the time division multiplexing The way is activated.
- the MAC array also includes a first CMOS inverter and a differential calculation capacitor; the output terminals of all complementary transmission gates in each calculation unit constituting the MAC array are connected to the same first CMOS The input terminal of the inverter and the output terminal of the first CMOS inverter are connected to a bottom plate of a differential calculation capacitor.
- the 6T SRAM structure connected to the complementary transmission gate is an 8T structure (8T Sub-cell, including 8 transistors). Referring to Fig.
- the connection structure of the calculation unit constituting the MAC array is called the second differential unit, and the result of the differential multiplication is stored as the voltage of the bottom plate of the differential calculation capacitor.
- the 8T structure in the second differential unit in this embodiment some transistors and capacitors are shared. Specifically, the first N-type MOS transistor, the first CMOS inverter, the differential calculation capacitor and the calculation capacitor are shared, and the subunits in the second differential unit are also activated in the time division multiplexing manner.
- Fig. 6c is a schematic diagram of the MAC array architecture of the differential system composed of the aforementioned first differential unit or the second differential unit.
- the top plates of all calculation capacitors in the same column are connected to the same accumulation bus, and the top plates of all differential calculation capacitors are connected to the same differential accumulation bus.
- the parameters V top_p_1 , V top_p_2 , V top_n_1 and V top_n_2 in FIG. 6c respectively represent the voltages generated by the differential calculation unit through calculation.
- a modular hybrid computing module with reconfigurable bit width is provided. See FIG. 7, which includes: the MAC array of the second aspect or any possible implementation of the second aspect. After the calculation is completed, the column-wise accumulation The result of is expressed as an analog voltage, that is , the value of the capacitor top plate V top in the above-mentioned embodiment; the filter/ifmap module provides filter parameters that are written and stored in the MAC array.
- the written The value that is input and stored in the MAC array can also be the value output by the upper-level network calculation; the ifmap/filter module provides the input of the MAC array, specifically, the input of the complementary transmission gate in the calculation unit, and the The filter parameter or the activation value calculated by the upper layer network is multiplied and added; the analog-to-digital conversion module converts the analog voltage obtained by the MAC operation into a digital representation; the digital processing module performs at least the digital representation output by the analog-to-digital conversion module For operations such as multi-bit fusion, bias, scaling, or non-linearity, the output result is a partial sum or an activation value (feature map) that can be directly used in the next layer of the network.
- the module of this application when used for the MAC calculation of a neural network, in general, due to the same area, the module includes more storage units, that is, 6T SRAM units, which can be used for loading and filtering at one time.
- Parameter (weight) After completing the calculation of the first layer of the network, the output part and or the activation value (feature map) that is finally used for the calculation of the next layer of the network can be MAC immediately with the filter parameters (weights) pre-loaded and stored in the module Calculation reduces the waiting time and power consumption of off-chip data transfer.
- the large throughput of the module can improve the on-chip storage capacity.
- the storage unit also stores the activation value (characteristic value) output by the network in this layer in the MAC array.
- the calculation unit in addition to the method of sharing transistors and calculation capacitors in the calculation unit and the MAC array described in the first or second aspect, in fact, in the non-MAC array area of the module, the calculation unit also shares some participation Transistor for analog-to-digital conversion and digital processing.
- the analog-to-digital conversion module may be a SARADC with a parallel capacitor structure, which converts the top plate voltage V top output by the column-direction calculation unit into a digital representation, including MAC DAC, SAR DAC, comparator, switch sequence and SAR logic, The SAR logic controls the switching sequence.
- SAR ADCs with parallel capacitor structures can make full use of the existing structure of the present invention, and achieve the effects of saving devices and reducing area.
- the MAC DAC is composed of capacitors of a column of calculation units in the aforementioned MAC array in parallel.
- the B capacitors whose capacitance decreases by 2 times also include a capacitor with the same value as the lowest LSB capacitance as a redundant capacitor.
- the capacitance of 0 is C/4
- the reference voltage ratios that can be distributed from MSB to LSB are: 1/2, 1/4, 1/8
- the capacitance of the redundant capacitor C U is C/4
- the B One end of the two capacitors and the redundant capacitor are connected in parallel
- the other end of the B capacitor is connected to the switching sequence
- the other end of the redundant capacitor is always grounded.
- the free end of the switch sequence includes a VDD end and a ground end, and the SAR logic controls the switch sequence.
- the output voltage V top of the MAC DAC is used as the positive input V + of the comparator; the output V SAR of the SAR DAC is used as the negative input V -of the comparator, and the SAR logic controls the switching sequence to make the negative
- the input V - is approximately equal to the positive input V +
- the final SAR logic output is the digital representation of V +.
- the activation sparsity of the MAC array can prevent some capacitors in the SAR DAC from switching, thereby achieving higher energy efficiency and ADC conversion speed.
- the number of MAC capacitors whose backplane voltage V btmi is VDD is less than 25%, that is, in the MAC array, a column of calculation units performs a 1-bit multiplication of 1 ⁇ 0, 0 ⁇ 0, and 0 ⁇ 1. If the 1 ⁇ 1 situation is less than 1/4 of the number of calculation units in the column, the first two capacitors of the SAR DAC, namely the S B-1 of the switching sequence corresponding to C B-1 and C B-2, can be used. And S B-2 is dialed to the ground, not to unconditionally activate all the capacitors in the SAR DAC for digital-to-analog conversion, which saves energy. It should be noted that the connection modes of the V + side and the V - side of the comparator shown in the drawings of the present invention are only for convenience of description, in fact, the connection of the V + side and the V - side can be interchanged.
- the MAC DAC and the SAR DAC can be connected together, that is, all capacitors are connected in parallel, and the total voltage generated is the positive input V + of the comparator; the negative input V - of the comparator is V ref ;
- V ref 0
- the MAC DAC on the positive input V + side and the SAR DAC on the negative input V- side of the comparator are both added with one and the other A half-LSB capacitor connected in parallel with the capacitor; the other end of the half-LSB capacitor on the positive input V + side is always grounded, and the other end of the half LSB capacitor on the negative input V - side can be connected to a switching sequence.
- This will produce a half-LSB voltage difference between the discrete analog levels between the MAC DAC and SAR DAC, providing additional error tolerance.
- the above half LSB capacitor may be two lowest LSB capacitors connected in series to achieve good matching.
- the MAC DAC is allowed to be multiplexed into a SAR DAC through the backplane sampling.
- the positive input V + side of the comparator is connected to the MAC DAC and a half-LSB capacitor.
- the capacitors and half-LSB capacitors of the first to N-1th units of the MAC DAC can be connected to the VDD end of the switching sequence or
- the capacitor of the Nth unit can be optionally connected to the grounding terminal; the negative input V - side of the comparator is not connected to the capacitor but the voltage Vref .
- the MAC DAC in this embodiment is also a SAR DAC.
- FIG. 12 shows a differential MAC architecture, which solves the problem of input offset voltage offset of the common mode related comparator.
- nS 0 -nS B-1 and S BX -nS BX all represent the switches of the switching sequence.
- the positive input V + side of the comparator is connected to the MAC DAC and an additional LSB capacitor.
- the capacitors of the first to N-1th units of the MAC DAC and the additional LSB capacitor can be connected to the switch Connect the VDD terminal or the ground terminal of the sequence, the capacitor of the Nth unit can be connected to the grounding switch sequence; the negative input V - side of the comparator is connected to the differential MAC DAC and an additional differential LSB capacitor.
- the differential MAC The capacitors of the 1st to N-1th units of the DAC and the additional differential LSB capacitors can all be connected to the switching sequence, and the capacitors of the Nth unit can optionally be connected to the grounding switching sequence.
- the differential MAC DAC includes a differential calculation capacitor array in the MAC array. It should be noted that the differential MAC architecture needs to be combined with the aforementioned differential structure modules to be implemented.
- the bit width of a column of SARADC can be determined in real time by the sparseness of the input data and the values stored in the column, so that the average capacitance in the binary weighted capacitor array that needs to be charged and discharged during the analog-to-digital conversion process The number may be greatly reduced, so as to achieve the effect of greatly saving the energy consumption of analog-to-digital conversion.
- the real-time bit width of the SAR ADC can be calculated as ceil(log 2 (min(X,W)+1)).
- ceil is the round-up function
- min is the minimum value function
- X is the number of 1 in the 1-bit input vector
- X 1 -X m represents the value of the 1st to mth 1-bit input vector, which can be passed through the adder Tree calculation is obtained
- W is the number of 1s stored in a column of the calculation array
- W 1 -W m indicates that the weight values stored in the 1st to mth cells in a column of the calculation array can be calculated off-chip, and in the When the data is stored in the computing array, it is already stored in the SAR logic.
- the min, log 2 and ceil functions in the formula for calculating the bit width can be replaced by simple digital combination logic to get the same calculation result.
- the modules included are only divided according to the functional logic, but not limited to the above-mentioned division, as long as the corresponding function can be realized; in addition, the specific name of each functional unit is also It is just for the convenience of distinguishing each other, and is not used to limit the protection scope of the present invention.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- General Engineering & Computer Science (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- Biophysics (AREA)
- General Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Power Engineering (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Neurology (AREA)
- Evolutionary Computation (AREA)
- Computer Hardware Design (AREA)
- Computational Mathematics (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Pure & Applied Mathematics (AREA)
- Microelectronics & Electronic Packaging (AREA)
- Static Random-Access Memory (AREA)
- Analogue/Digital Conversion (AREA)
- Semiconductor Integrated Circuits (AREA)
Abstract
一种模数混合存内计算的子单元,用于1位乘法计算,仅需要9个晶体管,在此基础上,提出多个子单元共用计算电容器、晶体管以组成1个计算单元,使得平均下来子单元的晶体管数量逼近8个,进而提出一种MAC阵列,用于乘加计算,包含多个计算单元,每个单元内的子单元以时分复用的方式被激活。进一步地,提出MAC阵列的差分体系,提高计算的容错能力。进一步地,提出一种用于内存内模数混合运算模组,对MAC阵列的并行模拟输出数字化并进行其它数字域的运算。所述运算模组中的模数转换模块充分利用MAC阵列的电容器,既能减少运算模组的面积,又能降低运算误差。进一步地,提出一种充分利用数据稀疏性来节省模数转换模块能耗的方法。
Description
本申请要求于2020年05月18日提交中国专利局、申请号为202010418649.0、发明名称为“子单元、MAC阵列、位宽可重构的模数混合存内计算模组”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
本发明涉及一种模数混合存内计算领域,并且更具体地,涉及一种子单元、MAC阵列、位宽可重构的模数混合存内计算模组。
目前,现有移动和物联网之类的新兴边缘应用要求高能效和高单位面积的运算速率。高能效意味着更长的电池寿命,而高单位面积的运算速率意味着在指定的运算速率下减小面积,进而降低成本。如今,深度神经网络(Deep Neural Network,DNN)中的前馈推理计算以乘法累加(Multiply-And-Accumulate,MAC)计算为主导,需要MAC计算的高能效和低面积的实现,同时减少待处理数据的搬运量。传统数字集成电路实现MAC有抗噪声能力强、精度高、扩展性好、设计方法成熟等优点,但是数字电路占用的芯片面积大、功耗大,难以实现高能效的大规模神经网络。并且传统数字电路采用的冯诺依曼结构带来的存储器和中央运算单元之间的数据交换瓶颈在DNN应用中的大规模数据搬运下会严重限制运算能效和运算速度。模拟电路实现MAC具有结构简单、功耗较低的优点,所以模拟和模数混合信号计算具有实现高能效的潜力。而为了打破冯诺依曼架构的瓶颈,近年来成为研究热点的存内计算从本质上无法以纯数字电路的形式实现,需要模拟电路的辅助。同时由于DNN对包括电路噪声造成的计算错误的承受能力较高,DNN专用集成电路(ASIC)正重新引起关注。
论文“A mixed-signal binarized convolutional-neural-network accelerator integrating dense weight storage and multiplication for reduced data movement”,DOI:10.1109/VLSIC.2018.8502421(以下称“论文1”)和论文“A Microprocessor implemented in 65nm CMOS with configurable and bit-scalable accelerator for programmable in-memory computing”,arXiv:1811.04047(以下称“论文2”),阐述1位MAC计算的乘法阶段是等效于1位权重和1位输入进行同或(XNOR)运算,把XNOR运算结果以电压的形式存储到电容器,加法阶段是利用电荷共享,每个电容器的电荷相同但所有电容器的总电荷不变,得出1位MAC计算结果。上述1位MAC计算的每个1位计算单元都有10个晶体管。论文1和论文2的现有技术存在的问题为:(1)对于每个加法操作,将无条件驱动每个计算单元中的传输门,而无法利用输入数据的稀疏性达到节省能耗的目的;(2)每一个进行1位乘法的运算单元配置一个独立电容器,逐次逼近型(Successive Approximation,SAR)模拟数字转换器(Analog to Digital Converter,ADC)的金属氧化物金属(Metal Oxide Metal,MOM)电容器位于静态随机存储器(Static Random Access Memory,SRAM)计算阵列之外,因为该阵列内部没有空间,从而降低了面积效率;(3)利用电荷共享的加法阶段需要连接存储XNOR运算结果的电容器的顶板。这种电路拓扑使加法容易受到非理想效应的影响,例如电荷注入,时钟馈通,传输门晶体管的漏极/源极处的非线性寄生电容,以及连接到电容器顶板的晶体管的漏电等,从而导致计算错误。此外,因为物理版图的不匹配而带来的运算电容器与ADC中的数模转换器里的电容器之间的不匹配也会导致计算错误。
论文“An always-on 3.8μJ/86%CIFAR-10 mixed-signal binary CNN processor with all memory on chip in 28nm CMOS”,DOI:10.1109/ISSCC.2018.8310264(以下称“论文3”)提出一种仅支持二进制化的权重和激活值的二值神经网络(BNN)的运算模组。论文3中的运算模组的不足为:(1)该架构只支持BNN,无法用于视觉应用的大型DNN模型,例如对象检测等,适用范围小;(2)1位MAC计算的乘法阶段至少需要一个或(OR)门,两个同或(XNOR)门,两个异或(NOR)门和一个锁存器,使用的晶体管数量多,面积占用大。
论文“Conv-RAM:an energy-efficient SRAM with embedded convolution computation for low-power CNN-based machine learning applications”,DOI:10.1109/ISSCC.2018.8310397(以下称“论文4”)提出一种具有嵌入式卷积计算功能的节能SRAM。论文4中的SRAM的不足有:(1)每个1位计算SRAM单元具有10个晶体管。每个单元中的晶体管数越高,存储密度越小;(2)利用位线上的寄生电容存储电荷,以用于随后的平均操作。与如MOM电容器之类的显式电容器相比,位线寄生电容的建模不充分,并且可能遭受更大的失配,导致较低的计算精度;(3)论文内所使用的水平电荷平均方法需要6个额外的晶体管,这些晶体管在几行单元之间共享,限制了吞吐量,因为并非所有行都可以同时执行计算;(4)差分电荷平均线Vp
AVG和Vn
AVG上的共模电压取决于输入数据X
in的大小,在通过局部MAV电路评估平均值后,此共模电压是不稳定的。因此差分结构的高效率高速ADC,例如SARADC并不适用。文章的方案采用了最大占用2
N-1(N是ADC分辨率)次步骤进行收敛的基于斜坡的ADC,降低了模数转换的速度,导致了较低的计算吞吐量;(5)阵列的输入使用额外的DAC电路将输入数据X
in(通常是特征图)从数字表示转换为模拟表示,DAC电路的非理想特性会导致更多的精度损失以及面积和能量的开销。
综上所述,在神经网络计算过程中,现有技术中的MAC阵列中进行1位乘法的计算单元使用的晶体管多;存储乘法结果用于累加的电容器与存储单元一一对应,即存储单元的个数与电容器的数量相同,而电容一般会比SRAM单元大很多,特别是在先进工艺制程下,会导致MAC阵列占用面积大;同时存在乘加运算中晶体管的无条件驱动,导致运算能效不高;另外,计算错误率高导致适用场景有限等。
因此,在模数混合存内计算领域,亟需一种面积小、能效高、容错能力好的位宽可重构的模数混合存内计算的运算模组。
发明内容
有鉴于此,本发明提供一种子单元、MAC阵列、位宽可重构的模数混合存内计算模组。为减小计算误差,还提供了差分体系的MAC阵列的 实现。为达到上述目的,本发明采用如下技术方案:
第一方面,提供了一种内存内计算子单元,进行1位乘法计算,包括:一个传统6T SRAM单元、一个互补传输门、一个第一N型MOS管、一个计算电容器;
所述传统6T SRAM单元由MOS管M
1、M
2、M
3、M
4、M
5、M
6组成,其中MOS管M
1、M
2组成的CMOS反相器与MOS管M
3、M
4组成的CMOS反相器交叉耦合,交叉耦合的两个CMOS反相器存储1位的过滤器参数,MOS管M
5、M
6为用于所述过滤器参数读写的位线的控制开关;
所述传统6T SRAM单元中MOS管M
1、M
2组成的CMOS反相器输出端连接互补传输门的输入端,互补传输门的输出端连接第一N型MOS管的漏极;
所述第一N型MOS管的源极接地,漏极连接计算电容器的底板;
运算时,所述子单元的互补传输门的N型MOS管的栅极连接输入信号,P型MOS管的栅极与所述第一N型MOS管的栅极的输入信号在运算时电平相同;
所述输入信号与过滤器参数的乘法结果存储为计算电容器底板的电压,多个子单元用于组成一个计算单元,同一所述计算单元内的每一个子单元共用同一所述第一N型MOS管、计算电容器。
在该方案中,过滤器参数/权重w写入并存储在SRAM单元中,输入信号A连接互补传输门的N型MOS管的栅极,互补传输门的P型MOS管栅极连接互补信号nA,第一N型MOS管的栅极连接信号B。特别地,对于一个计算子单元,互补输入信号nA的电平在计算时与信号B相同。这样的拓扑结构可以避免互补传输门的无条件驱动,提高能效。例如,当信号B=0,信号nA=0,输入信号A=1,过滤器参数w=1,计算电容器与第一N型MOS管连接的支路不通,互补传输门与计算电容器连接的支路导通,过滤器参数w与输入信号A的乘法结果存储为计算电容器的底板电压V
btm。那么,完成一位乘法(过滤器参数w与输入信号A)的子单元只需要9个晶体管,减小了单元的面积,互补传输门避免了连接到进行 电荷累加的计算电容器的顶板,这样可以最小化计算误差,特别是由于MOS管用作开关时的时钟馈通、由导通转向关断时的电荷注入、在互补传输门晶体管的漏/源处的非线性寄生电容、以及晶体管的漏电引起的误差。
在一些实施方式中,多个所述子单元组成一个计算单元,同一所述计算单元内的每一个子单元共用同一所述第一N型MOS管、计算电容器,子单元以2×2、4×2等可行的方式排列。直观地,该方案减少了第一N型MOS管的数量以及计算电容器的数量,以2×2的子单元组成的计算单元为例,减少了3个第一N型MOS管以及3个计算电容器。
特别地,越多的子单元共用所述第一N型MOS管与计算电容器,此时需要的器件数量及面积均摊到每一个子单元上,每个子单元需要的晶体管数量越逼近于8个,需要的电容器的面积也被均分。
结合第一方面,在一些实施方式中,计算单元内的子单元以时分复用的方式被激活,例如一个子单元被激活时,同一个计算单元内的其他子单元被停用。在一个子单元参与完成计算后,同一计算单元包含的其他子单元内的SRAM单元存储的过滤器参数可以立即用于内存内运算,不需要再从外部移动数据储存到SRAM内后再进行计算,这极大地提高了计算速度,提高吞吐量并且减少能量损耗以及面积消耗。
第二方面,提供了一种包括第一方面及第一方面可能的实施方式的MAC阵列,进行乘加运算,包括:多个计算单元,同一列的所有计算单元内的计算电容器的顶板连接同一累加总线。
该方案中,相对于独立的子单元组成的MAC阵列,采用所述共用电容器及晶体管的模式的计算单元,MAC阵列可以存储更多的神经网络参数或者用于下一层网络的激活值,具体地,计算单元内完成1位乘法的计算并将计算结果存储在计算电容器中,处于MAC阵列中的同一列计算单元通过计算电容器的顶板连接的同一条总线将各自的1位乘法结果累加,每一累加总线的电压对应MAC阵列中每一列乘法计算的累加和。
另外,由于一个电容占用的面积一般是一个6T SRAM单元占用面积的数倍,采用所述器件共享的方式,即多个用于1位乘法的子单元共用一 个晶体管和计算电容器进行计算、存储,相对其他的1个子单元用于1位乘法需要连接一个电容器存储乘法计算的结果的设计,可以极大地提高单位面积的存储容量。对于内存内计算,减少内外部数据的移动是减少能量消耗的最主要方式之一。方案中,单位面积MAC阵列包括更多的SRAM单元,可以一次性存储更多的神经网络过滤器参数以减少数据移动。
结合第二方面,在一些实施方式中,MAC阵列还包括差分互补传输门、差分计算电容器和第一P型MOS管;所述MAC阵列的每一计算单元内,每个传统6T SRAM单元中M
3、M
4组成的CMOS反相器的输出端分别连接一个差分互补传输门的输入端,所有M
3、M
4组成的CMOS反相器所连接的差分互补传输门的输出端连接同一第一P型MOS管的漏极;所述第一P型MOS管的漏极连接差分计算电容器的底板,源极接VDD;差分乘法结果存储为差分计算电容器底板的电压,同一列每一差分单元的差分计算电容器的顶板连接同一差分累加总线。
结合第二方面,在另一些实施方式中,MAC阵列还包括第一CMOS反相器和差分计算电容器;组成MAC阵列的每一个计算单元内所有互补传输门的输出端连接同一第一CMOS反相器的输入端,第一CMOS反相器的输出端连接一个差分计算电容器的底板,差分乘法结果存储为差分计算电容器底板的电压,同一列的所有差分计算电容器的顶板连接同一差分累加总线。
第三方面,提出了一种位宽可重构的模数混合内存内计算的运算模组,包括:
第二方面及第二方面可能实施方式的MAC阵列,阵列中列向累积的乘法结果表示为模拟电压;
filter/ifmap模块,提供被写入并存储在MAC阵列中的过滤器参数或上一层计算完成的激活值;ifmap/filter模块,提供MAC阵列的输入,与所述神经网络的过滤器参数或者上一层计算完成的激活值进行乘法计算;模数转换模块,将所述MAC阵列计算后得到的模拟电压转换为数字表示;数字处理模块,对模数转换模块的输出至少进行多位融合、偏置、缩放或非线性等操作,输出结果为部分和或者为能直接用于下一层网络输入 的激活值。
该方案中,所述过滤器参数或者神经网络上一层计算完的激活值通过filter/ifmap模块写入并存储在MAC阵列中,此过程按照标准的传统6T SRAM写入过程操作,使子单元中的SRAM存储逻辑1或0,并与ifmap/filter模块提供的输入进行乘加运算。此过程,每个子单元内的存储值与输入的乘法运算属于数字运算,等效于AND,乘法运算的结果存储在计算电容器中,加法阶段,由于同一列的所有计算电容器的顶板通过累加总线连接在一起,不同计算电容器中存储的电荷通过累加总线进行共享,列向累积的乘法结果存储为模拟电压。随后,模拟结果通过模数转换模块转换为数字表示,最后对该数字表示进行处理,输出结果为部分和或者为能用于下一层网络输入的激活值。在神经网络计算过程中,MAC耗费大量能耗。该方案中,MAC采用模数混合运算,可以极大降低能耗,同时MAC阵列的低面积实现可以提高能效以及计算速度。针对整个神经网络计算的不同阶段采用不同的运算方式的结合,极大地利用了模拟和数字电路的不同优点,保证了计算过程的低功耗、高能效、高速度、高精度的实现。
结合第三方面,在一种可能的实施方式中,模数转换模块采用SAR ADC,具体为二进制加权电容阵列的SARADC。
结合第三方面和第一种可能的实施方式,在第二种实施方式中,MAC阵列的输入数值以及存储数值的稀疏性可以使SAR DAC中的部分电容器的开关序列免于切换,从而获得更高的能效和ADC转换速度。换一种方式说,MAC阵列中每一列SAR ADC的位宽可以实时地由输入数值以及存储数值的稀疏性来决定。
结合第三方面或第三方面的第一种或第二种可能的实施方式,在第三种可能的实施方式中,MAC DAC和SAR DAC可以连接在一起,应当理解,所述MAC DAC指的是MAC阵列中的一列计算电容器,即所有MAC阵列中的电容器与SAR DAC中的电容器并联。
结合第三方面或第三方面的第一种或第二种或第三种实施方式中,在第四种可能的实施方式中,允许MAC DAC通过底板采样复用为SAR DAC,从而使用相同的电容阵列实现MAC操作以及模数转换,避免在MAC操作环节的MAC DAC和模数转换阶段的SAR DAC中使用不同电容阵列导致的失配以及精度损失,进一步地,并且允许全差分SAR ADC的实现,更好地解决共模相关的比较器的输入失调电压偏移的问题。
说明书附图
图1为本发明一实施例中用于1位乘运算的子单元结构的示意图;
图2为本发明一实施例中1位乘子单元真值表的示意图;
图3a为本发明一实施例中计算单元中子单元排列示意图;
图3b为本发明一实施例中多个子单元组成的计算单元的示意图;
图3c为本发明一实施例中计算单元运行时的真值表;
图4a为本发明一实施例中MAC阵列示意图;
图4b为本发明一实施例中计算电容器底、顶板电压示意图;
图5a为本发明一实施例中10T结构示意图;
图5b为本发明一实施例计算单元拓展示意图;
图6a为本发明一实施例中8T结构示意图;
图6b为本发明另一实施例中计算单元拓展示意图;
图6c为本发明一实施例中差分体系下的MAC阵列结构示意图;
图7为本发明一实施例中存内计算模组示意图;
图8为本发明一实施例中模数转换模块示意图;
图9为本发明另一实施例中模数转换模块示意图;
图10为本发明另一实施例中模数转换模块示意图;
图11为本发明另一实施例中模数转换模块示意图;
图12为本发明另一实施例中模数转换模块差分结构示意图;
图13为本发明一实施例中减少模数转换的能量消耗的架构示意图。
为了使发明的目的、原理、技术方案及优点更加清楚明白,以下结合附图及实施例,对本发明进行进一步详细说明。应当理解,正如本发明内容部分所述,此处所描述的具体实施例用以解释本发明,并不用于限定本 发明。
需要特别说明的是,根据说明书的文字或者技术内容可以确定的连接或位置关系,为了图画的简洁进行了部分的省略或者没有画出全部的位置变化图,本说明书未明确说明省略的或者没有画出的位置变化图,不能认为没有说明,为了阐述的简洁,在具体阐述时不再一一进行说明,在此统一说明。
作为一种常见的应用场景,本发明实施例所提供的的位宽可重构的模数混合计算模组可以应用在视觉、声学DNN架构中,更具体地,用于实现对象检测、低功耗的声学特征提取等。
以特征提取为例,将待处理数据与特征提取器中的由权重构成的过滤器进行卷积运算后,输出相应的特征图。过滤器选取不同,提取的特征也会不同。此过程中,待处理数据与过滤器的卷积运算需要的能耗最高,需要避免电路无条件驱动等情形造成的能耗,特别是待处理数据为稀疏矩阵时。
图1为用于1位乘运算的子单元结构一实施例示意图,进行1位乘法计算,包括:一个传统6T SRAM单元、一个互补传输门、一个第一N型MOS管、一个计算电容器;所述传统6T SRAM单元由MOS管M
1、M
2、M
3、M
4、M
5、M
6组成,其中MOS管M
1、M
2组成的CMOS反相器与MOS管M
3、M
4组成的CMOS反相器交叉耦合,交叉耦合的两个CMOS反相器存储1位的过滤器参数,MOS管M
5、M
6为用于所述过滤器参数读写的位线的控制开关;
所述传统6T SRAM单元中MOS管M
1、M
2组成的CMOS反相器输出端连接互补传输门的输入端,互补传输门的输出端连接第一N型MOS管的漏极;
所述第一N型MOS管的源极接地,漏极连接计算电容器的底板;
所述运算中的子单元的互补传输门的N型MOS管的栅极连接输入信号,P型MOS管的栅极与所述第一N型MOS管栅极输入信号在运算时电平相同;
所述输入信号与过滤器参数的乘法结果存储为计算电容器底板的电 压,多个计算子单元用于组成一个计算单元,所述同一计算单元内的每一个子单元共用同一所述第一N型MOS管、计算电容器。
互补传输门的N端和P端栅极的输入信号分别为A和nA,第一N型MOS管的栅极信号B,特别地,如图1,对于一个子单元,信号B和nA相同,在一些可能的实施方式中,可以采用信号nA与信号B共用节点从而提供相同的电平。传统6T SRAM存储被写入的过滤器参数W,写入过程按照标准的6T SRAM单元的写入,即字线WL置VDD,位线BL、nBL根据被写入的值为0或1。以写入“0”为例,字线WL置高电平VDD,MOS管M
5和M
6均导通,位线BL置0,nBL置VDD,W处电压透过M
6下降,为逻辑0;nW处电压透过M
5上升,为逻辑1。
可选的,子单元进行一位乘法计算的过程如下:
1.计算电容器的顶板电压V
top通过累加总线上的复位开关S
rst复位到V
rst;
2.将子单元中第一N型MOS管的栅极信号B提升到VDD,导通第一N型MOS管,将电容器的底板电压V
btm重置为0,将子单元中的互补传输门的输入信号A和nA分别保持在0和VDD。V
btm重置为0后,S
rst断开连接;
3.计算过程中,激活输入信号A和nA,激活子单元时1位乘法运算的真值表如图2所示;
4.子单元乘法运算完成后,计算电容器的底板电压V
btm要么保持在0,要么进入VDD,乘法运算的输出结果为计算电容器底板的电压,表示为VDD×w×A。
可以理解,对于一个子单元,完成一位乘法(过滤器参数w与输入信号A)的结构只需要9个晶体管,减小了子单元的面积,提高能效。应当理解,所述子单元中第一N型MOS管起控制作用,输入信号A与SRAM单元内存储的过滤器参数w的1位乘法计算结果存储为计算电容器底板的电压。为方便描述,所述SRAM单元与互补传输门连接的结构包含8个晶体管,称为8T结构(8T sub-cell,8个晶体管)。另外,所述计算子单元是对标准的传统6T SRAM单元的外部拓展,一方面,在实际应用中, 标准化的结构实施具备更好的经济效益,另一方面,提高了子单元的可拓展性,且互补传输门连接在计算电容器的底板相对于现有技术中采用连接在计算电容器的顶板的方式,能够最小化计算误差,特别是由于MOS管用作开关时的时钟馈通,由导通转向关断时的电荷注入,在互补传输门晶体管的漏/源处的非线性寄生电容,以及晶体管本身的漏电等等引起的误差。
为了进一步减少子单元的器件,在一些实施例中,多个所述子单元用于组成一个计算单元,子单元以2×2、4×2等可行的方式排列,排列方式参考图3a。如图3b所示,表示由四个子单元构成的计算单元,其中的参数WL
0表示子单元a和b共用的字线,WL
1表示子单元c和d共用的字线,BL
0和nBL
0表示子单元a和c共用的位线,BL
1和nBL
1表示子单元b和d共用的位线,B
0表示栅极信号,参数W
0a-W
0d以及nW
0a-nW
0d均表示四个子单元中用于存储权重的位点,参数V
btm0表示第一个计算单元的底板电压。同一个计算单元内的每个子单元保留各自所述的8T结构(8T sub-cell,8个晶体管),所有子单元共用同一所述第一N型MOS管、计算电容器。具体地,同一个计算单元内的每个子单元的互补传输门的输出端连接同一个第一N型MOS管的漏极,第一N型MOS管的漏极连接一个计算电容器的底板,可以理解,一个计算单元内只有一个第一N型MOS管、一个计算电容器,多个子单元共用一个计算电容器以及第一N型MOS管进行1位乘法的计算。直观地,该方案减少了第一N型MOS管的数量以及计算电容器的数量,以2×2的子单元组成的计算单元为例,减少了3个第一N型MOS管以及3个计算电容器。应当理解,越多的子单元共用所述第一N型MOS管与计算电容器,分摊下来每个子单元的晶体管数量就会越接近8个。
另外,由于单个电容占用的面积一般是整个6T SRAM占用面积的数倍,差距悬殊,采用所述子单元共用器件的方式,即多个用于1位乘法的子单元共用一个电容器存储计算结果的方式,相对其他的独立的1个子单元用于1位乘法需要连接一个电容器存储乘法计算结果的方式,可以极大的提高单位面积的存储容量,即相同面积内可以一次性存储较现有技术更 多的过滤器参数或权重。
进一步地,计算单元内的子单元以时分复用的方式被激活,即一个子单元被激活时,同一个计算单元内的其他子单元被停用。子单元被激活后按照上述执行一位乘法的运算,单元内计算的真值表参考图3c,具体地,在一些实施例中,子单元互补传输门的N型MOS管的栅极和P型MOS管的栅极的信号分别为A
ij和nA
ij,其中i为单元列的索引,且为0~(n-1)的非负整数,j为单元内子单元的索引,在2×2的单元内,j=a,b,c,d。可以理解,所述子单元共用一个计算电容器以及一个第一N型MOS管,指的是一个计算单元内包含了多个能用于乘加运算的子单元,应当注意,不同于单个独立的子单元,当多个子单元组成一个计算单元时,所述第一N型MOS管栅极的输入信号B
i与每个子单元的互补传输门的P端栅极的互补输入信号nA
ij是分别控制的,在时分复用下,虽然某时刻处于工作状态的子单元内的互补输入信号nA
ij与信号B
i电平相同,但是不再适用二者共用节点的情形。应当理解,所述子单元共用一个计算电容器以及一个第一N型MOS管,指的是一个计算单元内包含了多个能用于乘加运算的子单元,相对于相同数量且独立的子单元来说,子单元组合成的计算单元内需要的计算电容器以及第一N型MOS管的数量分别减少n-1个,计算单元内完成1位乘法的子单元结构逼近8个晶体管。一般地,由于制作工艺的区别,计算电容器占用的面积是SRAM单元的数倍,减少单位面积中计算电容器的数量,可以提高计算单元组成的阵列模组的存储容量。
应当理解,所述的子单元内部如果采用非传统6T SRAM的结构,但是起到同样存入和读出1位过滤器参数的功能时,同样适用所述的共用器件模式。
第二方面,提供了一种包括第一方面及第一方面可能实施方式的子单元的MAC阵列,进行乘加运算,参见图4a,MAC阵列包括多个计算单元,同一列的所有计算单元内的计算电容器的顶板连接同一累加总线。
该方案中,相对于独立的子单元组成的MAC阵列,所述采用共用电容器及晶体管的模式的计算单元,MAC阵列可以存储更多的神经网络参数或者上一层网络计算完成的值,具体地,计算单元内完成1位乘法的计 算并将计算结果存储在计算电容器中,处于MAC阵列中的同一列计算单元通过计算电容器的顶板连接的同一条累加总线将各自的1位乘法结果累加。
另外,对于内存内计算,减少内外部数据的移动是减少能量消耗的直接方式。可以理解,方案中,单位面积MAC阵列包括更多的SRAM单元,可以一次性存储较现有技术更多的过滤器参数,在一个子单元计算完成后,同一单元其他子单元内存储的过滤器参数可以立即用于内存内运算,不需要再从外部移动数据储存到SRAM内后再进行计算,这能提高吞吐量并且减少能量损耗以及面积消耗。一般地,计算电容器占用的面积是传统6T SRAM单元的数倍,减少计算单元中电容的数量,可以提高模组的吞吐量以及降低能耗。
参见图4b所示,特别地,同一列的所有计算电容器的顶板通过累加总线连接在一起,累加总线电压为V
top,需要明确的是,多个计算单元按列分布,1个计算单元对应1个计算电容器,1个计算单元中包含多个第一方面或者第一方面实施例所述的子单元。图4b中的参数V
btm0-V
btm(N-1)分别表示第一个计算单元至第N个计算单元的底板电压。
在一些实施例中,MAC阵列以下列“方式一”执行乘加运算:
1.过滤器参数(或上一层网络计算完成的激活值)首先按照6T SRAM的写入过程写入单元,并被存储在子单元中;
2.计算电容器的顶板电压V
top通过累加总线上的复位开关S
rst复位到V
rst,V
rst可以为0;
3.将每个计算单元中的信号B
i提升到VDD,计算电容器的底板电压V
btmi重置为0,每个计算单元中的信号A
ij和nA
ij分别保持在0和VDD。S
rst断开连接;
4.在计算过程中,信号A
ij和nA
ij以时分复用的方式被激活,例如,当A
0a和nA
0a被激活时,A
0j和nA
0j(j=b,c,d)被停用,即分别保持在0和VDD。值得注意,在计算过程中,一个计算单元的B
0与该时刻被激活的子单元内的nA
0j是一样的电平。
5.在一列中的每个计算单元的相乘完成后,计算电容器的底板电压 V
btmi要么保持在0,要么进入VDD。电荷在一列计算电容器中重新分布,类似于SARDAC的电容器中的电荷重新分布。如果不考虑寄生电容等非理想性,则一列计算电容器的模拟输出电压V
top表示如下式的累加结果,如图4b。
在其他实施例中,MAC阵列可以按照下面的“方式二”进行运算:
1.过滤器参数(或上一层网络计算完成的激活值)写入各个子单元;
2.计算电容器的顶板电压V
top通过累加总线上的复位开关S
rst复位到V
rst。S
rst保持V
top和V
rst之间的连接。
3.将每个计算单元中的信号B
i提升到VDD,将计算电容器的底板电压V
btmi重置为0,将每个计算单元中的信号A
ij和nA
ij分别保持在0和VDD;
4.在计算过程中,同理,信号A
ij和nA
ij以时分复用的方式被激活。
5.在一列中的每个计算单元的相乘完成后,每个计算电容器的底板电压V
btmi要么保持在0,要么进入VDD。然后断开S
rst,将底板电压V
btmi设置为0或VDD,每个计算单元的控制模块里的MOS开关运行逐次逼近算法进行模数转换。以V
btmi均被置为0为例,电压V
top可表示为:
其中的W
ij表示第i个计算单元中第j个子单元的过滤器参数。
特别地,所述的MAC阵列可用于多位权重的计算。每一列的计算单元执行逐位的MAC操作,通过把模数转换后的数字表示进行移位相加的操作得到多位权重的输出结果,举例而言,对于一个k位的权重或过滤器参数,每一列执行逐位的MAC,可以是第一列执行最低位,即第0位的值与输入信号的MAC,第k列执行最高位,即第k位的值与输入信号的MAC。可以理解,相当于每一列单独对一个多位的二进制权重的一位进行MAC,所有参与计算的列得到的MAC结果包含了k个元素,最后对 进行模数转换后的k个元素进行数字域的移位相加。
为了减少计算误差,可使用差分体系的MAC阵列架构,在一些实施方式中,MAC阵列还包括差分互补传输门、差分计算电容器和第一P型MOS管;所述MAC阵列的每一计算单元内,每个传统6T SRAM单元中由MOS管M
3、M
4组成的CMOS反相器的输出端分别连接一个差分互补传输门的输入端,同一所述计算单元内所有MOS管M
3、M
4组成的CMOS反相器所连接的差分互补传输门的输出端连接同一第一P型MOS管的漏极;所述第一P型MOS管的漏极连接差分计算电容器的底板,源极接VDD;差分乘法结果存储为差分计算电容器底板的电压,同一列每一差分单元的差分计算电容器的顶板连接同一差分累加总线。为描述方便,参见图5a,所述6T SRAM结构连接差分互补传输门和互补传输门的结构为10T结构(10T Sub-cell,包含10个晶体管)。参见图5b,组成MAC阵列的计算单元连接结构称为第一差分单元。那么可以理解,所述第一差分单元是计算单元的拓展,进一步地,是对组成计算单元的子单元的拓展,本实施例中所述第一差分单元内除了10T结构外,部分晶体管以及电容器是共用的,具体地,共用所述第一N型MOS管、第一P型MOS管、差分计算电容器、计算电容器,而且所述第一差分单元内的子单元同样以所述的时分复用的方式被激活。
在差分体系的MAC阵列架构另一些的实施方式中,MAC阵列还包括第一CMOS反相器和差分计算电容器;组成MAC阵列的每一个计算单元内所有互补传输门的输出端连接同一第一CMOS反相器的输入端,第一CMOS反相器的输出端连接一个差分计算电容器的底板。类似的,为描述方便,参见图6a,所述6T SRAM结构连接互补传输门为8T结构(8T Sub-cell,包含8个晶体管)。参见图6b,组成MAC阵列的计算单元连接结构称为第二差分单元,差分乘法结果存储为差分计算电容器底板的电压。那么可以理解,本实施例中所述第二差分单元内除了8T结构外,部分晶体管以及电容器是共用的。具体地,共用所述第一N型MOS管、第一CMOS反相器、差分计算电容器和计算电容器,所述第二差分单元内的子单元同样以所述的时分复用的方式被激活。
应当注意,所述第一差分单元、第二差分单元均是根据计算单元的拓展,此处命名只是为了方便电路结构的描述。图6c为由前述第一差分单元或者第二差分单元构成差分体系的MAC阵列架构的示意图,同一列的所有计算电容器的顶板连接同一累加总线,所有差分计算电容器的顶板连接同一差分累加总线。图6c中的参数V
top_p_1、V
top_p_2、V
top_n_1和V
top_n_2分别表示通过计算产生于差分计算单元的电压。
第三方面,提供了一种位宽可重构的模数混合计算模组,参见图7,包括:第二方面或第二方面的任意可能的实现方式的MAC阵列,计算完成后列向累积的结果表示为模拟电压,即上述实施例中电容器顶板V
top的值;filter/ifmap模块,提供被写入并存储在MAC阵列中的过滤器参数,应当理解,对于神经网络,所述被写入并存储在MAC阵列中的还可以是上一层网络计算完成所输出的值;ifmap/filter模块,提供MAC阵列的输入,具体地,提供计算单元内互补传输门的输入,与所述的过滤器参数或上一层网络计算后的激活值进行乘加运算;模数转换模块,将MAC操作得到的模拟电压转换为数字表示;数字处理模块,对模数转换模块输出的数字表示至少进行多位融合、偏置、缩放或非线性等操作,输出结果为部分和或者为能直接用于下一层网络的激活值(特征图)。
可以理解,将本申请的模组用于神经网络的MAC计算时,一般情况下,由于相同的面积上,模组包括更多的存储单元,即6T SRAM单元,可以一次性预先用于加载过滤器参数(权重)。在完成一层网络的计算后,输出部分和或者是最终用于下一层网络计算的激活值(特征图),可以立即与预先加载并存储在模组中的过滤器参数(权重)进行MAC计算,减少了片外的数据搬运的等待时间以及功耗。另外,模组的大吞吐量可以提高片上的存储能力,例如,存储单元除了存储过滤器参数外,本层网络输出的激活值(特征值)也可以存储在MAC阵列中。
应当理解,除了在第一方面或第二方面所述的计算单元和MAC阵列内采用共用晶体管和计算电容器的方式,实际上,在所述模组的非MAC阵列区域,计算单元还共用一些参与模数转换和数字处理的晶体管。
本发明中,所述模数转换模块可为并行电容结构的SARADC,将列 向计算单元输出的顶板电压V
top转换为数字表示,包括MAC DAC、SAR DAC、比较器、开关序列和SAR逻辑,所述SAR逻辑控制所述开关序列。相对于采用其他类型如电阻、混合阻容结构等的SAR ADC,采用并行电容结构的SAR ADC更能充分利用本发明已有的结构,达到节省器件,减小面积的效果。MAC DAC由前述MAC阵列中一列计算单元的电容器并联组成,应当理解,所述MAC DAC的输出电压为V
top。SAR DAC包括(B+1)个并联电容器,B=log
2N,N是MAC DAC中电容器的数量;所述电容器包括从最高位(Most SignificantBit,MSB)到最低位(Least SignificantBit,LSB)的电容呈2倍递减的B个电容器,还包括一个与最低位LSB电容等值的电容器,作为冗余电容器。举例说明,MAC DAC中电容器的数量N=8,则B=3,最高位MSB电容器C
B-1的电容为C,次高位电容器C
B-2的电容为C/2,最低位LSB电容器C
0的电容为C/4,从MSB到LSB能够分配SAR DAC的基准电压比例分别为:1/2、1/4、1/8,冗余电容器C
U的电容为C/4,所述B个电容器和冗余电容器的一端并联在一起,B个电容器的另一端连接开关序列,冗余电容器的另一端始终接地。所述开关序列的自由端包括VDD端和接地端,SAR逻辑控制所述开关序列。
在一实施例中,如图8,MAC DAC的输出电压V
top作为比较器的正输入V
+;SAR DAC的输出V
SAR作为比较器的负输入V
-,SAR逻辑控制所述开关序列使负输入V
-近似等于正输入V
+,最终的SAR逻辑输出是V
+的数字表示。特别地,MAC阵列的激活稀疏性可以使SAR DAC中的某些电容器免于切换,从而获得更高的能效和ADC转换速度。例如,如果已知在MAC操作之后,底板电压V
btmi为VDD的MAC电容器数量小于25%,即MAC阵列中,一列计算单元进行1位乘法中1×0、0×0、0×1的情形较多,而1×1的情形小于该列计算单元数量的1/4,则可以将SAR DAC的前两位电容器,即C
B-1和C
B-2对应的开关序列的S
B-1和S
B-2拨向接地端,并不是无条件激活SAR DAC中的所有电容进行数模转换,节省能耗。应当注意,本发明附图所示的比较器V
+侧和V
-侧的连接方式只是为了方便说明,实际上V
+侧和V
-侧的连接可以互换。
在另一实施例中,参见图9,MAC DAC和SAR DAC可以连接在一起,即所有电容器并联,产生的总电压为比较器的正输入V
+;比较器的负输入V
-为V
ref;SAR逻辑控制开关序列使正输入V
+逼近V
ref。应当注意,本实施例应在MAC操作遵循前述“方式一”的情况下。如果V
rst=0且未考虑电路非理想情况,连接至比较器负输入V
-侧的V
ref可以为0或VDD/2。例如,如果V
ref=0,SAR DAC中的电容器最初是通过从S
0到S
B-1的开关连接到VDD的,则SAR操作可以在给出数字表示的同时使V
+返回0,对应了“方式一”中电容器的顶板电压V
top通过重置开关S
rst重置为0这一步骤所需的V
rst=0。
图8和图9所示的两个实施例中,当比较器的正输入V
+和负输入V
-无限地彼此接近时,比较器很容易在模数转换过程中遭受亚稳性问题,即在短暂的时间内无法判断比较器的正输入V
+和负输入V
-的差异。这是因为要量化的模拟MAC结果的幅度不是连续的而是离散的,并且离散的幅度级别与SAR DAC对齐。为了减轻比较器的亚稳性,如图10,在另一实施例中,相对于图8,比较器的正输入V
+侧的MAC DAC和负输入V
-侧的SAR DAC均添加一个与其他电容器并联的半LSB电容器;正输入V
+侧的半LSB电容器另一端始终接地,负输入V
-侧的半LSB电容器另一端可连接开关序列。这将在MAC DAC和SAR DAC之间的离散模拟电平之间产生半个LSB电压的差异,提供额外的误差容限。上述半LSB电容器可以是两个串联的最低位LSB电容器,以实现良好的匹配。
在另一实施例,允许MAC DAC通过底板采样复用为SAR DAC。如图11,比较器的正输入V
+侧连接MAC DAC和一个半LSB电容器,MAC DAC的第1个至第N-1个单元的电容器和半LSB电容器均可连接开关序列的接VDD端或者接地端,第N个单元的电容器可选择连接地端;比较器的负输入V
-侧不连接电容器而是电压V
ref。实际上,本实施例中的MAC DAC也是SAR DAC。应当注意,此实施例应当在MAC计算遵循“方式二”的操作,且通常V
ref=V
rst。SAR转换完成后,比较器的正输入电压V
+返回V
rst,对应了“方式二”中电容器的顶板电压V
top通过重置开关S
rst重置为V
rst这一步骤所需的V
rst。这样使用相同的电容阵列实现MAC操作 以及模数转换,避免在MAC操作环节的MAC DAC和模数转换阶段SAR DAC中的电容阵列不同而导致的失配以及精度损失,并且允许全差分SARADC的实现。
结合图11的实施例,在另一实施例中,图12显示差分MAC体系结构,解决了共模相关比较器输入失调电压偏移的问题。图12中的nS
0-nS
B-1、S
B-X-nS
B-X均表示开关序列的开关。比较器的正输入V
+侧连接MAC DAC和一个额外的LSB电容器,在模数转换的过程中,MAC DAC的第1个至第N-1个单元的电容器和额外的LSB电容器均可连接开关序列的接VDD端或者接地端,第N个单元的电容器可连接接地开关序列;比较器的负输入V
-侧连接差分MAC DAC和一个额外的差分LSB电容器,在模数转换过程中,差分MAC DAC的第1个至第N-1个单元的电容器和额外的差分LSB电容器均可连接开关序列,第N个单元的电容器可选择连接接地开关序列。所述差分MAC DAC包括MAC阵列中的差分计算电容阵列。应当注意,所述差分MAC体系结构需与前述差分结构的模组结合才可实现。
在一实施例中,一列SARADC的位宽可以实时地由输入数据以及存储在该列的数值的稀疏性来决定,这样平均下来在模数转换过程中需要充放电的二进制加权电容器阵列里的电容的个数有可能大量减少,从而达到大幅节省模数转换能耗的效果。特别地,如图13所示,SAR ADC的实时位宽可以计算为ceil(log
2(min(X,W)+1))。其中ceil为上取整函数,min为最小值函数,X为1比特输入向量中1的个数,X
1-X
m表示第1个至第m个1比特输入向量的值,可以通过加法器树计算得到,W为计算阵列的一列里存储的1的个数,W
1-W
m表示计算阵列一列中第1至第m个单元里存储的权重值可以在片下计算得到,并且在将数据存储在计算阵列里的时候已经存放在SAR逻辑里。计算位宽的式子里的min,log
2,ceil函数可以被简单的数字组合逻辑替代而得到同样的计算结果。
值得注意的是,上述实施例中,所包括的各个模块只是按照功能逻辑进行划分的,但并不局限于上述的划分,只要能够实现相应的功能即可;另外,各功能单元的具体名称也只是为了便于相互区分,并不用于限制本 发明的保护范围。
以上所述仅为本发明的较佳实施例而已,并不用以限制本发明,凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等,均应包含在本发明的保护范围之内。
Claims (14)
- 一种内存内模数混合计算子单元,其特征在于,进行1位乘法计算,包括:一个传统6T SRAM单元、一个互补传输门、一个第一N型MOS管、一个计算电容器;所述传统6T SRAM单元由MOS管M 1、M 2、M 3、M 4、M 5、M 6组成,其中MOS管M 1、M 2组成的CMOS反相器与MOS管M 3、M 4组成CMOS反相器交叉耦合,交叉耦合的两个所述CMOS反相器存储1位的过滤器参数,MOS管M 5、M 6为用于所述过滤器参数读写的位线的控制开关;所述传统6T SRAM单元中由MOS管M 1、M 2组成的CMOS反相器的输出端连接互补传输门的输入端,所述互补传输门的输出端连接所述第一N型MOS管漏极;所述第一N型MOS管的源极接地,漏极连接所述计算电容器的底板;所述互补传输门的N型MOS管的栅极连接输入信号,P型MOS管的栅极连接互补输入信号且与所述第一N型MOS管的栅极输入信号在运算时电平相同;所述输入信号与所述过滤器参数的乘法结果存储为所述计算电容器底板的电压,多个所述子单元用于组成一个计算单元,同一所述计算单元内的每一个所述子单元共用同一所述第一N型MOS管、所述计算电容器。
- 如权利要求1所述的内存内模数混合计算子单元,其特征在于,所述计算单元内的子单元以时分复用的方式被激活,同一所述计算单元内的某时刻处于工作状态的子单元的互补传输门的P型MOS管的栅极的互补输入信号与所述第一N型MOS管的栅极连接的信号电平相同。
- 一种包括权利要求2所述内存内模数混合计算子单元的MAC阵列,进行乘加运算,其特征在于,包括多个计算单元,每一所述计算单元内的所有子单元的互补传输门的输出端连接同一个计算电容器的底板,同一列的所有所述计算单元内的计算电容器的顶板连接同一累加总线,每一所述累加总线的电压对应所述MAC阵列中每一列乘法计算结果的累加 和。
- 如权利要求3所述的MAC阵列,其特征在于,所述MAC阵列还包括多个差分计算单元,每一所述差分计算单元包括差分互补传输门、差分计算电容器和第一P型MOS管;所述MAC阵列的每一计算单元内,每个所述传统6T SRAM单元中由MOS管M 3、M 4组成的CMOS反相器的输出端分别连接一个所述差分互补传输门的输入端,所有由MOS管M 3、M 4组成的CMOS反相器所连接的所述差分互补传输门的输出端连接同一所述第一P型MOS管的漏极;所述第一P型MOS管的漏极连接所述差分计算电容器的底板,所述第一P型MOS管的源极接VDD;差分乘法结果存储为所述差分计算电容器底板的电压,同一列每一差分计算单元的差分计算电容器的顶板连接同一差分累加总线。
- 如权利要求3所述的MAC阵列,其特征在于,所述MAC阵列还包括第一CMOS反相器和差分计算电容器;组成所述MAC阵列的每一个计算单元内所有互补传输门的输出端连接同一所述第一CMOS反相器的输入端,所述第一CMOS反相器的输出端连接一个所述差分计算电容器的底板;差分乘法结果存储为所述差分计算电容器底板的电压,同一列的所有所述差分计算电容器的顶板连接同一差分累加总线。
- 一种位宽可重构的模数混合内存内计算的运算模组,其特征在于,包括:如权利要求3至权利要求5其一所述的MAC阵列,所述MAC阵列中列向累积的乘法结果表示为模拟电压;filter/ifmap模块,提供被写入并存储在所述MAC阵列中的过滤器参数或上一层计算完成的激活值;ifmap/filter模块,提供所述MAC阵列的输入,与神经网络的过滤器参数或者上一层计算完成的激活值进行乘法计算;模数转换模块,将所述MAC阵列计算后得到的模拟电压转换为数字表示;数字处理模块,对所述模数转换模块的输出至少进行多位融合、偏置、缩放或非线性操作,输出结果为部分和或者为能直接用于下一层网络输入的激活值。
- 如权利要求6所述的运算模组,其特征在于,所述模数转换模块为二进制加权电容阵列的SAR ADC,所述SAR ADC包括:MAC DAC,由MAC阵列中的一列计算电容器组成;SAR DAC,由多个二进制加权电容器和1个与LSB电容等值的冗余电容器组成的阵列;比较器;以及开关序列和SAR逻辑,所述SAR逻辑控制所述开关序列。
- 如权利要求7所述的运算模组,其特征在于,所述MAC DAC的输出电压作为所述比较器一端的输入;所述SAR DAC的输出电压作为所述比较器另一端的输入。
- 如权利要求7所述的运算模组,其特征在于,所述MAC DAC和所述SAR DAC中的电容器并联产生的输出电压作为所述比较器一端的输入;比较电压V ref作为所述比较器另一端的输入。
- 如权利要求8所述的运算模组,其特征在于,所述比较器的正输入V +端和负输入V -端各添加一个半LSB电容器;所述MAC DAC和一个半LSB电容器并联产生的输出电压作为所述比较器一端的输入,所述SAR DAC和另一个半LSB电容器并联产生的输出电压作为所述比较器另一端的输入。
- 如权利要求7所述的运算模组,其特征在于,所述MAC DAC和半LSB电容器均连接开关序列复用为所述SAR DAC,此双用途DAC的输出电压作为所述比较器一端的输入;比较电压V ref作为所述比较器另一端的输入。
- 如权利要求7所述的运算模组,其特征在于,所述SAR ADC还包括差分MAC DAC,所述差分MAC DAC由所述MAC阵列中的一列差分计算电容器组成。
- 如权利要求12所述的运算模组,其特征在于,所述MAC DAC和一个额外并联的LSB电容器均连接开关序列复用为所述SAR DAC,此双用途DAC的输出电压作为所述比较器一端的输入;所述差分MAC DAC和一个额外并联的差分LSB电容器均连接开关序列复用为差分SAR DAC,此双用途差分DAC的输出电压作为所述比较器另一端的输入。
- 如权利要求9所述的运算模组,其特征在于,所述SAR ADC的 位宽实时地根据输入数据和存储在计算阵列中的数据的稀疏性来决定,此实时位宽计算为ceil(log 2(min(X,W)+1));其中ceil为上取整函数,min为最小值函数,X为1比特输入向量中1的个数,W为计算阵列的一列里存储的1的个数,实时位宽计算公式在电路中等效地由数字组合逻辑实现。
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US17/762,447 US11948659B2 (en) | 2020-05-18 | 2021-03-30 | Sub-cell, mac array and bit-width reconfigurable mixed-signal in-memory computing module |
EP21808967.0A EP3989445A4 (en) | 2020-05-18 | 2021-03-30 | SUB-UNIT, MAC NETWORK, BIT WIDTH RECONFIGURABLE MEMORY RECONFIGURABLE HYBRID ANALOG-DIGITAL COMPUTER MODULE |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010418649.0A CN111431536B (zh) | 2020-05-18 | 2020-05-18 | 子单元、mac阵列、位宽可重构的模数混合存内计算模组 |
CN202010418649.0 | 2020-05-18 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021232949A1 true WO2021232949A1 (zh) | 2021-11-25 |
Family
ID=71551188
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2021/084022 WO2021232949A1 (zh) | 2020-05-18 | 2021-03-30 | 子单元、mac阵列、位宽可重构的模数混合存内计算模组 |
Country Status (4)
Country | Link |
---|---|
US (1) | US11948659B2 (zh) |
EP (1) | EP3989445A4 (zh) |
CN (1) | CN111431536B (zh) |
WO (1) | WO2021232949A1 (zh) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114089950A (zh) * | 2022-01-20 | 2022-02-25 | 中科南京智能技术研究院 | 一种多比特乘累加运算单元及存内计算装置 |
WO2023113906A1 (en) * | 2021-12-15 | 2023-06-22 | Microsoft Technology Licensing, Llc. | Analog mac aware dnn improvement |
WO2023146613A1 (en) * | 2022-01-31 | 2023-08-03 | Microsoft Technology Licensing, Llc. | Reduced power consumption analog or hybrid mac neural network |
TWI822313B (zh) * | 2022-09-07 | 2023-11-11 | 財團法人工業技術研究院 | 記憶體單元 |
Families Citing this family (23)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11170292B2 (en) * | 2017-09-21 | 2021-11-09 | The Trustees Of Columbia University In The City Of New York | Static random-access memory for deep neural networks |
CN113627601B (zh) * | 2020-05-08 | 2023-12-12 | 深圳市九天睿芯科技有限公司 | 子单元、mac阵列、位宽可重构的模数混合存内计算模组 |
CN111431536B (zh) | 2020-05-18 | 2023-05-02 | 深圳市九天睿芯科技有限公司 | 子单元、mac阵列、位宽可重构的模数混合存内计算模组 |
CN111816234B (zh) * | 2020-07-30 | 2023-08-04 | 中科南京智能技术研究院 | 一种基于sram位线同或的电压累加存内计算电路 |
CN111915001B (zh) * | 2020-08-18 | 2024-04-12 | 腾讯科技(深圳)有限公司 | 卷积计算引擎、人工智能芯片以及数据处理方法 |
JP2022049312A (ja) * | 2020-09-16 | 2022-03-29 | キオクシア株式会社 | 演算システム |
CN112116937B (zh) * | 2020-09-25 | 2023-02-03 | 安徽大学 | 一种在存储器中实现乘法和或逻辑运算的sram电路结构 |
CN112133348B (zh) * | 2020-11-26 | 2021-02-12 | 中科院微电子研究所南京智能技术研究院 | 一种基于6t单元的存储单元、存储阵列和存内计算装置 |
CN112711394B (zh) * | 2021-03-26 | 2021-06-04 | 南京后摩智能科技有限公司 | 基于数字域存内计算的电路 |
CN113364462B (zh) * | 2021-04-27 | 2022-09-02 | 北京航空航天大学 | 模拟存算一体多比特精度实现结构 |
CN113488092A (zh) * | 2021-07-02 | 2021-10-08 | 上海新氦类脑智能科技有限公司 | 基于sram实现多比特权重存储与计算的电路及存储与模拟计算系统 |
CN113658628B (zh) * | 2021-07-26 | 2023-10-27 | 安徽大学 | 一种用于dram非易失存内计算的电路 |
TWI788964B (zh) * | 2021-08-20 | 2023-01-01 | 大陸商深圳市九天睿芯科技有限公司 | 子單元、mac陣列、位寬可重構的模數混合存內計算模組 |
CN113672860B (zh) * | 2021-08-25 | 2023-05-12 | 恒烁半导体(合肥)股份有限公司 | 一种正负数兼容的存内运算方法、乘加运算装置及其应用 |
CN114300012B (zh) * | 2022-03-10 | 2022-09-16 | 中科南京智能技术研究院 | 一种解耦合sram存内计算装置 |
CN114546335B (zh) * | 2022-04-25 | 2022-07-05 | 中科南京智能技术研究院 | 一种多比特输入与多比特权重乘累加的存内计算装置 |
CN114816327B (zh) * | 2022-06-24 | 2022-09-13 | 中科南京智能技术研究院 | 一种加法器及全数字存内计算装置 |
CN114913895B (zh) * | 2022-07-19 | 2022-11-01 | 中科南京智能技术研究院 | 一种实现两比特输入单比特权重的存内计算宏单元 |
CN115756388B (zh) * | 2023-01-06 | 2023-04-18 | 上海后摩智能科技有限公司 | 多模式存算一体电路、芯片及计算装置 |
CN116402106B (zh) * | 2023-06-07 | 2023-10-24 | 深圳市九天睿芯科技有限公司 | 神经网络加速方法、神经网络加速器、芯片及电子设备 |
CN117608519B (zh) * | 2024-01-24 | 2024-04-05 | 安徽大学 | 基于10t-sram的带符号乘法与乘累加运算电路 |
CN117807021B (zh) * | 2024-03-01 | 2024-05-10 | 安徽大学 | 2t-2mtj存算单元和mram存内计算电路 |
CN118248193B (zh) * | 2024-05-27 | 2024-07-30 | 安徽大学 | 基于参考电路动态匹配的高可靠性存内计算电路、芯片 |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101807923A (zh) * | 2009-06-12 | 2010-08-18 | 香港应用科技研究院有限公司 | 具有二进制加权电容器采样阵列和子采样电荷分配阵列的混合模数转换器(adc) |
US20130194120A1 (en) * | 2012-01-30 | 2013-08-01 | Texas Instruments Incorporated | Robust Encoder for Folding Analog to Digital Converter |
CN110941185A (zh) * | 2019-12-20 | 2020-03-31 | 安徽大学 | 一种用于二值神经网络的双字线6tsram单元电路 |
CN111144558A (zh) * | 2020-04-03 | 2020-05-12 | 深圳市九天睿芯科技有限公司 | 基于时间可变的电流积分和电荷共享的多位卷积运算模组 |
CN111431536A (zh) * | 2020-05-18 | 2020-07-17 | 深圳市九天睿芯科技有限公司 | 子单元、mac阵列、位宽可重构的模数混合存内计算模组 |
Family Cites Families (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6850103B2 (en) * | 2002-09-27 | 2005-02-01 | Texas Instruments Incorporated | Low leakage single-step latch circuit |
US7741981B1 (en) * | 2008-12-30 | 2010-06-22 | Hong Kong Applied Science And Technology Research Institute Co., Ltd. | Dual-use comparator/op amp for use as both a successive-approximation ADC and DAC |
US8164943B2 (en) * | 2009-03-30 | 2012-04-24 | Manoj Sachdev | Soft error robust storage SRAM cells and flip-flops |
US7898837B2 (en) * | 2009-07-22 | 2011-03-01 | Texas Instruments Incorporated | F-SRAM power-off operation |
JP5623877B2 (ja) | 2010-11-15 | 2014-11-12 | ルネサスエレクトロニクス株式会社 | 半導体集積回路およびその動作方法 |
CN102332921A (zh) * | 2011-07-28 | 2012-01-25 | 复旦大学 | 一种适用于自动增益控制环路的逐次逼近型模数转换器 |
CN102394102B (zh) | 2011-11-30 | 2013-09-04 | 无锡芯响电子科技有限公司 | 一种采用虚拟地结构实现的近阈值电源电压sram单元 |
US8625334B2 (en) | 2011-12-16 | 2014-01-07 | Taiwan Semiconductor Manufacturing Company, Ltd. | Memory cell |
US9830964B2 (en) * | 2012-09-10 | 2017-11-28 | Texas Instruments Incorporated | Non-volatile array wakeup and backup sequencing control |
US8854858B2 (en) * | 2013-01-30 | 2014-10-07 | Texas Instruments Incorporated | Signal level conversion in nonvolatile bitcell array |
CN107733436B (zh) * | 2017-11-07 | 2018-11-30 | 深圳锐越微技术有限公司 | N位混合结构模数转换器及包含其的集成电路芯片 |
WO2019246064A1 (en) | 2018-06-18 | 2019-12-26 | The Trustees Of Princeton University | Configurable in-memory computing engine, platform, bit cells and layouts therefore |
US10381071B1 (en) | 2018-07-30 | 2019-08-13 | National Tsing Hua University | Multi-bit computing circuit for computing-in-memory applications and computing method thereof |
CN110414677B (zh) | 2019-07-11 | 2021-09-03 | 东南大学 | 一种适用于全连接二值化神经网络的存内计算电路 |
CN110598858A (zh) | 2019-08-02 | 2019-12-20 | 北京航空航天大学 | 基于非易失性存内计算实现二值神经网络的芯片和方法 |
CN111079919B (zh) * | 2019-11-21 | 2022-05-20 | 清华大学 | 支持权重稀疏的存内计算架构及其数据输出方法 |
US11372622B2 (en) | 2020-03-06 | 2022-06-28 | Qualcomm Incorporated | Time-shared compute-in-memory bitcell |
-
2020
- 2020-05-18 CN CN202010418649.0A patent/CN111431536B/zh active Active
-
2021
- 2021-03-30 US US17/762,447 patent/US11948659B2/en active Active
- 2021-03-30 WO PCT/CN2021/084022 patent/WO2021232949A1/zh unknown
- 2021-03-30 EP EP21808967.0A patent/EP3989445A4/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101807923A (zh) * | 2009-06-12 | 2010-08-18 | 香港应用科技研究院有限公司 | 具有二进制加权电容器采样阵列和子采样电荷分配阵列的混合模数转换器(adc) |
US20130194120A1 (en) * | 2012-01-30 | 2013-08-01 | Texas Instruments Incorporated | Robust Encoder for Folding Analog to Digital Converter |
CN110941185A (zh) * | 2019-12-20 | 2020-03-31 | 安徽大学 | 一种用于二值神经网络的双字线6tsram单元电路 |
CN111144558A (zh) * | 2020-04-03 | 2020-05-12 | 深圳市九天睿芯科技有限公司 | 基于时间可变的电流积分和电荷共享的多位卷积运算模组 |
CN111431536A (zh) * | 2020-05-18 | 2020-07-17 | 深圳市九天睿芯科技有限公司 | 子单元、mac阵列、位宽可重构的模数混合存内计算模组 |
Non-Patent Citations (1)
Title |
---|
See also references of EP3989445A4 |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2023113906A1 (en) * | 2021-12-15 | 2023-06-22 | Microsoft Technology Licensing, Llc. | Analog mac aware dnn improvement |
US11899518B2 (en) | 2021-12-15 | 2024-02-13 | Microsoft Technology Licensing, Llc | Analog MAC aware DNN improvement |
CN114089950A (zh) * | 2022-01-20 | 2022-02-25 | 中科南京智能技术研究院 | 一种多比特乘累加运算单元及存内计算装置 |
WO2023146613A1 (en) * | 2022-01-31 | 2023-08-03 | Microsoft Technology Licensing, Llc. | Reduced power consumption analog or hybrid mac neural network |
TWI822313B (zh) * | 2022-09-07 | 2023-11-11 | 財團法人工業技術研究院 | 記憶體單元 |
Also Published As
Publication number | Publication date |
---|---|
EP3989445A4 (en) | 2022-12-21 |
EP3989445A1 (en) | 2022-04-27 |
CN111431536A (zh) | 2020-07-17 |
US11948659B2 (en) | 2024-04-02 |
US20220351761A1 (en) | 2022-11-03 |
CN111431536B (zh) | 2023-05-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021232949A1 (zh) | 子单元、mac阵列、位宽可重构的模数混合存内计算模组 | |
WO2021223547A1 (zh) | 子单元、mac阵列、位宽可重构的模数混合存内计算模组 | |
Biswas et al. | Conv-RAM: An energy-efficient SRAM with embedded convolution computation for low-power CNN-based machine learning applications | |
US11551745B2 (en) | Computation in-memory architecture for analog-to-digital conversion | |
US11875244B2 (en) | Enhanced dynamic random access memory (eDRAM)-based computing-in-memory (CIM) convolutional neural network (CNN) accelerator | |
CN111816234B (zh) | 一种基于sram位线同或的电压累加存内计算电路 | |
TWI750038B (zh) | 記憶體裝置、計算裝置及計算方法 | |
US11762700B2 (en) | High-energy-efficiency binary neural network accelerator applicable to artificial intelligence internet of things | |
Mu et al. | SRAM-based in-memory computing macro featuring voltage-mode accumulator and row-by-row ADC for processing neural networks | |
Ha et al. | A 36.2 dB high SNR and PVT/leakage-robust eDRAM computing-in-memory macro with segmented BL and reference cell array | |
CN105827244B (zh) | 桥电容为整数值的电容电阻三段式逐次逼近模数转换器 | |
CN115080501A (zh) | 基于局部电容电荷共享的sram存算一体芯片 | |
CN117130978A (zh) | 基于稀疏跟踪adc的电荷域存内计算电路及其计算方法 | |
Lee et al. | A charge-sharing based 8t sram in-memory computing for edge dnn acceleration | |
Xiao et al. | A 128 Kb DAC-less 6T SRAM computing-in-memory macro with prioritized subranging ADC for AI edge applications | |
Nasrin et al. | Memory-immersed collaborative digitization for area-efficient compute-in-memory deep learning | |
US20220108742A1 (en) | Differential charge sharing for compute-in-memory (cim) cell | |
Jeong et al. | A Ternary Neural Network computing-in-Memory Processor with 16T1C Bitcell Architecture | |
TWI788964B (zh) | 子單元、mac陣列、位寬可重構的模數混合存內計算模組 | |
Fan et al. | A 3-8bit Reconfigurable Hybrid ADC Architecture with Successive-approximation and Single-slope Stages for Computing in Memory | |
Zang et al. | 282-to-607 TOPS/W, 7T-SRAM based CiM with reconfigurable column SAR ADC for neural network processing | |
Lin et al. | A reconfigurable in-SRAM computing architecture for DCNN applications | |
Li et al. | A Column-Parallel Time-Interleaved SAR/SS ADC for Computing in Memory with 2-8bit Reconfigurable Resolution | |
Do Park et al. | 10.76 TOPS/W CNN Algorithm Circuit using Processor-In-Memory with 8T-SRAM | |
US20240135989A1 (en) | Dual-six-transistor (d6t) in-memory computing (imc) accelerator supporting always-linear discharge and reducing digital steps |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21808967 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2021808967 Country of ref document: EP Effective date: 20220121 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |