WO2021197073A1 - 基于时间可变的电流积分和电荷共享的多位卷积运算模组 - Google Patents

基于时间可变的电流积分和电荷共享的多位卷积运算模组 Download PDF

Info

Publication number
WO2021197073A1
WO2021197073A1 PCT/CN2021/081322 CN2021081322W WO2021197073A1 WO 2021197073 A1 WO2021197073 A1 WO 2021197073A1 CN 2021081322 W CN2021081322 W CN 2021081322W WO 2021197073 A1 WO2021197073 A1 WO 2021197073A1
Authority
WO
WIPO (PCT)
Prior art keywords
convolution operation
bit
current
capacitor
input
Prior art date
Application number
PCT/CN2021/081322
Other languages
English (en)
French (fr)
Inventor
莫尔加多阿隆索
刘洪杰
Original Assignee
深圳市九天睿芯科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市九天睿芯科技有限公司 filed Critical 深圳市九天睿芯科技有限公司
Publication of WO2021197073A1 publication Critical patent/WO2021197073A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • G06N3/065Analogue means

Definitions

  • the invention relates to an analog operation module, in particular to an analog operation module related to convolution operation.
  • the invention also relates to an analog calculation method for convolution operation.
  • analog operations have higher efficiency than traditional digital operations. Therefore, digital quantities are usually converted into analog quantities and then operated.
  • digital quantities are usually converted into analog quantities and then operated.
  • neural networks compared to its computational energy consumption in the realization of neural network's medium and large hardware, because traditional data is stored in the disk, the data needs to be extracted into the memory when performing operations. This process requires a lot of I/O. Storage connected to traditional memory often takes up more power consumption.
  • the calculation process can be sent to the data for local execution, which greatly improves the calculation speed, saves storage area, and reduces data transmission and calculation power consumption.
  • the present invention proposes an effective method for realizing ultra-low power consumption analog memory or near-memory operation.
  • the realization of the analog operation circuit does not involve the change of the weight of the multiplier or the multiplicand. It is limited to the input of the first level of the multiplication operation of 1 bit, and cannot be used for the convolution of multi-digit binary numbers. Simulation calculations.
  • multi-bit operations are all controlled by modulating the control bus in the current domain, capacitive charge sharing, pulse-width-modulated (Pulse-width-modulated, PWM) to control the read and write of SRAM, modify the SRAM cell, or use near ⁇ memory operations Complicated digital matrix vector processing and other ways to achieve.
  • PWM pulse-width-modulated
  • multi-bit analog multipliers and accumulators have always been controlled by very complex digital processing.
  • traditional digital operations consume a lot of power compared with analog operations. Therefore, these The multi-bit operation under the control of digital processing will produce a lot of operation energy.
  • the purpose of the present invention is to provide a multi-bit binary convolutional analog operation module based on time-variable current integration and charge sharing with ultra-low power consumption, compact structure and fast operation speed.
  • the module is It supports general convolution of two or more inputs, and the number of binary digits can be adjusted, especially as a unit of analog memory operation implemented by neural network convolution operation unit or arithmetic accelerator hardware.
  • the present invention proposes a multi-bit convolution operation module based on time-adjustable current integration and charge sharing.
  • the module includes: at least one digital input x i , at least one digital to analog converter (Digital to Analog Converter, DAC) converts the digital input into a current for transmission in the circuit; at least one weight w ji , the weight represents When it is a binary number, w ji,k is the value at the k-th position; each convolution operation unit (i,j,k) is used for 1 bit-weighted 1-bit binary w ji,k and 1 Multiplication operation of multi-bit binary x i , a convolution operation array composed of multiple convolution operation units, the array completes the multiplication operation and addition operation of the convolution operation; at least one output y j ;
  • the current Ix i is converted by the DAC to convert the digital input x i according to the number given by the DAC.
  • the current Ix i is mirrored or copied to the convolution operation array.
  • the current on the same j*k plane is the same. Allows the input of multi-bit signals and the current to be scaled in the DAC, so that the time for the current to reach the switch is the same.
  • each operation unit (i, j, k) includes a current Ix i , a switch, an integral control module, a node a ji, k , and at least one capacitor .
  • the weight w ji , w ji,k is the value of the k-th position in the binary representation of the weight w ji , k ⁇ [1,B]
  • each bit w ji,k corresponds to a convolution operation unit, and the k-direction convolution operation
  • the units are arranged from low to high according to the bit w ji,k.
  • the AND gate output of the w ji, k and PWM signals in the control module controls the switch to close, the output is 1, and the switch is closed.
  • the weight change of the multiplicand or the multiplier in the multiplication stage is realized in the module by the PWM signal controlling the integration time of the current in the capacitor, and the units corresponding to the same k bits with different weight values w ji ,
  • the duration of the PWM signal is the same; the duration of the PWM signal of the convolution operation unit corresponding to the same weight value is twice that of the previous one, and one end of the capacitor is grounded, then the voltage across the capacitor is the upper pole of the capacitor
  • the voltage at the board is controlled by a PWM signal because it can improve the flexibility of the system.
  • SRAM Static Random-Access Memory
  • the PWM signal duration refers to the duration of a high level
  • the PWM signal duration refers to the duration of a low level.
  • the voltage at node a ji,k is the result of the multiplier of x i *w ji,k *2 (k-1) , and its value is determined by the connection time between the node and the upper plate of the capacitor by the value w ji on each of the weights , k and the duration of the PWM signal is determined; combined voltage corresponding to x i 1 * k th convolution arithmetic unit is the result of x i * w ji.
  • the y j is the voltage of the combined node obtained by connecting all a ji and k nodes of an i*k plane for a given j. Due to the characteristics of capacitor discharge, the capacitors in different arithmetic units are connected through each The node performs charge sharing. After the charge sharing is over, the amount of charge in each capacitor is the same, but the total amount of charge obtained by the current integration in the multiplication stage remains unchanged.
  • the accumulated voltage at the combined node is The result of ⁇ x i ⁇ w ji , completes the operation of the convolution process of the convolution kernel and the input matrix;
  • the bias b j is converted into an additional input fixed current I b for a given current Ix i , which is calculated separately by adding an additional bias calculation unit.
  • the scale of the bias unit array is j*k.
  • Each arithmetic unit (j, k) includes a current I b , a switch, an integral control module, a node a j, k , and a capacitor with a value of C u.
  • bias b j of y j is the cumulative voltage sum of all nodes a j and k of the 1*k group of units.
  • a counter or a clock divider is used to generate a PWM signal based on the clock at the maximum speed, which speeds up the integration speed of the capacitor.
  • the switch is a virtual switch or a current device or a non-switching element.
  • the present invention also includes a multi-bit convolutional analog operation method based on time-variable current integration and charge sharing, including:
  • the DAC converts the digital input x i into the current Ix i of the analog signal according to the given position number and transmits it in the circuit;
  • the current Ix i When the current Ix i arrives at the switch, it contains a logic operation integral control module.
  • the input of the logic operation is the k- th bit w ji,k of the weight w ji and the PWM signal modulated according to the bit position weight in the k-direction convolution operation unit
  • the duration of the PWM signal is increased by 2 times from the low to the high.
  • the duration of the PWM signal at the k-th position is 2 (k-1) * ⁇ , where ⁇ is the clock period of the PWM signal, and the output of this logic operation controls the closing of the switch;
  • the current Ix i is integrated into the capacitor through the node a ji,k connected to the upper plate of the capacitor. After a period of integration, the voltage across the capacitor is obtained. After the switch is opened, the current does not pass through the node a ji,k , so the integration After a period of time, the voltage across the capacitor is 0, the integration time is the duration of the PWM signal , and the voltage at the node a ji, k is the multiplication of the convolution operation x i *w ji,k *2 (k-1) result;
  • FIG. 1 is a schematic diagram of a circuit implementation of a multiplication stage of a convolution operation in an embodiment of the present invention
  • Figure 2 is a schematic diagram of an integral control module in an embodiment of the present invention.
  • FIG. 3 is a schematic diagram of the output realization of the convolution operation addition stage in an embodiment of the present invention (the ADC is not shown in the figure, and it can be added before each output y j when it is necessary to convert y j into a digital output);
  • FIG. 5 is a schematic diagram of implementing multiplication by adding a bias arithmetic unit to a convolution operation according to an embodiment of the present invention
  • Fig. 6 is a schematic diagram of output after biasing according to an embodiment of the present invention.
  • the result of the convolution operation is the feature extraction of a layer of neural network;
  • W ji represents the time of multi-bit binary number, w ji, the value of the k-th bit k w ji; two multi-bit binary ⁇ x i * w ji convolution calculation process is divided into two stages:
  • Multiplication stage input x i multiplied by each bit of weight w ji and then multiplied by the bit weight 2 (k-1) , that is, x i *w ji,k *2 (k-1) , w ji,k It is 0 or 1.
  • Addition stage accumulate and sum the results of each multiplication operation in the multiplication stage to get the output y j .
  • the weight matrix formed by the weight w ji of the multiplication stage is shared, that is, j changes from 1 to n-m+1
  • the present invention needs to solve the change of the bit weight when the multiplier is multiplied by each bit of the multiplier in the multiplication stage and the addition stage of the accumulation of the multiplication result.
  • the embodiment of the present invention proposes an arithmetic module 10 for realizing the above-mentioned multi-bit convolution operation based on the time adjustment of current integration and charge accumulation.
  • the module 10 includes: at least one digital input x i , at least one digital to analog converter 101 (Digital to analog converter, DAC) converts the digital input into a current Ix i for transmission in the circuit; at least one weight w ji , When the weight is expressed as a binary number, w ji,k is the value of the k-th bit in binary representation; the convolution operation array composed of multiple convolution operation units 102, the size of the convolution operation array is i*j *k, each convolution operation unit 102 (i, j, k) includes a current Ix i , a switch 1021, an integral control module 103, a node a ji, k , a capacitor 1022 with a value of C u , and one end of the capacitor 1022 is grounded, The capacitor
  • the multiplication stage combines the PWM signal 1031 to perform an AND operation to achieve weighted multi-bit.
  • This embodiment is based on the realization of the arithmetic unit of in-memory or near-memory convolution based on the matrix unit, which not only reduces the power of processes related to memory access, but also makes the physical realization of the matrix more compact.
  • the digital-analog converter 101 according to the current input x i is converted into an analog signal given number Ix i, DAC resolution digital input x i bits of the same.
  • the current Ix i is mirrored or copied by the current mirror to the j*k convolution operation units 102 corresponding to the same i.
  • the current integration of the convolution operation units 102 in the j direction can be performed at the same time.
  • the number of bits of the weight w ji in the k direction increases, and the corresponding convolution operation unit 102 is arranged from low to high for each bit w ji,k.
  • the current Ix i that needs to be converted by the DAC can be scaled in the DAC first and then transmitted in the circuit to control the current value not to exceed a certain threshold and reduce the transmission power loss.
  • the switch 1021 may be a virtual switch or a current device or a non-switching element such as a current device or a virtual load.
  • the integral control module 103 controls the on-off and on-off time of the switch 1021.
  • the logic operation included in the module may be an AND gate 1033.
  • the module includes a Static Random-Access Memory (SRAM). ) unit 1032, a convolution operation for the entire array, the SRAM cells may be the same or different SRAM cells 6T SRAM cell stores one binary bit in a JI w w ji, k, k is the direction of weights w ji low to high direction; the input of the AND gate 1033 is w ji,k and the PWM signal 1031 modulated according to the bit position weight, and the output signal of the AND gate 1033 controls the switch 1021 to turn on and off, so as to realize the binary multiplication stage.
  • SRAM Static Random-Access Memory
  • an input PWM signal 1031 of the AND gate 1033 of the present invention changes according to the different bits of the corresponding weight w ji of the unit, and the duration of the PWM signal 1031 of the i*j units corresponding to adjacent bits is within
  • the duration of the PWM signal 1031 is 2 (k-1) * ⁇ , where ⁇ is the clock period of the PWM signal 1031.
  • the duration of the PWM signal 1031 refers to the duration of the high level; when the bit w ji,k is 1 and the PWM signal 1031 is high, the AND gate 1033 outputs At this time, the switch 1021 is in the closed state, and the current Ix i enters the capacitor 1022 through the switch 1021 to integrate, and the capacitor 1022 begins to store charge; when the high level duration of the PWM signal 1031 elapses, the signal enters a low level state. When the switch is in the off state, the current Ix i does not pass, and the current stops integrating in the capacitor 1022.
  • the logic operation of the integral control module 103 can also be an OR gate.
  • the duration of the PWM signal 1031 is the duration of the low level, and the PWM signal 1031 and w ji, k do the OR operation.
  • a counter or clock divider is used to generate the PWM signal 1031 based on the maximum speed clock, that is, to make the ⁇ as small as possible, to speed up the integration speed of the capacitor 1022, that is, to speed up each step of the multiplication operation.
  • the use of PWM signal 1031 to control the time is that it can improve the flexibility of the system.
  • the switch 1021 when the switch 1021 is in the closed state, the current Ix i reaches the node a ji,k through the switch 1021, and the node a ji,k is connected to the upper plate of the capacitor 1022, and then the current Ix i enters the capacitor 1022.
  • the capacitor 1022 In the convolution operation, the capacitor 1022 needs to be reset to a given DC voltage before the current Ix i flows in, and the last operation result is cleared.
  • the capacitor 1022 is grounded, so the voltage across the capacitor 1022 is the voltage at the node a ji,k .
  • the amount of charge stored in the capacitor 1022 increases as the integration time elapses. That is, when the switch 1021 is in the closed state, the current is continuously integrated. At this time, the voltage across the capacitor 1022 gradually increases.
  • the integration time Is the on-off time of the switch 1021.
  • each bit w ji,k corresponds to the convolution operation unit
  • the corresponding k 1, 2, 3
  • the duration of the PWM signal 1031 is ⁇ , 2 ⁇ , 4 ⁇
  • the duration of the k-th PWM signal 1031 is 2 (k-1) * the highest PWM signal of ⁇
  • the duration of 1031 is 2 (B-1) * ⁇
  • the voltage at the node a ji,k in each convolution operation unit 1022 is the voltage across the capacitor 1022, and the voltage value is defined as x i *w ji,k *2 (k- 1) The result of the multiplier.
  • the addition stage obtains the convolution output through charge sharing.
  • x 1 corresponding to the k arithmetic unit completes a x 1 * w 11 and x 1 * w 11 of The operation is disassembled to see that the input x 1 is multiplied by each bit w 11,k of the weight w 11 and the bit weight 2 (k-1) of the bit, that is, x 1 *w 11,k *2 (k-1) , And then add the results obtained separately.
  • the voltage at the node a ji, k of each convolution operation unit 102 of the i*1*k array is the result of the multiplication.
  • ,k at this time, all the capacitors in the corresponding array are connected in parallel.
  • the capacitor 1022 in the short-circuited array performs charge sharing, and each capacitor The amount of stored charge in 1022 is the same, but the total charge value remains unchanged.
  • the voltage of the combined node obtained is the cumulative sum of the voltages of each multiplication result node a ji, k in the multiplication stage, which is the output y 1 .
  • the convolution kernels corresponding to different windows are the same, that is, when the convolution results of different windows are calculated, the weight formed by the multiplicand (weight w ji)
  • the other corresponding output y j can be obtained by short-circuiting the corresponding array of other j, as shown in the following equation 1:
  • the output y j is an analog signal.
  • the output y j is a digital signal
  • an analog-to-digital converter Analog-to-Digital Converter, ADC
  • the output y j obtained is a digital signal.
  • the convolution operation module is applied to a convolutional neural network, and the digital output y j can be used as a digital input into the convolution operation array to perform the convolution operation of the neural network of the second layer.
  • each group of convolution operation unit 102 needs The increase in the number of capacitors requires more physical area, which is not conducive to miniaturization of components. Therefore, it is considered that when connecting the combined node, an additional attenuation capacitor 105 with a value of C att is connected into the combined node at the same time, so as to adjust the scale range of the accumulated voltage, so that the accumulated voltage is scaled to a certain scale range to meet the requirements of the digital-to-analog converter. Enter the range.
  • the convolution operation module meets the needs of unit reuse.
  • the number of digits of the weight w ji is generally fixed, that is, the size of k is fixed.
  • the unit of the higher number of bits Do not participate in the operation.
  • the convolution operation unit 102 corresponding to the high number of digits is connected to the circuit, the power consumption of the circuit will increase.
  • a group of cells associated with the k- th position of the weight is reused for input x i or input x ii, and the corresponding current is Ix i or Ix ii respectively , and the voltage signal corresponding to the current is Vgx i or Vgx ii .
  • the multiplexer control signal controlled according to the bit k selects the voltage signal corresponding to the unused cell according to the remaining unused bit number corresponding to the cell, that is, the selected voltage V'gx i is the same as Vgx i or Vgx ii , respectively. Then the cell current I'x i corresponding to the bit k is the same as Ix i or Ix ii.
  • the current can be controlled by the diode in the current mirror via the voltage V'gx i , the DAC can be reconfigured when inputting a given position number and the ADC may be used for quantizing the output y j, The resolution of the DAC or ADC can be matched with the number of bits of the corresponding output.
  • the duration of the array PWM signal 1031 ranges from ⁇ to 2 (B-1) * ⁇ .
  • the 1031 duration is ⁇ , and all weights are quantized as a single bit, instead of quantizing each bit of the 8-bit weight w ji in the previous case.
  • 5 and 6 are an embodiment of adding a bias operation unit 1051 when the convolution operation unit 102 of the present invention is used for convolution neural network operation.
  • a convolution operation such that the offset b is added a convolution operation more efficient and accurate, typically, for a given output y j adding binary offset b j. Then the corresponding convolution output y j is changed from Equation 1 to Equation 2 below.
  • Figure 5 illustrates how to add this extra function in the multiplication phase. Since the implementation of the quantization of the bias bit is similar to the weight in FIG. 1 or FIG. 2, the realization of the bias is regarded as the fixed current I b of the additional input of the given current Ix i .
  • each bias arithmetic unit 1061 (j, k) includes a current I b , a switch 1021, a bias arithmetic unit integral control module 1062, a node a j,k , and a capacitor 1022 with a value of C u;
  • I b is integrated in the capacitor 1022, similar to the convolution operation stage, the weight w ji is transformed into b j , then the input of the bias AND gate in the integral control module 1062 of the bias operation unit is b j, k and b j , the PWM signal 1031 modulated by k-bit weights, the output of the bias AND gate controls the closing time of the switch
  • the PWM signal 1031 is the same as the PWM signal 1031 at the weight w ji,k in the convolution operation unit 102.
  • the duration of the PWM signal 1031 refers to the duration of the high level; when the bits b j,k are 1, and the PWM signal 1031 is high, the AND gate is biased The output is 1, the switch 1021 is closed at this time, the current I b is integrated into the capacitor 1022 through the switch, and the capacitor stores the charge; when the high level duration of the PWM signal 1031 elapses, the signal enters the low level state, and the switch 1021 is in the off state, the current I b does not pass, and the current stops integrating in the capacitor 1022.
  • the capacitor 1022 After the switch 1021 is turned off, the capacitor 1022 has no new charge accumulation, and the stored charge is the accumulation in the high-level state; b j, When k is 0, the bias AND gate outputs 0. At this time, the switch 1021 is in an off state, the current I b does not pass, there is no current integration in the capacitor 1022, and the stored charge is 0. Similarly, the voltage across the capacitor 1022 is the calculation result of the bias operation unit 1061 in the multiplication stage.
  • FIG. 6 illustrates that during the accumulation phase, an additional capacitor 1022 needs to be added for charge sharing and node accumulation.
  • the k unit nodes a j,k corresponding to a given j are short-circuited. Due to the discharge characteristics of the capacitor 1022, the capacitors 1022 in the short-circuited array perform charge sharing, and the amount of stored charge in each capacitor 1022 is the same , But the total charge value does not change, the voltage of the combined node obtained is the cumulative sum of the voltages of each multiplication result node a ji,k in the multiplication stage, that is, the bias b of y j is 1*k group unit all nodes a j, k cumulative voltage sum, as shown in Figure 6, the physical realization of the arithmetic unit of convolution and bias is independent, but when outputting the convolution result with bias at the end, the convolution arithmetic unit 102 and the bias can be connected Set the corresponding node of the operation unit 1061, and the voltage of the combined node obtained is the result of the convolution operation with the bias added.
  • the present invention also includes a multi-bit convolutional analog operation method based on time-variable current integration and charge sharing, including:
  • the digital-to-analog converter 101 converts the digital input x i into an analog signal current Ix i according to the position number and transmits it in the circuit.
  • the logic operation is in the integral control module 103.
  • the input of the logic operation is the k- th bit w ji,k of the weight w ji and the PWM signal 1031 modulated according to the bit weight.
  • the duration of the PWM signal 1031 in the k-direction convolution operation unit is doubled from low to high.
  • the duration of the k-th PWM signal 1031 is 2 (k-1) * ⁇ , where ⁇ is the clock period of the PWM signal.
  • the output control switch 1021 is closed.
  • the current Ix i enters the capacitor 1022 through the nodes a ji, k connected to the upper plate of the capacitor and is integrated. After a period of integration, the voltage across the capacitor is obtained. After the switch is opened, the current does not pass through the node a ji. ,k , the voltage across the capacitor 1022 is 0 after integration for a period of time, the integration time is the duration of the PWM signal 1031 , and the voltage at the node a ji,k is the convolution operation x i *w ji,k *2 ( The result of the multiplication of k-1).
  • the modules included are only divided according to the functional logic, but not limited to the above-mentioned division, as long as the corresponding function can be realized; in addition, the specific name of each functional unit is also It is just for the convenience of distinguishing each other, and is not used to limit the protection scope of the present invention.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Analogue/Digital Conversion (AREA)
  • Complex Calculations (AREA)

Abstract

一种模拟运算模组(10),尤其涉及一种关于卷积运算的模拟运算模组(10),提出了一组模拟乘法器和累加器(MAC)。其中,电容器中的电流积分用于两个多位二进制数卷积过程的乘法运算的实现,而电容器间的电荷共享实现加法过程。乘法阶段,同一时钟周期τ的PWM控制电流在电容器中的积分时间为τ、2τ、4τ.....2(B-1)*τ,从而使给定位数的二进制乘数在相乘时每一位k具有权位变化。这个思路适用于一系列位数可调的多位卷积可用于实现有两个或更多个输入的一般卷积,且二进制的位数可以调整。特别地,可以加入偏置运算单元阵列。该方法可用作于神经网络卷积运算单元或运算加速器硬件实现的存储器或近存储器运算的单元。

Description

基于时间可变的电流积分和电荷共享的多位卷积运算模组
本申请要求于2020年04月03日提交中国专利局、申请号为202010257151.0、发明名称为“基于时间可变的电流积分和电荷共享的多位卷积运算模组”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明涉及一种模拟运算模组,尤其涉及一种关于卷积运算的模拟运算模组,本发明还涉及一种卷积运算的模拟计算方法。
背景技术
对低信噪比的量化,模拟运算比传统数字运算具有更高的功效,因此,通常将数字量转化为模拟量再进行运算。尤其对于神经网络,相较其运算能耗在神经网络的中、大型硬件实现中,由于传统的数据存储在磁盘中,进行运算时需要将数据提取到内存中,此过程需要大量的I/O连接传统存储器的存储往往占用了更多的功耗。而基于模拟内存和近内存运算则可以将运算过程发送到数据本地执行,极大地提升了运算速度、节约了存储面积、降低了数据传输以及运算功耗。本发明提出了一种超低功耗模拟内存或近内存运算的有效实现方法。
近期论文“A Mixed-Signal Binarized Convolutional-Neural-Network Accelerator Integrating Dense weight Storage and Multiplication for Reduced Data Movement”Symp.VLSI Circuits,pp.141-142,2018提出的基于二进制的内存或近内存的对1比特二进制数乘法的模拟运算展现了高效的表现,通过静态随机存取存储器(Static Random-Access Memory,SRAM)单元存储1位的权重与输入的混合信号做卷积运算,极大地提高了运算能力以及降低了存储面积,该方法涉及的结构注重一位的乘法运算在神经网络中传递的过程,即输入层到卷积层再到池化层,最后输出。但是该背景技术文件,其模拟运算电路的实现没有涉及乘数或被乘数权位变化的情况,局限于1位的乘法运算在第一层次的输入,不能用于多位二进制数的卷积模拟运算。
极少数的多位运算涉及乘数或被乘数的权位的变化,如论文:
(1)“In-Memory Computation of a Machine-Learning Classifier in a Standard 6T SRAMArray”,JSSC,pp.915-924,2017;(2)“A481pJ/decision 3.4M decision/s multifunctional deep inmemory inference processor using standard 6T SRAM array”,arXiv:1610.07501,2016;(3)“A Microprocessor implemented in 65nm CMOS with Configurable and Bit-scalable Accelerator for Programmable In-memory Computing”,arXiv:1811.04047,2018;(4)“A Twin-8T SRAM Computation-In-Memory Macro for Multiple-Bit CNN-Based Machine Learning,”,ISSCC,pp.396-398,2018,(5)“A 42 pJ/Decision 3.12TOPS/W Robust In-Memory Machine Learning Classifier with On-Chip Training,”ISSCC,pp.490-491,2018;
但是这些多位运算都是通过利用调制当前域中的控制总线、电容电荷共享、脉冲宽度调制(Pulse-width-modulated,PWM)控制SRAM的读写、修改SRAM单元,或者用近\存储器运算的复杂数字矩阵矢量处理等方式实现的。这些多位运算的实施方法中,多位模拟乘法器和累加器一直采用非常复杂的数字处理控制,但是在低信噪比的量化方面,传统的数字运算相较模拟运算耗费大量功效,因此这些数字处理控制下的多位运算会产生很大的运算耗能。
CN201910068644提出的二值化的卷积,进行异或运算阶段是通过调制SRAM内控制总线从而实现电位的变化,但是该专利给出的技术方案和教导是要求采取复杂的数字处理控制,对控制模块的要求高,消耗过多的能耗。因此,本领域亟需一种对低信噪比的信号采用模拟运算实现超低功耗的解决方案。
发明内容
有鉴于此,本发明的目的在于提供一种超低功耗、结构紧凑、运算速度快的基于时间可变的电流积分和电荷共享的多位二进制的卷积模拟运算的模组,该模组支持两个或更多个输入的一般卷积,且二进制的位数可以调整,尤其是可用作于神经网络卷积运算单元或运算加速器硬件实现的模拟内存运算的单元。
所涉及的模组除了上述的优点,其基于矩阵单元的实现对于存储器内或接近存储器的基于卷积的运算单元来说是合理的,不仅降低与存储器存取相关进程的功率,而且还使得矩阵物理实现更加紧凑。为实现上述的目的,采用以下的技术方案:
基于卷积运算的两个阶段,本发明提出一种基于时间可调的电流积分和电荷共享的多位卷积运算的模组。所述模组包括:至少一个数字输入x i,至少一个数模转换器(Digital to Analog Converter,DAC)将所述的数字输入转化为电流在电路中传输;至少一个权重w ji,该权重表示为一个二进制数时,w ji,k为其第k位上的值;每个卷积运算单元(i,j,k)用于1个带位权的1位二进制w ji,k与1个多位二进制x i的乘法运算,由多个卷积运算单元构成的卷积运算阵列,该阵列完成卷积运算的乘法运算和加法运算;至少一个输出y j
Figure PCTCN2021081322-appb-000001
特别地,所述的电流Ix i是由DAC将数字输入x i按照DAC给定位数转换的,电流Ix i被镜像或复制到卷积运算阵列中,同一j*k面的电流是相同的,允许多位信号的输入以及电流在DAC中被缩放,使得电流到达开关的时间的相同的。
特别地,所述的卷积运算阵列的规模为i*j*k,每个运算单元(i,j,k)包括电流Ix i、开关、积分控制模块、节点a ji,k、至少一个电容。
特别地,所述的积分控制模块控制电容内电流的积分时间,由U=Q/C,从而得到的电容两端的电压是根据电流积分时间的不同而变化的。对于权重w ji,w ji,k是权重w ji二进制表示时第k位上的值,k∈[1,B],每个位w ji,k对应一卷积运算单元,k方向卷积运算单元依位w ji,k由低位到高位排列。
特别地,控制模块中w ji,k和PWM信号的与门输出控制开关闭合,输出为1,开关闭合。二进制数相乘时乘法阶段被乘数或乘数的权位变化在所述模组中通过PWM信号控制电容内电流的积分时间实现的,不同权重值w ji的同一k位所对应的的单元,其PWM信号持续时间相同;同一权重值后一位对应的卷积运算单元PWM信号的持续时间是前一位的2倍, 而电容器的一端是接地的,那么电容两端的电压为电容上极板处的电压,采用PWM信号控制在于其能提高系统的灵活性。
特别地,所述的积分控制模块的逻辑运算可以是与门或者或门,包括一个静态随机存取存储器(Static Random-Access Memory,SRAM),其可以是相同的SRAM 6T单元或不同的SRAM单元实施组成,一个位w ji,k;逻辑运算的输入是w ji,k和依据该位位权调制的PWM信号,PWM信号实现乘法权位变化,持续时间相对所在位2倍变化,即k=1,2,3时,相应的PWM信号的持续时间为1τ,2τ,4τ...,第k位的PWM信号持续时间2 (k-1)*τ,τ为PWM信号的时钟周期;逻辑运算的输出控制开关闭合,w ji,k=0的运算单元电流不通过开关进入电容器中积分,电容上方节点的电压为0。
进一步地,所述的逻辑运算是与门时,PWM信号持续时间指的是高电平的持续时间,逻辑运算是或门时,所述的PWM信号持续时间指的是低电平的持续时间。
进一步地,假设w ji,1=w ji,B=1,i,j分别相同,那么在电容内电流经过不同的积分时间后,储存的电荷量不同,其对应的电容两端的电压,k=B的电容会是k=1的电容电压的2 (k-1)倍。
特别地,节点a ji,k处的电压是x i*w ji,k*2 (k-1)乘数结果,其值由该节点与电容上极板连接时间由权重各位上的值w ji,k和PWM信号的持续时间决定;x i对应的1*k个卷积运算单元的组合电压是x i*w ji的结果。
进一步地,所述y j是给定一个j,连接一个i*k面的所有a ji,k节点得到的组合节点的电压,由于电容放电的特性,不同运算单元内的电容通过各自被连接的节点进行电荷共享,电荷共享结束后,每个电容内的电荷量是相同的,但是乘法阶段电流积分得到的总电荷量不变,该组合节点处的累加电压是
Figure PCTCN2021081322-appb-000002
的结果,即∑x i·w ji,完成一次卷积核和输入矩阵的卷积过程的运算;
进一步地,对于模组用于神经网络运算单元,通常需要添加偏置。本发明偏置b j转换为给定电流Ix i的附加输入的固定电流I b,是添加额外的 偏置运算单元单独进行运算的,所述的偏置单元阵列的规模为j*k,每个运算单元(j,k)包括电流I b、开关、积分控制模块、节点a j,k、值为C u的电容。
进一步地,所述y j的偏置b j为1*k组单元所有节点a j,k累积电压和。
进一步地,使用计数器或时钟分频器以生成基于以最大速度时钟的PWM信号,加快电容积分速度。
进一步地,为减弱电流镜上的反冲或瞬态效应,所述开关为虚拟开关或电流器或非开关元件。
本发明还包括一种基于时间可变的电流积分和电荷共享的多位卷积模拟运算方法,包括:
DAC按照给定位数将数字输入x i转换为模拟信号的电流Ix i在电路中传输;
电流Ix i到达开关时,包含一个逻辑运算的积分控制模块,逻辑运算的输入是权重w ji的第k位w ji,k和依据该位位权调制的PWM信号,k方向卷积运算单元内PWM信号持续时间依低位到高位2倍递增,第k位的PWM信号持续时间2 (k-1)*τ,τ为PWM信号的时钟周期,该逻辑运算的输出控制所述开关的闭合;
开关闭合后,电流Ix i通过与电容上极板连接的节点a ji,k进入电容内积分,积分一段时间后得到电容两端的电压,开关断开后,电流不通过节点a ji,k,积分一段时间后得到电容两端的电压为0,所述积分时间是PWM信号的持续时间,节点a ji,k的电压是卷积运算的x i*w ji,k*2 (k-1)的乘法结果;
短接一个i*k面的所有卷积运算单元内节点a ji,k,每个卷积运算单元内电容间电荷共享,得到的组合节点的电压为卷积运算
Figure PCTCN2021081322-appb-000003
的结果y j
说明书附图
图1为本发明一实施方式中卷积运算乘法阶段电路实现的示意图;
图2为本发明一实施方式中积分控制模块示意图;
图3为本发明一实施方式中卷积运算加法阶段输出实现的示意图(图 中没有画出ADC,在需要将y j转化为数字输出时可以添加在每个输出y j之前);
图4为本发明一实施单元重利用的示意图;
图5为本发明一实施方式为卷积运算添加偏置运算单元乘法的实现示意图;
图6为本发明一实施方式的加偏置后的输出示意图。
主要元件符号说明。
模组 10
数模转换器 101
卷积运算单元 102
积分控制模块 103
PWM信号 1031
静态随机存取存储器 1032
与门 1033
开关 1021
电容 1022
多路复用器 104
衰减电容 105
偏置单元阵列 106
偏置运算单元 1061
偏置积分控制模块 1062
数字输入 x i
电流 Ix i
权重 w ji
具体实施方式
为了使发明的目的、原理、技术方案及优点更加清楚明白,以下结合附图及实施例,对本发明进行进一步详细说明。
应当理解,正如本发明内容部分所述,此处所描述的具体实施例用以解释本发明,但是本发明还可以采用不同于在此描述的其他方式来实施,本领域的技术人员可以在不违背本发明内涵的基础上做类似推广,因此本发明不受下面公开的具体实施例的限制。
参考图1,对于如下的一个一般的卷积运算:
多位的二进制数x i构成的输入矩阵,i从1至N;多个权重w ji构成卷积核,也称为权重矩阵,j表示当i确定后,相应的第j窗口;假设当输入构成n*n的输入矩阵,卷积核为m*m的权重矩阵时,j为1~n-m+1(n>m时,窗口移动);输出为y j,所有的y j构成一次卷积运算结果,即为一层神经网络特征提取;
所述的w ji表示为多位的二进制数时,w ji,k为w ji第k位上的值;两个多位二进制∑x i*w ji的卷积运算过程分为两阶段:
乘法阶段:输入x i乘以权重w ji的每一位再乘以该位的位权2 (k-1),即x i*w ji,k*2 (k-1),w ji,k为0或1。
加法阶段:将乘法阶段每个乘法运算的结果累积求和得到输出y j
输出y j在卷积核的大小确定下,当本发明的模组用于神经网络的卷积计算时,乘法阶段权重w ji构成的权重矩阵共享,即j从1变化到n-m+1时,w i1=w i2=w i3=.....=w ji
针对上述对于多位二进制的卷积运算,本发明需要解决乘法阶段被乘数乘以乘数每一位时位权的变化以及乘法结果的累加的加法阶段。
本发明实施例提出了一种基于电流积分的时间可调和电荷累积的用于实现上述多位卷积运算的运算模组10。所述模组10包括:至少一个数字输入x i,至少一个数模转换器101(Digital to analog converter,DAC)将所述的数字输入转化为电流Ix i在电路中传输;至少一个权重w ji,该权重表示为一个二进制数时,w ji,k为其二进制表示第k位上的值;由多个卷积运算单元102构成的卷积运算阵列,卷积运算阵列的规模为i*j*k,每个卷积运算单元102(i,j,k)包括电流Ix i、开关1021、积分控制模块 103、节点a ji,k、值为C u的电容1022,电容1022的一端接地,在进行卷积运算前电容1022需要重置到给定的直流电压。该阵列完成卷积运算的乘法运算和加法运算,以及至少一个输出y j
乘法阶段,如图1,结合PWM信号1031进行与运算实现加权多位。在本实施方式是基于矩阵单元的对内存内或近内存卷积的运算单元的实现,不仅降低与存储器存取相关进程的功率,而且使矩阵物理实现更加紧凑。具体而言,数模转换器101将数字输入x i按照给定位数转换为模拟信号的电流Ix i,DAC的分辨率与数字输入x i的位数一致。电流Ix i被电流镜镜像或复制到同一i对应的j*k个卷积运算单元102中,因此对不同i*k面,j方向的卷积运算单元102的电流积分可以同时进行。特别地,k方向权重w ji的位数递增,对应的卷积运算单元102依每一位w ji,k由低位到高位排列。特别地,所述需要经过DAC转化的电流Ix i可以根据需要先在DAC中被缩放再在电路中传输以控制电流值不超过一定阈值,减少传输的功率损耗。此后电流Ix i经过开关1021,同时为了减弱电流镜上的反冲或瞬态效应,所述开关1021可为虚拟开关或电流器或如电流器或虚拟负载等非开关元件。
积分控制模块103控制所述开关1021的通断以及通断时间,举例而言,该模块包含的逻辑运算可以为与门1033,该模块包括一个静态随机存取存储器(Static Random-Access Memory,SRAM)单元1032,对于整个卷积运算阵列,所述的SRAM单元可以是相同的SRAM 6T单元或不同的SRAM单元,其存储一个二进制数w ji的某一位w ji,k,k方向是权重w ji低位到高位的方向;与门1033的输入是w ji,k和依据该位位权调制的PWM信号1031,与门1033的输出信号控制所述开关1021通断,从而实现二进制乘法阶段的被乘数乘以乘数每一位时位权的变化。具体而言,本发明所述与门1033的一个输入PWM信号1031根据所在单元的对应的权重w ji的位的不同变化,相邻位各自对应的i*j个单元的PWM信号1031持续时间在k方向2倍递增,例如k=1,2,3时,相应的PWM信号1031的持续时间为1τ,2τ,4τ,高位对应的PWM信号1031持续时间是较低位的2倍,第k位的PWM信号1031持续时间2 (k-1)*τ,τ为PWM信号1031的时钟周期。应当注意,在本实施例中,所述的PWM信号1031的持续 时间,指的是高电平的持续时间;当位w ji,k为1且PWM信号1031为高电平时,与门1033输出为1,此时开关1021开关是闭合状态,电流Ix i通过开关1021进入电容1022中积分,电容1022开始存储电荷;当PWM信号1031高电平持续时间经过,信号进入低电平的状态,此时开关为断开的状态,电流Ix i不经过,电流停止在电容1022中积分,电容1022在开关1021断开后没有新的电荷累积,存储的电荷为高电平状态下的累积;因此,由U=Q/C可知,本发明对于所述w ji,k为1的卷积运算单元102,电容1022两端的电压与电容1022内电流积分储存的电荷量有关;w ji,k为0时,无论PWM信号1031是否处于高电平状态,与门1033输出0,此时开关1021为断开的状态,电流Ix i不经过,电容1022中没有电流积分,存储的电荷为0,电容1022两端的电压为0。基于相同的原理,我们举另外一个实施例,该积分控制模块103的逻辑运算还可以是或门,在该实施例中,此时PWM信号1031的持续时间为低电平的持续时间,PWM信号1031和w ji,k做或运算。在其他实施例中,使用计数器或时钟分频器以生成基于最大速度时钟的PWM信号1031,即使得所述的τ尽可能小,加快电容1022的积分速度,即加快乘法运算每一步操作所需要的时间,采用PWM信号1031控制在于其能提高系统的灵活性。
具体地,开关1021处于闭合状态时,电流Ix i通过开关1021到达节点a ji,k,该节点a ji,k与电容1022的上极板连接,之后电流Ix i进入电容1022,对每次的卷积运算,所述的电容1022需要在电流Ix i流进前,重置到给定的直流电压,清除上一次的运算结果。该电容1022接地,那么电容1022两端的电压是节点a ji,k处的电压。电流进入电容1022后,随积分时间的推移电容1022内的储存的电荷量增加,即开关1021处于闭合状态时,电流不断地积分,此时电容1022两端的电压逐渐变大,所述的积分时间是开关1021通断时间。
举例说明,假设权重w ji的二进制表示时各个位w ji,k对应卷积运算单元内,w ji,1=w ji,2=w ji,3=….=1,对应下标i,j分别相同,对应的k=1、2、3,PWM信号1031的持续时间分别为τ、2τ、4τ,第k位的PWM信号1031持续时间为2 (k-1)*τ最高位的PWM信号1031的持续时间为2 (B-1)*τ,卷 积运算单元102中电容1022容量大小都相同,那么在电容1022内电流Ix i经过分别的积分时间后,由
Figure PCTCN2021081322-appb-000004
可知,在电流Ix i相同时,电容1022存储的电荷量与电流Ix i的积分时间成正比例,并随所在位升高的方向2倍变化,即k=1、2、3对应电容1022储存的电荷量分别为Q、2Q、4Q,进一步地由U=Q/C,电容1022容量相同时,电容1022两端的电压与其储存的电荷量成正比例,则其对应的电容1022两端的电压分别为U、2U、4U,即高位是较低位的2倍,k=B卷积运算单元102内的电容1022的值会是k=1的卷积运算单元102内电容1022电压的2 (B-1)倍,即实现权重w ji或者是乘数每一位在分别乘以输入x i或者是被乘数带有权位的变化,值得注意,以上只是w ji的一种特殊情况,实际上不管w ji,k为0或者1,其对应卷积运算单元102内电流积分时间与PWM信号1031持续时间相同,但是w ji,k=0对应卷积运算单元102内进行的是电流值为0的积分,w ji,k=1对应卷积运算单元102内进行的是值为Ix i的积分,PWM信号1031的持续时间只会依位2倍变化,不会因为w ji,k是0或者1而被影响。
电流积分结束后,由于电容1022一端接地,每一个卷积运算单元1022内节点a ji,k处的电压为电容1022两端的电压,电压值定义为x i*w ji,k*2 (k-1)为的乘数结果。
加法阶段,如图3,通过电荷共享得出卷积输出。本发明所有的卷积运算单元102都完成上述乘法阶段的电流积分的操作后,对于j=1,x 1所对应的k个单元完成一次x 1*w 11的运算,x 1*w 11的运算拆开来看输入x 1分别乘以权重w 11的每一位w 11,k以及该位的位权2 (k-1),即x 1*w 11,k*2 (k-1),再将分别得到的结果相加。同理,x i对应的k个单元完成一次x i*w i1运算,那么j=1,i∈[1,N]对应的所有i*1*k阵列完成一个卷积窗口的乘运算,所述i*1*k阵列每个卷积运算单元102的节点a ji,k电压为乘法结果,完成乘法运算后,电容1022短路,短接j=1对应阵列内所有电容1022上方的 节点a ji,k,此时所述对应阵列中的所有电容并联,由于每个单元内电容1022储存的电荷量不同以及电容1022放电的特性,被短接的阵列内的电容1022进行电荷共享,每个电容1022内的储存电荷量相同,但总的电荷值不变,得到的组合节点的电压为乘法阶段每一个乘法结果节点a ji,k电压的累加和,即为输出y 1。在另外的实施例中,对于卷积神经网络,权重矩阵共享的情况下,不同窗口对应的卷积核是相同,即运算不同窗口卷积结果时,被乘数(权重w ji)构成的权重矩阵是相同的,w j1=w j2=w j3=.....=w ji,减少了参与运算的参数量。同理,短接其他j对应的阵列的即可得到其他相应的输出y j,如下等式1:
Figure PCTCN2021081322-appb-000005
可选的,对输出y j进行转换。卷积运算阵列的执行模拟乘法的累加的运算后,输出的y j是模拟信号,在需要输出y j是数字信号时,在输出之前加上一个模数转换器(Analog-to-Digital Converter,ADC),得到的输出y j为数字信号。例如,该卷积运算模组运用到卷积神经网络,所述数字输出y j又可以作为数字输入进入卷积运算阵列中进行第二层的神经网络的卷积运算。此外,如果累积电压在模数转换器输入范围摆动或过高,可以通过在如图1的乘法阶段增加单位电容C u来有效地解决所述的问题,但这样每组卷积运算单位102需要的电容数量增加,需要更多的物理面积,不利于元件微小化。因此考虑在连接组合节点时,同时连接额外的值为C att的衰减电容105进入组合节点中,从而调整累积电压的刻度范围,使得累积电压缩放到一定的刻度范围内,满足数模转换器的输入范围。每当输出y j时,使用衰减电容105,衰减电容上方节点a att,j与原来的节点a ji,k连接,这种解决方案更有效地利用所述模组物理实现的面积。
该卷积运算模组满足单元重利用的需要。对于上述的两阶段的卷积运算的物理实现,权重w ji的位数一般是固定,即k的大小是固定,在输入或者权重w ji的二进制表示的位数较少时,高位数的单元不参与运算,将所述的高位数对应的卷积运算单元102连接到电路中时,会增加电路的功耗,因此对没有参与到运算的单元,一个简单的方法是在运算y j时,断开没有使用到的二进制权重w ji高位数所对应阵列单元,只连接参与运算 y j的卷积运算单元102,此举有利于降低功耗。然而这样会导致出现未使用到的区域,特别是在使用物理单元进行运算的权重w ji是低位数的时候。因此,考虑对输入和权重w ji的位数重新配置以满足矩阵输入、权重内部量化的运算灵活性,实现对未使用到的单元重新利用,对重配置的过程如下:
如图4,一组与权重的第k位相关联的单元被重新用于输入x i或输入x ii,其对应的电流分别为Ix i或Ix ii,电流对应的电压信号分别为Vgx i或Vgx ii。根据位k控制的多路复用器控制信号根据剩余未利用的位数对应单元选择与未利用单元符合的电压信号,即选择后的电压V’gx i与分别与Vgx i或Vgx ii相同。则位k对应的单元内电流I’x i与Ix i或Ix ii相同。举例说明,假设已有一个支持8位权重w ji运算的卷积运算模组,此时只有一个1位的权重w ji进行卷积运算的需求,那么会有剩余7(=8-1)组卷积运算单元102没有参与运算,这剩下的7组卷积运算单元102可以分别用于输入与原输入x i相同的输入(即I’x i=Ix i)进行7次1位的权重的卷积运算;而当原输入x i或者原权重w ji是5位时,显然剩余的3组单元不能进行与原输入相同的卷积运算,此时考虑进行另外小于或等于3位的权重和输入Ix ii,此时I’x i=Ix ii。特别地,重利用的另一实施,由于i方向每组单元是独立,因此在给定输入x i的i较小时,没利用的单元没有电流的输入,亦没有产生功率损耗;而当i较大而权重w ji较小时,多余的x i可以输入到其他输入没利用到的权重位对应的卷积运算单元102中。在其他实施例中,该电流可通过电流镜中的二极管经由电压V’gx i控制,DAC可以在为给定位数进行输入以及ADC在对可能用于输出y j的量化中时被重新配置,使得DAC或ADC分辨率可以跟对应的输出的位数匹配。
在上述多路复用器选择好符合的输入I’x i后,与权重w ji相关的PWM信号1031持续时间重配。由于原物理实现的未利用单元有与位权对应的PWM信号1031,原单元用于重利用,对应的位权需变化,即对应的PWM信号1031持续时间需要改变,使得与位k相关联的乘法与输入x i或输入x ii相关联。下面采用两个极端的例子来阐述这种重配置的能力。首先,假设已有对可支持权重最大比特数即k=8运算的物理实现,该物理实现的所有的卷积运算阵列如图1所示,显然该阵列PWM信号1031的持续 时间范围为τ至2 (B-1)*τ。然而,当权重位数k=1时,剩余的2至8位对应的单元可重新用于输入x i最多可以有8个输入并行,此时所有的权值脉宽调制脉冲宽度亦即PWM信号1031持续时间为τ,所有的权值都被量化为单比特,而不是前一种情况下对8位的权重w ji的每一位都要量化。
图5和图6为本发明所述的卷积运算单元102用于卷积神经网络运算时,添加偏置运算单元1051的一个实施例。考虑卷积运算加入偏置b使得卷积运算更为高效准确,代表性的是为给定的输出y j添加二进制偏置b j。那么对应的卷积输出y j由等式1改为如下等式2。
Figure PCTCN2021081322-appb-000006
图5阐述了如何在乘法阶段添加上这个额外的功能。由于偏置位的量化的执行方式类似于图1或图2中的权重,所以偏置的实现视为给定电流Ix i的附加输入的固定电流I b
本发明偏置b j转换为给定电流Ix i的附加输入的固定电流I b,是添加额外的偏置运算单元1061单独进行运算的,所述的偏置运算单元1061构成规模为j*k偏置运算阵列106,每个偏置运算单元1061(j,k)包括电流I b、开关1021、偏置运算单元积分控制模块1062、节点a j,k、值为C u的电容1022;电流I b在电容1022内积分,与卷积运算阶段类似,将权重w ji转变为b j,那么此时偏置运算单元积分控制模块1062中偏置与门的输入为b j,k与b j,k位权调制的PWM信号1031,该偏置与门的输出控制开关1021的闭合时间,即偏置运算单元(j,k)1061内电容1022内部电流积分时间为b j,k*2 (k-1)τ。同一k对应的偏置运算单元1061,PWM信号1031与卷积运算单元102中权重w ji,k处的PWM信号1031相同。应当注意,在本实施例中,所述的PWM信号1031的持续时间,指的是高电平的持续时间;当位b j,k为1,PWM信号1031为高电平时,偏置与门输出为1,此时开关1021开关是闭合状态,电流I b通过开关进入电容器1022中积分,电容器存储电荷;当PWM信号1031高电平持续时间经过,信号进入低电平的状态,此时开关1021为断开的状态,电流I b不经过,电流停止在电容1022中积分,电容1022在开关1021断开后没有新的电荷累积,存储的电荷为高电平状态下的累积;b j,k为0时,偏置与门输出0,此时开关1021为断开的状态,电流I b不经过,电容1022中没有电流积 分,存储的电荷为0。同理,电容1022两端的电压是偏置运算单元1061乘法阶段的计算结果。
图6阐述了在累加阶段,需要加上额外的电容1022用于电荷的共享和节点的累加。
同理,短接给定的j对应的k个单元节点a j,k,由于电容1022放电的特性,被短接的阵列内的电容1022进行电荷共享,每个电容1022内的储存电荷量相同,但总的电荷值不变,得到的组合节点的电压为乘法阶段每一个乘法结果节点a ji,k电压的累加和,即y j的偏置b为1*k组单元所有节点a j,k累积电压和,如图6,所述的卷积和偏置的运算单元的物理实现是独立的,但是在输出最后加上偏置的卷积结果时,可以连接卷积运算单元102以及偏置运算单元1061的对应节点,得到的组合节点的电压为加了偏置的卷积运算结果。
本发明还包括一种基于时间可变的电流积分和电荷共享的多位卷积模拟运算方法,包括:
数模转换器101照给定位数将数字输入x i转换为模拟信号的电流Ix i在电路中传输。
电流Ix i到达开关时,进行一个逻辑运算,该逻辑运算在积分控制模块103中,逻辑运算的输入是权重w ji的第k位w ji,k和依据该位位权调制的PWM信号1031,k方向卷积运算单元内PWM信号1031持续时间依低位到高位2倍递增,第k位的PWM信号1031持续时间2 (k-1)*τ,τ为PWM信号的时钟周期,该逻辑运算的输出控制开关1021的闭合。所述开关1021闭合后,电流Ix i通过与电容上极板连接的节点a ji,k进入电容1022内积分,积分一段时间后得到电容两端的电压,开关断开后,电流不通过节点a ji,k,积分一段时间后得到电容1022两端的电压为0,所述积分时间是PWM信号1031的持续时间,节点a ji,k的电压是卷积运算的x i*w ji,k*2 (k-1)的乘法结果。短接一个i*k面的所有卷积运算单元102内节点 a ji,k,每个卷积运算单元102内电容1022间电荷共享,得到的组合节点的电压为卷积运算
Figure PCTCN2021081322-appb-000007
的结果y j
值得注意的是,上述实施例中,所包括的各个模块只是按照功能逻辑进行划分的,但并不局限于上述的划分,只要能够实现相应的功能即可;另外,各功能单元的具体名称也只是为了便于相互区分,并不用于限制本发明的保护范围。
以上所述仅为本发明的较佳实施例而已,并不用以限制本发明,凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等,均应包含在本发明的保护范围之内。

Claims (16)

  1. 基于时间可变的电流积分和电荷共享的多位卷积运算模组,其特征在于,包括:
    至少一个数字输入x i,至少一个数模转换器DAC,至少一个二进制权重w ji,由多个卷积运算单元构成的卷积运算阵列,至少一个输出y j
    所述数字输入x i由DAC按照给定位数转换为模拟信号的电流Ix i在电路中传输;
    所述二进制权重w ji,j表示该权重是第j窗口的权重索引,w ji,k是权重w ji第k位上的值,w ji,k是0或1,k∈[1,B],其中B指二进制的最高位,每个位w ji,k对应一个卷积运算单元;
    所述的卷积运算阵列的规模为i*j*k,i方向是输入方向,j方向是卷积窗口方向,k方向卷积运算单元依权重w ji的每一位w ji,k由低位到高位依次排列;
    每个所述卷积运算单元包括输入电流Ix i、开关、积分控制模块、节点a ji,k、至少一个电容,电容一端接地;
    所述积分控制模块为给定的逻辑运算,所述逻辑运算的输入是w ji,k和依据w ji,k位权调制的PWM信号,k方向卷积运算单元内PWM信号持续时间依低位到高位2倍递增,第k位的PWM信号持续时间2 (k-1)*τ,τ为PWM信号的时钟周期,积分控制模块的输出控制所述开关的闭合;
    所述开关闭合时,电流Ix i通过与电容上极板连接的节点a ji,k进入电容内积分;开关断开时,电流Ix i不通过节点a ji,k;积分时间是PWM信号的持续时间,节点a ji,k的电压是卷积运算的x i*w ji,k*2 (k-1)的乘法结果;
    所述y j是通过短接一个i*k面的所有卷积运算单元内节点a ji,k,每个卷积运算单元内电容间电荷共享,得到的组合节点的电压,该电压为卷积运算的结果。
  2. 如权利要求1所述的模组,其特征在于,x i对应的1*k个卷积运算单元的组合电压是一个x i*w ji的结果,一个i*k面的卷积运算单元组合节点的电压是∑x i·w ji的结果,完成一次卷积核和输入矩阵的卷积过程的运算。
  3. 如权利要求2所述的模组,其特征在于,所述的输入x i是至少一位的二进制,转换输入x i的DAC的分辨率可以调整。
  4. 如权利要求3所述的模组,其特征在于,电流Ix i被电流镜镜像或复制到卷积运算阵列中,同一j*k面的电流是相同的,电流Ix i可以在数模转换器中缩放。
  5. 如权利要求4所述的模组,其特征在于,所述的积分控制模块的逻辑运算是与门,与门的输入一个是存储在一个SRAM单元的位w ji,k,另一个是随k逐位递增持续时间以τ为基数2倍递增的PWM信号,与门的输出控制所述开关闭合;不同权重w ji的同一k位所对应的卷积运算单元,PWM信号持续时间相同,同一权重w ji不同位对应的卷积运算单元PWM信号的持续时间不同,分别为2 (k-1)*τ。
  6. 如权利要求5所述的模组,其特征在于,使用计数器或时钟分频器生成最快速度的PWM时钟信号,加快电容积分速度。
  7. 如权利要求1至6任一项所述的模组,其特征在于,所述卷积运算单元内的开关为虚拟开关或电流器,减弱电流镜上的反冲或瞬态效应。
  8. 如权利要求7所述的模组,其特征在于,数字输入x i和权重w ji的位数可以重新配置用于重新数字输入x i或新的输入x ii,包括:
    多路复用器接收重新输入x i和x ii,根据权重w ji剩余未利用的位数对应的卷积运算单元选择与未利用单元符合的输入电压信号,输出的电压信号进入卷积运算单元中;
    将用于重利用的未利用卷积运算单元中与位权对应的PWM信号持续时间重新配置。
  9. 如权利要求8所述的模组,其特征在于,重利用阶段,至少一个多路复用器的位数与权值编码的位数适应,多路复用器的输出是权重位数k控制的。
  10. 如权利要求9所述的模组,其特征在于,所述卷积运算阵列还包括偏置模块,所述偏置模块包括:
    由多个偏置运算单元构成的偏置单元阵列,所述的偏置单元阵列的规模为j*k,每个偏置运算单元(j,k)包括电流I b、开关、积分控制模块、节点a j,k、值为C u的电容;
    所述的电流I b是电流Ix i附加的固定电流;
    b j,k是多位二进制偏置b j的第k位,偏置运算单元(j,k)内电容内 部电流积分时间为b j,k*2 (k-1)τ;
    所述的积分控制模块中,b j,k与依b j,k位权调制的PWM信号经与门运算输出控制所述开关闭合,控制偏置运算单元内电容中电流I b的积分时间;
    y j的偏置为1*k组单元所有节点a j,k累积电压和。
  11. 如权利要求10所述的模组,其特征在于,当组合节点的累积电压摆动高于模数转换器输入范围或高于阈值,输出y j在连接模数转换器前并联衰减电容器来调整累积电压的全刻度范围。
  12. 基于时间可变的电流积分和电荷共享的多位卷积运算方法,其特征在于,包括如下步骤:
    DAC按照给定位数将数字输入x i转换为模拟信号的电流Ix i在电路中传输,i∈[1,N],N为正整数;
    电流Ix i到达开关时,进行一个逻辑运算,逻辑运算的输入是权重w ji的第k位w ji,k和依据w ji,k位权调制的PWM信号,j表示当i确定后,相应的第j窗口,k方向卷积运算单元内PWM信号持续时间依低位到高位2倍递增,第k位的PWM信号持续时间2 (k-1)*τ,τ为PWM信号的时钟周期,该逻辑运算的输出控制所述开关的闭合;
    开关闭合后,电流Ix i通过与电容上极板连接的节点a ji,k进入电容内积分,积分一段时间后得到电容两端的电压,开关断开后,电流不通过节点a ji,k,积分一段时间后得到电容两端的电压为0,所述积分时间是PWM信号的持续时间,节点a ji,k的电压是卷积运算的x i*w ji,k*2 (k-1)的乘法结果;
    短接一个i*k面的所有卷积运算单元内节点a ji,k,每个卷积运算单元内电容间电荷共享,得到的组合节点的电压为卷积运算
    Figure PCTCN2021081322-appb-100001
    的结果y j,其中B为w ji的最高位。
  13. 如权利要求12所述的运算方法,其特征在于DAC在转换数字输入x i前,对DAC的分辨率进行调整。
  14. 如权利要求13所述的运算方法,其特征在于,在进行所述逻辑运算之前,使用计数器或时钟分频器生成最快速度的PWM时钟信号,加快电流的积分速度。
  15. 如权利要求14所述的运算方法,其特征在于在输入一次x i后,对未利用的卷积运算单元进行重利用,包括:
    使用多路复用器接收重新输入x i和x ii,根据权重w ji剩余未利用的位数对应的卷积运算单元选择与未利用单元符合的输入电压信号,输出的电压信号进入卷积运算单元中;在选择输入电压信号之后,对未利用卷积运算单元中与位权对应的PWM信号持续时间重新配置。
  16. 如权利要求15所述的运算方法,其特征在于,在连接ADC输出y j前,并联衰减电容器来调整累积电压的全刻度范围,使组合节点的累积电压摆动低于模数转换器输入范围。
PCT/CN2021/081322 2020-04-03 2021-03-17 基于时间可变的电流积分和电荷共享的多位卷积运算模组 WO2021197073A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010257151.0A CN111144558B (zh) 2020-04-03 2020-04-03 基于时间可变的电流积分和电荷共享的多位卷积运算模组
CN202010257151.0 2020-04-03

Publications (1)

Publication Number Publication Date
WO2021197073A1 true WO2021197073A1 (zh) 2021-10-07

Family

ID=70528805

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/081322 WO2021197073A1 (zh) 2020-04-03 2021-03-17 基于时间可变的电流积分和电荷共享的多位卷积运算模组

Country Status (2)

Country Link
CN (1) CN111144558B (zh)
WO (1) WO2021197073A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11899518B2 (en) 2021-12-15 2024-02-13 Microsoft Technology Licensing, Llc Analog MAC aware DNN improvement

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111144558B (zh) * 2020-04-03 2020-08-18 深圳市九天睿芯科技有限公司 基于时间可变的电流积分和电荷共享的多位卷积运算模组
CN111431536B (zh) 2020-05-18 2023-05-02 深圳市九天睿芯科技有限公司 子单元、mac阵列、位宽可重构的模数混合存内计算模组
CN112232501B (zh) * 2020-12-11 2021-09-28 中科南京智能技术研究院 一种存内计算装置
CN113516172B (zh) * 2021-05-19 2023-05-12 电子科技大学 基于随机计算贝叶斯神经网络误差注入的图像分类方法
CN115048075A (zh) * 2022-04-27 2022-09-13 北京大学 基于电容耦合的sram存算一体芯片
CN114723031B (zh) * 2022-05-06 2023-10-20 苏州宽温电子科技有限公司 一种计算装置
US20230386565A1 (en) * 2022-05-25 2023-11-30 Stmicroelectronics International N.V. In-memory computation circuit using static random access memory (sram) array segmentation and local compute tile read based on weighted current

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5521857A (en) * 1992-12-15 1996-05-28 France Telecom Process and device for the analog convolution of images
CN108629411A (zh) * 2018-05-07 2018-10-09 济南浪潮高新科技投资发展有限公司 一种卷积运算硬件实现装置及方法
CN108805270A (zh) * 2018-05-08 2018-11-13 华中科技大学 一种基于存储器的卷积神经网络系统
CN110008440A (zh) * 2019-04-15 2019-07-12 合肥恒烁半导体有限公司 一种基于模拟矩阵运算单元的卷积运算及其应用
TW201935266A (zh) * 2018-02-12 2019-09-01 美商耐能股份有限公司 卷積運算裝置及卷積神經網路的卷積輸入的調規方法
CN111144558A (zh) * 2020-04-03 2020-05-12 深圳市九天睿芯科技有限公司 基于时间可变的电流积分和电荷共享的多位卷积运算模组

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108629406B (zh) * 2017-03-24 2020-12-18 展讯通信(上海)有限公司 用于卷积神经网络的运算装置
GB2568102B (en) * 2017-11-06 2021-04-14 Imagination Tech Ltd Exploiting sparsity in a neural network
CN108764467B (zh) * 2018-04-04 2021-08-17 北京大学深圳研究生院 用于卷积神经网络卷积运算和全连接运算电路
CN109460817B (zh) * 2018-09-11 2021-08-03 华中科技大学 一种基于非易失存储器的卷积神经网络片上学习系统
CN109104197B (zh) * 2018-11-12 2022-02-11 合肥工业大学 应用于卷积神经网络的非还原稀疏数据的编译码电路及其编译码方法
CN109800876B (zh) * 2019-01-18 2021-06-01 合肥恒烁半导体有限公司 一种基于NOR Flash模块的神经网络的数据运算方法
CN110378193B (zh) * 2019-05-06 2022-09-06 南京邮电大学 基于忆阻器神经网络的羊绒羊毛识别方法
CN110543933B (zh) * 2019-08-12 2022-10-21 北京大学 基于flash存算阵列的脉冲型卷积神经网络

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5521857A (en) * 1992-12-15 1996-05-28 France Telecom Process and device for the analog convolution of images
TW201935266A (zh) * 2018-02-12 2019-09-01 美商耐能股份有限公司 卷積運算裝置及卷積神經網路的卷積輸入的調規方法
CN108629411A (zh) * 2018-05-07 2018-10-09 济南浪潮高新科技投资发展有限公司 一种卷积运算硬件实现装置及方法
CN108805270A (zh) * 2018-05-08 2018-11-13 华中科技大学 一种基于存储器的卷积神经网络系统
CN110008440A (zh) * 2019-04-15 2019-07-12 合肥恒烁半导体有限公司 一种基于模拟矩阵运算单元的卷积运算及其应用
CN111144558A (zh) * 2020-04-03 2020-05-12 深圳市九天睿芯科技有限公司 基于时间可变的电流积分和电荷共享的多位卷积运算模组

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11899518B2 (en) 2021-12-15 2024-02-13 Microsoft Technology Licensing, Llc Analog MAC aware DNN improvement

Also Published As

Publication number Publication date
CN111144558B (zh) 2020-08-18
CN111144558A (zh) 2020-05-12

Similar Documents

Publication Publication Date Title
WO2021197073A1 (zh) 基于时间可变的电流积分和电荷共享的多位卷积运算模组
US20210365241A1 (en) Multiplication and accumulation circuit based on radix-4 booth code and differential weight
US11948659B2 (en) Sub-cell, mac array and bit-width reconfigurable mixed-signal in-memory computing module
CN111448573B (zh) 用于混合信号计算的系统和方法
US11809837B2 (en) Integer matrix multiplication based on mixed signal circuits
WO2021004466A1 (zh) 一种基于多位并行二进制突触阵列的神经形态计算电路
CN111611529B (zh) 电容容量可变的电流积分和电荷共享的多位卷积运算模组
CN115048075A (zh) 基于电容耦合的sram存算一体芯片
WO2023123973A1 (zh) 实现卷积运算的电路及其方法
Seo et al. ARCHON: A 332.7 TOPS/W 5b variation-tolerant analog CNN processor featuring analog neuronal computation unit and analog memory
CN111611528B (zh) 电流值可变的电流积分和电荷共享的多位卷积运算模组
Al Maharmeh et al. A comparative analysis of time-domain and digital-domain hardware accelerators for neural networks
Al Maharmeh et al. Compute-in-time for deep neural network accelerators: Challenges and prospects
Lim et al. AA-ResNet: Energy efficient all-analog ResNet accelerator
US20220416801A1 (en) Computing-in-memory circuit
Lin et al. A reconfigurable in-SRAM computing architecture for DCNN applications
CN113741857A (zh) 一种乘累加运算电路
CN112784971A (zh) 基于数模混合神经元的神经网络运算电路
Youssefi et al. Hardware realization of mixed-signal neural networks with modular synapse-neuron arrays
Mirhassani et al. Robust low-sensitivity adaline neuron based on continuous valued number system
US20240036525A1 (en) Energy efficient digital to time converter (dtc) for edge computing
Youssefi et al. Efficient mixed-signal synapse multipliers for multi-layer feed-forward neural networks
Yin et al. A 65nm 8b-Activation 8b-Weight SRAM-Based Charge-Domain Computing-in-Memory Macro Using A Fully-Parallel Analog Adder Network and A Single-ADC Interface
Musah et al. A DCSRO Based Time Domain MAC Core
CN116434802A (zh) 有符号位的sram多值单元及存算一体芯片

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21780953

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21780953

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 30.03.2023)

122 Ep: pct application non-entry in european phase

Ref document number: 21780953

Country of ref document: EP

Kind code of ref document: A1