CN115629734A

CN115629734A - In-memory computing device and electronic apparatus of parallel vector multiply-add device

Info

Publication number: CN115629734A
Application number: CN202211329832.9A
Authority: CN
Inventors: 朱夏宁; 艾力
Original assignee: Hangzhou Zhixinke Microelectronics Technology Co ltd
Current assignee: Hangzhou Zhixinke Microelectronics Technology Co ltd
Priority date: 2022-10-27
Filing date: 2022-10-27
Publication date: 2023-01-20

Abstract

The embodiment of the application provides a memory computing device and electronic equipment of a parallel vector multiply-add device, relates to the technical field of memory computing, and can improve computing speed. An in-memory computing device for a parallel vector multiply-add device, comprising: l multiply-add modules, each multiply-add module comprising q multiplication units and an adder, q > 1, L > 1; the multiplication unit comprises a memory and a multiplication circuit; the adder comprises q addition input ends, the a-th addition input end is electrically connected with the output end of the a-th multiplication unit, and the value of a is 1, 2, \ 8230; the shift accumulator comprises L accumulation input ends, the b-th accumulation input end is electrically connected with the output end of the b-th multiply-add module, and the value of b is 1, 2, \8230; each accumulation input end is provided with a corresponding weight value, and the shifting accumulator is used for shifting and accumulating the numerical values of the L accumulation input ends based on the weight value corresponding to each accumulation input end.

Description

In-memory computing device and electronic apparatus of parallel vector multiplier-adder

Technical Field

The present disclosure relates to the field of in-memory computing technologies, and in particular, to an in-memory computing apparatus and an electronic device for a parallel vector multiply-add unit.

Background

Based on the traditional von Neumann architecture, a large amount of performance and power consumption are used for data transmission and reading and writing, and the efficiency is low. Based on the above problems, a memory-in-memory (CIM) architecture has appeared, in which a computing unit and a memory unit are integrated on the same chip to form a memory unit with computing capability, and operations are completed therein, and this extremely-close layout eliminates delay and power consumption of data movement, improves the problems of "memory wall" and "power consumption wall", and thus improves the computing energy efficiency ratio compared with the conventional architecture. However, in the current CIM architecture chip, when a large amount of data is calculated, the calculation speed is slow.

Disclosure of Invention

An in-memory computing device and an electronic apparatus of a parallel vector multiply-add device are capable of increasing a computing speed.

In a first aspect, an in-memory computing device of a parallel vector multiply-add device is provided, including: l multiply-add modules, each multiply-add module comprising q multiplication units and an adder, q > 1, L > 1; the multiplication unit comprises a memory and a multiplication circuit, the multiplication circuit comprises a first multiplication input end and a second multiplication input end, the second multiplication input end is electrically connected with the output end of the memory, and the multiplication circuit is used for carrying out multiplication calculation on the numerical values of the first multiplication input end and the second multiplication input end and outputting a result through the output end of the multiplication unit; the adder comprises q addition input ends, wherein the a-th addition input end is electrically connected with the output end of the a-th multiplication unit, the value of a is 1, 2, \ 8230, and q, and the output end of each adder is the output end of the corresponding multiplication and addition module; the shift accumulator comprises L accumulation input ends, the b-th accumulation input end is electrically connected with the output end of the b-th multiply-add module, and the value of b is 1, 2, \8230; each accumulation input end has a corresponding weight value, and the shift accumulator is used for carrying out shift accumulation calculation on the numerical values of the L accumulation input ends based on the weight value corresponding to each accumulation input end.

In one possible implementation, the in-memory computing device further includes: the input module corresponds to at least one multiplication and addition module and comprises q-1 input units which are sequentially cascaded, and the input units are used for outputting numerical values input in the last period in the current period; the output end of the a-th input unit is electrically connected with the first multiplication input end of the a + 1-th multiplication unit of the corresponding multiplication and addition module; the input end of the 1 st input unit is electrically connected with the first multiplication input end of the 1 st multiplication unit of the corresponding multiplication and addition module; the input end of the c-th input unit is electrically connected with the output end of the c-1-th input unit, and the value of c is 2, 3, \ 8230;, q.

In one possible implementation, an in-memory computing device includes: m multiplication and addition module groups, wherein each multiplication and addition module group comprises p multiplication and addition modules and an input module, and p is more than 1; in the d-th multiply-add module group, the output end of the a-th input unit is electrically connected to the first multiplication input end of the a + 1-th multiplication unit of each multiply-add module, the input end of the 1-th input unit is electrically connected to the first multiplication input end of the 1-th multiplication unit of each multiply-add module, and the value of d is 1, 2, \8230;, m.

In one possible embodiment, the weighted value corresponding to the accumulation input terminal electrically connected to the output terminal of the f-th multiply-add module in the e-th multiply-add module group is 2 ^(m-e+p-f) E is 1, 2, \8230, m, f is 1, 2, \8230, p.

In one possible implementation, the in-memory computing device is a finite-unit impulse response filter.

In one possible implementation, the multiplication circuit is an and gate multiplication circuit.

In one possible embodiment, the multiplication circuit is a nand gate multiplication circuit; the adder is an inverting adder.

In one possible embodiment, the shift accumulator is further configured to perform a two's complement calculation on the result of the shift accumulation calculation.

In one possible implementation, the multiplication circuit includes: a first transistor, a first end of which is electrically connected to the multiplication output end, and a control end of which is electrically connected to the first multiplication input end; a second transistor, a first end of which is electrically connected to the second end of the first transistor, a second end of which is electrically connected to the low level output end, and a control end of which is electrically connected to the second multiplication input end; the first transistor and the second transistor are n-type transistors.

In a second aspect, an electronic device is provided, which includes the memory computing apparatus.

In the memory computing device and the electronic equipment of the parallel vector multiply-add device in the embodiment of the application, one-bit multiplication computation is realized through the cooperation of the memory in the multiplication unit and the multiplication circuit, the output from different multiplication units is obtained through the adder to be added and used as the output of the multiplication-addition module, the output of a plurality of multiplication-addition modules is obtained through the shift accumulator to be shifted and accumulated and computed based on corresponding weight values, wherein the multiplication of two binary numbers by bit by the multiplication units is completed in parallel, all multiplication can be completed within 1 clock cycle, the shift accumulation can be completed within 2 clock cycles, and the computing speed is improved in a parallel mode.

Drawings

FIG. 1 is a block diagram of an embodiment of a parallel vector multiply-add device;

FIG. 2 is a block diagram of an alternative embodiment of an apparatus for memory computation in a parallel vector multiply-add device;

FIG. 3 is a block diagram of a multiplication unit according to an embodiment of the present application;

FIG. 4 is a block diagram of an adder and other modules according to an embodiment of the present disclosure;

FIG. 5 is a simplified computational diagram of an FIR filter;

FIG. 6 is a block diagram of a multiply-add module in an embodiment of the present application;

FIG. 7 is a block diagram of an alternative in-memory computing device for a parallel vector multiply-add unit according to an embodiment of the present application;

FIG. 8 is a schematic diagram of the 2 nd multiplier-adder module set shown in FIG. 7;

FIG. 9 is a schematic diagram of the 14 th multiplier-addition module in FIG. 7;

fig. 10 is a simplified schematic diagram of an algorithm matrix corresponding to a parallel vector multiply-add device according to an embodiment of the present application.

Detailed Description

The terminology used in the description of the embodiments section of the present application is for the purpose of describing particular embodiments of the present application only and is not intended to be limiting of the present application.

An embodiment of the present application provides an in-memory computing device for a parallel vector multiply-add unit, including: l multiply-add modules 1, each multiply-add module 1 comprising q multiplication units 11 and an adder 12, q > 1, L > 1; as shown in fig. 2 and 3, the multiplication unit 11 includes a memory 111 and a multiplication circuit 112, the multiplication circuit 112 includes a first multiplication input terminal MIN1 and a second multiplication input terminal MIN2, the second multiplication input terminal MIN2 is electrically connected to the output terminal of the memory 111, the multiplication circuit 112 is used for performing a multiplication calculation on the values of the first multiplication input terminal MIN1 and the second multiplication input terminal MIN2 and outputting the result through the output terminal of the multiplication unit 11, the multiplication circuit 112 can implement a single-bit multiplication calculation, specifically an and gate or a nand gate, the detailed structure and principle of the multiplication circuit 112 will be described later, the memory 111 can be, for example, a Read-only memory (ROM) or a Static Random Access Memory (SRAM), if the memory 111 is an SRAM, the multiplication circuit 112 can be implemented by using the structural features of the SRAM itself in addition to the implementation of the and gate or the nand gate, and the multiplication unit 11 multiplies two binary numbers in bits and outputs the result to the adder 12; the adder 12 comprises q addition input ends, the a-th addition input end is electrically connected with the output end of the a-th multiplication unit, the value of a is 1, 2, \8230, q, and the output end of each adder 12 is the output end of the corresponding multiplication and addition module 1; the shift accumulator 2, the shift accumulator 2 includes L accumulation input ends, the b th accumulation input end is connected to the output end of the b th multiply-add module 1 electrically, the value of b is 1, 2, \ 8230, L; each accumulation input end has a corresponding weight value, and the shift accumulator 2 is used for performing shift accumulation calculation on the numerical values of the L accumulation input ends based on the weight value corresponding to each accumulation input end.

In particular, for the same multiply-add module 1, the 1 st multiplication unit 11 is used for X pairs ₀ And W ₀ Performing multiplication and outputting the calculated result Out ₀ To the 1 st addition input A of the adder 12 ₀ (ii) a The 2 nd multiplication unit 11 is used for X ₁ And W ₁ Performing multiplication and outputting the calculated result Out ₁ To the 2 nd addition input A of the adder 12 ₁ (ii) a By analogy, the q-th multiplication unit 11 is used for X _q-1 And W _q-1 Performing multiplication and outputting the calculated result Out _q-1 Q-th addition input terminal A of adder 12 _q-1 . 1 st multiply-addThe output end of the module 1 outputs the multiplication and addition result S0 to the 1 st accumulation input end of the shift accumulator 2; the output end of the 2 nd multiply-add module 1 outputs the multiply-add result S1 to the 2 nd accumulation input end of the shift accumulator 2; by analogy, the output terminal of the lth multiply-add module 1 outputs the multiply-add result SL-1 to the lth accumulate input terminal of the shift accumulator 2. The shift accumulator 2 performs shift accumulation calculation on the values of the L accumulation input terminals based on the weight value corresponding to each accumulation input terminal. The weight value may correspond to the number of bits of the binary value, e.g. the lowest bit weight value is 2 ⁰ The weight value of the next lowest order is 2 ¹ And so on. According to the number of bits corresponding to the output value of the multiply-add module 1, the weight value of the accumulation input end electrically connected to the output end of the multiply-add module 1 corresponding to the output value can be determined, and therefore, the weight value corresponding to each accumulation input end can be preset.

In the memory computing device of the parallel vector multiplier-adder in the embodiment of the application, one-bit multiplication is realized through the cooperation of the memory in the multiplication unit and the multiplication circuit, the output from different multiplication units is obtained through the adder to be added and used as the output of the multiplication-addition module, the output of a plurality of multiplication-addition modules is obtained through the shift accumulator to be shifted and accumulated and calculated based on corresponding weight values, wherein the multiplication of two binary numbers by bits by the multiplication units is completed in parallel, all multiplication can be completed within 1 clock cycle, the shift accumulation can be completed within 2 clock cycles, and the computing speed is improved in a parallel mode.

In one possible implementation, as shown in FIG. 2, the multiplication circuit 112 is a NAND multiplication circuit; since the output result of the nand-gate implemented multiplication circuit 112 is inverted, an inverted circuit is needed in subsequent circuits to recover, e.g., adder 12 is an inverted adder.

In other possible implementations, the multiplication circuit 112 may be an and gate multiplication circuit.

In a possible embodiment, the shift accumulator 2 is further configured to perform a two's complement calculation on the result of the shift accumulation calculation to achieve the signed bit calculation.

In some embodiments, as shown in fig. 2, each multiplication unit 11 includes a memory 111 and a corresponding multiplication circuit 112, the memory 111 may be a register, and an input end D of the register is electrically connected to an externally input coefficient W signal line to implement writing of a coefficient W; the clock signal end clk of the register is electrically connected to the clock signal line clk; the output end Q of the register is electrically connected to the second multiplication input end of the corresponding multiplication circuit 112, the first multiplication input end of the multiplication circuit 112 is electrically connected to the signal line of the external input signal X, and the output end of the multiplication circuit 112 is electrically connected to one addition input end of the corresponding adder 12. The output of adder 12 is a multi-bit binary number. The multiplication circuit 112 in the example of fig. 2 is a nand gate.

In some embodiments, adder 12 is a digital adder.

In some embodiments, as shown in fig. 4, summer 12 includes an Analog-to-Digital Converter (ADC); the analog adding circuit comprises q capacitors C, a first end of the a-th capacitor C is an a-th first adding input end, second ends of the n capacitors C are connected to an input end of the analog-to-digital converter ADC, and an output end of the analog-to-digital converter ADC is used as an output end of the adder 12. The analog addition circuit can realize single-bit addition calculation in a charge mode, the calculation method is realized by using a capacitor, the capacitor saves more space compared with a digital adder, but an ADC (analog to digital converter) is required to convert the result of the addition calculation into analog to digital conversion, so that the subsequent addition calculation result based on the number is convenient to continue to carry out shift accumulation.

In one possible implementation, as shown in fig. 3, the multiplication circuit 112 includes: a first transistor m1 having a first end electrically connected to the multiplication output terminal MOUT and a control end electrically connected to the first multiplication input terminal MIN1; a second transistor m2, a first end of which is electrically connected to the second end of the first transistor m1, a second end of which is electrically connected to the low level output terminal V1, and a control end of which is electrically connected to the second multiplication input terminal MIN2; the first transistor m1 and the second transistor m2 are n-type transistors.

Specifically, the multiplication circuit 112 formed by the first transistor m1 and the second transistor m2 is actually a nand gate, and when any one of the first transistor m1 and the second transistor m2 is turned off, the multiplication output terminal MOUT keeps high level, i.e. outputs 1, in this embodiment, high level represents 1, low level represents 0, and the low level output terminal V1 is used for outputting low level representing 0; only when the first transistor m1 and the second transistor m2 are both turned on, the multiplication output terminal MOUT is pulled low by the low level output terminal V1, thereby becoming a low level, i.e., outputting 0. The first transistor m1 and the second transistor m2 are both n-type transistors, i.e., they are turned on under the control of high level and turned off under the control of low level. As shown in table 1.

TABLE 1

MIN1	MIN2	MOUT
				1	1	0
1	0	1
			0	1	1
0	0	1

Table 1 illustrates the corresponding values at each end of the multiplication circuit 112 in fig. 3 under different conditions, and it can be seen that the value output by the multiplication output MOUT is actually the inverted value of the multiplication result of the values at the first multiplication input MIN1 and the second multiplication input MIN2, and the inverted value can be recovered by inversion in the subsequent circuit, for example, the inverted value can be recovered at the adder. Therefore, the function of the multiplication operation can be realized by the multiplication circuit 112 composed of the first transistor m1 and the second transistor m 2. It should be noted that the structure of the multiplication circuit 112 in fig. 3 is only an example, and the specific structure of the multiplication circuit in the embodiment of the present application is not limited as long as the multiplication of a single-bit binary value can be realized.

In one possible implementation, as shown in fig. 3, the memory 111 includes: a third transistor m3 having a first end electrically connected to the high level output terminal V2, the high level output terminal V2 being configured to output a high level representing 1; a fourth transistor m4, a first end of which is electrically connected to the second end of the third transistor m3, a second end of which is electrically connected to the low level output end V1, and a control end of which is electrically connected to the control end of the third transistor m 3; a fifth transistor m5, a first end of which is electrically connected to the high-level output terminal V2, a second end of which is an output terminal of the memory 20, and a control end of which is electrically connected to a second end of the third transistor m3, that is, a second end of the fifth transistor m5 is electrically connected to the second multiplication input terminal MIN2 of the multiplication circuit 30; a sixth transistor m6, a first end of which is electrically connected to the second end of the fifth transistor m5, a second end of which is electrically connected to the low-level output terminal V1, a control terminal of which is electrically connected to the control terminal of the fifth transistor m5, and a control terminal of the sixth transistor m6 is electrically connected to the node Q; a seventh transistor m7 having a first end electrically connected to a Write Bit Line (WBL), a second end electrically connected to the second end of the third transistor m3, and a control end electrically connected to a Write Word Line (WWL); an eighth transistor m8 having a first end electrically connected to the inverted write bit line WBLB, a second end electrically connected to the second end of the fifth transistor m5, a control end electrically connected to the write word line WWL, and the inverted write bit line WBLB and the write bit line WBL having opposite signals; the third transistor m3 and the fifth transistor m5 are p-type transistors, and the fourth transistor m4, the sixth transistor m6, the seventh transistor m7, and the eighth transistor m8 are n-type transistors.

Specifically, the Memory 111 shown in fig. 3 is a Static Random Access Memory (SRAM). However, when the input data is written into the memory 111, the write word line WWL is at a high level, the seventh transistor m7 and the eighth transistor m8 are controlled to be turned on, and the data on the write bit line WBL is transmitted to the node Q through the seventh transistor m7, thereby realizing data writing. The first multiplication input terminal MIN1 may be referred to as a Read Word Line (RWL), and the multiplication output terminal MOUT may be referred to as a Read Bit Line (RBL). It should be noted that the circuit structure of the memory 111 shown in fig. 3 is only an example, and the specific structure of the memory 111 in the embodiment of the present application is not limited as long as the memory function can be realized.

In one possible implementation, as shown in fig. 3, the in-memory computing device further includes: q precharge transistors m0 corresponding to the adder 12, the a-th addition input terminal Aa of the adder 12 is electrically connected to the first terminal of the a-th precharge transistor m0, and the second terminal of the precharge transistor m0 is electrically connected to the high level output terminal V2. Before each multiplication calculation by the multiplication circuit 112, the precharge transistor m0 is controlled to be turned on, so that the effect of charging the multiplication output terminal MOUT with a high level through the precharge transistor m0 is realized. Then, the pre-charge transistor m0 is controlled to be turned off, at this time, the multiplication circuit 112 may perform multiplication calculation, and if at least one of the first transistor m1 and the second transistor m2 is turned off, the multiplication output terminal MOUT outputs the previously pre-charged high level, that is, output 1; if the first transistor m1 and the second transistor m2 are both turned on, the multiplication output terminal MOUT is pulled low by the low level output terminal V1 to become a low level, i.e., output 0.

In one possible implementation, the in-memory computing device is a Finite Impulse Response (FIR) filter. As shown in fig. 5, W0 to W14 are 15 tap coefficients of the filter, each coefficient being a binary number of 14bits, and X0 being an input signal, which is also a binary number of 14bits. And X W is actually 2 binary numbers with 14bits, bit multiplication is carried out on the binary numbers, a total of 15 groups of multiplication are carried out, and the result is summed to obtain a result y after the multiplication is finished. The embodiment of the application can adopt a full parallel mode to realize the calculation of the FIR filter.

In one possible implementation, as shown in fig. 6, the in-memory computing device further includes: the input module 3 corresponds to at least one multiplication and addition module 1, the input module 3 comprises q-1 input units 30 which are sequentially cascaded, and the input units 30 are used for outputting numerical values input in the last period in the current period; the output end of the a-th input unit 30 is electrically connected to the first multiplication input end of the a + 1-th multiplication unit of the corresponding multiply-add module 1; the input end of the 1 st input unit 30 is electrically connected to the first multiplication input end of the 1 st multiplication unit of the corresponding multiply-add module 1; the input end of the c-th input unit 30 is electrically connected to the output end of the c-1-th input unit 30, and the value of c is 2, 3, \8230andq. In this structure, the input signal X corresponding to the same multiply-add module 1 can be converted based on the input module 3.

For example, the input module 3 includes 14 input units 30 cascaded in sequence, and the multiply-add module 1 includes 15 multiplication units and 15 addition input terminals a corresponding to each other ₀ ～A ₁₄ The output end of the 1 st input unit 30 is electrically connected to the first multiplication input end of the 2 nd multiplication unit, the output end of the 2 nd input unit 30 is electrically connected to the first multiplication input end of the 3 rd multiplication unit, and so on, the output end of the 14 th input unit 30 is electrically connected to the first multiplication input end of the 15 th multiplication unit. The first multiplication input of the 1 st multiplication unit is electrically connected to the input of the 1 st input unit 30.

In one possible implementation, as shown in fig. 6 to 9, an in-memory computing device includes: m multiply-add module groups 10, wherein each multiply-add module group 10 comprises p multiply-add modules 1 and an input module 3, and p is more than 1; in the d-th multiply-add module group 10, the output end of the a-th input unit 30 is electrically connected to the first multiplication input end of the a + 1-th multiplication unit of each multiply-add module 1 in the d-th multiply-add module group 10, the input end of the 1-th input unit 30 is electrically connected to the first multiplication input end of the 1-th multiplication unit of each multiply-add module 1 in the d-th multiply-add module group 10, and the value of d is 1, 2, \8230, m. Fig. 6 is a schematic diagram of a structure of a 1 st multiply-add module group 101 of the memory computing device in fig. 7, fig. 8 is a schematic diagram of a structure of a 2 nd multiply-add module group 102 in fig. 7, and fig. 9 is a schematic diagram of a structure of a 14 th multiply-add module group 1014 in fig. 7.

In one possible implementation, as shown in fig. 6 to 9, the weighted value corresponding to the accumulation input terminal electrically connected to the output terminal of the f-th multiply-add module in the e-th multiply-add module group is 2 ^(m-e+p-f) E is 1, 2, \ 8230, m, f is 1, 2, \ 8230, p.

Specifically, the operation and principle of the memory computing device in the embodiment of the present application are described below based on a specific algorithm. For example, the formula that needs to be calculated is:

wherein, c _j And x _n-j Are all 14bits.

c _j ＝W _j,13 W _j,12 W _j,11 W _j,10 W _j,9 W _j,8 W _j,7 W _j,6 W _j,5 W _j,4 W _j,3 W _j,2 W _j,1 W _j,0 Wherein each W represents a binary number, W _j,0 Is the lowest order bit, or 1 st order bit, W _j,13 Is the highest order bit, or called 14 th bit, i.e. c _j Representing a 14-bit binary number. The calculation of this binary number can be expressed as:

c _j ＝2 ¹³ W _j,13 +2 ¹² W _j,12 +…+2 ¹ W _j,1 +2 ⁰ W _j,0 wherein each binary number W is multiplied by a corresponding weight to determine the number of bits, the first bit W _j,0 Corresponding weight is 2 ⁰ Second bit W _j,1 Corresponding weight is 2 ¹ By analogy, the fourteenth bit W _j,13 Corresponding weight bit 2 ¹³ 。

x _n-j ＝X _n-j,13 X _n-j,12 X _n-j,11 X _n-j,10 X _n-j,9 X _n-j,8 X _n-j,7 X _n-j,6 X _n-j,5 X _n-j,4 X _n-j,3 X _n-j,2 X _n-j,1 X _n-j,0 Wherein each X represents a binary number, X _n-j,0 Is the lowest order bit, or 1 st order bit, X _n-j,13 Is the highest order bit, or 14 th order bit, x _n-j Representing a 14-bit binary number. The calculation of this binary number can be expressed as:

x _n-j ＝2 ¹³ X _n-j,13 +2 ¹² X _n-j,12 +…+2 ¹ X _n-j,1 +2 ⁰ X _n-j,0 wherein each binary number X is multiplied by a corresponding weight to determine the number of bits, the first bit X _n-j,0 Corresponding weight is 2 ⁰ Second bit X _n-j,1 Corresponding weight is 2 ¹ By analogy, the fourteenth bit X _n-j,13 Corresponding weight is 2 ¹³ 。

c _j x _n-j The result is finally 14 × 14, i.e. 196 multiply-accumulate results, c _j x _n-j The calculation process of (a) is expressed as follows:

c _j x _n-j ＝2 ¹³ X _n-j,13 (2 ¹³ W _j,13 +…+20W _j,0 )+2 ¹² X _n-j,12 (2 ¹³ W _j,13 +…+2 ⁰ W _j,0 )+…+2 ⁰ X _n-j,0 (2 ¹³ W _j,13 +…+2 ⁰ W _j,0 )，

c _j x _n-j ＝(2 ²⁶ X _n-j,13 W _j,13 +…+2 ¹³ X _n-j,13 w _j,0 )+(2 ²⁵ X _n-j,12 W _j,13 +…+2 ¹² X _n-j,12 W _j,0 )+…+(2 ¹³ X _n-j,0 w _j,13 +…+2 ⁰ X _n-j,0 w _j,0 )。

y _n the calculation procedure of (c) is expressed as follows:

in connection with the practice of the present applicationFor example, in the embodiment of the present application, q =15, L =196, m =14, and p =14 are set. A memory (not shown in fig. 6 to 8) for storing W, the second multiplication input terminal c of the 1 st multiplication and addition module 1 in each multiplication and addition module group 10 _j 1 st bit W in _j,0 For the same multiplication and addition module 1, W input by 15 second multiplication input ends respectively corresponds to j with the values of 0, 1, 2, \8230, and 14, namely j = a-1; the input end of the 1 st input unit 30 of the input module 3 in the 1 st multiply-add module group 101 inputs x _n-j Position 14 of (1) _n-j,13 The 1 st input unit 30 of the input module 3 in the 2 nd multiply-add module group 102 inputs x _n-j 13 th position X in (1) _n-j,12 By analogy, the 1 st input unit 30 of the input module 3 in the 14 th multiply-add module group 1014 inputs x _n-j 1 st position X in (1) _n-j,0 . In each input module 3, X is sequentially input through the cascade of input cells 30 _n Conversion to X _n-1 、X _n-2 、…、X _n-14 And respectively output to the 2 nd to 15 th first multiplication input ends of each multiplication and addition module 1 in the same multiplication and addition module group 10. I.e. when j is greater than or equal to 1, X _n-j Is obtained by conversion of the input unit 30, so that the external input interface can only have X _n . The examples of the present application do not limit X _n The number of the second groups can be determined according to the needs. In addition, X is input through the input module 3 as described above _n Is converted to obtain X _n-j In addition to the above, in other possible embodiments, each X may be provided _n-j Are all obtained by external input.

With reference to fig. 6 to 10, the output of each multiplication unit corresponds to the result of multiplication of one X and one W in fig. 10, the output of each multiply-add module 1 corresponds to the result of multiplication-addition of a column of data in fig. 10, each multiply-add result is a 4-bit output, the last 196 4bits are output to the shift accumulator 2, and the shift accumulator 2 finally outputs a 32-bit result.

Wherein X _n-j <13:0>And W _j <13:0>The multiplication is carried out bit by bit in turn, 196 multiplications are needed, and the result of each multiplication is provided with different weights, and the weights are realized by the shift accumulator 2. In common with15 sets of such multiplications are directly summed and input to the shift accumulator 2. All of these multiplications and additions are processed in parallel and only require one clock cycle to complete. In addition, the shift accumulator requires at least 2 clock cycles to complete the accumulation.

For the weights in the shift accumulator 2, they can be determined in advance according to the above formula. For example, the value obtained at the accumulation input terminal electrically connected to the output terminal of the 1 st multiply-add module 1 in the 1 st multiply-add module group 101 is S0, and the corresponding multiply-add result of S0 is S0

It can be determined that its corresponding weight is 2 ²⁶ ＝2 ^(14-1+14-1) (ii) a The value obtained by the accumulation input terminal electrically connected to the output terminal of the 2 nd multiply-add module 1 in the 1 st multiply-add module group 101 is S1, and the corresponding multiply-add result of S1 is

It can be determined that its corresponding weight is 2 ²⁵ ＝2 ^(14-1+14-2) (ii) a By analogy, the value obtained at the accumulation input end electrically connected to the output end of the 1 st multiply-add module 1 in the 2 nd multiply-add module group 102 is S14, and the multiply-add result corresponding to S14 is S14

It can be determined that its corresponding weight is 2 ²⁵ ＝2 ^(14-2+14-1) (ii) a The value obtained by the accumulation input terminal electrically connected to the output terminal of the 2 nd multiply-add module 1 in the 2 nd multiply-add module group 102 is S15, and the corresponding multiply-add result of S15 is

It can be determined that its corresponding weight is 2 ²⁴ ＝2 ^(14-2+14-2) (ii) a By analogy, the weight value corresponding to the accumulation input end of the output end of the f-th multiplication and addition module electrically connected to the e-th multiplication and addition module group is 2 ^(m-e+p-f) 。

An embodiment of the present application further provides an electronic device, including the in-memory computing apparatus in any of the above embodiments. The electronic device may be a mobile phone, a tablet computer, a Personal Computer (PC), a Personal Digital Assistant (PDA), a smart watch, a netbook, a wearable electronic device, an Augmented Reality (AR) device, a Virtual Reality (VR) device, an in-vehicle device, a smart car, a smart audio, a robot, smart glasses, a smart television, and the like. The in-memory computing device may be a chip in an electronic device.

In the embodiments of the present application, "at least one" means one or more, and "a plurality" means two or more. "and/or" describes the association relationship of the associated objects, and means that there may be three relationships, for example, a and/or B, and may mean that a exists alone, a and B exist simultaneously, and B exists alone. Wherein A and B can be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" and similar expressions refer to any combination of these items, including any combination of singular or plural items. For example, at least one of a, b, and c may represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c may be single or multiple.

The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made to the present application by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. An in-memory computing device for a parallel vector multiply-add unit, comprising:

l multiply-add modules, each of which comprises q multiply units and an adder, q > 1, L > 1;

the multiplication unit comprises a memory and a multiplication circuit, the multiplication circuit comprises a first multiplication input end and a second multiplication input end, the second multiplication input end is electrically connected with the output end of the memory, and the multiplication circuit is used for carrying out multiplication calculation on the numerical values of the first multiplication input end and the second multiplication input end and outputting a result through the output end of the multiplication unit;

the adder comprises q addition input ends, the a-th addition input end is electrically connected with the output end of the a-th multiplication unit, the value of a is 1, 2, \8230, q, and the output end of each adder is the output end of the corresponding multiplication and addition module;

the shift accumulator comprises L accumulation input ends, the accumulation input end of the bth is electrically connected with the output end of the multiply-add module, and the value of b is 1, 2, \8230, and L;

each accumulation input end has a corresponding weight value, and the shift accumulator is used for carrying out shift accumulation calculation on the numerical values of the L accumulation input ends based on the weight value corresponding to each accumulation input end.

2. The apparatus of claim 1, further comprising:

the input module corresponds to at least one multiplication and addition module and comprises q-1 input units which are sequentially cascaded, and the input units are used for outputting numerical values input in the last period in the current period;

the output end of the a-th input unit is electrically connected to the first multiplication input end of the a + 1-th multiplication unit of the corresponding multiplication and addition module;

the input end of the 1 st input unit is electrically connected to the first multiplication input end of the 1 st multiplication unit of the corresponding multiplication and addition module;

the input end of the input unit of the c-th is electrically connected with the output end of the input unit of the c-1 th, and the value of c is 2, 3, \ 8230;, q.

3. The apparatus of claim 2, comprising:

m multiply-add module groups, wherein each multiply-add module group comprises p multiply-add modules and one input module, and p is more than 1;

in the d-th multiplication and addition module group, the output end of the a-th input unit is electrically connected to the first multiplication input end of the a + 1-th multiplication unit of each multiplication and addition module, the input end of the 1-th input unit is electrically connected to the first multiplication input end of the 1-th multiplication unit of each multiplication and addition module, and the value of d is 1, 2, \ 8230;, m.

4. The apparatus of claim 3,

the weight value corresponding to the accumulation input end electrically connected with the output end of the f-th multiplication and addition module in the e-th multiplication and addition module group is 2 ^(m-e+p-f) E is 1, 2, \ 8230, m, f is 1, 2, \ 8230, p.

5. The device according to any one of claims 1 to 3,

the memory computing device is a finite unit impulse response filter.

6. The device according to any one of claims 1 to 3,

the multiplication circuit is an AND gate multiplication circuit.

7. The device according to any one of claims 1 to 3,

the multiplication circuit is a NAND gate multiplication circuit;

the adder is an inverting adder.

8. The device according to any one of claims 1 to 3,

the shift accumulator is also used for carrying out two-system complement calculation on the result of the shift accumulation calculation.

9. The device according to any one of claims 1 to 3,

the multiplication circuit includes:

a first transistor, a first end of which is electrically connected to the multiplication output end, and a control end of which is electrically connected to the first multiplication input end;

a second transistor, a first end of which is electrically connected to the second end of the first transistor, a second end of which is electrically connected to the low level output end, and a control end of which is electrically connected to the second multiplication input end;

the first transistor and the second transistor are n-type transistors.

10. An electronic device comprising the in-memory computing apparatus of any one of claims 1 to 9.