CN117521734A - In-memory computing circuit for realizing efficient multiplication operation - Google Patents

In-memory computing circuit for realizing efficient multiplication operation Download PDF

Info

Publication number
CN117521734A
CN117521734A CN202311462442.3A CN202311462442A CN117521734A CN 117521734 A CN117521734 A CN 117521734A CN 202311462442 A CN202311462442 A CN 202311462442A CN 117521734 A CN117521734 A CN 117521734A
Authority
CN
China
Prior art keywords
memory
data
array
multiplicand
partial product
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311462442.3A
Other languages
Chinese (zh)
Inventor
王中风
张旭
邹丁阳
张高澈
王美琪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University
Original Assignee
Nanjing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University filed Critical Nanjing University
Priority to CN202311462442.3A priority Critical patent/CN117521734A/en
Publication of CN117521734A publication Critical patent/CN117521734A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/52Multiplying; Dividing
    • G06F7/523Multiplying only

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Analysis (AREA)
  • Computational Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Complex Calculations (AREA)

Abstract

The invention provides an in-memory computing circuit for realizing high-energy-efficiency multiplication operation, which comprises an in-memory Booth encoder array and an in-memory computing array, wherein the in-memory computing array comprises an in-memory partial product generator; the in-memory partial product generator stores the multiplicand, two complementary latch nodes in the data latch unit are used for representing the opposite numbers of the multiplicand and the multiplicand, two complementary latch nodes of the adjacent data latch unit are used for representing the opposite numbers of the twice of the multiplicand and the twice of the multiplicand, the inversion and the shift can be realized without adding extra transistor cost, all possible non-zero partial products of the base 4Booth algorithm are generated, the multiplier signal controls the Booth encoder array in the memory to generate an encoding signal, and the encoding signal controls the data selector to select one of the four non-zero partial products. The invention is suitable for multiplying any bit width, improves the utilization rate of the data latch unit circuit and the symmetry of the memory integrated unit circuit, and can flexibly adjust the calculation parallelism.

Description

In-memory computing circuit for realizing efficient multiplication operation
Technical Field
The present invention relates to an in-memory computing circuit for implementing energy-efficient multiplication operations.
Background
Artificial intelligence (Artificial Intelligence, AI) has found wide application in various fields, and has also driven the advent of the "arithmetical era". AI-related algorithms, such as deep neural networks (Deep Neural Networks, DNN), convolutional neural networks (Convolutional Neural Networks, CNN), etc., require extensive data processing. However, most modern computing systems are based on a traditional von neumann architecture, physically comprising separate computing units and data latching units, which require repeated transfer of large amounts of data between the memory and computing units during execution of various computing tasks, which results in significant delays and energy losses, thereby limiting the efficiency of data processing. As processors and memory devices develop unevenly over time, the speed gap between memory and processor becomes larger and larger, which gap is known as a "memory wall", thereby creating the well-known von neumann bottleneck. Only if the von neumann bottleneck is solved, the artificial intelligence can be applied to equipment with strict limits on energy consumption and area (such as internet of things equipment, movable equipment, wearable equipment and the like), so that the 'everywhere artificial intelligence' is realized. In order to overcome the computational limitations imposed by conventional von neumann architectures, in-memory computation (Computing In Memory, CIM) has resulted in the need to transfer data from memory to processor, integrating the computation portion directly into the memory array to perform the computation, which reduces not only the transfer of intermediate data, but also the computational effort of the processor. In the in-memory computing architecture, the bandwidth of the bus is no longer a limiting factor for throughput, thereby significantly improving throughput and energy efficiency. Another significant advantage of in-memory computing is the ability to implement multiple rows of reads, thereby reducing the number of memory accesses and increasing data throughput. With in-memory computing becoming a popular research area, more and more students are beginning to conduct research in this area, and multiplication, addition and subtraction, and logic operations have been implemented by in-memory computing. The carrier for implementing the internal calculation includes various volatile and nonvolatile memories such as Static Random-Access Memory (SRAM), dynamic Random-Access Memory (DRAM), resistive Random Access Memory (Resistive Random Access Memory, RRAM), phase change Random Access Memory (Phase Change Random Access Memory, PRAM), flash Memory (Flash Memory), and the like. Because Static Random-Access Memory (SRAM) data latch units have high stability, high reading speed and high erasable frequency, and the manufacturing process of the SRAM is compatible with advanced logic processes, in-Memory computation based on the SRAM has received extensive attention from academia and industry.
The prior art scheme is as follows: existing SRAM-based in-memory computing implementations can be categorized into analog in-memory computing (analog computing in memory, ACIM) and digital in-memory computing (analog computing in memory, DCIM). The ACIM generally needs to convert the input digital signal into an analog signal, and perform single-bit analog multiplication operation with the logic value stored by the data latch unit to form a multiplication accumulation unit, and simultaneously activate a plurality of word lines, so that the discharge current generated by the multiplication accumulation unit on the same column can be summed, and after being converted into the digital signal by the analog-to-digital conversion circuit, the digital signal can be used for the next operation of a later circuit. In general, in-memory digital computation is to multiplex a data latch unit into a logic operation unit, transfer single-bit multiplication (i.e. bitwise and) to a memory to realize, and realize a shift accumulator and an adder circuit at the periphery of a memory array to complete multiply-accumulate operation. The ACIM has the advantage of high energy efficiency, but the characteristic of analog calculation also causes the ACIM to be easily influenced by factors such as process deviation, voltage fluctuation, circuit noise and the like, so that the calculation accuracy is generally below 8 bits. Peripheral digital to analog and analog to digital conversion circuitry in the ACIM architecture also introduces additional overhead. DCIM avoids the overhead of the data converter and there is no loss of computational accuracy, but its energy efficiency ratio and surface efficiency ratio are typically worse than the analog in-memory computational chips.
Disclosure of Invention
The invention aims to: aiming at the defects of the prior art, the invention provides an in-memory computing circuit for realizing high-energy-efficiency multiplication operation, which comprises an in-memory Booth encoder array, a row decoder and word line driving circuit, a column decoder and read-write driving circuit and an in-memory computing array;
unlike conventional multiplier circuit designs, all of the above components of the present invention are implemented within or in close proximity to memory, thus eliminating or reducing the overhead of data transfer between the multiplication unit and memory.
The memory computing circuit has three working modes, namely reading, writing and computing;
in the read and write mode, the function of the row decoder is to select a specific row or cell in the memory chip for reading or writing, and if the row decoder selects a certain row, the word line driving circuit generates necessary control signals to activate the row, and data is read from or written to the row; the function of the column decoder is to select a particular column in the memory row that has been selected by the row decoder, which will further decode the address signal to determine the memory column to be accessed when the memory system receives the address signal; the column decoder ensures that only data on a selected memory column is read or written; the read-write driving circuit controls the read-write operation of the data, and for the read operation, the read-write driving circuit amplifies and outputs the data in the selected memory unit to the data bus; for a write operation, the read-write drive circuit writes data from the data bus to the selected memory cell; the read-write driving circuit also comprises data writing and reading time sequence control; the row decoder and column decoder work together to select a particular data unit in the memory, while the word line driver circuit and read-write driver circuit are responsible for activating and processing the selected data unit for a read or write operation. These components play a critical role in the memory system, ensuring that data can be stored and retrieved efficiently.
In a calculation mode, the in-memory Booth encoder array receives the multiplier signal and outputs a Booth encoding signal; under the control of the Booth coding signal, the in-memory computing array performs inverting and shifting operations on the multiplicand stored in the memory and outputs all partial product signals generated by multiplying the multiplicand.
The memory computing array comprises n memory partial product generator arrays, each memory partial product generator array comprises m memory partial product generators, and each memory partial product generator comprises k memory computing integral units; where n is the number of rows of the compute array in memory, k is the bit width of the multiplicand, and k×m is the number of columns of the compute array in memory.
Each memory cell includes a data latch unit, more than one pair of read and write ports, and a data selector array including z data selectors.
The ith data latch unit has two complementary data latch nodes Q [ i ] and QN [ i ], where i represents the position of the data latch unit, 0< i < k-1, and k data latch units store the multiplicand of k bits from right to left in order from low to high.
The read-write port performs read-write operation on the in-memory computing array under the control of the peripheral circuit.
Each data selector includes four control signal inputs TWO, TWON, NEG, NEGN, four data inputs W, 2WN and WN, and one data output OUT;
TWO, TWON, NEG, NEGN are generated by the in-memory Booth encoder array, and W, 2WN and WN are generated by the data latch unit of the memory cell.
The complementary latch signals existing in the memory are utilized, and the mode of cascading k memory integrated units one by one from right to left is designed, so that an internal left shift one-bit circuit is realized in the memory with zero transistor overhead, and all possible non-zero partial product candidate signals in a Radix-4 Booth algorithm are generated.
Storing a multiplicand in each in-memory partial product generator, using two complementary latch nodes in the data latch unit to represent the opposite numbers of the multiplicand and the multiplicand, using two complementary latch nodes in the adjacent data latch unit to represent the opposite numbers of the two times of the multiplicand and the two times of the multiplicand, realizing inversion and shift without adding extra transistor overhead, and generating all possible non-zero partial products of the base 4Booth algorithm, wherein the specific implementation manner is as follows: the 2W input ends of all data selectors of the data selector array of the 0 th integrative unit are grounded, the 2WN input ends of all data selectors of the data selector array of the 0 th integrative unit are grounded, the W input ends of all data selectors of the data selector array of the 0 th integrative unit are connected with the Q0 node of the 0 th integrative unit, and the WN input ends of all data selectors of the data selector array of the 0 th integrative unit are connected with the QN 0 node of the 0 th integrative unit; the 2W input ends of all data selectors of the data selector array of the x-th memory integrated unit are connected with the Q [ x-1] node of the x-1 th memory integrated unit, the 2WN input ends of all data selectors of the data selector array of the x-th memory integrated unit are connected with the QN [ x-1] node of the x-1 th memory integrated unit, the W input ends of all data selectors of the data selector array of the x-th memory integrated unit are connected with the Q [ x ] node of the x-th memory integrated unit, and the W input ends of all data selectors of the data selector array of the x-th memory integrated unit are connected with the QN [ x ] node of the x-th memory integrated unit, wherein 1< x < k-1.
The z data selectors of each memory cell each have a set of independent control inputs, i.e., { NEG [0], NEGN [0], TWO [0], TWON [0] }, { NEG [1], NEGN [1], TWO [1], TWON [1] }, { NEG [ z-1], NEGN [ z-1], TWO [ z-1], TWON [ z-1]. Wherein NEG [0], NEGN [0], TWO [0], TWON [0] are control signals of the 0 th data selector, and so on;
the data selector of each memory cell has independent output terminals, and the output terminals of the z data selectors of the xth memory cell are denoted as { OUT } 0 [x],OUT 1 [x],...,OUT z-1 [x]},OUT 0 [x]Is the output signal of the 0 th data selector of the x-th memory cell.
The partial product bits of the same sequence number of the k memory cells form a complete partial product, { OUT } 0 [0],OUT 0 [1],...,OUT 0 [k-1]Form part of 0 th product, { OUT } 1 [0],OUT 1 [1],...,OUT 1 [k-1]Form part 1 product, and so on, { OUT } z-1 [0],OUT z-1 [1],...,OUT z-1 [k-1]-forming the z-1 th partial product;
the in-memory partial product generator is capable of generating z partial products simultaneously in a single clock cycle.
The beneficial effects are that: (1) differences from existing analog in-memory computing circuitry: the invention is realized in a full digital mode, an analog-to-digital converter is not needed to convert input data into a digital signal, and a digital-to-analog converter is not needed to convert an operation result into the digital signal, so that the area and the power consumption expense of the data converter are saved; the scheme provided by the invention has strong anti-interference capability, is not easy to be influenced by process deviation, voltage fluctuation and temperature change, and has no loss of operation precision.
(2) The difference from the existing digital in-memory computing circuit is that: multiplication for two multi-bit numbers: and I is equal to W, wherein I is n bits, W is m bits, the existing all-digital implementation scheme is to utilize an AND gate or a NOR gate to realize single-bit multiplication in a single clock period, n clock periods are needed to obtain n partial products, and then the partial products are shifted and summed to obtain a final multiplication operation result. The invention realizes the Radix-4 Booth algorithm in the presence, and the calculation parallelism can depend on specific process and application requirements. The in-memory partial product generator provided by the invention can obtain all partial products in a maximum of n/2 clock cycles. Compared with the prior serial computing scheme, the throughput of the invention can be improved by at least 2 times. The total number of partial products is reduced by half after Booth coding, so that the energy consumption is effectively reduced. Therefore, the invention can effectively improve the calculation speed and simultaneously improve the calculation energy efficiency.
(3) The difference between the in-memory partial product generator and the traditional Booth multiplication circuit is that: the invention effectively utilizes the complementary signals existing in the data latch unit and the regular array structure of the memory, realizes the shift circuit and the multi-bit selector circuit with lower cost, and obtains partial products with lower area and power consumption.
Drawings
The foregoing and/or other advantages of the invention will become more apparent from the following detailed description of the invention when taken in conjunction with the accompanying drawings and detailed description.
FIG. 1 is a diagram of an in-memory computing circuit architecture of the present invention.
Fig. 2 is a circuit diagram of an in-memory Booth encoder.
FIG. 3 is a schematic diagram of a schematic representation of the circuit diagram of an in-memory Booth encoder.
FIG. 4 is a schematic diagram of a Booth encoder array in a row memory.
FIG. 5 is a schematic diagram of a Booth encoder array in 8-line memory.
Fig. 6 is a schematic diagram of a selector circuit.
Fig. 7 is a schematic diagram of a symbolized representation of a selector.
FIG. 8 is a schematic diagram of a memory cell circuit.
FIG. 9 is a schematic diagram of an in-memory partial product generator circuit.
Fig. 10 is a schematic diagram of a symbolized representation of an in-memory partial product generator.
FIG. 11 is a schematic diagram of a row in-memory partial product generator array.
FIG. 12 is a schematic diagram of an array of eight row in-memory partial product generators.
Detailed Description
The invention provides an in-memory computing circuit for realizing high-energy-efficiency multiplication operation, which comprises a circuit structure for realizing Radix-4 Booth multiplication in memory. The principle of the Booth algorithm is as follows, with a multiplier I being set to multiply by a multiplicand W, the complement of I being expressed as:
where n represents the bit widths of I and W, j represents the j-th bit, and I is further expanded to obtain:
I=-I n-1 *2 n-1 +I n-2 *2 n-2 +I n-3 *2 n-3 +...+I 3 *2 3 +I 2 *2 2 +I 1 *2 n1 +I 0 *2 0 +I -1
=(-2I n-1 +I n-2 +I n-3 )*2 n-2 +(-2I n-3 +I n-4 +I n-5 )*2 n-4 +(-2I 5 +I 4 +I 3 )*2 4 +(-2I 3 +I 2 +I 1 )*2 2 +(-2I 1 +I 0 +I -1 )*2 0
w is represented as:
as shown in the formula, the number of partial products can be halved through the transformation of the Radix-4 Booth multiplication, so that the operation speed of the multiplier is improved, and the area and power consumption cost are reduced. The multiplier I can be recoded according to the above formula, as shown in Table 1, and is called a Radix-4 Booth code.
TABLE 1
According to Table 1, the present invention relates to a memory Booth encoder which is provided with a multiplier as input and is provided with three continuous bits I 2j+1 、I 2j 、I 2j-1 The output is TWO, TWON, NEG, NEGN, ZERO four signals, the truth table of which is shown in Table 2. Wherein TWON is the inverting signal of TWO and NEGN is the inverting signal of NEG. TWO, TWON, NEG, NEGN together control the selector array in the in-memory partial product generator array to select the correct signal representing the partial product, as shown in Table 2, if the TWO signal is low and the NEG signal is also low, then the multiplicand stored in the in-memory partial product generator array is selected, indicating that the partial product is a multiplicand; if the TWO signal is high and the NEG signal is low, selecting a signal of which the multiplicand stored in the partial product generator array is shifted one bit to the left, wherein the partial product is multiplicand multiplied by 2; if the TWO signal is high and the NEG signal is also high, selecting a signal with the multiplicand inverted and shifted left by one bit, representing a partial product of multiplication 2 as the multiplicand; if the TWO signal is low and the NEG signal is high, the multiplicand inverted signal stored in the partial product generator array is selectedThe partial product is represented as multiplicand times-1.
The ZERO signal has a higher priority than the TWO, TWON, NEG, NEGN 4 signals, and if ZERO is high, then part of the products are ZERO, and only if ZERO is low, the other signals are active.
TABLE 2
In table 2 above, x represents any signal, and if the ZERO signal is high, the in-memory calculation circuit periphery will have a partial product of ZERO.
The logical expression corresponding to table 2 is:
NEG=I 2j+1
wherein the method comprises the steps ofIndicating TWO is inverted, < >>Represents NEG negation;
as shown in FIG. 1, the present invention provides a new in-memory multiplication circuit, which comprises an in-memory Booth encoder array, a row decoder and word line driving circuit, a column decoder and read-write driving circuit, and an in-memory computing array.
The in-memory computing array includes n in-memory partial product generator arrays, each in-memory partial product generator array including m in-memory partial product generators, each in-memory partial product generator including k in-memory computing unit. Where n is the number of rows of the compute array in memory, k is the bit width of the multiplicand, and k×m is the number of columns of the compute array in memory. The value of k is determined according to the calculation precision required by the application, and the integrated circuit for storing and calculating provided by the invention supports multiplication operation with any bit width.
Each memory cell includes a data latch unit, more than one pair of read and write ports, and a data selector array including z data selectors.
Each data latch unit has two complementary data latch nodes Qx and QN x, 0< x < k-1, i represents the x-th bit of the multiplicand stored in the data latch unit of the memory cell. The k data latch units store the multiplicand of k bits from right to left in order from the lower bit to the upper bit.
Each of the data selectors includes four control signal input terminals TWO, TWON, NEG, NEGN, four data input terminals W, 2WN and WN, and one data output terminal OUT. TWO, TWON, NEG, NEGN is produced by the in-memory Booth encoder and W, 2WN and WN are produced by the data latch unit of the memory cell.
The invention utilizes the complementary latch signals existing in the memory and designs a mode of cascading k memory integrated units one by one from right to left, and realizes a one-bit circuit shifting left in the memory by zero transistor overhead in the memory, thereby generating all possible non-zero partial product candidate signals in the Radix-4 Booth algorithm shown in the tables 1 and 2. Storing a multiplicand in each in-memory partial product generator, using two complementary latch nodes in the data latch unit to represent the opposite numbers of the multiplicand and the multiplicand, using two complementary latch nodes in the adjacent data latch unit to represent the opposite numbers of the two times of the multiplicand and the two times of the multiplicand, realizing inversion and shift without adding extra transistor overhead, and generating all possible non-zero partial products of the base 4Booth algorithm, wherein the specific implementation manner is as follows: in each memory partial product generator, the 2W input ends of all data selectors of the data selector array of the 0 th memory unit are grounded, the 2WN input ends of all data selectors of the data selector array of the 0 th memory unit are grounded, the W input ends of all data selectors of the data selector array of the 0 th memory unit are connected with the Q0 node of the 0 th memory unit, and the WN input ends of all data selectors of the data selector array of the 0 th memory unit are connected with the QN 0 node of the 0 th memory unit; the 2W input ends of all data selectors of the data selector array of the x-th memory integrated unit are connected with the Q [ x-1] node of the x-1 th memory integrated unit, the 2WN input ends of all data selectors of the data selector array of the x-th memory integrated unit are connected with the QN [ x-1] node of the x-1 th memory integrated unit, the W input ends of all data selectors of the data selector array of the x-th memory integrated unit are connected with the Q [ x ] node of the x-th memory integrated unit, and the W input ends of all data selectors of the data selector array of the x-th memory integrated unit are connected with the QN [ x ] node of the x-th memory integrated unit, wherein 1< x < k-1.
The z data selectors of each memory cell each have a set of independent control inputs, i.e., { NEG [0], NEGN [0], TWO [0], TWON [0] }, { NEG [1], NEGN [1], TWO [1] }, { NEG [ z-1]; where NEG [0], NEGN [0], TWO [0], TWON [0] are control signals for the 0 th data selector, and so on.
The data selector of each memory cell has independent output terminals, and the output terminals of the z data selectors of the xth memory cell are denoted as { OUT } 0 [x],OUT 1 [x],...,OUT z-1 [x]},OUT 0 [x]The output signal of the 0 th data selector, i.e. the 0 th partial product of the x th bit, OUT 1 [x]Is the output signal of the 1 st data selector of the x-th memory cell, i.e. the x-th bit of the 1 st partial product, and so on. Wherein 0< x < k-1.
Partial product bit structure of same serial number of k memory integrated unitsInto a complete partial product, i.e. { OUT 0 [0],OUT 0 [1],...,OUT 0 [k-1]Form part of 0 th product, { OUT } 1 [0],,OUT 1 [1],...,OUT 1 [k-1]Form part 1 product, and so on, { OUT } z-1 [0],OUT z-1 [1],...,OUT z-1 [k-1]And z-1 th partial product. That is, the in-memory partial product generator designed by the present invention is capable of generating z partial products simultaneously in a single clock cycle. The size of z is selected according to specific process and design index requirements, the larger z is, the smaller the clock cycle number required for completing one multiplication is, but the larger the area of the integrated unit is, and the wiring difficulty is also increased.
In one embodiment of the present invention, an embodiment of a static random access memory (Static Random Access Memory, SRAM) based integrated circuit is described, which is configured as follows:
1. the multiplication precision is 8 bits (namely the bit width of the multiplier I and the multiplicand W is 8 bits), and a partial product generating circuit is composed of 8 storage integral units;
2. generating 4 partial products simultaneously in one period, namely four selectors in one memory integrated unit circuit;
3. the memory partial product generator array of each row is composed of 8 memory partial product generators, namely 8 multiplicands are stored;
4. there are 8 rows, i.e. a total of 8 x 8 = 512b storage capacity.
One embodiment of an in-memory Radix-4 Booth encoder is shown in FIG. 2, which is a portion of memory, in contrast to conventional multiplier designs, where the in-memory Radix-4 Booth encoder is placed in close proximity to the in-memory computational array; the symbolized representation of the Radix-4 Booth encoder is shown in fig. 3.
As shown in FIG. 4, 4 of the above encoders are duplicated to form a Booth encoder array, and the 4 encoders encode under the control of the same multiplier to produce four sets of control signals, namely { NEG [0], NEGN [0], TWO [0], TWON [0] },... The four sets of control signals control each in-memory partial product generator of a row of in-memory partial product generator arrays to simultaneously generate 4 partial products, i.e., each row of in-memory partial product generator arrays can simultaneously generate 32 partial products.
As shown in fig. 5, 8 Booth encoder arrays are duplicated, each Booth encoder array being controlled by an independent multiplier, respectively controlling the 8-line in-memory partial product generator arrays of the present embodiment, while generating 256 partial products.
One embodiment of a selector circuit is shown in fig. 6, where a selector outputs a partial product bit.
A symbolized representation of the selector of this embodiment is shown in fig. 7.
An embodiment of a memory cell is shown in fig. 8: the integrated memory unit comprises a data latch unit formed by cross-coupled inverters, wherein the data latch unit is provided with two complementary storage nodes Q and QN; a pair of read-write ports formed by NMOS tubes NMO and NM 1; a selector array consisting of 4 selectors; BL and BLB are complementary bit lines in the column direction, and WL is a word line in the horizontal direction.
As shown in fig. 9, 8 memory units are cascaded from right to left to form a memory partial product generator. Under the control of four sets of control signals { NEG [0], NEGN [0], TWO [0], TWON [0] }, { NEG [3], NEGN [3], TWO [3], TWON [3] }, the in-memory partial product generator outputs OUT0<7:0>, OUT1<7:0>,.
A symbolized representation of the in-memory partial product generator is shown in fig. 10.
As shown in fig. 11, 8 in-memory partial product generators are duplicated in the horizontal direction to constitute a row of in-memory partial product generator arrays. OUT0<7:0>, OUT1<7:0>, OUT3<7:0> constitutes the partial product output signal of the 1 st in-memory partial product generator, OUT0<15:8>, OUT1<15:8>, OUT3<15:8> constitutes the partial product output signal of the 2 nd in-memory partial product generator, and so on until OUT0<63:56>, OUT1<63:56>, OUT3<63:56> constitutes the partial product output signal of the 8 th in-memory partial product generator.
As shown in fig. 12, 8 in-memory partial product generator arrays are duplicated in the vertical direction, constituting 8-row in-memory partial product generator arrays.
The present invention provides an in-memory computing circuit for implementing an energy-efficient multiplication operation, and the method and the way for implementing the technical scheme are numerous, the above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, modifications and embodiments can be made, and those modifications and embodiments should also be considered as the protection scope of the present invention, without departing from the principles of the present invention. The components not explicitly described in this embodiment can be implemented by using the prior art.

Claims (10)

1. The in-memory computing circuit for realizing the efficient multiplication operation is characterized by comprising an in-memory Booth encoder array, a row decoder, a word line driving circuit, a column decoder, a read-write driving circuit and an in-memory computing array;
the memory computing circuit has three working modes, namely reading, writing and computing;
in the read and write mode, the function of the row decoder is to select a specific row or cell in the memory chip for reading or writing, and if the row decoder selects a certain row, the word line driving circuit generates necessary control signals to activate the row, and data is read from or written to the row; the function of the column decoder is to select a particular column in the memory row that has been selected by the row decoder, which will further decode the address signal to determine the memory column to be accessed when the memory system receives the address signal; the column decoder ensures that only data on a selected memory column is read or written; the read-write driving circuit controls the read-write operation of the data, and for the read operation, the read-write driving circuit amplifies and outputs the data in the selected memory unit to the data bus; for a write operation, the read-write drive circuit writes data from the data bus to the selected memory cell;
in a calculation mode, the in-memory Booth encoder array receives the multiplier signal and outputs a Booth encoding signal; under the control of the Booth coding signal, the in-memory computing array performs inverting and shifting operations on the multiplicand stored in the memory and outputs all partial product signals generated by multiplying the multiplicand.
2. An in-memory computing circuit for implementing energy-efficient multiplication operations as defined in claim 1, wherein the in-memory computing array comprises n in-memory partial product generator arrays, each in-memory partial product generator array comprising m in-memory partial product generators, each in-memory partial product generator comprising k in-memory arithmetic unit; where n is the number of rows of the compute array in memory, k is the bit width of the multiplicand, and k×m is the number of columns of the compute array in memory.
3. An in-memory computing circuit for implementing energy-efficient multiplication operations as recited in claim 2, wherein each of the memory cells includes a data latch cell, more than one pair of read and write ports, and a data selector array including z data selectors.
4. An in-memory computing circuit for implementing an energy efficient multiplication operation as claimed in claim 3, wherein the ith data latch unit has two complementary data latch nodes Q [ i ] and QN [ i ], where i represents the position of the data latch unit, 0< i < k-1, and k data latch units store the multiplicand of k bits from right to left in order from low order to high order.
5. The in-memory computing circuit for implementing energy-efficient multiplication as recited in claim 4, wherein the read-write port performs read-write operations on the in-memory computing array under control of the peripheral circuit.
6. An in-memory computational circuit for implementing energy-efficient multiplication operations according to claim 5, wherein each data selector comprises four control signal inputs TWO, TWON, NEG, NEGN, four data inputs W, 2WN and WN, and a data output OUT;
TWO, TWON, NEG, NEGN are generated by the in-memory Booth encoder array, and W, 2WN and WN are generated by the data latch unit of the memory cell.
7. An in-memory computing circuit for implementing an energy efficient multiplication operation according to claim 6, wherein the complementary latch signals present in the memory itself are utilized and the in-memory left shift one-bit circuit is implemented in the memory with zero transistor overhead in a manner designed to concatenate k memory cells one by one from right to left, thereby generating all possible non-zero partial product candidate signals in the Radix-4 Booth algorithm.
8. An in-memory computational circuit for performing an energy-efficient multiplication operation according to claim 7 wherein one multiplicand is stored in each in-memory partial product generator, the two complementary latch nodes in the data latch unit representing the multiplicand and the inverse of the multiplicand, and the two complementary latch nodes in the adjacent data latch unit representing the inverse of the two times the multiplicand and the inverse of the two times the multiplicand, the negation and shifting being performed without adding additional transistor overhead, yielding all possible non-zero partial products of the base 4Booth algorithm, by: the 2W input ends of all data selectors of the data selector array of the 0 th integrative unit are grounded, the 2WN input ends of all data selectors of the data selector array of the 0 th integrative unit are grounded, the W input ends of all data selectors of the data selector array of the 0 th integrative unit are connected with the Q0 node of the 0 th integrative unit, and the WN input ends of all data selectors of the data selector array of the 0 th integrative unit are connected with the QN 0 node of the 0 th integrative unit; the 2W input ends of all data selectors of the data selector array of the x-th memory integrated unit are connected with the Q [ x-1] node of the x-1 th memory integrated unit, the 2WN input ends of all data selectors of the data selector array of the x-th memory integrated unit are connected with the QN [ x-1] node of the x-1 th memory integrated unit, the W input ends of all data selectors of the data selector array of the x-th memory integrated unit are connected with the Q [ x ] node of the x-th memory integrated unit, and the W input ends of all data selectors of the data selector array of the x-th memory integrated unit are connected with the QN [ x ] node of the x-th memory integrated unit, wherein 1< x < k-1.
9. The in-memory computational circuit of claim 8 wherein each of the z data selectors of each of the memory cells has a set of independent control inputs, { NEG [0], NEGN [0], TWO [0] }, { NEG [1], NEGN [1], TWO [1], TWON [1] }, { NEG [ z-1], NEGN [ z-1], TWO [ z-1]; wherein NEG 0, NEGN 0, TWO 0, TWON 0 is the control signal of the 0 th data selector;
the data selector of each memory cell has independent output terminals, and the output terminals of the z data selectors of the xth memory cell are denoted as { OUT } 0 [x],OUT 1 [x],...,OUT z-1 [x]},OUT 0 [x]Is the output signal of the 0 th data selector of the x-th memory cell.
10. The in-memory computing circuit for performing energy-efficient multiplication according to claim 9, wherein the partial product bits of the same sequence number of the k memory cells form a complete partial product, { OUT } 0 [0],OUT 0 [1],...,OUT 0 [k-1]Form part of 0 th product, { OUT } 1 [0],OUT 1 [1],...,OUT 1 [k-1]Form part 1 product, and so on, { OUT } z-1 [0],OUT z-1 [1],...,OUT z-1 [k-1]-forming the z-1 th partial product;
the in-memory partial product generator is capable of generating z partial products simultaneously in a single clock cycle.
CN202311462442.3A 2023-11-06 2023-11-06 In-memory computing circuit for realizing efficient multiplication operation Pending CN117521734A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311462442.3A CN117521734A (en) 2023-11-06 2023-11-06 In-memory computing circuit for realizing efficient multiplication operation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311462442.3A CN117521734A (en) 2023-11-06 2023-11-06 In-memory computing circuit for realizing efficient multiplication operation

Publications (1)

Publication Number Publication Date
CN117521734A true CN117521734A (en) 2024-02-06

Family

ID=89759889

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311462442.3A Pending CN117521734A (en) 2023-11-06 2023-11-06 In-memory computing circuit for realizing efficient multiplication operation

Country Status (1)

Country Link
CN (1) CN117521734A (en)

Similar Documents

Publication Publication Date Title
WO2022199684A1 (en) Circuit based on digital domain in-memory computing
CN111816232B (en) In-memory computing array device based on 4-pipe storage structure
Jiang et al. A two-way SRAM array based accelerator for deep neural network on-chip training
CN112636745B (en) Logic unit, adder and multiplier
TW202230165A (en) Device and method of compute in memory
Roohi et al. Processing-in-memory acceleration of convolutional neural networks for energy-effciency, and power-intermittency resilience
CN114937470B (en) Fixed point full-precision memory computing circuit based on multi-bit SRAM unit
Liu et al. Sme: Reram-based sparse-multiplication-engine to squeeze-out bit sparsity of neural network
CN116126779A (en) 9T memory operation circuit, multiply-accumulate operation circuit, memory operation circuit and chip
Tsai et al. RePIM: Joint exploitation of activation and weight repetitions for in-ReRAM DNN acceleration
CN114300012A (en) Decoupling SRAM memory computing device
US20230253032A1 (en) In-memory computation device and in-memory computation method to perform multiplication operation in memory cell array according to bit orders
Zhang et al. In-memory multibit multiplication based on bitline shifting
CN117521734A (en) In-memory computing circuit for realizing efficient multiplication operation
CN116543808A (en) All-digital domain in-memory approximate calculation circuit based on SRAM unit
Li et al. Optimization strategies for digital compute-in-memory from comparative analysis with systolic array
CN116204490A (en) 7T memory circuit and multiply-accumulate operation circuit based on low-voltage technology
US20220019407A1 (en) In-memory computation circuit and method
Monga et al. A Novel Decoder Design for Logic Computation in SRAM: CiM-SRAM
CN118093507A (en) Memory calculation circuit structure based on 6T-SRAM
CN114911453B (en) Multi-bit multiply-accumulate full-digital memory computing device
CN114647398B (en) Carry bypass adder-based in-memory computing device
CN114239818B (en) Memory computing architecture neural network accelerator based on TCAM and LUT
Zhou et al. RISC-V based Fully-Parallel SRAM Computing-in-Memory Accelerator with High Hardware Utilization and Data Reuse Rate
US20220334800A1 (en) Exact stochastic computing multiplication in memory

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination