CN116932456A

CN116932456A - Circuit, in-memory computing circuit and operation method

Info

Publication number: CN116932456A
Application number: CN202310613417.4A
Authority: CN
Inventors: 李嘉富; 吕承翰; 池育德; 张琮永; 陈炎辉; 李承恩; 赵威丞; 森阳纪; 藤原英弘
Original assignee: Taiwan Semiconductor Manufacturing Co TSMC Ltd
Current assignee: Taiwan Semiconductor Manufacturing Co TSMC Ltd
Priority date: 2022-06-28
Filing date: 2023-05-29
Publication date: 2023-10-24

Abstract

An embodiment of the present invention provides a circuit comprising: a multiplier circuit receiving the signed mantissa of each of the plurality of input and weighted data elements and generating a two-complement product by multiplying and reformatting the signed mantissa of some or all of the input data elements and the signed mantissa of some or all of the weighted data elements; a summing circuit that receives the index of each of the plurality of input and weight data elements and generates a sum by adding the index of each input data element to the index of each weight data element; a shift circuit that shifts each product by an amount equal to a difference between the corresponding sum and the maximum sum; and an adder tree that generates a mantissa sum from the shifted product. The embodiment of the invention also provides an in-memory computing circuit and a method for operating the same.

Description

Circuit, in-memory computing circuit and operation method

Technical Field

Embodiments of the present invention relate generally to the field of electronic circuits and, more particularly, to circuits, in-memory computing circuits, and methods of operation.

Background

Memory arrays are commonly used to store and access data for various types of computation, such as logical or mathematical operations. To perform these operations, data bits are moved between the memory array and the circuitry used to perform the calculations. In-memory Computing (CIM) circuits, a memory array is combined with a computing circuit. In some cases, the computation includes a multi-layer operation, and the result of the first operation is used as input data for the second operation.

Disclosure of Invention

An embodiment of the present invention provides a circuit comprising: a multiplier circuit configured to receive a signed mantissa of each data element of a plurality of input data elements and a plurality of weight data elements, and to generate a plurality of two-complement products by multiplying and reformatting a portion or all of the signed mantissas of the plurality of input data elements and a portion or all of the signed mantissas of the plurality of weight data elements; a summing circuit configured to: receiving the plurality of input data elements and the index of each of the plurality of weight data elements, and generating a plurality of sums by adding each index of the plurality of input data elements to each index of the plurality of weight data elements; a shift circuit configured to shift each product of the plurality of products by an amount equal to a difference between a respective sum of the plurality of sums and a maximum sum; and an adder tree configured to generate a mantissa sum from the shifted plurality of products.

Another embodiment of the present invention provides a method of operating a circuit, the method comprising: receiving a signed mantissa and exponent for each data element of a plurality of input data elements and a plurality of weight data elements; generating a plurality of two-complement products by multiplying and reformatting part or all of the signed mantissas of the plurality of input data elements and part or all of the signed mantissas of the plurality of weight data elements; generating a plurality of sums by adding each exponent of the plurality of input data elements to each exponent of the plurality of weight data elements; shifting each product of the plurality of products by an amount equal to a difference between a respective sum of the plurality of sums and a maximum sum; and adding the shifted plurality of products to generate a mantissa sum.

Yet another embodiment of the present invention provides an in-memory computing circuit, comprising: a memory array configured to store a plurality of input data elements and a plurality of weight data elements; a multiplication accumulation unit configured to generate a part and a sequence from the plurality of input data elements and the plurality of weight data elements; an adder configured to generate a cumulative sum sequence by adding each partial sum of the partial sum sequences to a stored cumulative sum; and a buffer configured to: storing each of the accumulated sums in the sequence of accumulated sums as a stored accumulated sum; outputting each stored accumulated sum to the adder, and outputting a final stored accumulated sum from the in-memory calculation circuit.

Drawings

The various aspects of the invention are best understood from the following detailed description when read in connection with the accompanying drawings. It should be noted that the various components are not drawn to scale according to standard practice in the industry. In fact, the dimensions of the various features may be arbitrarily increased or reduced for clarity of discussion.

Fig. 1A and 1B are schematic diagrams of data calculation circuits according to some embodiments.

Fig. 2 is a schematic diagram of a shift circuit according to some embodiments.

Fig. 3 is a schematic diagram of data elements of a shift circuit according to some embodiments.

Fig. 4 is a schematic diagram of a shift circuit according to some embodiments.

Fig. 5 is a schematic diagram of data elements of a shift circuit in accordance with some embodiments.

Fig. 6 is a schematic diagram of data elements of a shift circuit in accordance with some embodiments.

Fig. 7 is a flow chart of a method of operating a circuit according to some embodiments.

FIG. 8 is a schematic diagram of circuitry for in-memory computing, according to some embodiments.

FIG. 9 is a flow chart of a method of operating a memory circuit according to some embodiments.

Detailed Description

The invention provides many different embodiments, or examples, for implementing different features of the disclosure. Specific examples of components and arrangements are described below to simplify the present disclosure. These are, of course, merely examples and are not intended to limit the invention. Such as in the following description, forming the first component over or on the second component may include embodiments in which the first component and the second component are formed in direct contact, and may also include embodiments in which additional components may be formed between the first component and the second component, such that the first component and the second component may not be in direct contact. Furthermore, the present invention may repeat reference numerals and/or characters in the various examples. This repetition is for the purpose of simplicity and clarity and does not in itself dictate a relationship between the various embodiments and/or configurations discussed.

Moreover, spatially relative terms such as "below …," "below …," "lower," "above …," "upper" and the like may be used herein for ease of description to describe one element or component's relationship to another element(s) or component(s) as illustrated. Spatially relative terms are intended to encompass different orientations of the device in use or operation in addition to the orientation depicted in the figures. The device may be otherwise oriented (rotated 90 degrees or at other orientations) and the spatially relative descriptors used herein interpreted accordingly.

In various embodiments, a data computation circuit, such as a in-memory Computation (CIM) circuit, is configured to perform data computation by separating sign and tail bits from exponent bits of input and weight data elements. The multiplier circuit multiplies and reformats the sign and mantissa bits of the input and weight data elements to produce a two-complement product, and the summing circuit adds the exponent of the input data to the exponent of the weight data to produce a sum. The shift circuit shifts the products according to the difference between the sum value and the maximum sum value, and the adder tree adds the shifted products together to produce a partial sum. The data calculation circuit is thus configured to perform floating point part and calculation using less time, area and power than other methods, and without quantization errors that exist in some methods where floating point data is converted to fixed point data. In some embodiments, the shifter circuit employs a two-stage configuration to further reduce area and simplify routing requirements.

In some embodiments, the CIM circuit includes a partial sum buffer configured to accumulate a plurality of partial sum data elements prior to output (e.g., to an external memory array), thereby reducing access power and time requirements as compared to methods of outputting a single partial sum data element.

Each of fig. 1A and 1B is a schematic diagram of a data calculation circuit 100 according to some embodiments. In the embodiment depicted in fig. 1A, data computation circuit 100, also referred to in some embodiments as circuit 100 or memory circuit 100, includes a memory array 110, a multiplier circuit 120A, a summing circuit 130, a difference circuit 140, a shift circuit 150, an adder tree 160, and converters 170 and 180. In the embodiment depicted in FIG. 1B and discussed below, circuit 100 includes multiplier circuit 120B instead of multiplier circuit 120A, and is otherwise configured in accordance with the embodiment discussed below with respect to FIG. 1A.

In the embodiment depicted in FIG. 1A, memory array 110 is connected to each of multiplier circuit 120A and summing circuit 130; the summing circuit 130 is also connected to a difference circuit 140; the difference circuit 140 is also connected to each of the multiplier circuit 120A, the shift circuit 150, and the converter 180; the shift circuit 150 is also connected to the adder tree 160; adder tree 160 is also coupled to converter 170; and the converter 170 is also connected to the converter 180.

Two or more circuit elements are considered to be connected based on direct electrical connection or electrical connection comprising one or more additional circuit elements and thus can be controlled, for example, by one or more transistors or other switching devices to be resistive or disconnected.

The embodiment depicted in FIG. 1A is a non-limiting example provided for illustration purposes. In some embodiments, the circuit 100 has a configuration different from that depicted in FIG. 1A, thereby enabling the benefits discussed below. In some embodiments, circuit 100 does not include memory array 110, and each of multiplier circuit 120A and summing circuit 130 is connected to a circuit external to circuit 100, e.g., a memory array. In some embodiments, circuit 100 does not include one or both of converters 170 or 180. In some embodiments, difference circuit 140 is not further connected to multiplier circuit 120A.

In some embodiments, the circuit 100 includes circuit elements other than those depicted in fig. 1A and discussed below, e.g., additional examples of control circuits or depicted circuit elements. In some embodiments, circuit 100 includes adder 830 and buffer 840, which are configured as discussed below with respect to fig. 8.

In some embodiments, the elements depicted in fig. 1A are part of a memory circuit that includes multiple instances of memory array 110, multiplier circuit 120A, summing circuit 130, difference circuit 140, shift circuit 150, adder tree 160, and/or converters 170 and/or 180.

In some embodiments, circuit 100 is included in a CIM circuit that includes elements configured to perform in-memory calculations, such as Convolutional Neural Networks (CNNs), wherein an array, such as memory array 110, includes stored weight data elements, such as a plurality of data elements WtDE, that are applied to one or more sets of input data elements, such as a plurality of data elements InDE, in a Multiply and Accumulate (MAC) operation.

A memory array, such as memory array 110 or memory array 810 discussed below with respect to fig. 8, is a memory device that includes a plurality of memory elements, each including an electrical, electromechanical, electromagnetic or other device configured to store one or more data elements, each data element including one or more data bits represented by a logic state. In some embodiments, the logic state corresponds to a voltage level of charge stored in some or all of the storage elements. In some embodiments, the logic state corresponds to a physical property of some or all of the storage element, such as resistance or magnetic orientation.

In some embodiments, the storage element includes one or more Static Random Access Memory (SRAM) cells. In various embodiments, an SRAM cell, such as a five transistor (5T), six transistor (6T), eight transistor (8T), or nine transistor (9T) SRAM cell, includes a number of from twenty to twelve transistors. In some embodiments, the SRAM cell comprises a multi-rail SRAM cell. In some embodiments, the SRAM cell includes a length at least 2 times greater than a width.

In some embodiments, the memory elements include one or more Dynamic Random Access Memory (DRAM) cells, resistive Random Access Memory (RRAM) cells, magnetoresistive Random Access Memory (MRAM) cells, ferroelectric random access memory (FeRAM) cells, NOR flash cells, NAND flash cells, conductive Bridging Random Access Memory (CBRAM) cells, data registers, non-volatile memory (NVM) cells, 3D NVM cells, or other memory cell types capable of storing bit data.

The memory array 110 is configured to store data elements InDE, also referred to as input data elements InDE, and data elements WtDE, also referred to as weight data elements WtDE. In some embodiments where circuit 100 is included in a CIM circuit, input data elements InDE and weight data elements WtDE correspond to corresponding input and weight data calculated by one or more matrices.

In some embodiments, the plurality of input data elements InDE is one of a plurality of input data elements, and the memory array 110 is configured to store each of a plurality of complex (multiple pluralities) input data elements. In some embodiments, the plurality of weight data elements WtDE is one complex number of the plurality of complex weight data elements, and the memory array 110 is configured to store each complex number of the plurality of complex weight data elements.

As the number of each data element and the number of bits per data element stored in memory array 110 increases, the complexity and power consumption of the circuit increases along with functional capabilities (e.g., increased weight data resolution).

In the embodiment depicted in FIG. 1A, memory array 110 includes four data elements, inDE and WtDE, respectively. In some embodiments, memory array 110 includes data elements InDE and WtDE, the total number of which are the same or different from each other. In some embodiments, memory array 110 includes fewer or greater than four data elements of one or both of data elements InDE and WtDE. In some embodiments, memory array 110 includes a number of data elements of one or both of data elements InDE and WtDE ranging from 2 to 1024. In some embodiments, memory array 110 includes a number of data elements of one or both of data elements InDE and WtDE, ranging from 2 to 128.

In the embodiment depicted in FIG. 1A, memory array 110 is configured to store each of data elements InDE and WtDE having a bit number equal to 16. In some embodiments, memory array 110 is configured to store each of data elements InDE and WtDE ranging in number of bits from 4 to 256. In some embodiments, memory array 110 is configured to store each of data elements InDE and WtDE ranging in number of bits from 8 to 64. In some embodiments, memory array 110 is configured to store each of data elements InDE and WtDE having a number of bits equal to 16.

In the embodiment depicted in fig. 1A, each of the data elements InDE and WtDE has a binary floating point computer digital format including a sign bit, a plurality of exponent bits, and a plurality of mantissa bits (also referred to as a plurality of fraction bits in some embodiments).

In some embodiments, each of the data elements InDE and WtDE has a BF16 format, also referred to in some embodiments as a bfloat format or brain floating-point format (Bfloat-point format), wherein the first bit represents the sign of the floating-point number, the next eight bits represent the exponent of the floating-point number, and the last seven bits represent the mantissa or fraction of the floating-point number. Because the mantissa is configured to start with a non-zero value, the last seven bits of each stored data element represent an eight-bit mantissa, with the first, most Significant Bit (MSB) being equal to 1.

In some embodiments, each of the data elements InDE and WtDE has an FP16 format, also referred to as a semi-precision format in some embodiments, wherein the first digit represents the sign of a floating point number, the next five digits represent the exponent of the floating point number, and the last ten digits represent the mantissa or fraction of the floating point number. In this case, the last ten bits of each stored data element represent a mantissa of 11 bits, with the first MSB being equal to 1.

In some embodiments, each of the data elements InDE and WtDE has a floating point format other than BF16 or FP16 formats, for example, another 16-bit format, 32-bit, 64-bit, 128-bit or 256-bit format, or a 40-bit or 80-bit extended precision format.

In some embodiments, the sign and mantissa of a data element representing a floating point number are collectively referred to as the signed mantissa of the floating point number. In some embodiments, the MSB of the mantissa is referred to as a hidden bit or a hidden MSB.

The memory array 110 includes one or more I/O connections (not shown) through which logic states are programmed in write operations and accessed in read operations. The memory array 110 is configured to: in a read operation, part or all of each of data elements InDE and WtDE are output to each of multiplier circuit 120A and summing circuit 130 on one or more data buses (not shown). In some embodiments, memory array 110 is configured to output all of each of data elements InDE and WtDE to each of multiplier circuit 120A and summing circuit 130. In some embodiments, memory array 110 is configured to output only the signed mantissa of each data element to multiplier circuit 120A and the exponent of each data element to summing circuit 130.

Multiplier circuit 120A is an electronic circuit, such as an Integrated Circuit (IC), configured to receive, for example, sign bits InS and mantissas InM (collectively signed mantissas/InM) for each data element of data element InDE from memory array 110, and sign bits WtS and mantissas WtM (collectively signed mantissas WtS/WtM) for each data element of data element WtDE. Summing circuit 130 is an electronic circuit, such as an integrated circuit, configured to receive, for example, an exponent InE of each data element of data element InDE and an exponent WtE of each data element of data element WtDE from memory array 110.

Multiplier circuit 120A includes one or more data registers (not shown) configured to receive instances of signed mantissas InS/InM and WtS/WtM. In the embodiment depicted in FIG. 1A, multiplier circuit 120A is configured to receive instances of signed mantissas InS/InM and WtS/WtM corresponding to data elements InDE and WtS/WtM, each having four data elements. In some embodiments, multiplier circuit 120A is configured to receive instances of signed mantissas InS/InM and WtS/WtM corresponding to one or both of data elements InDE and WtDE, with the total number of each data element being less than or greater than four.

In some embodiments, multiplier circuit 120A includes one or more data registers configured to receive signed mantissa InS/InM and/or WtS/WtM instances including the hidden MSB. In some embodiments, multiplier circuit 120A includes one or more data registers configured to add the hidden MSB to the received signed mantissa InS/InM and/or WtS/WtM instance.

Multiplier circuit 120A includes logic circuitry (not shown) configured to reformat each instance of signed mantissa InS/InM into a two's complement mantissa InTC, also referred to as reformatted mantissa InTC, and reformat each instance of signed mantissa WtS/WtM into a two's complement mantissa WtTC, also referred to as reformatted mantissa WtTC, in operation. The reformatted mantissa inttc has the same number of bits as the signed mantissa InS/InM, while the reformatted mantissa WtTC has the same number of bits as the signed mantissa WtS/WtM.

Multiplier circuit 120A includes one or more logic gates M1, with logic gate M1 configured to, in operation, multiply some or all instances of the reformatted mantissa InTC with some or all instances of the reformatted mantissa WtTC, thereby producing a number of products P [0] -P [ N ] of N+1. In various embodiments, one or more logic gates M1 comprise one or more AND or NOR gates or other circuitry suitable for performing part or all of the multiplication operations. The one or more logic gates M1 are configured to generate, in operation, each product P [0] -P [ N ] as a two-complement data element comprising a number of bits equal to twice the number of bits of the reformatted mantissas InTC and WtTC minus 1.

Multiplier circuit 120A is configured to generate, in operation, N+1 products P [0] -P [ N ] equal to the number of data elements of data element InDE multiplied by the number of data elements of data element WtDE. In the embodiment depicted in FIG. 1A, multiplier circuit 120A is configured to generate N+1 products P [0] -P [ N ] equal to 16. In some embodiments, multiplier circuit 120A is configured to generate N+1 products P [0] -P [ N ] that are less than or greater than 16. In some embodiments, multiplier circuit 120A is configured to generate products P [0] through P [ N ] having a number equal to N+1 under the condition of the first difference threshold, as discussed below with respect to difference circuit 140 and shift circuit 150.

In some embodiments, for example, in those embodiments in which the data elements InDE and WtDE have BF16 format, multiplier circuit 120A is configured to generate products P [0] -P [ N ] having a total of 17 bits based on each of signed mantissas/InM and WtS/WtM having 9 bits and reformatted mantissas InTC and WtTC. In some embodiments, such as those in which the data elements InDE and WtDE have an FP16 format, multiplier circuit 120A is configured to generate each of the products P [0] -P [ N ] having a total of 23 bits based on each of the signed mantissas InS/InM and WtS/WtM having 12 bits and the reformatted mantissas InTC and WtTC. Embodiments in which multiplier circuit 120A is configured to generate products P [0] -P [ N ] having other total digits based on each of signed mantissas/InM and WtS/WtM having other total digits and reformatted mantissas IntC and WtTC are within the scope of the present disclosure.

Multiplier circuit 120A is thus configured to multiply and reformat the sign and mantissa bits of input data element InDE and weight data element WtDE in operation to produce the two-complement product P0-P N.

Multiplier circuit 120A is configured to output products P [0] -P [ N ] to shift circuit 150 over a data bus (not labeled), as discussed further below.

Summing circuit 130 includes one or more data registers (not labeled) configured to receive instances of indices InE and WtE corresponding to the number of data elements InE and WtDE discussed above with respect to multiplier circuit 120.

Summing circuit 130 includes one or more logic gates A1 configured to, in operation, add each instance of exponent InE to each instance of exponent WtE. In various embodiments, one or more logic gates A1 include one or more full adder gates, half adder gates, ripple carry adder circuits, carry save adder circuits, carry select adder circuits, carry look-ahead adder circuits, or other circuits suitable for performing part or all of the addition operations. One or more logic gates A1 are configured to generate, as data elements, sums S [0] -S [ N ] having a total number of bits equal to the number of bits of each of exponents InE and WtE plus 1.

Summing circuit 130 is configured to generate in operation a sum of n+1 and ordered data elements with a sum of n+1 and ordered sums S [0] -S [ N ] having a total number of data elements corresponding to products P [0] -P [ N ] discussed above with respect to multiplier circuits 120A and 120B. Accordingly, for a total of N combinations of data elements InDE and WtDE, each nth combination corresponds to both the nth and S [ N ] of the sum S [0] -S [ N ] and the nth product P [ N ] of the products P [0] -P [ N ].

In some embodiments, for example, in those embodiments in which the data elements InDE and WtDE have BF16 format, summing circuit 130 is configured to generate each of the sums S [0] -S [ N ] having a total of 9 bits based on each of the exponents InE and WtE having 8 bits. In some embodiments, for example, where the data elements InDE and WtDE have the FP16 format, the summing circuit 130 is configured to generate each of the sums S [0] -S [ N ] having a total of 6 bits based on each of the exponents InE and WtE having 5 bits. It is within the scope of the present disclosure for summing circuit 130 to be configured to generate each of the sums S0-S N having other total numbers of bits based on each of the exponents InE and WtE having other total numbers of bits.

Summing circuit 130 is configured to output sums S [0] -S [ N ] to difference circuit 140 over a data bus (not labeled).

The difference circuit 140 is an electronic circuit, such as an IC, comprising one or more logic gates L1 and one or more logic gates B1, each configured to receive the sums S [0] -S [ N ] from the summing circuit 130.

The one or more logic gates L1 are configured to generate, in operation, a maximum sum MaxExp that is a data element having a maximum value equal to sum S [0] -S [ N ] and having a number of bits equal to sum S [0] -S [ N ]. The one or more logic gates L1 are configured to output a maximum sum MaxExp to the one or more logic gates B1 and the conversion circuit 180, as described below.

The one or more logic gates B1 are configured to, in operation, generate differences D [0] -D [ N ] by subtracting each data element of the sums S [0] -S [ N ] from the maximum sum MaxExp. Thus, the differences D [0] -D [ N ] have a total number of data elements N+1 and ordering corresponding to the total number and ordering of the sums S [0] -S [ N ] and products P [0] -P [ N ] discussed above.

In the embodiment depicted in FIG. 1A, one or more logic gates B1 are configured to output a difference value D [0] -D [ N ] to each of multiplier circuit 120A and shift circuit 150 over a data bus (not labeled).

In the embodiment depicted in FIG. 1A, multiplier circuit 120A is configured to receive differences D [0] -D [ N ] in operation, compare each instance D [ N ] of differences D [0] -D [ N ] to a first difference threshold (not shown in FIG. 1A), and generate a corresponding instance P [ N ] of products P [0] -P [ N ] by performing a multiplication operation only if the differences are less than the first difference threshold.

For a given instance Dn of differences D0-DN, a difference Dn less than a first difference threshold represents a corresponding instance Sn of sum values S0-SN having a value greater than the maximum sum MaxExp minus the first difference threshold. As discussed below with respect to shift circuit 150, only instance P [ N ] of products P [0] -P [ N ] corresponding to such instance S [ N ] can affect subsequent summation operations performed by adder tree 160. Thus, by performing the multiplication operation only when the difference is less than the first difference threshold, less power is consumed than in embodiments where the multiplication operation is performed at all times.

In some embodiments, one or more logic gates B1 are not configured to output differences D [0] -D [ N ] to multiplier circuit 120A, and thus multiplier circuit 120A is configured to generate each instance P [ N ] of products P [0] -P [ N ] by always performing a multiplication operation. In such an embodiment, multiplier circuit 120A is less complex than an embodiment in which the multiplication operation is performed only if the difference D [ n ] is less than the first difference threshold.

The shift circuit 150 is an electronic circuit, such as an IC, including one or more registers and/or logic gates, configured to perform a shift operation on each instance P [ N ] of products P [0] -P [ N ] based on the value of the corresponding instance D [ N ] of differences D [0] -D [ N ].

Each instance P [ N ] of the products P [0] -P [ N ] is based on the sign and mantissa of a respective combination of data elements InDE and WtDE, and each instance D [ N ] of the differences D [0] -D [ N ] is based on the sum of exponents of the same combination. The shift circuit 150 is configured to shift each instance P [ N ] of the products P [0] -P [ N ] to the right by an amount equal to the corresponding difference value D [ N ] to produce a shifted product SP [0] -SP [ N ], wherein the sign and mantissa bits are aligned according to the sum of the exponents used to produce the differences D [0] -D [ N ]. Based on this alignment, the shift circuit 150 is configured to generate each instance SP [ N ] of shift products SP [0] -SP [ N ] having the same exponent using the maximum sum MaxExp as a baseline.

To compensate for the right shift operation, the shift circuit 150 is configured to add an instance of the sign bit (zero or one) of each product P [ n ] as the leftmost bit of the corresponding shift product SP [ n ]. The number of increased sign bit instances is equal to the amount of right shift determined by the corresponding difference Dn.

In some embodiments, shift circuit 150 is configured to generate shift products SP [0] -SP [ N ] having a number of bits greater than the number of bits of products P [0] -P [ N ]. In such an embodiment, the shift circuit 150 is configured to add tail zero bits (trailing zero bit) as the rightmost bits of the shift products SP [0] -SP [ N ] as needed. The difference between the shifted products SP 0-SP N and the number of bits of the products P0-PN defines a second difference threshold such that a difference Dn equal to or less than the second difference threshold corresponds to the case where one or more zero bits are added and a difference Dn greater than the second difference threshold corresponds to the case where no zero bits are added.

As the difference between the number of bits of the shifted products SP 0-SP N and the products P0-P N increases, the resolution of the shifted products SP 0-SP N increases and the complexity of the shifting circuit 150 increases. In some embodiments, the difference in the number of bits of the shift products SP [0] -SP [ N ] and the products P [0] -PN [ N ] has a value from two to eight. In some embodiments, the difference between the number of bits of the shifted products SP [0] -SP [ N ] and the products P [0] -PN is equal to four.

The number of mantissa bits of each data element of shift products SP 0-SP N corresponds to the first difference threshold discussed above with respect to multiplier circuit 120 and difference circuit 140. The difference value D [ n ] smaller than the first difference threshold corresponds to a case where at least one bit of the corresponding product P [ n ] is included in the corresponding shift product SP [ n ], whereas the difference value D [ n ] greater than or equal to the first difference threshold corresponds to a case where no bit of the corresponding product P [ n ] can be included in the corresponding shift product SP [ n ]. Thus, the shifting circuit 150 is configured to, in operation, generate each shift product SP [ n ] from the respective product P [ n ] based on the respective difference D [ n ] being less than the first difference threshold, and generate each shift product SP [ n ] as a zero value data element (all zero bits) based on the respective difference D [ n ] being greater than or equal to the first difference threshold.

In some embodiments, for example, in embodiments where the data elements InDE and WtDE have BF16 formats, the shift circuit 150 is configured to generate shift products SP 0-SP N with 21 bits based on each of the products P0-P N with 17 bits, as discussed below with respect to FIGS. 3 and 5. In some embodiments, for example, where the data elements InDE and WtDE have the FP16 format, the shift circuit 150 is configured to generate shift products SP [0] -SP [ N ] having a total of 27 bits based on each of the products P [0] -P [ N ] having a total of 23 bits, as discussed below with respect to FIGS. 6 and 7. It is within the scope of the present disclosure that shift circuit 150 be configured to generate each of shift products SP [0] -SP [ N ] having other total bits based on each of products P [0] -P [ N ] having other total bits.

In some embodiments, the shift circuit 150 includes the shift circuit 200 discussed below with respect to fig. 2. In some embodiments, the shift circuit 150 includes the shift circuit 400 discussed below with respect to fig. 4.

Based on the products P [0] -P [ N ] having a two-complement format, the shift circuit 150 is configured to generate shift products SP [0] -SP [ N ] having a two-complement format.

The shift circuit 150 is configured to output shift products SP [0] -SP [ N ] to the adder tree 160 over a data bus (not labeled).

Adder tree 160 is an electronic circuit, e.g., an IC, comprising multiple layers of one or more logic gates (not shown), e.g., one or more logic gates A1 as discussed above, wherein a first layer is configured to receive shift products SP 0-SP N, and a last layer is configured to generate sum PSTC as a data element corresponding to the sum of shift products SP 0-SP N. In some embodiments, one or more successive layers between the first layer and the last layer are configured to receive a first number of data elements (sum data elements) generated by the previous layer and generate a second number of data elements from the first number of data elements, the second number being half of the first number. Thus, the total number of layers includes the first layer and the last layer and each successive layer, if present.

Adder tree 160 is configured to generate and PSTC, also referred to in some embodiments as a partial and PSTC or mantissa and PSTC, in operation, having a total number of bits corresponding to the number of bits and data elements of shift products SP [0] -SP [ N ]. In some embodiments, the number of bits of the sum PSTC is equal to the number of bits of the shift products SP 0-SP [ N ] plus the number of bits that can represent the number of data elements of the shift products SP 0-SP [ N ]. In some embodiments, the number of bits of the sum PSTC is equal to the number of bits of the shift products SP 0-SP N plus four bits of 16 data elements that can represent the shift products SP 0-SP N.

In some embodiments, for example, in those embodiments in which the data elements InDE and WtDE have BF16 format, adder tree 160 is configured to generate sum PSTC having a total of 25 bits based on each of the shift products SP [0] -SP [ N ] having a total of 21 bits. In some embodiments, for example, in embodiments where the data elements InDE and WtDE have the FP16 format, adder tree 160 is configured to generate a sum PSTC having a total of 31 bits based on each of the shift products SP [0] -SP [ N ] having 27 bits. It is within the scope of the present disclosure that adder tree 160 be configured to generate and PSTC based on each of the shift products SP [0] -SP [ N ] with other total numbers of bits.

Based on the shifted products SP [0] -SP [ N ] in two's complement format, adder tree 160 is configured to generate a sum PSTC in two's complement format.

In the embodiment depicted in fig. 1A, adder tree 160 is configured to output the sum PSTC to converter 170 on a data bus (not labeled). In some embodiments, adder tree 160 is configured to output the sum PSTC to a circuit (not shown) external to circuit 100.

The converter 170 is an electronic circuit, e.g., an IC, comprising logic circuitry configured to receive the sum PSTC from the adder tree 160 in operation and to convert the sum PSTC from a two's complement to a sum PSSM having a sign-plus-mantissa format. The converter 170 is configured to generate a sum PSSM having the same number of bits as the PSTC.

In the embodiment depicted in fig. 1A, the converter 170 is configured to output the sum PSSM to the converter 180 over a data bus (not labeled). In some embodiments, the converter 170 is configured to output the sum PSSM to a circuit (not shown) external to the circuit 100. In some embodiments, the converter 170 is configured to output the sum PSSM to the adder 830 discussed below with respect to fig. 8.

The converter 180 is an electronic circuit, e.g., an IC, including logic circuitry, configured to receive the sum PSSM from the converter 170 and the maximum sum MaxExp from the difference circuit 140 in operation, and to convert the sum PSSM from a sign-mantissa format to a sum PS having an output format (e.g., a floating point format as described above) that is different from the sign-mantissa format based on the sum PSSM and MaxExp. In some embodiments, converter 180 is configured to generate a sum PS configured to be compatible with circuitry (not shown) external to circuit 100.

In the embodiment depicted in fig. 1A, converter 180 is configured to output sum PS to circuitry (not shown) external to circuit 100, e.g., a memory array that is part of CNN or other instance of circuit 100. In some embodiments, converter 180 is configured to output sum PS to adder 830 discussed below with respect to fig. 8.

In the embodiment depicted in fig. 1B, circuit 100 is configured as discussed above with respect to fig. 1A, except that multiplier circuit 120B is included instead of multiplier circuit 120A.

Multiplication circuit 120B is an electronic circuit configured to receive sign bits InS and WtS and mantissas InM and WtM and, in accordance with each of the embodiments discussed above with respect to multiplier circuit 120A, for example, in connection with various data element formats and conditionally perform multiplication operations according to differences D [0] -D [ N ], generate products P [0] -P [ N ].

Multiplier circuit 120B includes one or more logic gates M1 discussed above with respect to multiplier circuit 120A, and also includes an exclusive-or gate (XOR) gate X1.

XOR gate X1 comprises two input terminals configured to receive sign bits InS and WtS and is configured, in operation, to generate sign bit SB on an output terminal based on the exclusive or logic of table 1.

InS	WtS	SB
			0	0	0
0	1	1
			1	0	1
1	1	0

TABLE 1

As depicted in fig. 1B, multiplier circuit 120B includes one or more logic gates M1 configured to receive mantissas InM and WtM and generate mantissa product MP by multiplying mantissas InM and WtM.

Multiplier circuit 120B includes one or more logic circuits configured to, in operation, convert sign bit SB in combination with mantissa product MP into a two-complement format for a given product P [ N ] of products P [0] -P [ N ].

Multiplier circuit 120B is thus configured to multiply and reformat the sign and mantissa bits of input data element InDE and weight data element WtDE in operation to generate the two-complement product P0-P N. Multiplier circuit 120B is capable of performing multiplication and reformatting operations at higher speeds, using less power, and/or by using less area than multiplier circuit 120A.

In each of the embodiments discussed above, the circuit 100 is thus configured to perform data calculations by separating sign bits and mantissa bits of the input data element InDE and the weight data element WtDE from exponent bits. Multiplier circuit 120A or 120B multiplies and reformats the sign and mantissa bits of the input and weight data to produce a two-complement product P0-P N, and summing circuit 130 adds the exponent of the input data to the exponent of the weight data to produce sum S0-S N. The shift circuit 150 shifts the products according to the difference between the sum S [0] -S [ N ] and the maximum sum MaxExp, and the adder tree 160 adds together the shifted products SP [0] -SP [ N ] to produce the partial sum PSTC. Thus, circuit 100 is configured to perform floating point parts and calculations using less time, area, and power than other methods, and without quantization errors that occur in some methods of converting floating point data to fixed point data.

Fig. 2 is a schematic diagram of a shift circuit 200 according to some embodiments. The shift circuit 200 may be used as part or all of the shift circuit 150 discussed above with respect to fig. 1A. In the embodiment depicted in FIG. 2, shift circuit 200 corresponds to data elements InDE and WtDE having BF16 format and shift products SP [0] -SP [ N ] having a total of 21 bits.

In the embodiment depicted in FIG. 2, shift circuit 200 includes selection circuits S0-S19, where selection circuits S0, S18, and S19 are depicted in FIG. 2. Each of the selection circuits S0-S19 is configured to receive a signal DIFF [4:0] representative of the difference D [ n ] discussed above with respect to FIG. 1A.

Each of the selection circuits S0-S19 is configured to receive some or all of the signals M [0] -M [15] corresponding to 16 mantissa bits in the product P [ N ] of products P [0] -P [ N ], each of the products P [0] -P [ N ] having a total of 17 bits including the sign bit S. Each of the selection circuits S0-S19 is configured to also receive zero 0, and each of the selection circuits S1-S19 is configured to also receive the sign bit S of the product P [ n ].

The selection circuit S0 is configured to receive all signals M [0] -M [15], while the selection circuit S19 is configured to receive only signal M [15]. Each selection circuit S1-S18 is configured to receive a subset of signals M [0] -M [15], the number of subsets decreasing one with increasing number of each selection circuit. The signal with the highest index number is included in each subset, as indicated by the selection circuit S18 being configured to receive signals M14 and M15.

Based on signals M [0] -M [15], zero bits 0, and sign bits S, selection circuits S0-S19 are configured to each generate signals O [0] -O [19] corresponding to tail bits of shifted products SP [ n ] in operation, where shifted products SP [ n ] correspond to products P [ n ]. In response to signals DIFF [4:0], selection circuit S0 is configured to generate signal O [0] by selecting one of signals M [0] -M [15] or zero bits 0, selection circuit S19 is configured to generate signal O [19] by selecting one of signals M [15], zero bits 0 or sign bits S, and each of selection circuits S1-S18 is configured to generate a respective one of signals O [1] -O [18] by selecting a respective subset of signals M [1] -M [15], zero bits 0 or sign bits S.

In the embodiment depicted in FIG. 2, the shift product SP [ n ] with 20 mantissa bits corresponds to a first difference threshold equal to 20, as discussed above with respect to shift circuit 150. Thus, the selection circuits S0-S19 and the signals DIFF [4:0] are configured to generate each of the signals O [0] -O [19] as zero 0 in response to the difference Dn having a value greater than or equal to 20.

The number of selection circuits and signal configurations depicted in fig. 2 is a non-limiting example provided for purposes of illustration. In some embodiments, the shift circuit 200 includes a fewer or greater number of selection circuits that are similarly configured to generate shift signals from signals corresponding to formats other than the BF16 format, as discussed above with respect to fig. 1A.

With the configuration discussed above, the shift circuit 200 is capable of performing the operations discussed above with respect to the shift circuit 150 and fig. 1, such that the circuit including the shift circuit 200 is capable of achieving the benefits discussed above with respect to the data calculation circuit 100.

Fig. 3 is a schematic diagram of a shift circuit, e.g., data element 300 of shift circuit 200, in accordance with some embodiments. Data element 300 corresponds to an embodiment in which data elements InDE and WtDE have BF16 format and shift products SP [0] -SP [ N ] have 21 bits, as discussed above with respect to FIGS. 1A and 2.

The data element 300 comprises a signal DIFF [8:0] corresponding to the difference value D [ n ], signal values DIFF with 9 bits, and for each signal value DIFF, an output signal O [0] -O [20] corresponding to the shift product SP [ n ]. The output signals O0-O19 correspond to the outputs of the selection circuits S0-S19 described in FIG. 2, and the output signal O20 corresponds to the sign bit S of the product P n.

In the embodiment depicted in FIG. 3, the second difference threshold is four and one or more of the signals O [0] -O [3] include zero 0 in accordance with the signal value DIFF being less than or equal to four. For signal values DIFF greater than or equal to a second difference threshold of 20, each of signals O [0] -O [20] includes a zero bit of 0. For signal values DIFF less than a second difference threshold of 20, signals O [0] to O [20] include signals M [0] to M [15] shifted right by signal value DIFF and preamble symbol bit S. The number of sign bits S corresponds to the sign bit S of the product P n plus the number of sign bits S equal to the increase of the signal value DIFF.

Fig. 4 is a schematic diagram of a shift circuit 400 according to some embodiments. The shift circuit 400 may be used as part or all of the shift circuit 150 discussed above with respect to fig. 1A. In the embodiment depicted in FIG. 4, shift circuit 400 corresponds to data elements InDE and WtDE having BF16 format and shift products SP [0] -SP [ N ] having a total of 21 bits.

The shift circuit 400 is functionally equivalent to the shift circuit 200 discussed above and has a different configuration, as described below. FIG. 4 depicts signals M [0] -M [15], O [0] - [19], and DIFF [4:0], with sign bit S and zero bit 0, each discussed above with respect to FIGS. 2 and 3.

The shift circuit 400 includes a first stage and a second stage. In the embodiment depicted in FIG. 4, the first stage includes selection circuits FS1-FS19 (FS 1, FS2, FS18, and FS19 depicted in FIG. 4), each configured to receive signals DIFF [1:0], and the second stage includes selection circuits SS1-SS5, each configured to receive signals DIFF [4:2].

Each of the selection circuits FS1-FS19 is configured to receive a subset of the signals M [0] -M [15] and a sign bit S or zero 0. Each selection circuit FS1-FS19 includes a total of four inputs configured to receive one or more instances having from one to four numbers of subsets, and in some cases, sign bit S or zero bit 0, as depicted in FIG. 4. In operation, each of the selection circuits FS1-FS19 is thus configured to output a predetermined one of the signals M [0] -M [15], the sign bit S, or the zero bit 0 as a corresponding one of the signals INT [1] -INT [19] (also referred to as intermediate signals or data elements INT [1] -INT [19] in some embodiments) in response to the signals DIFF [1:0 ]. In addition to the signals INT 1-INT 19, the intermediate signals INT 0-INT 19 include a signal INT 0 having a zero bit of 0.

Each of the selection circuits SS1-SS5 is configured to receive a subset of the intermediate signals INT [0] -INT [19] and zero 0, and each of the selection circuits SS2-SS5 is configured to receive a sign bit S, as depicted in FIG. 4. In operation, each of the selection circuits SS1-SS5 is configured to output a respective four-bit portion of signals O [0] -O [19] in response to signals DIFF [4:2 ]: o3:0, O7:4, O11:8, O15:12 or O19:16.

The four bit portions of signals O0-O19 correspond to the data pattern illustrated in FIG. 5, which is a diagram of data element 300 discussed above with respect to FIG. 3. As shown in the first highlighted portion of the data element 300 depicted in FIG. 5, the signals DIFF [1:0] correspond to four possible ranges of values for the signal value DIFF and four possible combinations of the intermediate signals INT [0] -INT [19] depicted in FIG. 4.

The highlighted portions correspond to five four-bit portions of signals O0-O19 output by selection circuits SS1-SS5 in response to the combination of four of intermediate signals INT 0-INT 19 and the combination of signals DIFF 4:2. The highlighted portion is a block of four rows that are right shifted within signals O0-O19 based on intermediate signals INT 0-INT 19 and according to signals DIFF 4:2. Accordingly, blocks having the same bit configuration are arranged along a descending diagonal.

As depicted in fig. 4 and illustrated in fig. 5, the shift circuit 400 includes: a first stage configured to generate an intermediate signal INT [0] -INT [19] having four combinations corresponding to the two LSBs, DIFF [1:0 ]; and a second stage configured to right shift DIFF [4:2] based on the four combinations and in accordance with the remaining bit portions, generating one to five blocks of signals O [0] -O [19 ].

In some embodiments, the shift circuit 400 includes first and second stages (not shown) configured in accordance with data elements InDE and WtDE having the FP16 format. FIG. 6 is a diagram of a data element 600 of a shift circuit (e.g., shift circuit 200 or 400), where data elements InDE and WtDE have FP16 format and shift products SP [0] -SP [ N ] have 27 bits, according to some embodiments, as discussed above with respect to FIGS. 1A and 2.

The data element 600 includes a signal DIFF [5:0] corresponding to a difference value D [ n ] having six bits, a signal value DIFF, and an output signal O [0] -O [26] corresponding to a shift product SP [ n ] of each signal value DIFF. The output signals O0-O25 correspond to the outputs of the shift circuits 200 or 400, while the output signal O26 corresponds to the sign bit S of the product P n.

In the embodiment depicted in FIG. 6, the second difference threshold is four and one or more of the signals O [0] -O [3] include zero 0 according to the signal value DIFF being less than or equal to four. For signal values DIFF greater than or equal to a second difference threshold of 26, signals O [0] -O [26] each include a zero bit of 0. For signal values DIFF less than a second difference threshold of 26, signals O0-O25 include signals M0-M21 shifted right by signal value DIFF and leading sign bit S. The number of sign bits S corresponds to the sign bit S of the product P n plus the number of sign bits S equal to the increase of the signal value DIFF.

As depicted in FIG. 5, the data element 600 depicted in FIG. 6 includes a first highlighting portion corresponding to four combinations of DIFF [1:0] and an additional highlighting portion corresponding to six four-bit portions of signals O [2] -O [25] output by the second stage of the shift circuit 400 in response to four combinations of intermediate signals (not shown) and a combination of signals DIFF [4:2 ]. The highlighted portion is a four-row block, each based on the intermediate signal and right shifted within signals O2-O25 according to signals DIFF [4:2 ]. Data element 600 also includes signals O [0] and O [1] output by the second stage of shift circuit 400.

In some embodiments, the data element 600 depicted in FIG. 6 thus corresponds to an embodiment of the shift circuit 200, including one or more selection circuits in addition to the selection circuits S0-S19 depicted in FIG. 2.

In some embodiments, the data element 600 depicted in FIG. 6 thus corresponds to an embodiment of the shift circuit 400, including one or more selection circuits in the first stage in addition to selection circuits FS1-FS19, and one or more selection circuits in the second stage in addition to selection circuits SS1-SS5, as depicted in FIG. 2.

In some embodiments, the shift circuit 400 includes a fewer or greater number of selection circuits similarly configured to generate shift signals from signals corresponding to formats other than BF16 or FP16 formats, as discussed above with respect to fig. 1A.

With the configuration discussed above, the shift circuit 400 is able to perform the operations discussed above with respect to the shift circuit 150 and fig. 1A, so that the circuitry comprising the shift circuit 400 is able to realize the benefits discussed above with respect to the data calculation circuit 100. By including the first and second stage configurations discussed above, the shift circuit 400 is also capable of having less complex signal routing and a reducible area up to 50% than methods that do not include the first and second stage configurations (e.g., shift circuit 200 discussed above).

Fig. 7 is a flow chart of a method 700 of operating a circuit according to some embodiments. The method 700 may be used with a data calculation circuit (e.g., the data calculation circuit 100 discussed above with respect to fig. 1A and 1B).

The order of the operations of the method 700 depicted in fig. 7 is for illustration only; the operations of method 700 may be performed in a different order than depicted in fig. 7. In some embodiments, operations are performed before, between, during, and/or after those described in fig. 7 in addition to those described in fig. 7. In some embodiments, the operations of method 700 are a subset of the methods that perform CIM operations (e.g., as part of CNN operations).

At operation 710, a signed mantissa and exponent for each data element of a plurality of input data elements and a plurality of weighted data elements are received. Receiving the signed mantissa includes receiving the sign bit and the mantissa bit at a multiplier circuit, such as multiplier circuit 120A or 120B, and receiving the exponent includes receiving the exponent at a summing circuit, such as summing circuit 130, each discussed above with respect to fig. 1A and 1B.

In some embodiments, receiving the signed mantissa and exponent for each data element of the plurality of input data elements and the plurality of weighted data elements includes receiving the signed mantissa InS/InM and WtS/WtM and exponents InE and WtE for each data element of the data elements InDE and WtDE discussed above with respect to fig. 1A and 1B.

In some embodiments, receiving the signed mantissa and exponent of each data element includes receiving each data element having a plurality of input data elements and a plurality of weight data elements in BF16 format or FP16 format as discussed above with respect to fig. 1A-6.

In some embodiments, the circuit includes a memory array, such as memory array 110 discussed above with respect to fig. 1A, and receiving the signed mantissa and exponent includes receiving each data element from a plurality of input data elements and a plurality of weight data elements stored in the memory array.

At operation 720, each signed mantissa is reformatted into a two's complement. Reformatting each signed mantissa into a two's complement includes using a multiplier circuit.

In some embodiments, reformatting each signed mantissa into a two's complement includes reformatting the signed mantissa InS/InM into a reformatted mantissa InTC, and reformatting the signed mantissa WtS/WtM into a reformatted mantissa WtTC, as discussed above with respect to fig. 1.

At operation 720, a plurality of two-complement products are generated by performing multiplication and reformatting operations on the partially or fully signed mantissas of the plurality of input data elements and the partially or fully signed mantissas of the plurality of weight data elements. Generating the plurality of two-complement products by performing multiplication and reformatting operations includes using a multiplier circuit, such as multiplier circuit 120A discussed above with respect to fig. 1A or multiplier circuit 120B discussed above with respect to fig. 1B.

In some embodiments, generating the plurality of products includes reformatting each signed mantissa into a two's complement and multiplying a partially or fully reformatted mantissa of the plurality of input data elements with a partially or fully reformatted mantissa of the plurality of weight data elements. In some embodiments, reformatting each signed mantissa into a two-complement and multiplying a portion or all of the reformatted mantissas of the plurality of input data elements with a portion or all of the reformatted mantissas of the plurality of weight data elements includes reformatting the signed mantissa InS/InM into a reformatted mantissa InTC, reformatting the signed mantissa WtS/WtM into a reformatted mantissa WtTC, and generating the product P [0] -P [ N ] by multiplying the reformatted mantissa InTC with the reformatted mantissa WtTC, as discussed above with respect to fig. 1A.

In some embodiments, generating the plurality of products includes generating a plurality of sign bits by performing an exclusive OR (OR) operation on sign bits of a part OR all of the signed mantissas of the plurality of input data elements and the weight data elements, generating a corresponding plurality of mantissa products by multiplying mantissa bits of the part OR all of the signed mantissas of the plurality of input data elements with mantissa bits of the part OR all of the signed mantissas of the plurality of weight data elements, and reformatting the plurality of sign bit and mantissa products into two-complement codes. In some embodiments, generating the plurality of products includes generating a plurality of sign bits SB by performing an exclusive or operation on sign bits InS and WtS, generating a corresponding plurality of mantissa products MP by multiplying mantissas InM of some or all of input data elements InDE with mantissas WtM of some or all of weight data elements WtDE, and reformatting the sign bits SB and mantissa products MP into two-complement codes, as discussed above with respect to fig. 1B.

In some embodiments, generating the plurality of products includes, for each difference between a respective sum of the plurality of sums and a maximum sum (e.g., based on a difference D [ n ] of the maximum sum MaxExp discussed above with respect to fig. 1A), performing a multiplication and reformatting operation on the respective input and weight data elements only if the difference is less than a difference threshold (e.g., the first difference threshold discussed above with respect to fig. 1A-6).

In operation 730, a plurality of sums are generated by adding each exponent of the plurality of input data elements to each exponent of the plurality of weight data elements. The plurality of sums is generated by adding each exponent of the plurality of input data elements to each exponent of the plurality of weight data elements, including using a summing circuit (e.g., summing circuit 130, including one or more logic gates A1, discussed above with respect to FIG. 1A).

Generating a plurality of sums by adding each exponent of the plurality of input data elements to each exponent of the plurality of weighted data elements includes generating a plurality of sums having a total number and an ordering of data elements corresponding to the plurality of products generated in operation 720.

In some embodiments, the plurality of sums is generated by adding each exponent of the plurality of input data elements to each exponent of the plurality of weight data elements, including generating and S [0] -S [ N ] by adding exponent InE to WtE, as discussed above with respect to FIG. 1A.

In some embodiments, generating the plurality of sums includes determining a maximum sum of the plurality of sums using a difference circuit and generating the plurality of differences by subtracting each sum of the plurality of sums from the maximum sum. In some embodiments, the difference circuit is used to determine a maximum sum of the plurality of sums and to generate the plurality of differences by subtracting each sum of the plurality of sums from the maximum sum, including using the difference circuit 140 to determine a maximum sum maxepp and to generate the differences D [0] -dn by subtracting the sums S [0] -sn from maxepp, as discussed above with respect to fig. 1A.

At operation 740, each product of the plurality of products is shifted by an amount equal to a difference between a respective sum of the plurality of sums and the maximum sum. Shifting each of the plurality of products includes generating the plurality of shifted products using a shift circuit, e.g., using shift circuit 150 discussed above with respect to fig. 1A or shift circuit 200 or 400 discussed above with respect to fig. 2-6, to generate shifted products SP [0] -SP [ N ].

In some embodiments, shifting each product of the plurality of products by an amount equal to the difference between the respective sum of the plurality of sums and the maximum sum includes shifting each of the products P [0] -P [ N ] by an amount equal to the respective difference D [0] -D [ N ].

In some embodiments, shifting each product of the plurality of products includes generating an intermediate data element from the product based on the two least significant bits of the respective difference value, and generating a respective shifted product from the intermediate data element based on the other bits of the respective difference value. In some embodiments, generating the intermediate data elements includes generating signals INT [0] -INT [19] using a first stage of the shift circuit 400, and generating the corresponding shift products includes generating signals O [0] -O [19] using a second stage, as discussed above with respect to FIGS. 4-6.

In some embodiments, shifting each product of the plurality of products by an amount equal to a difference between a respective sum of the plurality of sums and a maximum sum includes generating a zero value data element based on the difference being greater than or equal to a difference threshold (e.g., the first difference threshold discussed above with respect to fig. 1A-6).

In some embodiments, shifting each product of the plurality of products by an amount equal to a difference between a respective sum of the plurality of sums and a maximum sum comprises: right shifting the product by that amount; adding a number of preamble sign bits to the shifted product, the number of additions being equal to the amount; and adding one or more tail zero bits (trailing zero bit) corresponding to the amount being less than the difference threshold, e.g., adding the sign bit S and zero bit 0 based on the second difference threshold, as discussed above with respect to fig. 1A-6.

At operation 750, the plurality of shift products are summed to produce a mantissa sum. Summing the plurality of shifted products to produce a mantissa sum includes using an adder tree, such as adder tree 160 discussed above with respect to fig. 1A.

In some embodiments, summing the plurality of shift products to produce a mantissa sum includes summing shift products SP [0] -SP [ N ] to produce a sum PSTC, as discussed above with respect to FIG. 1A.

At operation 760, in some embodiments, the mantissa sum is converted into sign bits plus a plurality of mantissa bits. Converting the mantissa sum into sign bits plus a plurality of mantissa bits includes using a converter, such as converter 170 discussed above with respect to fig. 1A.

In some embodiments, converting the mantissa sum into the sign bit plus the plurality of mantissa bits includes converting the sum PSTC into a sum PSSM, as discussed above with respect to fig. 1A.

In operation 770, in some embodiments, the sign bit plus the plurality of tail bits are converted to an output format. Converting the sign bit plus the plurality of mantissa bits into an output format includes using a converter, such as converter 180 discussed above with respect to fig. 1A.

In some embodiments, converting the sign bit plus the plurality of mantissa bits into an output format includes converting the sum PSSM into a sum PS, as discussed above with respect to fig. 1A.

In some embodiments, converting the sign bit plus the plurality of mantissa bits into an output format includes outputting a sum having the output format from the circuit to an external circuit, e.g., a memory array that is part of a CNN operation.

By performing some or all of the operations of method 700, data computation is performed by: the method further includes separating the exponent bits of the input and weight data elements from the sign and mantissa bits, multiplying and reformatting the sign and mantissa bits of the input and weight data to produce a two's complement product, summing the exponent of the input data with the exponent of the weight data to produce a sum, shifting the product according to the difference between the sum and the maximum sum, and summing the shifted products to produce a partial sum, thereby achieving some or all of the benefits described above with respect to data calculation circuit 100 and shift circuits 200 and 400.

Fig. 8 is a schematic diagram of a CIM circuit 800 according to some embodiments. In the embodiment depicted in fig. 8, CIM circuitry 800, also referred to in some embodiments as circuitry 800 or memory circuitry 800, includes a memory array 810, a MAC unit 820, an adder 830, and a buffer 840.

In the embodiment depicted in fig. 8, the output port of memory array 810 is connected to the input port of MAC unit 820, the output port of MAC unit 820 is connected to the input port of adder 830, the output port of adder 830 is connected to the input port of buffer 840, and the output port of buffer 840 is connected to the input port of adder 830. The output port of buffer 840 is also connected to external circuitry (not shown), such as a memory array.

The embodiment depicted in fig. 8 is a non-limiting example provided for illustration purposes. In some embodiments, circuit 800 has a different structure than that shown in FIG. 8, whereby the advantages discussed below can be achieved. In some embodiments, circuit 800 does not include memory array 810, and the input port of MAC unit 820 is connected to circuitry external to circuit 800, e.g., a memory array. In some embodiments, the input port of the MAC unit is connected to one of adder tree 160, converter 170, or converter 180 discussed above with respect to fig. 1A.

In some embodiments, circuit 800 includes circuit elements other than those depicted in fig. 8 and discussed below, e.g., additional examples of control circuits or depicted circuit elements. In some embodiments, the elements depicted in fig. 8 are part of a memory circuit, including multiple instances of memory array 810, MAC unit 820, adder 830, and buffer 840.

In some embodiments, circuit 800 is included in a circuit that includes elements configured to perform a series of in-memory calculations, e.g., CNNs, where multiple arrays, e.g., memory array 810, include stored weight data elements that are applied to one or more sets of input data elements in a MAC operation.

The memory array 810 is configured to store data elements DE, also referred to as input data elements IDE and weight data elements WDE. The input data elements IDE and the weight data elements WDE correspond to respective input and weight data of one or more matrix calculations.

The MAC unit 820 is an electronic circuit, e.g., an IC, comprising one or more data registers and logic circuitry configured to receive input data elements IDE and weight data elements WDE, e.g., from the memory array 810, and perform a series of MAC computations based on the input data elements IDE and weight data elements WDE.

The MAC unit 820 is configured to store W weight data elements WDE on one or more data registers (also referred to as weight buffers in some embodiments). Each input data element IDE has K bits and the MAC unit 820 is configured to receive and store the kth bit of each input data element IDE in one or more data registers. In various embodiments, the MAC unit 820 includes a selection circuit configured to sequentially select each kth bit in operation or to sequentially receive the kth bit of the input data element IDE in operation. In various embodiments, MAC unit 820 is configured to select or receive the kth bit in LSB-to-MSB order or MSB-to-LSB order.

The MAC unit 820 includes one or more logic circuits configured to select each weight data element WDE in turn in operation, and for each selected weight data element WDE, perform partial sum computation by multiplying the selected weight data element WDE with the K-th bit received and stored in turn in the K-bits of each input data element IDE to produce a series of K products.

The MAC unit 820 includes one or more logic circuits configured to, in operation, sequentially add each product to the sum of previously generated products shifted in a k-th bit ordered LSB to MSB or MSB to LSB order. The one or more logic circuits are configured to output the sum of the K products as a partial sum PSUM.

The MAC unit 820 is configured to repeat the partial and calculation for each of the W weight data elements WDE and thereby configured to generate and output a sequence of W parts and PSUMs.

As the number of input data elements IDE increases, both the computational power and the circuit complexity of the MAC unit 820 increase. In some embodiments, the number of input data elements IDE is from 8 to 256. In some embodiments, the number of input data elements IDE ranges from 16 to 128. In some embodiments, the input data element IDE has a number equal to 72.

Adder 830 is an electronic circuit, e.g., an IC, comprising one or more logic circuits configured to receive each of the portions and PSUMs of the sequence of portions and PSUMs in operation and to generate and output a corresponding sequence of accumulated SUMs ASUM (accumulated SUM) by adding each of the portions and PSUMs to the stored accumulated SUMs SUM.

Buffer 840, also referred to in some embodiments as portion and buffer 840, is an electronic circuit, such as an IC, including one or more data registers and/or latches, configured to receive each accumulation and ASUM in operation and store each accumulation and ASUM as a stored accumulation and SUM. Buffer 840 is configured to output the stored SUM to an input port of adder 830 and to circuitry (not shown) external to CIM circuitry 800, such as a memory array.

In the partial and accumulation operation, buffer 840 is configured to generate a stored accumulation SUM having an initial value of 0. After a total of W partial SUMs corresponding to the W weight data elements WDE are performed, the buffer 840 is thereby configured to store and output a stored cumulative SUM having a final cumulative SUM ASUM of the sequence of cumulative SUMs ASUM, the final cumulative SUM ASUM being equal to the SUM of the respective partial SUMs PSUM.

In some embodiments, the total number of bits of each of the cumulative SUM ASUM and the stored cumulative SUM is greater than the number of bits of the partial SUM value PSUM. In some embodiments, the total number of bits of each of the cumulative SUM ASUM and the stored cumulative SUM is equal to the number of bits of the partial SUM PSUM plus W.

As the number W of weight data elements WDE increases, both the computational power and the circuit complexity of CIM circuit 800 increase. In some embodiments, the number W has a value from 4 to 64. In some embodiments, the number W has a value from 8 to 32. In some embodiments, the number W is equal to 16.

With the configuration discussed above, CIM circuit 800 includes a partial sum buffer 840 configured to accumulate a plurality of partial sum data elements before they are output to, for example, an external memory array, thereby reducing access power and time requirements as compared to a method in which a single partial sum data element is output to an external memory array.

FIG. 9 is a flow chart of a method 900 of operating a memory circuit according to some embodiments. The method 900 may be used with a CIM circuit, such as the CIM circuit 800 discussed above with respect to FIG. 8.

The order of the operations of the method 900 depicted in fig. 9 is for illustration only; the operations of method 900 may be performed in a different order than depicted in fig. 9. In some embodiments, operations are performed before, between, during, and/or after those described in fig. 9 in addition to those described in fig. 9. In some embodiments, the operations of method 900 are a subset of the methods that perform CNN operations.

In operation 910, a plurality of input data elements and a plurality of weight data elements are received at a MAC unit of a CIM circuit. In some embodiments, receiving the plurality of input data elements and the plurality of weight data elements at a MAC unit of the CIM circuitry includes receiving the input data elements IDE and the weight data elements WDE at a MAC unit 820 of the CIM circuitry 800 discussed above with respect to fig. 8.

In some embodiments, receiving the plurality of input data elements and the plurality of weight data elements includes receiving the plurality of input data elements and the plurality of weight data elements from a memory array of a CIM circuit, e.g., memory array 810 discussed above with respect to fig. 8.

In operation 920, a MAC unit is used to generate a sequence of partial sums based on the plurality of input data elements and the plurality of weight data elements. In some embodiments, generating the sequence based on the partial sums of the plurality of input data elements and the plurality of weight data elements using the MAC unit includes generating the partial sums PSUM based on the input data elements IDE and the weight data elements WDE using the MAC unit 820, as discussed above with respect to fig. 8.

In operation 930, an adder is used to generate a sequence of accumulated sums by adding each partial sum of the sequence of partial sums to a stored accumulated sum. In some embodiments, generating the sequence of accumulated SUMs using an adder by adding each partial SUM of the sequence of partial SUMs to the stored accumulated SUMs includes generating the sequence of accumulated SUMs ASUM using an adder 830 by adding each partial SUM PSUM to the stored accumulated SUMs SUM, as discussed above with respect to fig. 8.

At operation 940, a buffer is used to store each of the accumulated sums of the sequence as a stored accumulated sum, output each stored accumulated sum to an adder, and output a final stored accumulated sum from the CIM circuit. In some embodiments, using a buffer to store each accumulation SUM of the accumulation SUM sequence as a stored accumulation SUM, outputting each stored accumulation SUM to an adder, and outputting a final stored accumulation SUM from the CIM circuit includes using a buffer 840 to store each accumulation SUM ASUM as a stored accumulation SUM, outputting each stored accumulation SUM to adder 830, and outputting a final stored accumulation SUM from CIM circuit 800, as discussed above with respect to FIG. 8.

By performing some of the overall operations of method 900, portions and data elements are accumulated before being output, e.g., into an external memory array, to achieve the benefits discussed above with respect to CIM circuit 800.

In some embodiments, the circuit comprises: a multiplier circuit configured to receive the signed mantissas of each of the plurality of input data elements and the plurality of weight data elements and to generate a plurality of products by multiplying and reformatting some or all of the signed mantissas of the plurality of input data elements and some or all of the signed mantissas of the plurality of weight data elements; a summing circuit configured to receive the plurality of input data elements and the exponent of each of the plurality of weight data elements and to generate a plurality of sums by adding each exponent of the plurality of input data elements to each exponent of the plurality of weight data elements; a shift circuit configured to shift each of the plurality of products by an amount equal to a difference between a respective sum and a maximum sum of the plurality of products; and an adder tree configured to generate a mantissa sum from the plurality of shifted products. In some embodiments, the multiplication and summation circuit is configured to receive a plurality of input data elements and each of a plurality of weight data elements having a BF16 format, the multiplication circuit is configured to generate a plurality of products as 17-bit data elements, the summation circuit is configured to generate a plurality of sums as 9-bit data elements, the shift circuit is configured to generate a plurality of shifted products as 21-bit data elements, and the adder tree is configured to generate mantissa sums as 25-bit data elements. In some embodiments, the multiplier and summing circuit is configured to receive a plurality of input data elements and each of a plurality of weight data elements, the multiplier circuit is configured to generate a plurality of products as 23-bit data elements, the summing circuit is configured to generate a plurality of sums as 6-bit data elements, the shifting circuit is configured to generate a plurality of shifted products as 27-bit data elements, and the adder tree is configured to generate a mantissa sum as 31-bit data element. In some embodiments, the multiplier circuit is configured to perform the multiplying and reformatting operations by reformatting a part or all of the signed mantissas of the input and weight data elements into two-complement and multiplying the part or all of the reformatted mantissas of the plurality of input data elements with the part or all of the reformatted mantissas of the plurality of weight data elements. In some embodiments, the multiplier circuit is configured to perform the multiplication and reformatting operations by xoring sign bits of a signed mantissa of part or all of the input and weight data elements to generate a plurality of sign bits. A corresponding plurality of mantissa products is generated by multiplying the signed mantissa bits of the part or all of the plurality of input data elements with the signed mantissa bits of the part or all of the plurality of weight data elements, and the plurality of sign bits and mantissa products are reformatted into two-complement codes. In some embodiments, the multiplication and summation circuit is configured to receive each of a plurality of input data elements and a plurality of weight data elements, four in total, the multiplication circuit is configured to multiply the plurality of input data elements and the plurality of weight data elements a total of sixteen or less times, and the summation circuit is configured to multiply the plurality of input data elements and the plurality of weight data elements a total of sixteen summation operations. In some embodiments, the shift circuit is configured to, for each product of the plurality of products, right shift the product by a shift amount, add a number of leading sign bits to the shifted product, the number being equal to the shift amount, and add one or more tail zero bits corresponding to the shift amount being less than the difference threshold. In some embodiments, the shift circuit includes a first stage configured to generate a plurality of intermediate data elements from the plurality of products based on two least significant bits of the respective difference values, and a second stage configured to generate a plurality of shift products from the plurality of intermediate data elements based on other bits of the respective difference values. In some embodiments, the circuit includes a difference circuit configured to determine a maximum sum of the plurality of sums, calculate each difference by subtracting a corresponding sum of the plurality of sums from the maximum sum, and output each difference to the shift circuit. In some embodiments, the shifting circuit is configured to, for each difference value, generate a respective shifted product of the plurality of shifted products from the respective product of the plurality of products based on the difference value being less than a difference threshold, or generate the respective shifted product of the plurality of shifted products as a zero value data element based on the difference value being greater than or equal to the difference threshold. In some embodiments, the multiplier circuit is configured to receive each difference from the difference circuit and, for each difference, to multiply and reformat the corresponding input and weight data elements only if the difference is less than a difference threshold. In some embodiments, the circuit is configured to convert the mantissa sum into sign bits plus a plurality of mantissa bits.

In some embodiments, a method of operating a circuit includes receiving a signed mantissa and exponent for each data element of a plurality of input data elements and a plurality of weight data elements, generating a plurality of two-complement products by multiplying and reformatting some or all of the signed mantissas for the plurality of input data elements and some or all of the signed mantissas for the plurality of weight data elements. A mantissa sum is generated by adding each exponent of the plurality of input data elements to each exponent of the plurality of weight data elements, shifting each product of the plurality of products by an amount equal to a difference between a corresponding sum and a maximum sum of the plurality of sums, and adding the plurality of shifted products. In some embodiments, the circuit includes a memory array, and receiving the signed mantissa and exponent for each data element includes receiving each data element from a plurality of input data elements and a plurality of weighted data elements stored in the memory array. In some embodiments, receiving the signed mantissa and exponent of each data element includes receiving each data element having a plurality of input data elements and a plurality of weight data elements in BF16 format or FP16 format. In some embodiments, generating the plurality of products includes, for each difference between a respective sum of the plurality of sums and the maximum sum, performing a multiplication and reformatting operation on the respective input and weight data elements only if the difference is less than a difference threshold. In some embodiments, shifting each product of the plurality of products includes generating an intermediate data element from the products based on the two least significant bits of the respective difference value, and generating a respective shifted product from the intermediate data element based on the other bits of the respective difference value. In some embodiments, the method includes converting the mantissa sum into sign bits plus a plurality of mantissa bits.

In some embodiments, the CIM circuitry includes a memory array configured to store a plurality of input data elements and a plurality of weight data elements, a MAC unit configured to generate a sequence of partial sums based on the plurality of input data elements and the plurality of weight data elements. An adder configured to generate a sequence of accumulated sums by adding each partial sum of the sequence of partial sums to a stored accumulated sum, and a buffer configured to store each accumulated sum of the sequence of accumulated sums as a stored accumulated sum, output each stored accumulated sum to the adder, and output a final stored accumulated sum from the CIM circuit. In some embodiments, the buffer is configured to output the final stored accumulated sum to a memory array of another CIM circuit.

The foregoing outlines features of several embodiments so that those skilled in the art may better understand the aspects of the present disclosure. Those skilled in the art should appreciate that they may readily use the present disclosure as a basis for designing or modifying other processes and structures for carrying out the same purposes and/or achieving the same advantages of the embodiments introduced herein. Those skilled in the art should also realize that such equivalent constructions do not depart from the spirit and scope of the present disclosure, and that they may make various changes, substitutions, and alterations herein without departing from the spirit and scope of the present disclosure.

Claims

1. A circuit, comprising:

a multiplier circuit configured to:

receiving a signed mantissa of each data element of a plurality of input data elements and a plurality of weighted data elements, an

Generating a plurality of two-complement products by multiplying and reformatting part or all of the signed mantissas of the plurality of input data elements and part or all of the signed mantissas of the plurality of weight data elements;

a summing circuit configured to:

receiving an index of each of the plurality of input data elements and the plurality of weighted data elements, and

generating a plurality of sums by adding each exponent of the plurality of input data elements to each exponent of the plurality of weight data elements;

a shift circuit configured to shift each product of the plurality of products by an amount equal to a difference between a respective sum of the plurality of sums and a maximum sum; and

an adder tree configured to generate a mantissa sum from the shifted plurality of products.

2. The circuit of claim 1, wherein,

the multiplier circuit and the summing circuit are configured to receive each data element of the plurality of input data elements and the plurality of weight data elements having BF16 format,

The multiplier circuit is configured to generate the plurality of products as 17-bit data elements,

the summing circuit is configured to generate the plurality of sums as 9-bit data elements,

the shift circuit is configured to generate the plurality of shifted products as 21-bit data elements, and

the adder tree is configured to generate the mantissa sums as 25-bit data elements.

3. The circuit of claim 1, wherein,

the multiplier circuit and the summing circuit are configured to receive each data element of the plurality of input data elements and the plurality of weight data elements having FP16 format,

the multiplier circuit is configured to generate the plurality of products as 23-bit data elements,

the summing circuit is configured to generate the plurality of sums as 6-bit data elements,

the shift circuit is configured to generate the plurality of shifted products as 27-bit data elements, and

the adder tree is configured to generate the mantissa sum as a 31-bit data element.

4. The circuit of claim 1, wherein the multiplier circuit is configured to perform the multiplication and reformatting operations by:

Reformatting the partial or full signed mantissas of the plurality of input data elements and the plurality of weight data elements into a two-complement, and

multiplying the partially or fully reformatted mantissas of the plurality of input data elements with the partially or fully reformatted mantissas of the plurality of weight data elements.

5. The circuit of claim 1, wherein the multiplier circuit is configured to perform the multiplication and reformatting operations by:

generating a plurality of sign bits by exclusive-or-ing the plurality of input data elements and the part or all of the signed mantissas of the plurality of weight data elements,

generating a corresponding plurality of mantissa products by multiplying mantissa bits of the portion or all of the signed mantissas of the plurality of input data elements with mantissa bits of the portion or all of the signed mantissas of the plurality of weight data elements, and

the plurality of sign bits and the plurality of mantissa products are reformatted into two-complement codes.

6. The circuit of claim 1, wherein,

the multiplier circuit and the summing circuit are configured to receive each of the plurality of input data elements and the plurality of weight data elements, each having a total of 4 data elements,

The multiplier circuit is configured to multiply the plurality of input data elements and the plurality of weight data elements a total of 16 or less, and

the summing circuit is configured to perform a total of 16 summing operations on the plurality of input data elements and the plurality of weight data elements.

7. A method of operating a circuit, the method comprising:

receiving a signed mantissa and exponent for each data element of a plurality of input data elements and a plurality of weight data elements;

shifting each product of the plurality of products by an amount equal to a difference between a respective sum of the plurality of sums and a maximum sum; and

the shifted products are added to generate a mantissa sum.

8. The method of claim 7, wherein,

the circuit includes a memory array, and

Receiving a signed mantissa and exponent for each data element includes receiving each data element of the plurality of input data elements and the plurality of weighted data elements stored in the memory array.

9. An in-memory computing circuit comprising:

a memory array configured to store a plurality of input data elements and a plurality of weight data elements;

a multiplication accumulation unit configured to generate a part and a sequence from the plurality of input data elements and the plurality of weight data elements;

an adder configured to generate a cumulative sum sequence by adding each partial sum of the partial sum sequences to a stored cumulative sum; and

a buffer configured to:

storing each of the sequence of accumulated sums as a stored accumulated sum;

outputting each stored cumulative sum, to the adder

And outputting the final stored accumulated sum from the in-memory computing circuit.

10. The in-memory computing circuit of claim 9, wherein the buffer is configured to output the final stored accumulated sum to a memory array of another in-memory computing circuit.