JP2021527886A

JP2021527886A - Configurable in-memory computing engine, platform, bit cell, and layout for it

Info

Publication number: JP2021527886A
Application number: JP2020570472A
Authority: JP
Inventors: ヴァ—マ、ネイヴェーン; ヴァ―マ、ネイヴェーン; ヴァラヴィ、フセイン; ジア、ホンギャン
Original assignee: ザ、トラスティーズオブプリンストンユニバーシティ; ザ、トラスティーズ　オブ　プリンストン　ユニバーシティ
Priority date: 2018-06-18
Filing date: 2019-06-18
Publication date: 2021-10-14

Abstract

様々な実施例は、プログラミング可能又は事前にプログラミングされたイン・メモリ・コンピューティング動作を提供するためのシステム、方法、アーキテクチャ、機構又は装置を含む。イン・メモリ・コンピューティング・アーキテクチャは、超並列のビット単位の入力信号、複数のＣＩＭチャンネル出力信号を形成するために受信データ・ワードのシーケンスを再シェーピングし、それによってマルチ・ビットの出力ワードのシーケンスを供給し、それによってマルチ・ビットの出力ワードのシーケンスを供給するように構成される再シェーピング・バッファと、ＣＩＭアレイに、シングル・ビットの内部回路と信号とを使用して入力信号及び蓄積信号に対してマルチ・ビット・コンピューティング動作を実行させるように構成される制御回路と、マルチ・ビットの出力ワードのシーケンスをコンピューティング結果として供給するように構成されるニア・メモリ・コンピューティング・パスとを備える。 Various embodiments include systems, methods, architectures, mechanisms or devices for providing programmable or pre-programmed in-memory computing operations. The in-memory computing architecture reshapes the sequence of received data words to form a massively parallel bit-by-bit input signal, multiple CIM channel output signals, thereby multi-bit output words. An input signal and storage using a single-bit internal circuit and signal in a CIM array with a reshaping buffer configured to supply the sequence and thereby a sequence of multi-bit output words. A control circuit configured to perform multi-bit computing operations on a signal and near-memory computing configured to provide a sequence of multi-bit output words as a result of computing. With a pass.

Description

本出願は、２０１８年６月１８日に出願された米国仮特許出願第６２／６８６，２９６号、２０１８年７月２４日に出願された米国仮特許出願第６２／７０２，６２９号、２０１８年１１月２日に出願された米国仮特許出願第６２／７５４，８０５号、及び２０１８年１１月７日に出願された米国仮特許出願第６２／７５６，９５１号の利益を主張し、これらの米国仮特許出願は、その全体が参照により本明細書に組み込まれる。 This application is filed on June 18, 2018, US Provisional Patent Application No. 62 / 686,296, and on July 24, 2018, US Provisional Patent Application No. 62 / 702,629, 2018. Claiming the interests of US Provisional Patent Application No. 62 / 754,805 filed on November 2, and US Provisional Patent Application No. 62 / 756,951 filed on November 7, 2018, these The entire US provisional patent application is incorporated herein by reference in its entirety.

本発明は、イン・メモリ・コンピューティング（in-memory computing）及び行列ベクトル積（matrix-vector multiplication）の分野に関する。 The present invention relates to the fields of in-memory computing and matrix-vector multiplication.

近年、チャージ・ドメイン・イン・メモリ・コンピューティング（Charge-domain in-memory computing）が、イン・メモリ・コンピューティングを実行する強固かつスケーラブルな方法として出現してきた。ここで、メモリ・ビット・セル内の演算動作は、典型的にはコンデンサを介した電圧電荷変換を用いて電荷としてその結果を供給する。したがって、ビット・セル回路は、所与のビット・セルにおけるローカル・コンデンサの適切な切換を含み、そのローカル・コンデンサは、他のビット・セル・コンデンサとも適切に結合されて、結合されたビット・セルの全体における集約された演算結果を生成する。イン・メモリ・コンピューティングは、行列ベクトル積を実装するのによく適しており、行列要素はメモリ・アレイに格納され、ベクトル要素はメモリ・アレイ全体において並列してブロードキャストされる。 In recent years, Charge-domain in-memory computing has emerged as a robust and scalable way to perform in-memory computing. Here, the arithmetic operation in the memory bit cell typically supplies the result as a charge using voltage-charge conversion via a capacitor. Therefore, the bit cell circuit includes the proper switching of the local capacitor in a given bit cell, and that local capacitor is also properly coupled and coupled with the other bit cell capacitors. Generates aggregated calculation results for the entire cell. In-memory computing is well suited to implement matrix-vector products, where matrix elements are stored in a memory array and vector elements are broadcast in parallel across the memory array.

仮特許出願第６２／５５５，９５９号、「ＡｎａｌｏｇＳｗｉｔｃｈｅｄ−ＣａｐａｃｉｔｏｒＮｅｕｒａｌＮｅｔｗｏｒｋ」、２０１７年９月８日出願Provisional Patent Application No. 62 / 555,959, "Analog Switched-Capacitor Neural Network", filed September 8, 2017

従来技術における様々な欠点が、プログラミング可能な又は事前にプログラミングされたイン・メモリ・コンピューティング動作を提供するシステム、方法、アーキテクチャ、機構、又は装置によって対処される。 Various shortcomings in the prior art are addressed by systems, methods, architectures, mechanisms, or devices that provide programmable or pre-programmed in-memory computing operations.

一実施例は、超並列のビット単位の入力信号を形成するために受信データ・ワードのシーケンスを再シェーピングするように構成される再シェーピング・バッファ（ｒｅｓｈａｐｉｎｇｂｕｆｆｅｒ）と、第１のコンピュート・イン・メモリ（ＣＩＭ）アレイ次元を介して上記の超並列のビット単位の入力信号を受信し、第２のＣＩＭアレイ次元を介して１つ以上の蓄積信号を受信するように構成されるビット・セルのＣＩＭアレイであって、共通の蓄積信号と関連付けられた複数のビット・セルの各々は、それぞれの出力信号を供給するように構成されるそれぞれのＣＩＭチャンネルを形成する、ＣＩＭアレイと、複数のＣＩＭチャンネル出力信号を処理することによってマルチ・ビットの出力ワードのシーケンスを供給するように構成されるアナログ・デジタル変換器（ＡＤＣ）回路と、ＣＩＭアレイに、シングル・ビットの内部回路及び信号を用いて上記入力及び蓄積信号に対してマルチ・ビット・コンピューティング動作を実行させるように構成される制御回路と、マルチ・ビットの出力ワードのシーケンスをコンピューティング結果として供給するように構成されるニア・メモリ・コンピューティング・パスと、を備える、イン・メモリ・コンピューティング・アーキテクチャを提供する。 One embodiment includes a reshaping buffer configured to reshape a sequence of received data words to form a massively parallel bit-by-bit input signal, and a first compute-in. A bit cell configured to receive the above-mentioned massively parallel bit-by-bit input signals via a memory (CIM) array dimension and one or more stored signals via a second CIM array dimension. A CIM array and a plurality of CIMs in which each of the plurality of bit cells associated with a common storage signal forms a CIM channel configured to supply a respective output signal. Using an analog-digital converter (ADC) circuit configured to supply a sequence of multi-bit output words by processing the channel output signal, and a single-bit internal circuit and signal to the CIM array. A control circuit configured to perform multi-bit computing operations on the inputs and stored signals and a near memory configured to supply a sequence of multi-bit output words as a result of computing. -Provides an in-memory computing architecture with a computing path.

本発明のさらなる目的、利点及び新規の特徴は、以下の説明において部分的に記載され、部分的には以下の説明の検討時に当業者にとって明らかとなり、又は本発明の実行によって学習され得る。本発明の目的及び利点は、添付の特許請求の範囲において特に示された手段及び組み合わせによって実現及び達成され得る。 Further objects, advantages and novel features of the present invention will be described in part in the following description and will be partially apparent to those skilled in the art at the time of consideration of the following description or may be learned by practicing the present invention. The objects and advantages of the present invention may be realized and achieved by means and combinations specifically indicated in the appended claims.

本明細書に組み込まれ、本明細書の一部を構成する添付図面は、本発明の実施例を図示し、上記の本発明の全体的な説明とともに、さらに下記の実施例の詳細な説明とともに、本発明の原理を説明する役割を果たす。 The accompanying drawings, which are incorporated herein and constitute a portion of the present specification, illustrate examples of the invention, with a general description of the invention described above, as well as a detailed description of the examples below. , Play a role in explaining the principles of the present invention.

イン・メモリ・コンピューティング・アーキテクチャの典型的な構造を図示する図である。It is a figure which illustrates the typical structure of an in-memory computing architecture. 一実施例による例示のアーキテクチャの高レベル・ブロック図である。It is a high-level block diagram of the example architecture by one example. 図２のアーキテクチャにおける使用に適した例示のコンピュート・イン・メモリ・ユニット（ＣＩＭＵ）の高レベル・ブロック図である。FIG. 2 is a high-level block diagram of an exemplary compute-in-memory unit (CIMU) suitable for use in the architecture of FIG. 一実施例による、図２のアーキテクチャにおける使用に適した入力活性化ベクトル再シェーピング・バッファ（ＩＡＢＵＦＦ：Ｉｎｐｕｔ−ＡｃｔｉｖａｔｉｏｎＶｅｃｔｏｒＲｅｓｈａｐｉｎｇＢｕｆｆｅｒ）の高レベル・ブロック図を図示する。A high-level block diagram of an Input-Activation Vector Reshaping Buffer (IA BUFF) suitable for use in the architecture of FIG. 2 according to an embodiment is illustrated. 一実施例による、図２のアーキテクチャにおける使用に適したＣＩＭＡ読出／書込バッファの高レベル・ブロック図である。FIG. 2 is a high level block diagram of a CIMA read / write buffer suitable for use in the architecture of FIG. 2 according to an embodiment. 一実施例による、図２のアーキテクチャにおける使用に適したニア・メモリ・データパス（ＮＭＤ：Ｎｅａｒ−ＭｅｍｏｒｙＤａｔａｐａｔｈ）モジュールの高レベル・ブロック図である。FIG. 2 is a high level block diagram of a Near Memory Datapath (NMD) module suitable for use in the architecture of FIG. 2 according to an embodiment. 一実施例による、図２のアーキテクチャにおける使用に適した直接記憶アクセス（ＤＭＡ：ｄｉｒｅｃｔｍｅｍｏｒｙａｃｃｅｓｓ）モジュールの高レベル・ブロック図である。FIG. 2 is a high level block diagram of a direct memory access (DMA) module suitable for use in the architecture of FIG. 2 according to an embodiment. 図２のアーキテクチャにおける使用に適したＣＩＭＡチャンネルのデジタル化／重み付けの異なる実施例の高レベル・ブロック図である。FIG. 2 is a high-level block diagram of examples with different digitization / weighting of CIMA channels suitable for use in the architecture of FIG. 図２のアーキテクチャにおける使用に適したＣＩＭＡチャンネルのデジタル化／重み付けの異なる実施例の高レベル・ブロック図である。FIG. 2 is a high-level block diagram of examples with different digitization / weighting of CIMA channels suitable for use in the architecture of FIG. 一実施例による方法のフロー図である。It is a flow chart of the method by one Example. 乗算ビット・セル（ｍｕｌｔｉｐｌｙｉｎｇｂｉｔ−ｃｅｌｌ）の回路図である。It is a circuit diagram of a multiplication bit cell. ＸＮＯＲ関数を実行するように構成された３つのＭ−ＢＣの回路図である。It is a circuit diagram of three M-BCs configured to execute an XNOR function. 一実施例によるＭ−ＢＣの回路図である。It is a circuit diagram of M-BC by one Example. 図１３のＭ−ＢＣの例示のＩＣを図示する図である。It is a figure which illustrates the example IC of M-BC of FIG. スイッチ型結合構造を有するビット・セルのブロック図である。It is a block diagram of a bit cell having a switch type coupling structure. 非スイッチ型結合構造を有するビット・セルのブロック図である。It is a block diagram of a bit cell having a non-switch type coupling structure. 一実施例による非スイッチ型結合構造を有するビット・セル回路の回路図である。It is a circuit diagram of the bit cell circuit which has a non-switch type coupling structure by one Example. 一実施例によるビット・セルのレイアウトの二方向インターリーブの回路図である。It is a circuit diagram of the two-way interleaving of the layout of a bit cell according to one embodiment.

添付図面は、必ずしも一定の縮尺で図示したものではなく、本発明の基本原理を説明する様々な特徴の何らかの簡略化された表現を示すことを理解されたい。例えば、様々な図示構成要素の特定の寸法、方向、配置、及び形状を含む、本明細書において開示されるような一連の動作の特定の設計上の特徴は、特定の意図された適用及び使用環境によって部分的に判断されるであろう。記載の実施例の特定の特徴は、可視化及び明確な理解を容易にするために、他の特徴と比較して拡大又は歪曲されている。特に、明確化及び説明のために、薄い特徴が濃く示されている場合がある。 It should be understood that the accompanying drawings are not necessarily illustrated to a constant scale and show some simplified representation of various features that explain the basic principles of the invention. Specific design features of a series of operations as disclosed herein, including, for example, specific dimensions, orientations, arrangements, and shapes of various illustrated components, are specific intended applications and uses. It will be partially judged by the environment. Certain features of the described examples have been magnified or distorted compared to other features for ease of visualization and clear understanding. In particular, for clarity and explanation, light features may be shown darkly.

本発明をさらに詳細に説明する前に、本発明は記載の特定の実施例に限定されるものではなく、したがって、当然ながら、変化し得ることを理解されたい。また、本発明の範囲は、添付の特許請求の範囲によってのみ限定されるものであるため、本明細書において使用される用語は、特定の実施例を説明することのみを目的としており、限定を意図するものではないことを理解されたい。 Before discussing the invention in more detail, it should be understood that the invention is not limited to the particular embodiments described and can therefore, of course, vary. Also, since the scope of the present invention is limited only by the appended claims, the terms used herein are for the sole purpose of describing specific embodiments and are limited. Please understand that it is not intended.

値の範囲が与えられている場合、その範囲の上限と下限との間の、その文脈が明確に指示しない限り最小目盛りの１０分の１までの各介在値及び定められる範囲における他の定められる値又は介在値は本発明内に包含されることが理解される。これらのより小さな範囲の上限及び下限はその小さな範囲に独立して含まれ得るものであり、定められた範囲の具体的に除外される限度に従って、同様に本発明内に包含されるものである。定められた範囲が、上記の限度の一方又は両方を含む場合、それらの含まれる限度の一方又は両方を除く範囲も同様に本発明に含まれる。 If a range of values is given, each intervening value between the upper and lower bounds of the range, up to one tenth of the minimum scale, and others within the defined range, unless explicitly indicated by the context. It is understood that the value or intervening value is included within the present invention. The upper and lower limits of these smaller ranges can be independently included in the smaller range and are also included within the invention in accordance with the specifically excluded limits of the defined range. .. When the defined range includes one or both of the above limits, the range excluding one or both of those included limits is also included in the present invention.

定義されない限り、本明細書において使用される全ての技術用語及び科学用語は、本発明が属する当業者によって一般的に理解されるのと同様の意味を有する。本明細書に記載のものと同様又は同等のあらゆる方法及び材料が本発明の実施又は試験においても使用可能であるが、本明細書では限定数の例示の方法及び材料が記載される。なお、本明細書及び添付の特許請求の範囲において使用されるように、文脈が明確に指示しない限り、単数形の「ある（ａ）」、「１つの（ａｎ）」、及び「その（ｔｈｅ）」は複数形の参照語を含む。 Unless defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which the present invention belongs. Any method and material similar to or equivalent to that described herein can be used in the practice or testing of the present invention, but a limited number of exemplary methods and materials are described herein. It should be noted that, as used in this specification and the appended claims, the singular forms "a", "an", and "the" unless the context explicitly dictates. ) ”Contains the plural reference term.

開示の実施例は、最新ＶＬＳＩ実装のための高効率の線形代数コンピューティング・アーキテクチャのプログラミング性を可能とするシステム及び方法を含む。本明細書で開示するコンピューティング・アーキテクチャは「イン・メモリ・コンピューティング」と呼ばれ、このアーキテクチャのプログラミング性によって、特に機械学習及び人工知能のための適用の範囲にわたって広く使用されることが可能となる。開示のアプローチはイン・メモリ・コンピューティング・アレイに関する構成可能性の特徴の範囲を取り入れており、それによって、同様に、現在では機械学習及び人工知能の適用の広い範囲に対するエネルギー及びスループットを扱うために、プログラミング可能なプラットフォームにおいて統合されることが可能となる。 Disclosure examples include systems and methods that allow programmability of highly efficient linear algebra computing architectures for modern VLSI implementations. The computing architecture disclosed herein is referred to as "in-memory computing" and the programmability of this architecture allows it to be widely used, especially across applications for machine learning and artificial intelligence. It becomes. The disclosure approach incorporates a range of configurable features for in-memory computing arrays, thereby similarly dealing with energy and throughput over a wide range of machine learning and artificial intelligence applications. In addition, it will be possible to integrate in a programmable platform.

線形代数演算、特に行列ベクトル積は、特に機械学習（ＭＬ：ｍａｃｈｉｎｅｌｅａｒｎｉｎｇ）及び人工知能（ＡＩ：ａｒｔｉｆｉｃｉａｌｉｎｔｅｌｌｉｇｅｎｃｅ）からの新しい作業負荷において非常に顕著となってきた。その行列とベクトルとの次元は非常に大きく成り得ることが多い（通常、＞１００）。上記の演算で要素が再利用される方法に起因して、要素は埋込型又はオフチップのメモリ（格納されなければならない要素の数に依存する）に通常格納される。最新ＶＬＳＩ技術における実装は、事実上、そのようなメモリからのデータへのアクセスのエネルギー及び遅延がデータに対する実際の演算を実質的に上回るということを示してきた。これは、従来のアクセラレータによって達成可能であったエネルギー／遅延の削減を制限しており、それによってメモリと演算を分離し、イン・メモリ・コンピューティングのパラダイムの動機付けとなってきた。イン・メモリ・コンピューティング・システムの文脈内において、生データはメモリからアクセスされないが、生データの多くのビット上での演算結果はアクセスされ、それによってアクセスのエネルギー及び遅延を償却する。 Linear algebraic operations, especially matrix vector products, have become very prominent, especially in new workloads from machine learning (ML) and artificial intelligence (AI). The dimensions of the matrix and vector can often be very large (usually> 100). Due to the way elements are reused in the above operations, elements are usually stored in embedded or off-chip memory (depending on the number of elements that must be stored). Implementations in modern VLSI technology have shown that, in effect, the energy and delay of accessing data from such memory substantially exceeds the actual computation on the data. This limits the energy / delay reductions achievable with traditional accelerators, thereby separating memory and computation and motivating the in-memory computing paradigm. In the context of in-memory computing systems, raw data is not accessed from memory, but the results of operations on many bits of raw data are accessed, thereby amortizing the energy and delay of access.

開示のシステム、方法、及びその一部によって、最新ＶＬＳＩ実装等のための、さらには集積回路実装につながる高効率の線形代数コンピューティング・アーキテクチャの構成可能性及びプログラミング性が可能となる。本明細書において開示されるコンピューティング・アーキテクチャは、広くは「イン・メモリ・コンピューティング」と呼ばれる場合があり、このアーキテクチャのプログラミング性によって、機械学習、人工知能、及びその他の適用において使用されるような行列ベクトル演算等を含む適用範囲にわたる幅広い使用が可能になる。様々な実施例において、本開示のシステム、方法、及びその一部は、並列及び直列動作を用いるハイブリッド・アナログ／デジタル・コンピューティング方法を使用してイン・メモリ・コンピューティング・アーキテクチャの構成可能性及びプログラミング性を可能とする。 The disclosed systems, methods, and parts thereof allow for configurable and programmable high-efficiency linear algebraic computing architectures for modern VLSI implementations and the like, which in turn leads to integrated circuit implementations. The computing architecture disclosed herein is commonly referred to as "in-memory computing" and is used in machine learning, artificial intelligence, and other applications due to the programmability of this architecture. It can be used in a wide range of applications including such matrix vector operations. In various embodiments, the systems, methods, and parts thereof of the present disclosure use hybrid analog / digital computing methods with parallel and serial operation to configure in-memory computing architectures. And enable programmability.

開示のアプローチは演算実行の途中で特定の形状の量子化ノイズは発生させ得るが、これは本明細書で開示の数多くのアーキテクチャに関する特徴によって制御され、場合によっては、標準的な固定小数点精度のコンピューティングにあるような量子化ノイズを示す動作が可能である。 The disclosure approach can generate certain shapes of quantization noise in the middle of the operation, which is controlled by the many architectural features disclosed herein and in some cases standard fixed-point accuracy. It is possible to perform operations that exhibit quantization noise as in computing.

開示のシステム及びその一部は集積回路として実装され、その様々な特徴は、Ｖｅｒｉｌｏｇ／トランジスタ・レベルのシミュレーションを用いてシミュレーションされるのと同様に、フィールド・プログラマブル・ゲート・アレイ（ＦＰＧＡ）として実装された。 The disclosed system and parts thereof are implemented as integrated circuits, the various features of which are implemented as field programmable gate arrays (FPGAs), as they are simulated using Verilog / transistor level simulations. Was done.

開示のアプローチは、本明細書において完全に記載されているかのようにその全体が参照により組み込まれる、２０１７年９月８日に出願された仮特許出願第６２／５５５，９５９号、「ＡｎａｌｏｇＳｗｉｔｃｈｅｄ−ＣａｐａｃｉｔｏｒＮｅｕｒａｌＮｅｔｗｏｒｋ」において開示されるアーキテクチャを含むイン・メモリ・コンピューティング・アーキテクチャの範囲に対して適用され得る。 The disclosed approach is incorporated by reference in its entirety as if fully described herein, Provisional Patent Application No. 62 / 555,959, filed September 8, 2017, "Analogue Switched." -Can be applied to a range of in-memory computing architectures, including the architectures disclosed in "Capacitor September Network".

開示のアプローチは、線形代数演算に対する高エネルギー効率のアプローチであるとされているイン・メモリ・コンピューティングの構成可能性及びプログラミング性を可能とする。そのような構成及びプログラミング性は、開示のアーキテクチャの幅広い適用範囲における使用を可能とする。 The disclosed approach enables the configurable and programmable nature of in-memory computing, which has been described as a high energy efficiency approach to linear algebra. Such configuration and programmability allow for use in a wide range of applications of the disclosed architecture.

さらなる開示の実施例は、マルチ・ビット・イン・メモリ・コンピューティングに対してビット・パラレル／ビット・シリアル・アプローチを提供する。特に、メモリ・ビット・セルが１−ｂオペランドに対して演算を実行するイン・メモリ・コンピューティング・アーキテクチャが本明細書で開示され、一方のオペランドの複数のビットが並列ビット・セルにマッピングされるビット・パラレル／ビット・シリアル（ＢＰ／ＢＳ：ｂｉｔ−ｐａｒａｌｌｅｌ／ｂｉｔ−ｓｅｒｉａｌ）法を使用してマルチ・ビットのオペランドに対する動作を可能とするようにハードウェアが拡張され、他方のオペランドの複数のビットが連続して入力される。開示のアプローチは、時間及び空間において異なるビット・セル出力演算にわたって可能な、演算出力ビット・セルがデジタル化された後にさらなる演算へ供給されるイン・メモリ・コンピューティングと、デジタル化された出力に対してビット・パラレル／ビット・シリアル（ＢＰ／ＢＳ）法でマルチ・ビット・オペランド演算が実行されるイン・メモリ・コンピューティングと、ＢＰ／ＢＳアプローチが例えばビット・セルによるビット単位のＡＮＤ演算を用いることによって２の補数表現を使用するイン・メモリ・コンピューティングと、ＢＰ／ＢＳアプローチがビット・セルによるＸＮＯＲ演算の使用を可能にするために、例えば１／０ビットが数学的に＋１／−１の値を有するようにとられる異なる数表現を使用するイン・メモリ・コンピューティングとを考慮する。 Further disclosure examples provide a bit-parallel / bit-serial approach to multi-bit-in-memory computing. In particular, an in-memory computing architecture in which a memory bit cell performs operations on the 1-b operand is disclosed herein, with multiple bits of one operand mapped to a parallel bit cell. The hardware has been extended to allow operation on multi-bit operands using the bit-parallel / bit-serial (BP / BS) method, and multiple of the other operands. Bits are input consecutively. The disclosed approach is for in-memory computing and digitized output, which is possible over different bit cell output operations in time and space, and is fed to further operations after the operation output bit cells are digitized. On the other hand, in-memory computing in which multi-bit complement operations are performed by the bit parallel / bit serial (BP / BS) method, and the BP / BS approach performs bit-by-bit AND operations using, for example, bit cells. In-memory computing, which uses two's complement representation by use, and the BP / BS approach allows the use of XNOR operations by bit cells, for example 1/0 bits mathematically + 1 /- Consider in-memory computing that uses a different number representation that is taken to have a value of 1.

構成可能なイン・メモリ・コンピューティング・エンジン及びプラットフォーム
様々な実施例は、実行可能な適用への拡大のために必要とされるプログラミング性と仮想化とを可能にするために、イン・メモリ・コンピューティング・アクセラレータに関する構成可能性及びハードウェア・サポートを統合することに関する。一般に、イン・メモリ・コンピューティングは行列ベクトル積を実装し、行列要素はメモリ・アレイに格納され、ベクトル要素はメモリ・アレイ上に並列にブロードキャストされる。実施例のいくつかの態様は、そのようなアーキテクチャのプログラミング性と構成可能性とを可能にすることに関する。 Configurable in-memory computing engines and platforms Various examples provide in-memory to enable the programmability and virtualization required for expansion into viable applications. Concerning the integration of configurable and hardware support for computing accelerators. In-memory computing typically implements a matrix-vector product, where matrix elements are stored in a memory array and vector elements are broadcast in parallel on the memory array. Some aspects of the embodiment relate to enabling the programmability and configurable nature of such an architecture.

イン・メモリ・コンピューティングは、通常、行列要素、ベクトル要素、又はその両方のための１−ｂ表現を含む。これは、マルチ・ビット演算に必要とされるビット間の異なるバイナリ重み付け結合を提供することなく、メモリが、並列で均質な方法でブロードキャストが行われる独立したビット・セルにデータを格納するためである。本発明において、マルチ・ビット行列及びベクトル要素への拡張は、ビット・パラレル／ビット・シリアル（ＢＰＢＳ）法で実現される。 In-memory computing typically includes 1-b representations for matrix elements, vector elements, or both. This is because memory stores data in independent bit cells that are broadcast in a parallel and homogeneous manner, without providing the different binary weighted joins between the bits required for multi-bit operations. be. In the present invention, extensions to multi-bit matrices and vector elements are realized by the bit parallel / bit serial (BPBS) method.

行列−ベクトル積に関わることが多い一般的な演算を可能にするために、高構成可能／プログラミング可能なニア・メモリ・コンピューティング・データ・パスが含まれる。この両方は、イン・メモリ・コンピューティングのビット単位の演算からマルチ・ビット演算に拡張するために必要とされる演算を可能とし、汎用性のため、これはマルチ・ビット動作をサポートし、イン・メモリ・コンピューティング固有の１−ｂ表現に制約されなくなる。プログラミング可能／構成可能でマルチ・ビットのコンピューティングはデジタル・ドメインではより効率的であるため、本発明においてアナログ・デジタル変換はイン・メモリ・コンピューティングの後に実行され、特定の実施例において、他の多重化の比率も使用可能であるが、構成可能なデータ・パスは８のＡＤＣ／イン・メモリ・コンピューティング・チャンネル間で多重化される。これは、本実施例において８−ｂオペランドまでのサポートが提供される場合に、マルチ・ビットの行列要素サポートに対して用いられたＢＰＢＳ法とも協調する。 Highly configurable / programmable near-memory computing data paths are included to enable common operations that often involve matrix-vector products. Both allow the operations required to extend from bitwise operations in in-memory computing to multi-bit operations, and for versatility, this supports multi-bit operation and is in-memory. -You are no longer restricted to the 1-b representation unique to memory computing. Because programmable / configurable and multi-bit computing is more efficient in the digital domain, analog-to-digital conversion is performed after in-memory computing in the present invention, and in certain embodiments, others. The multiplexing ratio of is also available, but the configurable data path is multiplexed between the 8 ADC / in-memory computing channels. This is also coordinated with the BPBS method used for multi-bit matrix element support where support up to the 8-b operand is provided in this example.

入力ベクトルのスパーシティは多くの線形代数適用において一般的であるため、本発明は、エネルギーに比例したスパーシティ制御を可能とするためのサポートを取り込む。これは、ゼロ値化要素に相当する、入力ベクトルからのビットのブロードキャストをマスキングすることによって実現される（そのようなマスキングはビット・シリアル・プロセスにおいて全ビットに対して行われる）。これは、ブロードキャスト・エネルギーとともに、メモリ・アレイ内の演算エネルギーも節約する。 Since the spatiality of the input vector is common in many linear algebra applications, the present invention incorporates support to enable energy-proportional spatiality control. This is achieved by masking the broadcast of bits from the input vector, which corresponds to the zeroing element (such masking is done for all bits in the bit serial process). This saves computational energy in the memory array as well as broadcast energy.

イン・メモリ・コンピューティングに対して内部のビット単位の演算アーキテクチャ及び典型的なマイクロプロセッサの外部のデジタル・ワード・アーキテクチャを想定すると、データ再シェーピング・ハードウェアが、入力ベクトルが供給される際に使用される演算インタフェースと、行列要素が書き込まれ読み出される際に使用されるメモリ・インタフェースとの両方のために使用される。 Assuming a bit-by-bit arithmetic architecture inside for in-memory computing and a digital word architecture outside a typical microprocessor, the data reshaping hardware will be fed when the input vector is fed. It is used for both the arithmetic interface used and the memory interface used when the matrix elements are written and read.

図１は、イン・メモリ・コンピューティング・アーキテクチャの典型的な構造を図示する。メモリ・アレイ（標準的なビット・セル又は変形のビット・セルに基づく場合がある）から構成されるとすると、イン・メモリ・コンピューティングは、信号の２つの追加の「直交」セット、すなわち（１）入力ライン及び（２）蓄積ラインを含む。図１を参照すると、ビット・セルの二次元のアレイが図示され、複数のイン・メモリ・コンピューティング・チャンネル１１０のそれぞれは、ビット・セルのそれぞれの列を含み、チャンネル毎のビット・セルのそれぞれは共通の蓄積ライン及びビット・ライン（列）と、それぞれの入力ライン及びワードライン（行）と関連付けられる。なお、本明細書において、信号の列及び行は、図１に図示されるビット・セルの二次元アレイなどのビット・セルのアレイの文脈内の行／列関係を単純に示すために互いに対して「直交」していると記載される。本明細書で使用される「直交」という用語は、特定の幾何学的な関係を伝えることを意図したものではない。 FIG. 1 illustrates a typical structure of an in-memory computing architecture. Given that it consists of a memory array (which may be based on standard bit cells or modified bit cells), in-memory computing has two additional "orthogonal" sets of signals, ie ( Includes 1) input line and (2) storage line. Referring to FIG. 1, a two-dimensional array of bit cells is illustrated, each of the plurality of in-memory computing channels 110 containing a respective column of bit cells, of the bit cells per channel. Each is associated with a common storage line and bit line (column) and their respective input lines and word lines (rows). It should be noted that in the present specification, the columns and rows of signals are relative to each other in order to simply show the row / column relationships within the context of an array of bit cells, such as the two-dimensional array of bit cells illustrated in FIG. Is described as "orthogonal". The term "orthogonal" as used herein is not intended to convey a particular geometric relationship.

信号の入力／ビット及び蓄積／ビットのセットはメモリ内の既存の信号（例えばワードライン、ビット・ライン）と物理的に結合されてもよく、分離される場合もある。行列ベクトル積を実装するために、行列要素はまずメモリ・セルにロードされる。その後、複数の入力ベクトル要素（全ての可能性もある）は入力ラインを介して一度に適用される。これによって、典型的に何らかの形態の乗算であるローカルな演算動作をメモリ・ビット・セルのそれぞれにおいて発生させる。その後、この演算動作の結果は、共有される蓄積ライン上に駆動される。このように、蓄積ラインは、入力ベクトル要素によって活性化される複数のビット・セル上の演算結果を表す。これは、ビット・セルが一度に１つずつビット・ラインを介してアクセスされ、単一のワードラインによって活性化される標準的なメモリ・アクセスとは対照的である。 A set of signal inputs / bits and stores / bits may be physically combined or separated from existing signals in memory (eg, wordlines, bitlines). To implement a matrix-vector product, matrix elements are first loaded into memory cells. The multiple input vector elements (all possible) are then applied at once via the input line. This causes local arithmetic operations, typically some form of multiplication, to occur in each of the memory bit cells. The result of this arithmetic operation is then driven onto a shared storage line. Thus, the accumulation line represents the result of an operation on a plurality of bit cells activated by the input vector element. This is in contrast to standard memory access, where bit cells are accessed one at a time through the bit line and activated by a single word line.

上述したようなイン・メモリ・コンピューティングは、数多くの重要な属性を有する。第１に、演算は通常アナログである。これは、メモリ及びビット・セルの制約された構造は、単純なデジタルのスイッチ型抽象によって実現するよりもリッチな演算モデルを必要とするためである。第２に、ビット・セルにおけるローカルな動作は、通常、ビット・セルに格納される１−ｂ表現を用いた演算を含む。これは、標準的なメモリ・アレイにおけるビット・セルは、あらゆるバイナリ重み付けの方法によって互いに結合せず、そのような結合は末端からのビット・セルのアクセス／読出しの方法によって実現される必要があるためである。以下において、本発明において提案されるイン・メモリ・コンピューティングに関する拡張を説明する。 In-memory computing, as described above, has a number of important attributes. First, the computation is usually analog. This is because the constrained structure of memory and bit cells requires a richer computational model than is achieved by a simple digital switch abstraction. Second, local behavior in a bit cell usually involves an operation using the 1-b representation stored in the bit cell. This is because the bit cells in a standard memory array do not join each other by any binary weighting method, and such a join needs to be achieved by the way bit cells are accessed / read from the end. Because. Hereinafter, extensions relating to in-memory computing proposed in the present invention will be described.

ニア・メモリ及びマルチ・ビット演算への拡張
イン・メモリ・コンピューティングは従来のデジタル・アクセラレーションでは対応できない方法で行列ベクトル積を扱う可能性を有する一方、通常の演算パイプラインは、行列ベクトル積に関わる他の動作の範囲を含み得る。通常、そのような動作は従来のデジタル・アクセラレーションによって良好に扱われるが、それにも関わらず、適切なアーキテクチャにおいて、並列性、高スループット（したがって高通信帯域幅への必要性）、及びイン・メモリ・コンピューティングに関連した一般的な演算パターンを扱うために、そのようなアクセラレーション・ハードウェアをイン・メモリ・コンピューティング・ハードウェアの近くに配置することは高い価値を有する場合がある。関連する動作の大部分をデジタル・ドメインで実行されることが好ましいため、ＡＤＣによるアナログ・デジタル変換はイン・メモリ・コンピューティング蓄積ラインのそれぞれの後に含まれ、したがってイン・メモリ・コンピューティング・チャンネルと呼ぶ。最重要課題は、ＡＤＣハードウェアを各イン・メモリ・コンピューティング・チャンネルのピッチに組み込むことであるが、本発明でとられる適切なレイアウトのアプローチはこれを可能にする。 Extensions to near-memory and multi-bit operations In-memory computing has the potential to handle matrix-vector products in ways that traditional digital acceleration cannot handle, while ordinary arithmetic pipelines have matrix-vector products. It may include a range of other actions related to. Such behavior is usually well handled by traditional digital acceleration, but nevertheless, in a suitable architecture, parallelism, high throughput (and thus the need for high communication bandwidth), and in-memory. Placing such acceleration hardware close to in-memory computing hardware can be of great value in order to handle common computational patterns associated with memory computing. The analog-to-digital conversion by the ADC is included after each of the in-memory computing storage lines, as it is preferable that most of the associated operations be performed in the digital domain, and thus the in-memory computing channel. Called. The most important task is to incorporate ADC hardware into the pitch of each in-memory computing channel, and the proper layout approach taken in the present invention makes this possible.

各演算チャンネルの後にＡＤＣを組み込むことによって、それぞれビット・パラレル／ビット・シリアル（ＢＰＢＳ）演算を介した、マルチ・ビット行列及びベクトル要素をサポートするためのイン・メモリ・コンピューティングを拡張する効率的な方法が可能となる。ビット・パラレル演算は、異なるイン・メモリ・コンピューティング列において異なる行列要素ビットをロードすることを含む。その異なる列からのＡＤＣ出力は、その後、対応するビット重み付けを表すように適切にビット・シフトされ、マルチ・ビット行列要素演算結果を得るように列の全てにおけるデジタル蓄積が実行される。一方、ビット・シリアル演算は、後続の入力ベクトル・ビットに対応する次の出力によるデジタル蓄積の前に、一度に１つずつベクトル要素の各ビットを適用し、その都度ＡＤＣ出力を格納し、格納された出力を適切にビット・シフトすることを含む。そのようなアナログとデジタルの演算の混合を可能とするＢＰＢＳアプローチは、従来のメモリ動作に関連するアクセス・コストを克服しながら、デジタル（マルチ・ビット）の高効率高精度の方式とともにアナログ（１−ｂ）の高効率低精度の方式を活用するため、高効率となっている。 Efficient extension of in-memory computing to support multi-bit matrices and vector elements, each via bit-parallel / bit-serial (BPBS) operations, by incorporating an ADC after each operation channel. Method is possible. Bit-parallel arithmetic involves loading different matrix element bits in different in-memory computing columns. The ADC outputs from the different columns are then appropriately bit-shifted to represent the corresponding bit weights, and digital accumulation is performed on all of the columns to obtain a multi-bit matrix element operation result. Bit-serial operations, on the other hand, apply each bit of a vector element one at a time before digital storage by the next output corresponding to the subsequent input vector bit, and store and store the ADC output each time. Includes proper bit shifting of the output. The BPBS approach, which allows for a mixture of such analog and digital operations, is an analog (1) along with a digital (multi-bit), highly efficient and accurate method, while overcoming the access costs associated with traditional memory operation. The efficiency is high because the high-efficiency and low-precision method of −b) is utilized.

ニア・メモリ・コンピューティング・ハードウェアの範囲も考えられるが、本発明の本実施例に組み込まれたハードウェアの詳細を以下に説明する。上記のマルチ・ビット・デジタル・ハードウェアの物理的レイアウトを容易にするため、各ニア・メモリ・コンピューティング・チャンネルに８つのイン・メモリ・コンピューティング・チャンネルが多重化される。我々は、これによってイン・メモリ・コンピューティングの高並列動作がデジタルのニア・メモリ・コンピューティングの高周波数動作とスループットを一致させることができる（高並列アナログ・イン・メモリ・コンピューティングは、デジタル・ニア・メモリ・コンピューティングよりも低いクロック周波数で動作する）ことに着目した。したがって、各ニア・メモリ・コンピューティング・チャンネルは、デジタル・バレル・シフタ、乗算器、アキュムレータとともに、ルック・アップ・テーブル（ＬＵＴ）、固定非線形関数実装を含む。さらに、ニア・メモリ・コンピューティング・ハードウェアに関連する構成可能な有限状態機械（ＦＳＭ：ｆｉｎｉｔｅ−ｓｔａｔｅｍａｃｈｉｎｅ）は、ハードウェア全体の演算を制御するために組み込まれる。 Although the scope of near-memory computing hardware is also conceivable, the details of the hardware incorporated in this embodiment of the present invention will be described below. Eight in-memory computing channels are multiplexed on each near memory computing channel to facilitate the physical layout of the multi-bit digital hardware described above. We can this allow the high parallel operation of in-memory computing to match the throughput with the high frequency operation of digital near memory computing (high parallel analog in-memory computing is digital).・ It operates at a lower clock frequency than near-memory computing). Therefore, each near memory computing channel includes a look-up table (LUT), a fixed nonlinear function implementation, as well as a digital barrel shifter, multiplier, and accumulator. In addition, a configurable finite state machine (FSM) associated with near memory computing hardware is incorporated to control operations across the hardware.

入力インターフェーシング及びビット・スケーラビリティ制御
イン・メモリ・コンピューティングをプログラミング可能なマイクロプロセッサと統合するために、内部のビット単位の動作及び表現は、典型的なマイクロプロセッサのアーキテクチャで用いられる外部のマルチ・ビット表現と適切にインタフェース接続される必要がある。したがって、データ再シェーピング・バッファは入力ベクトル・インタフェースとメモリ読出／書込インタフェースとの両方に含まれ、それを介して行列要素がメモリ・アレイに格納される。本発明の実施例のために用いられる設計の詳細を以下に説明する。データ再シェーピング・バッファは、入力ベクトル要素のビット幅スケーラビリティを可能としながら、イン・メモリ・コンピューティング・ハードウェア、それと外部メモリとの間、さらに他のアーキテクチャ・ブロックへのデータ転送の最大帯域幅を維持する。このデータ再シェーピング・バッファは、入力ベクトルに対して入並列マルチ・ビット・データを要素毎に受信し全ベクトル要素に対して出並列シングル・ビット・データを供給するライン・バッファとしての役割を果たすレジスタ・ファイルからなる。 Input Interfacing and Bit Scalability Control To integrate in-memory computing with programmable microprocessors, the internal bitwise behavior and representation is external multi-use used in typical microprocessor architectures. It needs to be properly interfaced with the bit representation. Therefore, the data reshaping buffer is included in both the input vector interface and the memory read / write interface through which the matrix elements are stored in the memory array. Details of the design used for the embodiments of the present invention will be described below. The data reshaping buffer allows for bit-width scalability of input vector elements, while maximizing the bandwidth of data transfer between in-memory computing hardware and external memory and to other architectural blocks. To maintain. This data reshaping buffer serves as a line buffer that receives input-parallel multi-bit data element by element for the input vector and supplies output-parallel single-bit data to all vector elements. Consists of register files.

ワード単位／ビット単位のインターフェーシングに加えて、入力ベクトルに印加される畳み込み動作のためのハードウェア・サポートも含まれる。そのような動作は、畳み込みニューラル・ネットワーク（ＣＮＮ：ｃｏｎｖｏｌｕｔｉｏｎａｌ−ｎｅｕｒａｌｎｅｔｗｏｒｋ）において顕著である。この場合、行列ベクトル積は、供給される必要のある新規のベクトル要素のサブセットのみとともに実行される（他の入力ベクトル要素はバッファに格納され、適切に単純にシフトされる）。これは、高スループットのイン・メモリ・コンピューティング・ハードウェアに対するデータを得るための帯域幅制約を軽減する。本発明の実施例において、マルチ・ビットの入力ベクトル要素の適切なビット・シリアル順序付けを実行しなければならない畳み込みサポート・ハードウェアは、その適切に読み出された出力が構成可能な畳み込みストライドのためのデータをシフトする特化バッファ内に実装される。 In addition to word / bit interfacing, it also includes hardware support for convolutions applied to the input vector. Such behavior is prominent in convolutional neural networks (CNNs). In this case, the matrix vector product is performed with only a subset of the new vector elements that need to be supplied (the other input vector elements are buffered and simply shifted appropriately). This alleviates bandwidth constraints for obtaining data for high throughput in-memory computing hardware. In an embodiment of the invention, convolutional support hardware that must perform proper bit-serial ordering of multi-bit input vector elements is due to its properly read output being configurable convolutional stride. It is implemented in a specialized buffer that shifts the data of.

次元及びスパーシティ制御
プログラミング性のため、（１）行列／ベクトル次元は適用毎に可変であり得る、（２）多くの適用においてベクトルはスパースとなる、という２つのさらなる考慮すべき事項をハードウェアによって対応しなければならない。 Dimension and Spacity Control For programmability, the hardware considers two additional considerations: (1) matrix / vector dimensions can be variable from application to application, and (2) vectors are sparse in many applications. Must be dealt with by.

次元について、適用において所望される次元レベルに対してのみエネルギーを消費するため、イン・メモリ・コンピューティング・ハードウェアはアレイのタイル部分をイネーブル／ディセーブルとする制御を組み入れることが多い。しかしながら、用いられるＢＰＢＳアプローチでは、入力ベクトル次元は演算エネルギー及びＳＮＲに対する重要な示唆を有する。ＳＮＲについて、各イン・メモリ・コンピューティング・チャンネルにおけるビット単位の演算を行う場合、各入力（入力ライン上で供給される）とビット・セルに格納されるデータとの間の演算が１ビット出力を生成すると想定すると、蓄積ライン上で可能な個別レベルの数はＮ＋１に等しい。ただしＮは入力ベクトル次元である。これは、ｌｏｇ２（Ｎ＋１）ビットＡＤＣが必要であることを示唆する。しかしながら、ＡＤＣはビット数によって強く拡大又は縮小するエネルギー・コストを有する。したがって、ＡＤＣエネルギーの相対的な寄与を削減するために、非常に大きいがＡＤＣにおいてｌｏｇ２（Ｎ＋１）ビットよりも小さなＮをサポートすることが有利である。これを行った結果は、演算動作の信号対量子化ノイズ比（ＳＱＮＲ：ｓｉｇｎａｌ−ｔｏ−ｑｕａｎｔｉｚａｔｉｏｎ−ｎｏｉｓｅｒａｔｉｏ）が標準的な固定精度演算とは異なり、ＡＤＣビットの数に伴って低減されることである。したがって、変化する適用レベルの次元及びＳＱＮＲの要件をサポートするために、対応するエネルギー消費に伴って、構成可能な入力ベクトル次元に対するハードウェアのサポートが不可欠である。例えば、低減したＳＱＮＲが許容可能であれば、大次元の入力ベクトル・セグメントをサポートしなければならない。一方、高いＳＱＮＲを維持しなければならない場合、低次元の入力ベクトル・セグメントがサポートされなければならず、異なるイン・メモリ・コンピューティング・バンクから結合可能な複数の入力ベクトル・セグメントから内積結果が得られる（したがって、特に、標準的な固定精度動作と理想的に一致した演算を確実にするために、入力ベクトル次元はＡＤＣビットの数によって設定されたレベルまで減少され得る）。本発明において行われる混合のアナログ／デジタルのアプローチはこれを可能にする。すなわち、入力ベクトル要素は、所望の次元に対してのみのブロードキャストをフィルタリングするためにマスキングされることができる。これは、入力ベクトル次元に比例して、ブロードキャスト・エネルギー、及びビット・セル演算エネルギーを節約する。 For dimensions, in-memory computing hardware often incorporates controls that enable / disable the tile portion of the array, as it consumes energy only at the desired dimension level in the application. However, in the BPBS approach used, the input vector dimension has important implications for computational energy and SNR. For SNR, when performing bitwise operations on each in-memory computing channel, the operation between each input (supplied on the input line) and the data stored in the bit cell is a 1-bit output. The number of individual levels possible on the storage line is equal to N + 1, assuming that Where N is the input vector dimension. This suggests that a log2 (N + 1) bit ADC is needed. However, ADCs have an energy cost that strongly expands or contracts depending on the number of bits. Therefore, in order to reduce the relative contribution of ADC energy, it is advantageous to support a very large but smaller N than log2 (N + 1) bits in the ADC. The result of doing this is that the signal-to-quantization noise ratio (SQNR: signal-to-quantization-noise ratio) of the operation operation is reduced with the number of ADC bits, unlike the standard fixed-precision operation. Is. Therefore, in order to support changing application level dimensions and SQNR requirements, hardware support for configurable input vector dimensions is essential with the corresponding energy consumption. For example, if the reduced SQNR is acceptable, it must support large dimensional input vector segments. On the other hand, if high SQNR must be maintained, lower dimensional input vector segments must be supported and the inner product results from multiple input vector segments that can be combined from different in-memory computing banks. (Therefore, in particular, the input vector dimension can be reduced to the level set by the number of ADC bits to ensure an operation that is ideally matched to standard fixed precision operation). The mixed analog / digital approach made in the present invention makes this possible. That is, the input vector elements can be masked to filter broadcasts only for the desired dimension. This saves broadcast energy and bit cell computational energy in proportion to the input vector dimension.

スパーシティについて、ゼロ値化要素に対応する全ての入力ベクトル要素ビットのブロードキャストを防ぐために、同様のマスキング・アプローチがビット・シリアル動作全体に対して適用されることが可能である。我々は、用いられたＢＰＢＳアプローチが特にこれを実現する上で助けになる点に着目した。これは、非ゼロ要素の予測数はスパース線形代数適用においては既知であることが多い一方、入力ベクトル次元は大きくなり得るためである。したがってＢＰＢＳアプローチによって、我々は入力ベクトル次元を増加させることができるとともに、蓄積ライン上でサポートされる必要があるレベルの数がＡＤＣ分解能内にあることを依然として確実にし、それによって高い演算ＳＱＮＲを確実とする。非ゼロ要素の予測数が既知である一方、実際の非ゼロ要素の可変数をサポートすることが依然として不可欠であり、これは入力ベクトル毎に異なり得る。マスキング・ハードウェアは所与のベクトルに対するゼロ値要素の数を単純に数えた後、ＢＰＢＳ動作後のデジタル・ドメインにおいて最終的な内積結果に対して対応するオフセットを印加すればよいため、混合のアナログ／デジタルアプローチにおいて容易に達成される。 For spasity, a similar masking approach can be applied to the entire bit serial operation to prevent the broadcast of all input vector element bits corresponding to the zeroing element. We noted that the BPBS approach used was particularly helpful in achieving this. This is because the predicted number of non-zero elements is often known in sparse linear algebra applications, while the input vector dimension can be large. Therefore, the BPBS approach allows us to increase the input vector dimension and still ensure that the number of levels that need to be supported on the storage line is within the ADC resolution, thereby ensuring a high operational SQNR. And. While the predicted number of non-zero elements is known, it is still essential to support a variable number of actual non-zero elements, which can vary from input vector to input vector. The masking hardware simply counts the number of zero-valued elements for a given vector and then applies the corresponding offset to the final product result in the digital domain after BPBS operation, so it is mixed. Easily achieved in the analog / digital approach.

例示の集積回路アーキテクチャ
図２は、一実施例による例示のアーキテクチャの高レベル・ブロック図を図示する。特に、図２の例示のアーキテクチャは、本明細書における様々な実施例を試験するために特定の構成要素及び機能要素を使用したＶＬＳＩ組立技術を使用した集積回路として実装された。異なる構成要素（例えばより大きい、又はよりパワフルなＣＰＵ、メモリ要素、処理要素など）を有する別の実施例が本開示の範囲内にあることが発明者によって企図されることが理解される。 Illustrated Integrated Circuit Architecture FIG. 2 illustrates a high-level block diagram of an exemplary architecture according to an embodiment. In particular, the illustrated architecture of FIG. 2 was implemented as an integrated circuit using VLSI assembly techniques using specific components and functional elements to test the various embodiments herein. It is understood by the inventor that another embodiment with different components (eg, larger or more powerful CPU, memory elements, processing elements, etc.) is within the scope of the present disclosure.

図２に図示するように、アーキテクチャ２００は、中央処理装置（ＣＰＵ：ｃｅｎｔｒａｌｐｒｏｃｅｓｓｉｎｇｕｎｉｔ）２１０（例えば３２ビットＲＩＳＣ−ＶＣＰＵ）と、プログラム・メモリ（ＰＭＥＭ）２２０（例えば１２８ＫＢプログラム・メモリ）と、データ・メモリ（ＤＭＥＭ）２３０（例えば１２８ＫＢデータ・メモリ）と、外部メモリ・インタフェース２３５（例えば、説明上、１つ以上に３２ビット外部メモリ装置（不図示）にアクセスすることによってアクセス可能メモリを拡張するように構成される）と、ブートローダー・モジュール２４０（例えば、８ＫＢオフチップＥＥＰＲＯＭ（不図示）にアクセスするように構成される）と、様々な構成レジスタ２５５を含み本明細書に記載の実施例に従ってイン・メモリ・コンピューティング及び様々な他の機能を実行するように構成されたコンピュート・イン・メモリ・ユニット（ＣＩＭＵ）３００と、様々な構成レジスタ２６５を含む直接記憶アクセス（ＤＭＡ：ｄｉｒｅｃｔｍｅｍｏｒｙａｃｃｅｓｓ）モジュール２６０と、データの送受信のための万能非同期同受信機（ＵＡＲＴ：ＵｎｉｖｅｒｓａｌＡｓｙｎｃｈｒｏｎｏｕｓＲｅｃｅｉｖｅｒ／Ｔｒａｎｓｍｉｔｔｅｒ）モジュール２７１、汎用入力／出力（ＧＰＩＯ：ｇｅｎｅｒａｌｐｕｒｐｏｓｅｉｎｐｕｔ／ｏｕｔｐｕｔ）モジュール２７３、様々なタイマ２７４等の様々な支援／周辺モジュールとを含む。ＳｏＣ構成モジュール（不図示）など、ここで図示されていない他の要素も図２のアーキテクチャ２００に含まれ得る。 As illustrated in FIG. 2, the architecture 200 includes a central processing unit (CPU) 210 (eg, a 32-bit RISC-V CPU), a program memory (PMEM) 220 (eg, a 128KB program memory), and Expand accessible memory by accessing data memory (DMEM) 230 (eg 128KB data memory) and external memory interface 235 (eg, one or more 32-bit external memory devices (not shown) for illustration purposes). The implementation described herein includes a bootloader module 240 (eg, configured to access an 8KB off-chip EEPROM (not shown)) and various configuration registers 255. Direct memory access (DMA) including a compute-in-memory unit (CIMU) 300 configured to perform in-memory computing and various other functions as usual, and various configuration registers 265. access) module 260, universal asynchronous receiver (UART: Universal Receiver / Transmitter) module 271, general-purpose input / output (GPIO: general purpose input / output) module 273, various timers 27 Includes various support / peripheral modules. Other elements not shown here, such as SoC components (not shown), may also be included in Architecture 200 of FIG.

ＣＩＭＵ３００は行列ベクトル積などに非常に良好に適しているが、他の種類の演算／計算が非ＣＩＭＵ演算装置によってより適して実行され得る。したがって、様々な実施例において、特定の演算及び／又は機能が割り当てられた演算装置の選択がより効率的な演算機能を提供するように制御され得るように、ＣＩＭＵ３００とニア・メモリとの間の近接結合が提供される。 The CIMU 300 is very well suited for matrix vector products and the like, but other types of operations / calculations may be performed more well by non-CIMU arithmetic units. Thus, in various embodiments, between the CIMU 300 and the near memory so that the selection of the arithmetic unit to which a particular arithmetic and / or function is assigned can be controlled to provide more efficient arithmetic functions. Neighbor-joining is provided.

図３は、図２のアーキテクチャにおける使用に適した例示のコンピュート・イン・メモリ・ユニット（ＣＩＭＵ）３００の高レベル・ブロック図を図示する。以下の説明は、図２のアーキテクチャ２００とともに、そのアーキテクチャ２００のコンテクスト内での使用に適した例示のＣＩＭＵ３００に関する。 FIG. 3 illustrates a high level block diagram of an exemplary compute-in-memory unit (CIMU) 300 suitable for use in the architecture of FIG. The following description relates to the architecture 200 of FIG. 2 as well as an exemplary CIMU 300 suitable for use within the context of the architecture 200.

一般的に言えば、ＣＩＭＵ３００は、例えば、様々な構成レジスタを介して構成されるビット・セルのコンピュテーション・イン・メモリ・アレイ（ＣＩＭＡ：ｃｏｍｐｕｔａｔｉｏｎ−ｉｎ−ｍｅｍｏｒｙａｒｒａｙ）を含む様々な構造要素を含むことによって、行列ベクトル積などのプログラミング可能なイン・メモリ演算機能を提供する。特に、例示のＣＩＭＵ３００は、入力行列Ｘに入力ベクトルＡを乗算して出力行列Ｙを得るように割り当てられた５９０ｋｂ、１６バンクのＣＩＭＵとして構成される。 Generally speaking, the CIMU 300 includes various structural elements including, for example, a computation-in-memory array (CIMA) of bit cells configured via various configuration registers. By including, it provides programmable in-memory arithmetic functions such as matrix-vector products. In particular, the illustrated CIMU 300 is configured as a 590 kb, 16 bank CIMU allocated to multiply the input matrix X by the input vector A to obtain the output matrix Y.

図３を参照すると、ＣＩＭＵ３００はコンピュテーション・イン・メモリ・アレイ（ＣＩＭＡ）３１０と、入力活性化ベクトル再シェーピング・バッファ（ＩＡＢＵＦＦ）３２０と、スパーシティ／ＡＮＤ論理コントローラ３３０と、メモリ読出／書込インタフェース３４０と、行デコーダ／ＷＬドライバ３５０と、複数のＡＤ変換器３６０と、ニア・メモリ・コンピューティングの乗算−シフト−蓄積データ・パス（ＮＭＤ）３７０とを含むとして図示される。 Referring to FIG. 3, the CIMU 300 includes a computing in memory array (CIMA) 310, an input activation vector reshaping buffer (IA BUFF) 320, a spurity / AND logic controller 330, and a memory read / write. It is illustrated as including an embedded interface 340, a row decoder / WL driver 350, a plurality of AD converters 360, and a multiplication-shift-accumulated data path (NMD) 370 for near memory computing.

図示するコンピュテーション・イン・メモリ・アレイ（ＣＩＭＡ）３１０は、４×４クロック・ゲート方式の６４×（３×３×６４）イン・メモリ・コンピューティング・アレイとして配置され、したがって合計２５６のイン・メモリ・コンピューティング・チャンネル（例えばメモリ列）を有し、そのイン・メモリ・コンピューティング・チャンネルをサポートするために２５６のＡＤＣ３６０が含まれる２５６×（３×３×２５６）コンピュテーション・イン・メモリ・アレイを含む。 The Computation-in-Memory Array (CIMA) 310 illustrated is arranged as a 4x4 clock-gate 64x (3x3x64) in-memory computing array, thus a total of 256 in-memory. A 256x (3x3x256) computation-in that has a memory computing channel (eg, a memory sequence) and includes 256 ADC360s to support that in-memory computing channel. Includes memory array.

ＩＡＢＵＦＦ３２０は、例えば、３２ビットのデータ・ワードのシーケンスを受信するように動作し、それらの３２ビットのデータ・ワードを、ＣＩＭＡ３１０による処理に適した高次元のベクトルのシーケンスに再シェーピングする。なお、３２ビット、６４ビット、又は他のあらゆる幅のデータ・ワードが、コンピュート・イン・メモリ・アレイ３１０の利用可能なサイズ又は選択されたサイズに合致するように再シェーピングされてもよく、この場合、コンピュート・イン・メモリ・アレイ３１０自体が高次元ベクトルに対して動作するように構成され、２−８ビット、１−８ビット又は他のサイズを有する場合がありアレイ全体において並列してそれらを適用する要素を含む。また、本明細書で説明する行列ベクトル積演算はＣＩＭＡ３１０の全体を利用するとして図示されているが、様々な実施例において、ＣＩＭＡ３１０の一部のみが使用される。さらに、様々な他の実施例において、ＣＩＭＡ３１０と関連論理回路は、インターリーブされた行列ベクトル積演算を実現するように適応され、行列の並列部分はＣＩＭＡ３１０のそれぞれの部分によって同時に処理される。 The IA BUFF 320 operates, for example, to receive a sequence of 32-bit data words and reshapes those 32-bit data words into a sequence of high-dimensional vectors suitable for processing by the CIMA 310. Note that 32-bit, 64-bit, or any other width of the data word may be reshaped to match the available size or selected size of the compute-in-memory array 310. If the compute-in-memory array 310 itself is configured to work with respect to high-dimensional vectors, it may have 2-8 bits, 1-8 bits or other sizes and they in parallel throughout the array. Includes elements to which. Further, although the matrix vector product operation described in the present specification is illustrated as utilizing the entire CIMA 310, only a part of the CIMA 310 is used in various embodiments. Moreover, in various other embodiments, the CIMA 310 and associated logic circuits are adapted to implement interleaved matrix vector product operations, and parallel parts of the matrix are processed simultaneously by their respective parts of the CIMA 310.

特に、ＩＡＢＵＦＦ３２０は、３２ビットのデータ・ワードのシーケンスを、ＣＩＭＡ３１０に一度に（又は少なくとも大きいチャンク単位で）加えられてもよくビット・シリアル方式で適切に順序付けされる高並列データ構造に再シェーピングする。例えば、８つのベクトル要素を有する４ビット演算は、２０００ｎビット以上のデータ要素の高次元ベクトルと関連付けられてもよい。ＩＡＢＵＦＦ３２０は、このデータ構造を形成する。 In particular, the IA BUFF320 reshapes a sequence of 32-bit data words into a highly parallel data structure that may be added to the CIMA310 at one time (or at least in large chunk units) and properly ordered in a bit-serial fashion. do. For example, a 4-bit operation with eight vector elements may be associated with a higher dimensional vector of data elements of 2000 n bits or more. The IA BUFF 320 forms this data structure.

図示されるように、ＩＡＢＵＦＦ３２０は、例えば３２ビットのデータ・ワードのシーケンスとして入力行列Ｘを受信してＣＩＭＡ３１０のサイズに応じて受信したデータ・ワードのシーケンスのサイズ変更／再配置を行って、例えば２３０３ｎビットのデータ要素を含むデータ構造を供給するように構成される。それぞれのマスキング・ビットとともに、それらの２３０３ｎビットのデータ要素のそれぞれは、ＩＡＢＵＦＦ３２０からスパーシティ／ＡＮＤ論理コントローラ３３０へ送信される。 As shown, the IA BUFF 320 receives the input matrix X as, for example, a sequence of 32-bit data words and resizes / rearranges the sequence of received data words according to the size of the CIMA 310. For example, it is configured to supply a data structure containing 2303 n-bit data elements. Each of those 2303 n-bit data elements, along with each masking bit, is transmitted from the IA BUFF 320 to the spurity / AND logical controller 330.

スパーシティ／ＡＮＤ論理コントローラ３３０は、例えば２３０３ｎビットのデータ要素とそれぞれのマスキング・ビットとを受信し、それに応じてスパーシティ機能を呼び出すように構成され、この場合、ゼロ値データ要素（それぞれのマスキング・ビットによって示されるものなど）は処理のためにＣＩＭＡ３１０に伝搬されない。このようにして、ＣＩＭＡ３１０によるそのようなビットの処理のために特別に必要なエネルギーが節約される。 The spasity / AND logical controller 330 is configured to receive, for example, 2303 n-bit data elements and their respective masking bits and call the spurity function accordingly, in which case the zero-valued data elements (each masking). -Those indicated by bits) are not propagated to the CIMA 310 for processing. In this way, the energy specially required for the processing of such bits by the CIMA 310 is saved.

動作において、ＣＰＵ２１０は標準的な方法で実装される直接データ・パスを介してＰＭＥＭ２２０とブートローダー２４０を読み込む。ＣＰＵ２１０は標準的な方法で実装される直接データ・パスを介してＤＭＥＭ２３０、ＩＡＢＵＦＦ３２０、及びメモリ読出／書込バッファ３４０にアクセスしてもよい。これらのメモリ・モジュール／バッファ、ＣＰＵ２１０及びＤＭＡモジュール２６０の全ては、ＡＸＩバス２８１によって接続される。チップ構成モジュール及び他の周辺モジュールは、ＡＰＢバス２８２によってグループ化され、スレーブとしてＡＸＩバス２８１に取り付けられる。ＣＰＵ２１０は、ＡＸＩバス２８１を介してＰＭＥＭ２２０に書き込むように構成される。ＤＭＡモジュール２６０は、専用データ・パスを介して、ＤＭＥＭ２３０、ＩＡＢＵＦＦ３２０、メモリ読出／書込バッファ３４０、及びＮＭＤ３７０にアクセスして、例えばＤＭＡコントローラ２６５に従って、ＡＸＩ／ＡＰＢバスを介して他のアクセス可能なメモリ空間の全てにアクセスするように構成される。ＣＩＭＵ３００は、上述のＢＰＢＳ行列ベクトル積を実行する。上記及び他の実施例のさらなる詳細は以下で説明する。 In operation, the CPU 210 loads the PMEM 220 and boot loader 240 via a direct data path implemented in a standard way. The CPU 210 may access the DMEM230, IA BUFF320, and memory read / write buffer 340 via a direct data path implemented in a standard way. All of these memory modules / buffers, CPU 210 and DMA module 260 are connected by the AXI bus 281. The chip configuration module and other peripheral modules are grouped by the APB bus 282 and attached to the AXI bus 281 as slaves. The CPU 210 is configured to write to the PMEM 220 via the AXI bus 281. The DMA module 260 accesses the DMEM230, IA BUFF320, memory read / write buffer 340, and NMD370 via a dedicated data path, and is accessible to others via the AXI / APB bus, eg, according to the DMA controller 265. It is configured to access all of the memory space. The CIMU 300 performs the BPBS matrix vector product described above. Further details of the above and other embodiments will be described below.

このように、様々な実施例において、ＣＩＭＡは、ベクトル情報を受信し、行列ベクトル積を実行し、複合行列ベクトル積機能を提供するために必要に応じて別の演算機能によってさらに処理される場合のあるデジタル化出力信号（すなわち、Ｙ＝ＡＸ）を提供するように、ビット・シリアル・ビット・パラレル（ＢＰＢＳ）方式で動作する。一般的に言えば、本明細書で説明する実施例は、超並列のビット単位の入力信号を形成するために受信データ・ワードのシーケンスを再シェーピングするように構成された再シェーピング・バッファと、第１のＣＩＭアレイ次元を介して上記の超並列のビット単位の入力信号を受信し、第２のＣＩＭアレイ次元を介して１つ以上の蓄積信号を受信するように構成されるビット・セルのコンピュート・イン・メモリ（ＣＩＭ）アレイであって、共通の蓄積信号と関連付けられた複数のビット・セルのそれぞれは、それぞれの出力信号を供給するように構成されるそれぞれのＣＩＭチャンネルを形成するＣＩＭアレイと、複数のＣＩＭチャンネル出力信号を処理することによってマルチ・ビットの出力ワードのシーケンスを供給するように構成されたアナログ・デジタル変換器（ＡＤＣ）回路と、ＣＩＭアレイに、シングル・ビットの内部回路及び信号を用いて上記入力及び蓄積信号に対してマルチ・ビット・コンピューティング動作を実行させるように構成された制御回路と、コンピューティング結果としてマルチ・ビットの出力ワードのシーケンスを供給するように構成されたニア・メモリ・コンピューティング・パスと、を含むイン・メモリ・コンピューティング・アーキテクチャを提供する。 Thus, in various embodiments, when the CIMA receives vector information, performs a matrix-vector product, and is further processed by another arithmetic function as needed to provide a composite matrix-vector product function. It operates in a bit-serial-bit-parallel (BPBS) fashion to provide some digitized output signal (ie, Y = AX). Generally speaking, the embodiments described herein include a reshaping buffer configured to reshape a sequence of received data words to form a massively parallel bit-by-bit input signal. A bit cell configured to receive the above-mentioned massively parallel bit-by-bit input signals via a first CIM array dimension and one or more stored signals via a second CIM array dimension. A CIM in a compute-in-memory (CIM) array, each of which is associated with a common storage signal and forms a CIM channel configured to supply its own output signal. A single-bit internal array, an analog-digital converter (ADC) circuit configured to supply a sequence of multi-bit output words by processing multiple CIM channel output signals, and a single-bit internal to the CIM array. A control circuit configured to use circuits and signals to perform multi-bit computing operations on the inputs and stored signals, and to provide a sequence of multi-bit output words as a result of computing. Provides an in-memory computing architecture that includes a configured near-memory computing path.

メモリ・マップ及びプログラミング・モデル
ＣＰＵ２１０はＩＡＢＵＦＦ３２０及びメモリ読出／書込バッファ３４０に直接アクセスするように構成されるため、これらの２つのメモリ空間は、特にアレイ／行列データなどの構造化データに対して、ユーザ・プログラムの観点から、レイテンシ及びエネルギーに関してＤＭＥＭ２３０にとって類似して見える。様々な実施例において、イン・メモリ・コンピューティングの特徴が活性化されていない、若しくは部分的に活性化されている場合、メモリ読出／書込バッファ３４０及びＣＩＭＡ３１０は、通常のデータ・メモリとして使用されてもよい。 Memory Map and Programming Model Because the CPU 210 is configured to directly access the IA BUFF 320 and the memory read / write buffer 340, these two memory spaces are particularly for structured data such as array / matrix data. Looks similar to DMEM230 in terms of latency and energy, from a user programming perspective. In various embodiments, the memory read / write buffer 340 and CIMA310 are used as normal data memory when in-memory computing features are not activated or are partially activated. May be done.

図４は、一実施例による、図２のアーキテクチャにおける使用に適した入力活性化ベクトル再シェーピング・バッファ（ＩＡＢＵＦＦ）３２０の高レベル・ブロック図を図示する。図示するＩＡＢＵＦＦ３２０は、１ビットから８ビットの要素精度を有する入力活性化ベクトルをサポートし、他の精度も様々な実施例において考慮されてもよい。本明細書で説明するビット・シリアル・フロー機構によれば、入力活性化ベクトルにおける全ての要素の特定のビットは、行列ベクトル積演算のためにＣＩＭＡ３１０に対して一度にブロードキャストされる。ただし、この動作の高パラレル性は、高次元入力活性化ベクトルの要素が最大帯域幅及び最小エネルギーで提供されることを必要とし、さもなければ、イン・メモリ・コンピューティングのスループット及びエネルギー効率の利点が利用されないであろう。これを達成するためには、入力活性化再シェーピング・バッファ（ＩＡＢＵＦＦ）３２０は以下のように構成されてもよく、それによってイン・メモリ・コンピューティングがマイクロプロセッサの３２ビット（又は他のビット幅）のアーキテクチャに組み込まれることができ、それによって対応する３２ビットのデータ転送のためのハードウェアがイン・メモリ・コンピューティングの高パラレル内部組織のために最大限に活用される。 FIG. 4 illustrates a high level block diagram of an input activation vector reshaping buffer (IA BUFF) 320 suitable for use in the architecture of FIG. 2 according to one embodiment. The illustrated IA BUFF 320 supports an input activation vector with an element accuracy of 1 to 8 bits, and other accuracy may be considered in various embodiments. According to the bit serial flow mechanism described herein, specific bits of all elements in the input activation vector are broadcast to the CIMA 310 at once for matrix vector product operations. However, the high parallelism of this operation requires that the elements of the high-dimensional input activation vector be provided with maximum bandwidth and minimum energy, otherwise in-memory computing throughput and energy efficiency. The benefits will not be taken advantage of. To achieve this, the input activation reshaping buffer (IA BUFF) 320 may be configured as follows, which allows in-memory computing to be 32 bits (or other bits) of the microprocessor. Width) can be incorporated into the architecture, thereby maximizing the corresponding hardware for 32-bit data transfer for highly parallel internal organizations of in-memory computing.

図４を参照すると、ＩＡＢＵＦＦ３２０は、１から８ビットのビット精度の入力ベクトル要素を含む場合のある３２ビット入力信号を受信する。それによって、３２ビットの入力信号が、合計２４（ここではレジスタ４１０−０から４１０−２３で示される）となる４×８−ｂレジスタ４１０にまず格納される。これらのレジスタ４１０は、自身のコンテンツを、それぞれ９６列を有する８レジスタ・ファイル（レジスタ・ファイル４２０−０から４２０−８として示される）に供給し、３×３×２５６＝２３０４までの次元を有する入力ベクトルは並列の列にその要素とともに配置される。これは、８−ｂ入力要素の場合に行われ、２４の４×８−ｂレジスタ４１０はレジスタ・ファイル４２０の１つに対して９６の並列出力を供給し、１−ｂ入力要素の場合は、２４の４×８−ｂレジスタ４１０は８つのレジスタ・ファイル４２０の全部に対して１５３６の並列出力を供給する（又は他のビット精度に対しては中間の構成）。各レジスタ・ファイル列の高さは２×４×８−ｂであり、各入力ベクトル（８ビットまでの要素精度）が４セグメントに格納可能とし、全入力ベクトル要素がロードされる場合にダブル・バッファリングを可能とする。一方、入力ベクトル要素の１／３程度がロードされる場合（すなわち、１のストライドを有するＣＮＮ）、各４つのレジスタ・ファイル列のうちの１つがバッファの役割を果たし、３つの列からのデータが演算のためにＣＩＭＵに対して順伝搬可能とする。 Referring to FIG. 4, the IA BUFF 320 receives a 32-bit input signal that may include input vector elements with a bit precision of 1 to 8 bits. As a result, the 32-bit input signal is first stored in the 4 × 8-b register 410, which totals 24 (here, represented by registers 410-0 to 410-23). These registers 410 supply their content to an eight register file (denoted as register files 420-0 to 420-8), each with 96 columns, with dimensions up to 3x3x256 = 2304. The input vector it has is placed in a parallel column with its elements. This is done for the 8-b input element, where the 24 4x8-b registers 410 provide 96 parallel outputs to one of the register files 420, and for the 1-b input element. , 24 4x8-b registers 410 provide 1536 parallel outputs for all eight register files 420 (or an intermediate configuration for other bit precision). The height of each register file string is 2x4x8-b, each input vector (element precision up to 8 bits) can be stored in 4 segments, and double when all input vector elements are loaded. Allows buffering. On the other hand, when about 1/3 of the input vector elements are loaded (ie, CNN with 1 stride), one of each of the four register file strings acts as a buffer and the data from the three columns. Allows forward propagation to CIMU for computation.

それによって、各レジスタ・ファイル４２０によって出力される９６列のうち、７２のみがそれぞれの円形のバレル・シフト・インタフェース４３０によって選択され、一度に８つのレジスタ・ファイル４２０に対して合計５７６の出力を与える。これらの出力は、そのレジスタ・ファイルに格納された４つの入力ベクトル・セグメントの１つに対応する。したがって、１−ｂレジスタ内において、入力ベクトル要素の全てをスパーシティ／ＡＮＤ論理コントローラ３３０にロードするためには４つの周期が必要とされる。 Thereby, of the 96 columns output by each register file 420, only 72 are selected by their respective circular barrel shift interfaces 430, producing a total of 576 outputs for eight register files 420 at a time. give. These outputs correspond to one of the four input vector segments stored in the register file. Therefore, four cycles are required to load all of the input vector elements into the spurity / AND logic controller 330 in the 1-b register.

入力活性化ベクトルにおいてスパーシティを活用するために、ＣＰＵ２１０又はＤＭＡ２６０が再シェーピング・バッファ３２０に書き込む一方で、データ要素毎にマスク・ビットが生成される。マスキングされた入力活性化は、ＣＩＭＡ３１０における電荷に基づく演算動作を防ぎ、それによって演算エネルギーを節約する。このマスク・ベクトルは、ＳＲＡＭブロックにも格納され、入力活性化ベクトルと同様だが１ビット表現を有して組織される。 A mask bit is generated for each data element while the CPU 210 or DMA 260 writes to the reshaping buffer 320 to take advantage of the spurity in the input activation vector. The masked input activation prevents charge-based arithmetic operations on the CIMA 310, thereby saving computational energy. This mask vector is also stored in the SRAM block and is organized with a 1-bit representation similar to the input activation vector.

４対３バレル・シフタ４３０は、ＶＧＧスタイル（３×３フィルタ）ＣＮＮ演算をサポートするために使用される。次のフィルタリング動作（畳み込み再利用）に移行する際に、入力活性化ベクトルの３つのうちの１つのみが更新を必要とし、それによってエネルギーを節約し、スループットを向上させる。 The 4 to 3 barrel shifter 430 is used to support VGG style (3x3 filter) CNN operations. When moving to the next filtering operation (convolution reuse), only one of the three input activation vectors needs to be updated, thereby saving energy and improving throughput.

図５は、一実施例による、図２のアーキテクチャにおける使用に適したＣＩＭＡ読出／書込バッファ３４０の高レベル・ブロック図を図示する。図示されたＣＩＭＡ読出／書込バッファ３４０は、例えば７６８ビット幅のスタティック・ランダム・アクセスメモリ（ＳＲＡＭ：ｓｔａｔｉｃｒａｎｄｏｍａｃｃｅｓｓｍｅｍｏｒｙ）ブロック５１０として組織される一方、図示されるＣＰＵのワード幅はこの例では３２ビットであり、読出／書込バッファ３４０はその間をインタフェース接続するために使用される。 FIG. 5 illustrates a high level block diagram of the CIMA read / write buffer 340 suitable for use in the architecture of FIG. 2 according to one embodiment. The illustrated CIMA read / write buffer 340 is organized as, for example, a 768-bit wide static random access memory (SRAM) block 510, while the illustrated CPU word width is in this example. It is 32 bits and the read / write buffer 340 is used to make an interface connection between them.

図示されるような読出／書込バッファ３４０は、７６８ビット書込レジスタ５１１及び７６８ビット読出レジスタ５１２を含む。読出／書込バッファ３４０は、通常、ＣＩＭＡ３１０において幅広いＳＲＡＭブロックに対してキャッシュのように動作するが、いくつかの詳細が異なる。例えば、読出／書込バッファ３４０は、ＣＰＵ２１０が異なる行に書き込む場合のみＣＩＭＡ３１０に書き戻すが、異なる行の読出しは書き戻しをトリガしない。読出しアドレスが書込レジスタのタグと一致する場合、ＣＩＭＡ３１０から読み出すかわりに、書込レジスタ５１１における修正バイト（汚染ビットとして示す）は読出レジスタ５１２にバイパスされる。 The read / write buffer 340 as shown includes a 768-bit write register 511 and a 768-bit read register 512. The read / write buffer 340 typically behaves like a cache for a wide range of SRAM blocks in the CIMA 310, but with some differences. For example, the read / write buffer 340 writes back to the CIMA 310 only when the CPU 210 writes to different lines, but reading different lines does not trigger a write back. If the read address matches the tag of the write register, instead of reading from the CIMA 310, the correction byte (shown as a contaminated bit) in the write register 511 is bypassed to the read register 512.

蓄積ライン・アナログ・デジタル変換器（ＡＤＣ）。ＣＩＭＡ３１０からの蓄積ラインはそれぞれ８ビットのＳＡＲＡＤＣを有し、イン・メモリ・コンピューティング・チャンネルのピッチに適合する。領域を節約するため、ＳＡＲＡＤＣのビット・サイクリングを制御する有限状態機械（ＦＳＭ：ｆｉｎｉｔｅ−ｓｔａｔｅｍａｃｈｉｎｅ）は各イン・メモリ・コンピューティング・タイルにおいて必要とされる６４ＡＤＣ間で共有される。ＦＳＭ制御論理は、８＋２シフト・レジスタからなり、リセット、サンプリング、そして８ビット決定段階を繰り返すためのパルスを生成する。シフト・レジスタ・パルスは、６４ＡＤＣに対してブロードキャストされ、それらはローカルでバッファに入れられ、ローカルの比較器決定をトリガし、ローカルのＡＤＣコード・レジスタに対応するビット決定を格納し、その後、次のコンデンサ−ＤＡＣ構成をトリガするように使用される。高精度ｍｅｔａｌ−ｏｘｉｄｅ−ｍｅｔａｌ（ＭＯＭ）容量は、各ＡＤＣのコンデンサ・アレイの小型化を可能とするために使用してもよい。 Storage line analog-to-digital converter (ADC). Each storage line from the CIMA 310 has an 8-bit SAR ADC and fits the pitch of the in-memory computing channel. To save space, a finite state machine (FSM) that controls the bit cycling of the SAR ADC is shared between the 64 ADCs required in each in-memory computing tile. The FSM control logic consists of 8 + 2 shift registers that generate pulses to repeat the reset, sampling, and 8-bit determination steps. Shift register pulses are broadcast to 64 ADCs, they are buffered locally, trigger a local comparator decision, store the bit decision corresponding to the local ADC code register, and then the next. Used to trigger the capacitor-DAC configuration of. The precision metal-oxide-metal (MOM) capacitance may be used to allow miniaturization of the capacitor array of each ADC.

図６は、一実施例による、図２のアーキテクチャにおける使用に適したニア・メモリ・データ・パス（ＮＭＤ）モジュール６００の高レベル・ブロック図を図示するが、他の特徴を有するデジタル・ニア・メモリ・コンピューティングも使用可能である。図６に図示するＮＭＤモジュール６００は、ＢＰＢＳ方式によるマルチ・ビット行列積をサポートするＡＤＣ出力後のデジタル演算データ・パスを示す。 FIG. 6 illustrates a high-level block diagram of the Near Memory Data Path (NMD) Module 600 suitable for use in the architecture of FIG. 2 according to one embodiment, but with other features Digital Near. Memory computing is also available. The NMD module 600 illustrated in FIG. 6 shows a digitally calculated data path after ADC output that supports multi-bit matrix multiplication by the BPBS scheme.

特定の実施例において、２５６ＡＤＣ出力はデジタル演算フローのために８のグループに組織される。これによって、８ビットまでの行列要素構成のサポートが可能となる。したがって、ＮＭＤモジュール６００は、３２の同一ＮＭＤユニットを含む。各ＮＭＤユニットは、８ＡＤＣ出力６１０及び対応バイアス６２１、被乗数６２２／６２３、シフト数６２４及び蓄積レジスタのうちから選択するためのマルチプレクサ６１０／６２０と、グローバル・バイアス及びマスク・カウントを減算するために８ビットの無符号入力及び９ビットの符号付き入力を有する加算器６３１と、ニューラル・ネットワーク・タスクのためのローカル・バイアスを演算するための符号付き加算器６３２と、スケーリングを実行するための固定点マルチプレクサ６３３と、被乗数の指数を演算して重み要素における異なるビットのためのシフトを実行するためのバレル・シフタ６３４と、蓄積を実行するための３２ビット符号付き加算器６３５と、１、２、４、及び８ビット構成を有する重みをサポートするための８つの３２ビット蓄積レジスタ６４０と、ニューラル・ネットワーク適用のためのＲｅＬＵユニット６５０とからなる。 In a particular embodiment, 256 ADC outputs are organized into 8 groups for digital arithmetic flow. This makes it possible to support matrix element configurations up to 8 bits. Therefore, the NMD module 600 includes 32 identical NMD units. Each NMD unit has an 8ADC output 610 and a corresponding bias 621, a multiplier 622/623, a shift number 624 and a multiplexer 610/620 for selecting from storage registers, and 8 for subtracting the global bias and mask count. Adder 631 with unsigned bits and 9-bit signed inputs, signed adder 632 for calculating local bias for neural network tasks, and fixed points for performing scaling. Multiplexer 633, barrel shifter 634 for calculating the exponent of the multiplicand and performing shifts for different bits in the weight element, 32-bit signed adder 635 for performing accumulation, 1, 2, It consists of eight 32-bit storage registers 640 to support weights with 4 and 8-bit configurations and a ReLU unit 650 for neural network application.

図７は、一実施例による、図２のアーキテクチャにおける使用に適した直接記憶アクセス（ＤＭＡ）モジュール７００の高レベル・ブロック図を図示する。図示するＤＭＡモジュール７００は、例えば、異なるハードウェア・リソースとの同時のデータ転送をサポートする２つのチャンネルと、ＤＭＥＭ、ＩＡＢＵＦＦ、ＣＩＭＵＲ／ＷＢＵＦＦ、ＮＭＤ結果及びＡＸＩ４バスのそれぞれとの５つの独立したデータ・パスとを含む。 FIG. 7 illustrates a high level block diagram of a direct memory access (DMA) module 700 suitable for use in the architecture of FIG. 2 according to one embodiment. The DMA module 700 illustrated has, for example, two channels supporting simultaneous data transfer with different hardware resources and five each of DMEM, IA BUFF, CIMU R / W BUFF, NMD results and AXI4 bus. Includes independent data paths.

ビット・パラレル／ビット・シリアル（ＢＰＢＳ）行列ベクトル積
マルチ・ビットＭＶＭ

のためのＢＰＢＳ方式を図８に示す。ただし、Ｂ_Ａは行列要素ａ_ｍ，ｎのために使用されるビット数に相当し、Ｂ_ｘは入力ベクトル要素ｘ_ｎのために使用されるビット数に相当し、Ｎは入力ベクトルの次元に相当し、これは本実施例のハードウェアにおいて２３０４までとなり得る（Ｍ_ｎはスパーシティ及び次元制御のために使用されるマスク・ビットである）。ａ_ｍ，ｎの複数のビットは並列ＣＩＭＡ列にマッピングされ、ｘ_ｎの複数のビットは直列で入力される。したがって、マルチ・ビット乗算及び蓄積は、両方が本実施例の乗算ビット・セル（Ｍ−ＢＣ）によってサポートされるビット単位のＸＮＯＲ又はビット単位のＡＮＤのいずれかによってイン・メモリ・コンピューティングを介して達成可能である。特に、ビット単位のＡＮＤは、入力ベクトル要素ビットが低いときにその出力が低いままでなければならないという点でビット単位のＸＮＯＲとは異なる。本実施例のＭ−ＢＣは、差分信号として入力ベクトル要素ビットを（一度に１つ）入力することを含む。Ｍ−ＢＣはＸＮＯＲを実装する。ただし真理表における各論理「１」出力は、入力ベクトル要素ビットの真の信号と補数信号とをそれぞれ介してＶ_ＤＤに駆動することによって達成される。したがって、ＡＮＤは補数信号を単にマスキングすることによって容易に達成されるため、出力が低いままとなり、ＡＮＤに対応する真理表を生成する。 Bit Parallel / Bit Serial (BPBS) Matrix Vector Product Multi Bit MVM

The BPBS method for this is shown in FIG. Where B _A corresponds to the number of bits used for the matrix elements am _{, n} _{, B x} corresponds to the number of bits used for the input vector element x _n , and N corresponds to the dimension of the input vector. Correspondingly, this can be up to 2304 in the hardware of this embodiment ( _Mn is a mask bit used for spatiality and dimension control). _{A plurality of bits of am and n} are mapped to a parallel CIMA sequence, and a plurality of bits of _{x n are input in series.} Thus, multi-bit multiplication and storage is via in-memory computing by either bitwise XNOR or bitwise AND, both supported by the multiplication bit cell (M-BC) of this embodiment. Is achievable. In particular, bitwise AND differs from bitwise XNOR in that its output must remain low when the input vector element bits are low. The M-BC of this embodiment includes inputting (one at a time) an input vector element bit as a difference signal. M-BC implements XNOR. However, each logic "1" output in the truth table is achieved by driving the _VDD via the true signal and the complement signal of the input vector element bits, respectively. Therefore, AND is easily achieved by simply masking the complement signal, leaving the output low and producing a truth table corresponding to AND.

ビット単位のＡＮＤは、マルチ・ビット行列及び入力ベクトル要素のための標準的な２の補数表現をサポートすることができる。これは、ＡＤＣの後で、デジタル化された出力を他の列の演算の出力に追加する前にデジタル・ドメインにおいて最上位ビット（ＭＳＢ：ｍｏｓｔｓｉｇｎｉｆｉｃａｎｔｂｉｔ）に相当する列演算に対して負号を適切に印加することを含む。 Bitwise AND can support standard two's complement representations for multi-bit matrices and input vector elements. This is negative for the column operation that corresponds to the most significant bit (MSB) in the digital domain after the ADC and before adding the digitized output to the output of the operation for other columns. Includes proper application.

ビット単位のＸＮＯＲは、数字表現のわずかな修正を必要とする。すなわち、要素ビットは１／０ではなく＋１／−１にマッピングし、適切にゼロを表現するために同等のＬＳＢ重み付けを有する２つのビットを必要とする。これは以下のように行われる。まず、各Ｂビット・オペランド（標準的な２の補数表現）はＢ＋１ビット符号付き整数に分解される。例えばｙはＢ＋１正／負の１ビット

に分解して、

を得る。 Bitwise XNORs require minor modifications to the numeric representation. That is, the element bits are mapped to + 1 / -1 instead of 1/0 and require two bits with equivalent LSB weighting to properly represent zero. This is done as follows. First, each B-bit operand (standard two's complement representation) is decomposed into B + 1-bit signed integers. For example, y is B + 1 positive / negative 1 bit

Disassembled into

To get.

＋１／−１の数学的値に対して１／０値のビットをマッピングすることによって、ビット単位のイン・メモリ・コンピューティング乗算は論理ＸＮＯＲ動作を介して実現されてもよい。したがって、入力ベクトル要素のために差分信号を使用して論理ＸＮＯＲを実行するＭ−ＢＣは、列演算からのデジタル化された出力をビット重み付け及び加算を行うことによって符号付きのマルチ・ビット乗算を可能とすることができる。 Bit-by-bit in-memory computing multiplication may be achieved via logical XNOR operation by mapping 1/0 value bits to + 1 / -1 mathematical values. Therefore, an M-BC that performs a logical XNOR using a difference signal for an input vector element performs signed multi-bit multiplication by bit-weighting and adding the digitized output from a column operation. It can be possible.

ＡＮＤに基づくＭ−ＢＣ乗算及びＸＮＯＲに基づくＭ−ＢＣ乗算が２つのオプションを表しているが、Ｍ−ＢＣで可能な論理動作を有する適切な数の表現を使用することによって他のオプションも可能である。そのような代替案は有益である。例えば、ＸＮＯＲに基づくＭ−ＢＣ乗算は、２値化された（１−ｂ）演算に対して好適である一方、ＡＮＤに基づくＭ−ＢＣ乗算はデジタル・アーキテクチャ内での統合を容易化するためのより標準的な数の表現を可能とする。さらに、この２つのアプローチは、わずかに異なる信号対量子化ノイズ比（ＳＱＮＲ）を発生させるため、適用の必要性に基づいて選択されることが可能である。 M-BC multiplication based on AND and M-BC multiplication based on XNOR represent two options, but other options are possible by using the appropriate number of representations with the logical behavior possible with M-BC. Is. Such alternatives are beneficial. For example, XNOR-based M-BC multiplication is suitable for binarized (1-b) operations, while AND-based M-BC multiplication facilitates integration within a digital architecture. Allows a more standard representation of numbers. In addition, the two approaches generate slightly different signal-to-quantization noise ratios (SQNRs) and can be selected based on application needs.

異種コンピューティング・アーキテクチャ及びインタフェース
本明細書において説明する様々な実施例は、ビット・セル（又は乗算ビット・セル：Ｍ−ＢＣ）が演算結果に相当する出力電圧をローカル・コンデンサに駆動する場合のチャージ・ドメインのイン・メモリ・コンピューティングの様々な態様を企図する。イン・メモリ・コンピューティング・チャンネル（列）からのコンデンサは、その後、電荷の再配分を介した蓄積を実現するために結合される。上述したように、そのようなコンデンサは、単純に互いに近接しているため電界を介して結合される配線を介して等、ＶＬＳＩ等処理において複製が非常に容易である特定の形状を使用して形成されてもよい。それによって、コンデンサとして形成されたローカル・ビット・セルは１又はゼロを表す電荷を格納する一方、ローカルで多数のコンデンサ又はビット・セルの電荷の全てを足し上げることによって、行列ベクトル積において基礎の動作となる乗算及び蓄積／加算の関数の実装を可能とする。 Heterogeneous Computing Architectures and Interfaces The various examples described herein are when a bit cell (or multiplication bit cell: M-BC) drives an output voltage to a local capacitor that corresponds to the result of an operation. It contemplates various aspects of charge domain in-memory computing. Capacitors from in-memory computing channels (columns) are then coupled to achieve storage through charge redistribution. As mentioned above, such capacitors use certain shapes that are very easy to replicate in VLSI and other processing, such as via wiring that is simply coupled via an electric field because they are in close proximity to each other. It may be formed. Thus, the local bit cell formed as a capacitor stores the charge representing one or zero, while locally adding up all the charges of a large number of capacitors or bit cells to form the basis in the matrix vector product. It enables the implementation of operating multiplication and accumulation / addition functions.

上述した様々な実施例は、改良されたビット・セルに基づくアーキテクチャ、コンピューティング・エンジン、及びプラットフォームを提供しており、有益である。行列ベクトル積は、標準的なデジタル処理又はデジタル・アクセラレーションによって効率的に実行されない１つの動作である。したがって、この一種類のコンピュテーション・イン・メモリの演算を実行することは既存のデジタル設計を上回る大きな利点を提供することになる。ただし、デジタル設計を使用して他の様々な種類の動作が実行される。 The various examples described above provide an improved bit cell-based architecture, computing engine, and platform, which is beneficial. Matrix-vector product is an operation that is not efficiently performed by standard digital processing or digital acceleration. Therefore, performing this one type of computation in memory offers significant advantages over existing digital designs. However, various other types of operations are performed using digital design.

様々な実施例は、例えば異種コンピューティング・アーキテクチャを形成するために、上記のビット・セルに基づくアーキテクチャ、コンピューティング・エンジン、プラットフォーム等を従来のデジタル・コンピューティング・アーキテクチャ及びプラットフォームに接続／インタフェース接続するための機構を企図する。このようにして、伝統的なコンピュータ処理によく適した他の演算動作は伝統的なコンピュータ・アーキテクチャを介して処理される一方、ビット・セル・アーキテクチャ処理（例えば行列ベクトル処理）によく適した演算動作は上述したように処理される。すなわち、様々な実施例は、本明細書で説明する高パラレル処理機構を含むコンピューティング・アーキテクチャを提供し、この機構が複数のインタフェースに接続されることによって、より従来のデジタル・コンピューティング・アーキテクチャに外部結合されることが可能となる。これによって、デジタル・コンピューティング・アーキテクチャはイン・メモリ・コンピューティング・アーキテクチャと直接的及び効率的に並ぶことができ、その２つの間でのデータ移行のオーバーヘッドを最小限にするために２つが近接して配置されることを可能とする。例えば、機械学習適用は８０％から９０％の行列ベクトル演算を含む一方、依然として１０％から２０％の他の種類の演算／動作が実行される。本明細書で説明するイン・メモリ・コンピューティングをアーキテクチャにおいてより従来型であるニア・メモリ・コンピューティングと組み合わせることによって、その結果得られるシステムは多くの種類の処理を実行するための格段の構成可能性を実現する。したがって、様々な実施例は、本明細書で説明するイン・メモリ・コンピューティングと組み合わせたニア・メモリ・デジタル演算を企図する。 Various embodiments connect / interface to traditional digital computing architectures and platforms, such as the above bit cell based architectures, computing engines, platforms, etc., to form heterogeneous computing architectures. Consume a mechanism for doing so. In this way, other arithmetic operations that are well suited for traditional computer processing are processed through traditional computer architecture, while operations that are well suited for bit cell architecture processing (eg, matrix vector processing). The operation is processed as described above. That is, various embodiments provide a computing architecture that includes a highly parallel processing mechanism as described herein, and by connecting this mechanism to multiple interfaces, a more conventional digital computing architecture. Can be outer-coupled to. This allows the digital computing architecture to line up directly and efficiently with the in-memory computing architecture, with the two in close proximity to minimize the overhead of data migration between the two. It is possible to be placed. For example, machine learning applications include 80% to 90% matrix vector operations, while 10% to 20% other types of operations / operations are still performed. By combining in-memory computing as described herein with near-memory computing, which is more conventional in architecture, the resulting system is significantly configured to perform many types of processing. Realize the possibilities. Therefore, various examples contemplate near-memory digital computation in combination with in-memory computing as described herein.

本明細書で説明するイン・メモリ演算は超並列であるが単一ビットの動作である。例えば、１ビットのみがビット・セルに格納される場合が多い。１か０である。ビット・セルに駆動される信号は、通常、入力ベクトルである（すなわち、各行列要素は２Ｄベクトル乗算動作において各ベクトル要素が乗算される）。このベクトル要素は、同様にデジタルである信号に置かれ、ベクトル要素が同様に１ビットとなるように１ビットのみである。 The in-memory operations described herein are massively parallel but single-bit operations. For example, often only one bit is stored in a bit cell. It is 1 or 0. The signal driven by the bit cell is usually an input vector (ie, each matrix element is multiplied by each vector element in a 2D vector multiplication operation). This vector element is placed on a signal that is also digital and has only one bit so that the vector element is also one bit.

様々な実施例は、ビット・パラレル／ビット・シリアルのアプローチを使用して１ビット要素から複数ビット要素へ行列／ベクトルを拡張する。 Various examples extend a matrix / vector from a one-bit element to a multi-bit element using a bit-parallel / bit-serial approach.

図８Ａ及び８Ｂは、図２のアーキテクチャにおける使用に適したＣＩＭＡチャンネル・デジタル化／重み付けの様々な実施例の高レベル・ブロック図を図示する。特に、図８Ａは、様々な他の図に関して上述したものと同様のデジタル・バイナリ重み付け及び加算実施例を図示する。図８Ｂは、図８Ａの実施例及び／又は本明細書で説明する他の実施例よりも少ない数のアナログ・デジタル変換器の使用を可能とするために様々な回路要素に対して修正を加えたアナログ・バイナリ重み付け及び加算実施例を図示する。 8A and 8B illustrate high-level block diagrams of various examples of CIMA channel digitization / weighting suitable for use in the architecture of FIG. In particular, FIG. 8A illustrates a digital binary weighting and addition embodiment similar to that described above for various other figures. FIG. 8B modifies various circuit elements to allow the use of a smaller number of analog-to-digital converters than the embodiment of FIG. 8A and / or the other embodiments described herein. An example of analog / binary weighting and addition is illustrated.

上述したように、様々な実施例は、ビット・セルのコンピュート・イン・メモリ（ＣＩＭ）アレイが、第１のＣＩＭアレイ次元（例えば２ＤＣＩＭアレイの行）を介して超並列のビット単位の入力信号を受信し、第２のＣＩＭアレイ次元（例えば２ＤＣＩＭアレイの列）を介して１つ以上の蓄積信号を受信するように構成されることを企図しており、この場合、共通の蓄積信号（例えばビット・セルの列として図示）と関連付けられた複数のビット・セルのそれぞれが、それぞれの出力信号を供給するように構成されたそれぞれのＣＩＭチャンネルを形成する。アナログ・デジタル変換器（ＡＤＣ）回路は、複数のＣＩＭチャンネル出力信号を処理することによってマルチ・ビットの出力ワードのシーケンスを供給するように構成される。制御回路は、ＣＩＭアレイに、シングル・ビット内部回路及び信号を使用して入力及び蓄積信号に対してマルチ・ビット・コンピューティング動作を実行させるように構成され、それによって動作上係合するニア・メモリ・コンピューティング・パスが演算結果としてマルチ・ビットの出力ワードのシーケンスを提供するように構成され得るようにする。 As mentioned above, in various embodiments, a bit cell compute-in-memory (CIM) array is input in bits in parallel via a first CIM array dimension (eg, a row in a 2D CIM array). It is intended to be configured to receive a signal and receive one or more stored signals via a second CIM array dimension (eg, a row of 2D CIM arrays), in which case a common stored signal. Each of the plurality of bit cells associated with (eg, illustrated as a sequence of bit cells) forms a CIM channel configured to supply its respective output signal. An analog-to-digital converter (ADC) circuit is configured to supply a sequence of multi-bit output words by processing multiple CIM channel output signals. The control circuit is configured to cause the CIM array to perform multi-bit computing operations on input and stored signals using single-bit internal circuits and signals, thereby engaging in operation near. Allow the memory computing path to be configured to provide a sequence of multi-bit output words as the result of the operation.

図８Ａを参照すると、ＡＤＣ回路機能を実行するデジタル・バイナリ重み付け及び加算実施例が図示される。特に、二次元ＣＩＭＡ８１０Ａは第１の（行）次元（すなわち、複数のバッファ８０５を介する）において行列入力値を受信し、第２の（列）次元においてベクトル入力値を受信し、ＣＩＭＡ８１０Ａは、様々なチャンネル出力信号ＣＨ−ＯＵＴを供給するように制御回路等（不図示）に従って動作する。 With reference to FIG. 8A, examples of digital binary weighting and addition that perform ADC circuit functions are illustrated. In particular, the two-dimensional CIMA810A receives matrix input values in the first (row) dimension (ie, via multiple buffers 805) and vector input values in the second (column) dimension, and the CIMA810A varies. It operates according to a control circuit or the like (not shown) so as to supply a channel output signal CH-OUT.

図８ＡのＡＤＣ回路は、ＣＩＭチャンネル毎に、ＣＩＭチャンネル出力信号ＣＨ−ＯＵＴをデジタル化するように構成されたそれぞれのＡＤＣ７６０と、デジタル化されたＣＩＭチャンネル出力信号ＣＨ−ＯＵＴに対してそれぞれのバイナリ重み付けを付与することによって、マルチ・ビットの出力ワード８７０のそれぞれの部分を形成するように構成されたそれぞれのシフト・レジスタ８６５とを提供する。 The ADC circuit of FIG. 8A has a binary for each ADC 760 configured to digitize the CIM channel output signal CH-OUT and a respective binary for the digitized CIM channel output signal CH-OUT for each CIM channel. By assigning weights, each shift register 865 configured to form each portion of the multi-bit output word 870 is provided.

図８Ｂを参照すると、ＡＤＣ回路機能を実行するアナログ・バイナリ重み付け及び加算実施例が図示されている。特に、二次元ＣＩＭＡ８１０Ｂは第１の（行）次元で（すなわち、複数のバッファ８０５を介して）行列入力値を受信し、第２の（列）次元でベクトル入力値を受信し、ＣＩＭＡ８１０Ｂは様々なチャンネル出力信号ＣＨ−ＯＵＴを供給するように制御回路等（不図示）に従って動作する。 Referring to FIG. 8B, analog / binary weighting and addition embodiments that perform ADC circuit functions are illustrated. In particular, the two-dimensional CIMA810B receives matrix input values in the first (row) dimension (ie, through multiple buffers 805) and the vector input values in the second (column) dimension, and the CIMA810B varies. It operates according to a control circuit or the like (not shown) so as to supply a channel output signal CH-OUT.

図８ＢのＡＤＣ回路は、ＣＩＭＡ８１０Ｂ内にスイッチ８１５−１、８１５−２等の４つの制御可能な（又は事前に設定された）バンクを提供し、これらのバンクは、そこに形成されているコンデンサを結合及び／又は分離するように動作することによってチャンネルの１つ以上のサブグループのそれぞれに対してアナログ・バイナリ重み付け法を実装し、チャンネル・サブグループのそれぞれは単一の出力信号を供給して、ＣＩＭチャンネルのそれぞれのサブセットのＣＩＭチャンネル出力信号の重み付けされたアナログ加算をデジタル化することによってマルチ・ビットの出力ワードのそれぞれの部分を形成するために１つのＡＤＣ８６０Ｂのみが必要とされるようにする。 The ADC circuit of FIG. 8B provides four controllable (or preset) banks within the CIMA810B, such as switches 815-1, 815-2, etc., which are the capacitors formed therein. Implements an analog-binary weighting method for each of one or more subgroups of a channel by acting to combine and / or separate, and each of the channel subgroups provides a single output signal. Thus, only one ADC 860B is required to form each portion of the multi-bit output word by digitizing the weighted analog addition of the CIM channel output signal of each subset of the CIM channel. To.

図９は、一実施例による方法のフロー図を図示する。特に、図９の方法９００は、本明細書において説明するようにアーキテクチャ、システム等によって実装される様々な処理動作に関し、この場合、入力行列／ベクトルはビット・パラレル／ビット・シリアル・アプローチにおいて演算されるように拡張される。 FIG. 9 illustrates a flow chart of the method according to one embodiment. In particular, method 900 of FIG. 9 relates to various processing operations implemented by architectures, systems, etc. as described herein, in which case the input matrix / vector is calculated in a bit-parallel / bit-serial approach. Expanded to be.

ステップ９１０において、行列及びベクトルのデータは、適切なメモリ位置にロードされる。 In step 910, the matrix and vector data is loaded into the appropriate memory location.

ステップ９２０において、ベクトル・ビット（ＭＳＢからＬＳＢ）のそれぞれは順次処理される。特に、ベクトルのＭＳＢは行列のＭＳＢによって乗算され、ベクトルのＭＳＢは行列のＭＳＢ−１によって乗算され、ベクトルのＭＳＢは行列のＭＳＢ−２によって乗算される等が行われて、最終的にベクトルのＭＳＢは行列のＬＳＢによって乗算される。結果として得られるアナログ電荷結果は、その後、ＭＳＢからＬＳＢのベクトル積毎にデジタル化され、得られた結果はラッチされる。このプロセスは、ベクトルＭＳＢ−ＬＳＢのそれぞれが行列のＭＳＢ−ＬＳＢ要素のそれぞれによって乗算されるまで、ベクトルＭＳＢ−１、ベクトルＭＳＢ−２等からベクトルＬＳＢまで反復される。 In step 920, each of the vector bits (MSB to LSB) is processed sequentially. In particular, the vector MSB is multiplied by the matrix MSB, the vector MSB is multiplied by the matrix MSB-1, the vector MSB is multiplied by the matrix MSB-2, and so on, and finally the vector The MSB is multiplied by the LSB of the matrix. The resulting analog charge result is then digitized by vector product from MSB to LSB, and the resulting result is latched. This process is repeated from vector MSB-1, vector MSB-2, etc. to vector LSB until each of the vectors MSB-LSB is multiplied by each of the MSB-LSB elements of the matrix.

ステップ９３０において、ビットは適切な重み付けを印加するようにシフトされ、その結果はともに加算される。なお、アナログ重み付けが使用される実施例のいくつかにおいて、ステップ９３０のシフト動作は不要である。 At step 930, the bits are shifted to apply the appropriate weighting and the results are added together. Note that in some of the embodiments where analog weighting is used, the shift operation in step 930 is unnecessary.

様々な実施例は、高密度のメモリにデータを格納するために使用される回路内において非常に安定及び強固な演算を実行可能とする。さらに、メモリ・ビット・セル回路のために高密度を可能にすることによって、様々な実施例は本明細書で説明するコンピューティング・エンジン及びプラットフォームを進歩させる。密度は、よりコンパクトなレイアウトに起因して、さらにメモリ回路に対して使用される非常に積極的な設計ルール（すなわち、プッシュ・ルール）とのレイアウトの向上した適合性を理由として高めることが可能である。様々な実施例は、機械学習及び他の線形代数のためのプロセッサの性能を実質的に向上させる。 Various embodiments make it possible to perform very stable and robust operations in circuits used to store data in dense memory. In addition, various embodiments advance the computing engines and platforms described herein by enabling high densities for memory bit cell circuits. Density can be increased due to the more compact layout and also due to the improved compatibility of the layout with the very aggressive design rules (ie, push rules) used for memory circuits. Is. Various examples substantially improve the performance of the processor for machine learning and other linear algebra.

イン・メモリ・コンピューティング・アーキテクチャ内で使用可能なビット・セル回路が開示される。開示されるアプローチによって、高密度のメモリにデータを格納するために使用される回路内において非常に安定／強固な演算が実行可能となる。強固なイン・メモリ・コンピューティングのための開示アプローチは、既知のアプローチと比較してメモリ・ビット・セル回路に対して高密度を可能とする。密度は、よりコンパクトなレイアウトに起因して、さらにメモリ回路に対して使用される非常に積極的な設計ルール（すなわち、プッシュ・ルール）とのレイアウトの向上した適合性を理由として高められる。開示の装置は、標準的なＣＭＯＳ集積回路処理を使用して組み立てられることが可能である。 Bit cell circuits that can be used within an in-memory computing architecture are disclosed. The disclosed approach allows for very stable / robust operations in the circuits used to store data in dense memory. Disclosure approaches for robust in-memory computing allow higher densities for memory bit cell circuits compared to known approaches. Density is increased due to the more compact layout and also due to the improved compatibility of the layout with the very aggressive design rules (ie, push rules) used for memory circuits. The disclosed devices can be assembled using standard CMOS integrated circuit processing.

メモリ・アクセスは、多くの演算作業負荷においてエネルギー及び遅延の主要部分を占めている。標準的なメモリでは、生データが行毎にアクセスされ、格納点からメモリ・アレイ外の演算点へデータを動かす際に通信コストを発生させるため、メモリのエネルギー及び遅延が発生する。一方、イン・メモリ・コンピューティング・アーキテクチャは行にわたって格納されたデータの多くのビット上の演算結果にアクセスすることによって、一度に多くの行にアクセスし、通信コストを償却する。 Memory access is a major part of energy and latency in many computational workloads. In standard memory, raw data is accessed row by row, causing communication costs when moving data from a storage point to a calculation point outside the memory array, resulting in memory energy and delay. In-memory computing architectures, on the other hand, access many rows at once and amortize communication costs by accessing the results of operations on many bits of data stored across rows.

そのような償却が乗数（すなわち、大まかには同時にアクセスされる行の数）によるエネルギー及び遅延を削減する一方、最重要課題は、演算上の信号対ノイズ比（ＳＮＲ）も対応する因数によって低減されることである。これは、一般的に多数のビットに対する演算が必要とされるダイナミック・レンジを増加させるが、これに伴って、メモリの既存のビット・ラインの限定スイング内においてＳＮＲ抑制するためである。特にイン・メモリ・アーキテクチャの演算ノイズの大部分は、ビット・セルによって実行される演算動作の変形と非線形によるものである。標準的なメモリにおいて、ビット・セルは出力電流を供給する。これは、標準的に使用される高密度のビット・セルに対する変化を最小限にすることを目的として、電流ドメインの演算をイン・メモリ・コンピューティング・アーキテクチャに対する自然な選択としてきた。 While such depreciation reduces energy and delay due to multipliers (ie, roughly the number of rows accessed at the same time), the most important issue is that the computational signal-to-noise ratio (SNR) is also reduced by the corresponding factor. To be done. This is to increase the dynamic range, which generally requires operations on a large number of bits, but with this to suppress SNR within the limited swing of the existing bit line of memory. In particular, most of the computational noise in in-memory architectures is due to the deformation and non-linearity of the operational behavior performed by the bit cells. In standard memory, bit cells provide the output current. It has made current domain computation a natural selection for in-memory computing architectures, with the aim of minimizing changes to standardly used high-density bit cells.

しかしながら、ビット・セル電流は、ビット・セル・トランジスタに影響する変化及び非線形の高レベルの影響を受けやすい。これはイン・メモリ・コンピューティングのＳＮＲを限定し、したがってスケーラビリティを限定する。チャージ・ドメインの演算を使用する様々な実施例に従って改良が得られる。ここでは、ビット・セルからの演算出力は、コンデンサ上に電荷として記憶される。例えば、様々な実施例はビット・セル・トランジスタの上でのメタル・フィンガ・コンデンサの使用を企図しており、そのようなコンデンサは追加の領域を発生させず、したがってビット・セルが高密度の構造を維持することができる。演算上のＳＮＲのために重要なことは、そのようなコンデンサも非常に良好な線形を示すとともに、処理と温度の変化が存在するときに高い安定性を示すことである。これは、実質的に、イン・メモリ・コンピューティングのスケーラビリティを高めてきた。 However, the bit cell current is susceptible to changes affecting the bit cell transistor and high levels of non-linearity. This limits the SNR of in-memory computing and thus limits scalability. Improvements are obtained according to various embodiments that use charge domain operations. Here, the arithmetic output from the bit cell is stored as an electric charge on the capacitor. For example, various examples contemplate the use of metal finger capacitors on bit cell transistors, such capacitors that do not generate additional regions and therefore have a high density of bit cells. The structure can be maintained. Important for the operational signal-to-noise ratio is that such capacitors also show very good linearity and high stability in the presence of processing and temperature changes. This has, in effect, increased the scalability of in-memory computing.

メタル・フィンガ・コンデンサはビット・セル上に配置されることが可能である一方、スイッチト・キャパシタのチャージ・ドメイン演算のために、ビット・セル内でいくつかの回路変更が必要である。本発明は、スタティック・ランダム・アクセス・メモリ（ＳＲＡＭ）におけるチャージ・ドメインのビット・セル演算の高密度を可能とするために回路及びレイアウトに着目する。特に、ＳＲＡＭビット・セルのために使用される積極的なプッシュ設計ルールとの適合性を向上させる回路及びレイアウトを説明する。 While metal finger capacitors can be placed on the bit cell, some circuit changes are required within the bit cell for the charge domain operation of the switched capacitor. The present invention focuses on circuits and layouts to enable high density of charge domain bit cell operations in static random access memory (SRAM). In particular, circuits and layouts that improve compatibility with the aggressive push design rules used for SRAM bit cells will be described.

図１０は、乗算ビット・セルの回路図を図示する。図９のビット・セル９００は、１−ｂデータの格納、書込み、読出しなどの動作を実行し、さらに格納された１−ｂデータと１−ｂＩＡ／ＩＡｂ信号（差分）間の乗算を可能にする。したがって、この構造を乗算ビット・セル（Ｍ−ＢＣ）と呼ぶ。なお、１−ｂ乗算は、論理ＸＮＯＲ演算に相当する。これを実装するため、図１０に図示するように、ＰＭＯＳトランジスタが追加され、ビット・セル格納ノードに結合され、ＩＡ／ＩＡｂ信号によって駆動される。 FIG. 10 illustrates a circuit diagram of a multiplication bit cell. The bit cell 900 of FIG. 9 executes operations such as storing, writing, and reading 1-b data, and can further multiply the stored 1-b data and the 1-b IA / IAb signal (difference). To. Therefore, this structure is called a multiplication bit cell (M-BC). The 1-b multiplication corresponds to the logical XNOR operation. To implement this, as illustrated in FIG. 10, a MIMO transistor is added, coupled to a bit cell storage node and driven by an IA / IAb signal.

図１１は、ＸＮＯＲ関数を実行するように構成された３つのＭ−ＢＣの回路図を図示する。特に、図１０の回路の動作は、３つのＭ−ＢＣに対して、（１）コンデンサ短絡スイッチ（ＴＳＨＯＲＴ／ＴＳＨＯＲＴｂ）及び放電ＮＭＯＳトランジスタ（ＰＲＥ）をアサートすることによってＭ−ＢＣコンデンサが無条件で放電され、（２）ＴＳＨＯＲＴ／ＴＳＨＯＲＴｂがディアサートされ、ＩＡ／ＩＡｂが駆動され、ローカルＭ−ＢＣコンデンサ上にＭ−ＢＣＸＮＯＲ出力を格納し、（３）ＩＡ／ＩＡｂがディアサートされ、ＴＳＨＯＲＴがアサートされ、全てのＸＮＯＲ結果からの電荷を蓄積して、乗算−蓄積演算を与えることを含む。なお、８つのＭ−ＢＣトランジスタに加えて、ＴＳＨＯＲＴ／ＴＳＨＯＲＴｂ実装のためのＭ−ＢＣ毎に２つの追加のＮＭＯＳ／ＰＭＯＳトランジスタが必要である。以前の設計においては、ノードの効率的な共有を可能とするため、３つのＭ−ＢＣとともに、上記のＮＭＯＳ／ＰＭＯＳトランジスタは各Ｍ−ＢＣの外側に配置された。 FIG. 11 illustrates a circuit diagram of three M-BCs configured to perform the XNOR function. In particular, the operation of the circuit of FIG. 10 is that the M-BC capacitor is unconditionally operated by asserting (1) a capacitor short-circuit switch (TSHORT / TSHORTb) and a discharge NMOS transistor (PRE) for three M-BCs. Discharged, (2) TSHORT / TSHORTb is deasserted, IA / IAb is driven, the M-BC XNOR output is stored on the local M-BC capacitor, (3) IA / IAb is deasserted, and TSHORT is It is asserted and involves accumulating charges from all XNOR results and giving a multiplication-accumulation operation. In addition to the eight M-BC transistors, two additional NMOS / MOSFET transistors are required for each M-BC for TSHORT / TSHORTb mounting. In the previous design, the above NMOS / MOSFET transistors, along with the three M-BCs, were placed outside each M-BC to allow efficient sharing of the nodes.

図１１Ａ、図１１Ｂは標準的なＳＲＡＭビット・セル（図１１Ｂ）の８トランジスタＭ−ＢＣの次段の８トランジスタＭ−ＢＣ（図１１Ａ）上の例示の集積回路（ＩＣ）のレイアウトを図示する。２種類のビット・セル間のＰＣＢサイズ及び複雑性の差は検査によってわかり得る。 11A and 11B illustrate the layout of an exemplary integrated circuit (IC) on the 8-transistor M-BC (FIG. 11A) next to the 8-transistor M-BC in a standard SRAM bit cell (FIG. 11B). .. Differences in PCB size and complexity between the two types of bit cells can be seen by inspection.

図１２は、３つのＭ−ＢＣからなるグループ（その上に配置されたメタル・フィンガ・コンデンサも図示）及びＴＳＨＯＲＴＮＭＯＳ／ＰＭＯＳの例示のＩＣレイアウトを図示する。なお、ビット・セル演算のためのＰＭＯＳトランジスタは、Ｍ−ＢＣ内のＮＭＯＳ及びＰＭＯＳトランジスタのバランスがとれた使用につながり、それによって結果的に標準的な６トランジスタ（６Ｔ）ＳＲＡＭビット・セルのものとは大きく異なるＩＣレイアウトとなる。これは、プッシュ・ルール適合性の可能性に影響する。 FIG. 12 illustrates an exemplary IC layout of a group of three M-BCs (also shown with metal finger capacitors placed on top of them) and TSHORT NMOS / MOSFETs. It should be noted that the NMOS transistors for bit cell operations lead to the balanced use of the NMOS and NMOS transistors in the M-BC, which results in the standard 6-transistor (6T) SRAM bit cell. The IC layout is significantly different from that of. This affects the likelihood of push rule compliance.

様々な実施例は、ビット・セル演算のためにＮＭＯＳトランジスタを追加する新規のＭ−ＢＣ回路を企図する。これによって、実施例は、標準的な６Ｔビット・セルと比べて高密度及び近接の両方を実現しプッシュ・ルール適合性も向上させるＩＣレイアウトを提供する。 Various embodiments contemplate a novel M-BC circuit that adds an NMOS transistor for bit cell computation. Thereby, the embodiment provides an IC layout that achieves both high density and proximity as compared to a standard 6T bit cell and also improves push rule compliance.

図１３は、一実施例によるＭ−ＢＣの回路図を図示する。特に、図１３のＭ−ＢＣ１３００は、ＮＭＯＳトランジスタを使用した１−ｂチャージ・ドメイン乗算（ＸＮＯＲ）を実装する。ここで、ＮＭＯＳ入力ＩＡ／ＩＡｂはローカル・コンデンサの無条件の放電の間は低く、その後演算のために差分駆動される。 FIG. 13 illustrates a circuit diagram of an M-BC according to an embodiment. In particular, the M-BC1300 of FIG. 13 implements 1-b charge domain multiplication (XNOR) using an NMOS transistor. Here, the NMOS inputs IA / IAb are low during the unconditional discharge of the local capacitor and then differentially driven for computation.

図１４は、図１３のＭ−ＢＣの例示のレイアウトを図示する。特に、図１４のレイアウトは、単一のＭ−ＢＣ内にコンパクトにＮＭＯＳ／ＰＭＯＳＴＳＨＯＲＴスイッチを含むこと（メタル・フィンガ・コンデンサはビット・セル上に配置される）を企図する。図１４のレイアウト１４００において、信号ＷＬ、ＩＡ／ＩＡｂは水平に走る一方、信号ＢＬ、Ｂｌａｂ、ＰＡ、ＶＤＤ、ＧＮＤは垂直に走る。このレイアウトは、標準的な６Ｔセルの面積のおよそ２倍を有しており、周辺のＭ−ＢＣといくつかのノードを共有する機会を利用する。 FIG. 14 illustrates an exemplary layout of the M-BC of FIG. In particular, the layout of FIG. 14 contemplates compactly including an NMOS / NMOS TSHORT switch in a single M-BC (metal finger capacitors are located on the bit cell). In layout 1400 of FIG. 14, the signals WL, IA / IAb run horizontally, while the signals BL, Brab, PA, VDD, and GND run vertically. This layout has approximately twice the area of a standard 6T cell and takes advantage of the opportunity to share some nodes with the surrounding M-BC.

開示のアプローチは、標準的なメモリ・ビット・セル回路よりも多くの面積を使用する一方、ビット・セルの大部分はプッシュ・ルールによって実証されたものであり、これへの拡張は、プッシュ・ルールの使用に成功した他の構造と同様である。開示のアプローチは、機械学習及び他の線形代数のためのプロセッサの性能を実質的に向上させる。そのような改善は従来のアーキテクチャに対して実験的に証明されており、開示のアプローチはそのアーキテクチャを実質的に進歩させるものである。 While the disclosed approach uses more area than standard memory bit cell circuits, the majority of bit cells have been demonstrated by push rules, and extensions to this are push rules. Similar to any other structure that has successfully used the rule. The disclosed approach substantially improves the performance of the processor for machine learning and other linear algebra. Such improvements have been experimentally proven for traditional architectures, and the disclosure approach is a substantial advance in that architecture.

上述したように、メモリ・ビット・セル内の演算動作は、通常コンデンサを介した電圧電荷変換を使用してその結果を電荷として供給する。したがって、ビット・セル回路は、所与のビット・セルのローカル・コンデンサの適切な切換を含み、この場合、ローカル・コンデンサは他のビット・セル・コンデンサにも適切に結合されて、結合されたビット・セルにわたる集約演算結果を生成する。 As mentioned above, the arithmetic operation in the memory bit cell usually uses voltage-charge conversion via a capacitor to supply the result as a charge. Therefore, the bit cell circuit includes the proper switching of the local capacitor of a given bit cell, in which case the local capacitor is also properly coupled and coupled to the other bit cell capacitors. Generates aggregated operation results across bit cells.

再構成可能なチャージ・ドメインのイン・メモリ・コンピューティングのためのチャージ・インジェクションの強固なビット・セル及びビット・セル・レイアウトが本明細書において開示される。開示の装置、特にビット・セル回路は、イン・メモリ・コンピューティング・アーキテクチャ内で使用可能である。開示のアプローチによって、高密度メモリにデータを格納するために使用される回路内において、非常に安定／強固な演算とともに再構成可能な演算が実行可能となる。開示のアプローチによって、従来のアプローチよりもイン・メモリ・コンピューティングのための高い強固性及び再構成可能性が可能となる。開示の装置は、標準的なＣＭＯＳ集積回路処理を使用して組み立てられてもよい。開示のアプローチは、機械学習及び他の線形代数のためのプロセッサの性能を実質的に向上させることができるため、半導体業界に対して顕著な有用性を有すると考えられる。 A robust bit-cell and bit-cell layout of charge injection for in-memory computing of reconfigurable charge domains is disclosed herein. The disclosed devices, especially bit cell circuits, can be used within an in-memory computing architecture. The disclosed approach allows reconfigurable operations as well as very stable / robust operations to be performed in circuits used to store data in high density memory. The disclosure approach allows for greater robustness and reconfigurability for in-memory computing than traditional approaches. The disclosed devices may be assembled using standard CMOS integrated circuit processing. The disclosed approach is considered to have significant utility for the semiconductor industry as it can substantially improve the performance of processors for machine learning and other linear algebra.

本明細書で開示されるアプローチは、（１）明示的なスイッチを必要とせずに、ビット・セル・コンデンサ間の結合が達成可能である構成（非スイッチ型結合構造）、（２）結合されたビット・セルが他の結合ビット・セルとインターリーブされる物理的レイアウト（インターリーブ・レイアウト）のビット・セル回路の２つの新規の態様に関する。 The approaches disclosed herein are (1) a configuration in which coupling between bit-cell capacitors is achievable without the need for an explicit switch (non-switched coupling structure), and (2) coupled. It relates to two novel aspects of a bit cell circuit in a physical layout (interleaved layout) in which a bit cell is interleaved with other combined bit cells.

非スイッチ型結合構造は、コンデンサ・プレートの１つに関する演算結果を供給するビット・セルに関し、この場合、コンデンサ間の結合は、他のコンデンサ・プレートを介して実現される。これは、ビット・セル回路が通常スイッチを介して最終的に他のコンデンサに対して結合される同一のコンデンサ・プレートに関する演算結果を供給するスイッチ型結合構造と対照的である。 The non-switched coupling structure relates to a bit cell that supplies the result of an operation on one of the capacitor plates, in which case the coupling between the capacitors is achieved via the other capacitor plate. This is in contrast to the switch-type coupling structure, where the bit cell circuit usually provides the result of operations on the same capacitor plate that is ultimately coupled to another capacitor via a switch.

図１５Ａはスイッチ型結合構造によるビット・セルのブロック図を図示する一方、図１５Ｂは非スイッチ型結合構造によるビット・セルのブロック図を図示する。両方の場合において、結合コンデンサは、まず、コンデンサが結合される出力ノードを短絡するなどしてリセットされる必要がある（コンデンサ上の電荷を除去する）。その後、演算動作ｆ（．）がビット・セルにおいてローカルに実行される。これは２つのオペランドａ及びｂに対して例示されており、一方のオペランドはビット・セルに格納され、他方はビット・セル周辺から外部に供給される。ただし、一般的に、より多くのオペランドを有する構造が可能である。その後、演算動作は、サンプル・プレート上のスイッチを介して（スイッチ型結合構造）又は他のプレート上のスイッチを用いないで（非スイッチ型結合構造）のいずれかによって他のコンデンサと結合されるローカル・コンデンサのプレートを駆動する。有益なことに、図１５Ｂの非スイッチ型結合構造は、ビット・セルにおける結合スイッチの必要性を回避し、さらに、ＭＯＳＦＥＴによって実装される場合に、スイッチがＭＯＳＦＥＴによって可変量の電荷が吸収／解放されるようにでき（電圧レベルに依存）、それによってチャージ・ドメインの演算がわずかに損なわれるなどのチャージ・インジェクション・エラーの影響を低減する可能性を有する。 FIG. 15A illustrates a block diagram of a bit cell with a switch-type coupled structure, while FIG. 15B illustrates a block diagram of a bit cell with a non-switched combined structure. In both cases, the coupling capacitor must first be reset (removing the charge on the capacitor), such as by shorting the output node to which the capacitor is coupled. After that, the operation f (.) Is executed locally in the bit cell. This is exemplified for two operands a and b, one of which is stored in a bit cell and the other which is supplied to the outside from around the bit cell. However, in general, structures with more operands are possible. The arithmetic operation is then coupled to other capacitors either through a switch on the sample plate (switch-type coupling structure) or without a switch on the other plate (non-switch-type coupling structure). Drives the plate of a local capacitor. Advantageously, the non-switched coupling structure of FIG. 15B avoids the need for a coupling switch in the bit cell, and when implemented by a MOSFET, the switch absorbs / releases a variable amount of charge by the MOSFET. It can be made (depending on the voltage level), which has the potential to reduce the effects of charge injection errors, such as a slight loss of charge domain computation.

図１６は一実施例による非スイッチ型結合構造を有するビット・セル回路の回路図を図示する。なお、この回路の他の変形も本開示の実施例の文脈内において可能である。図１６のビット・セル１６００は、格納データＷ／Ｗｂ（ＭＮ１−３／ＭＰ１−２によって形成される６トランジスタ交差結合回路内）と入力データＩＡ／ＩＡｂとの間のＸＮＯＲ又はＡＮＤ演算のいずれかの実装を可能とする。例えば、ＸＮＯＲ演算の場合、リセット後に、ＩＡ／ＩＡｂは相補的に駆動されることが可能であり、結果としてローカル・コンデンサの下方プレートがＩＡＸＮＯＲＷに従ってプル・アップ／ダウンされる。一方、ＡＮＤ演算の場合、リセット後にＩＡのみが駆動され（ＩＡｂは低いままとなる）、結果としてローカル・コンデンサの下方プレートがＩＡＡＮＤＷに従ってプル・アップ／ダウンされる。有益なことに、この構造は、結合コンデンサの全ての間で結果的に得られる直列プル・アップ／プル・ダウン充電構造に起因してコンデンサの合計の切換エネルギーが削減可能となるとともに、出力ノードにおける結合スイッチの削除に起因してスイッチ・チャージ・インジェクション・エラーの影響が低減可能となる。 FIG. 16 illustrates a circuit diagram of a bit cell circuit having a non-switched coupling structure according to an embodiment. It should be noted that other modifications of this circuit are also possible within the context of the embodiments of the present disclosure. The bit cell 1600 of FIG. 16 is either an XNOR or AND operation between the stored data W / Wb (in a 6-transistor cross-coupled circuit formed by MN1-3 / MP1-2) and the input data IA / IAb. Can be implemented. For example, in the case of an XNOR operation, the IA / IAb can be driven complementarily after a reset, resulting in the lower plate of the local capacitor being pulled up / down according to the IA XNOR W. On the other hand, in the case of AND operation, only IA is driven after reset (IAb remains low), and as a result, the lower plate of the local capacitor is pulled up / down according to IA AND W. Interestingly, this structure allows the total switching energy of the capacitors to be reduced due to the resulting series pull-up / pull-down charging structure among all of the coupling capacitors, as well as the output nodes. The effect of the switch charge injection error can be reduced due to the deletion of the coupling switch in.

図１７は、一実施例によるビット・セルのレイアウトの二方向インターリーブの回路図を図示する。特に、ビット・セルのインターリーブされたレイアウトは、コンデンサが２つ以上のセットに共に結合される場合のあるレイアウトに関する。図１７は二方向のインターリーブの場合の例示を示すが、様々な実施例においてより高いインターリーブも企図される。さらに、コンデンサは列の側方に配置され、実際にはビット・セル・トランジスタ上に配置されてもよく、及び／又はビット・セル・トランジスタに隣接した他の場所に配置されてもよい。この構造の利点は、特に、向上した構成可能性である。すなわち、出力は２つの異なるノード上で供給されるため、結合Ａ及びＢは個別の演算を実装するために使用可能である。又は、結合Ａ及びＢは、例えば適切な周辺回路を介して異なるノード上の出力を適切に組み合わせることによって統合演算（ｊｏｉｎｔｃｏｍｐｕｔａｔｉｏｎ）を実装するために使用可能である。 FIG. 17 illustrates a two-way interleave circuit diagram of the bit cell layout according to one embodiment. In particular, the interleaved layout of bit cells relates to layouts in which capacitors may be coupled together in more than one set. FIG. 17 illustrates the case of bidirectional interleaving, but higher interleaving is also contemplated in various embodiments. In addition, the capacitors may be located on the side of the row and may actually be located on the bit cell transistor and / or elsewhere adjacent to the bit cell transistor. The advantage of this structure is, among other things, improved configurability. That is, since the output is supplied on two different nodes, the couplings A and B can be used to implement separate operations. Alternatively, couplings A and B can be used to implement joint completion, for example by appropriately combining outputs on different nodes via appropriate peripheral circuits.

なお、図示され本明細書において説明される機能は、例えば汎用コンピュータ、１つ以上の特定用途向け集積回路（ＡＳＩＣ：ａｐｐｌｉｃａｔｉｏｎｓｐｅｃｉｆｉｃｉｎｔｅｇｒａｔｅｄｃｉｒｃｕｉｔ）、又は他のあらゆる同等のハードウェアを使用してハードウェア又はソフトウェア及びハードウェアの組み合わせによって実装されてもよい。本明細書で説明するステップのいくつかは、例えば様々な方法ステップを実行するプロセッサと協働する回路として、ハードウェア内で実装されてもよいことが企図される。本明細書で説明される機能／要素の一部は、コンピュータ・プログラム製品として実装されてもよく、コンピュータ命令は、演算装置によって処理されたとき、本明細書で説明される方法又は技術が呼び出されるか、さもなければ提供されるように、演算装置の動作を適応する。本発明の方法を呼び出すための命令は、固定又は取り外し可能な媒体又はメモリなどの有形及び非一時的なコンピュータ判読可能な媒体に格納されてもよく、又は命令に従って動作する演算装置内のメモリ内に格納されてもよい。 Note that the functions illustrated and described herein are hardware using, for example, a general purpose computer, one or more application specific integrated circuits (ASICs), or any other equivalent hardware. Alternatively, it may be implemented by a combination of software and hardware. It is contemplated that some of the steps described herein may be implemented in hardware, for example, as circuits that work with processors to perform various method steps. Some of the functions / elements described herein may be implemented as computer program products, and when computer instructions are processed by an arithmetic unit, the methods or techniques described herein are invoked. Or adapt the operation of the arithmetic unit as provided otherwise. Instructions for invoking the methods of the invention may be stored in tangible and non-temporary computer-readable media such as fixed or removable media or memory, or in memory in an arithmetic unit operating in accordance with the instructions. It may be stored in.

様々な図面に関して本明細書において説明されたシステム、方法、装置、機構、技術、及びその一部に対して様々な変形が行われてもよく、そのような変形は、本発明の範囲内にあると企図される。例えば、ステップの特定の順序又は機能要素の配置が、本明細書で説明される様々な実施例において提示されるが、様々な実施例の文脈内においてステップ又は機能要素の様々な他の順序／配置が利用されてもよい。さらに、実施例に対する変形が個別に説明される場合があるが、様々な実施例は、同時又は順番に複数の変形を使用してもよく、複合した変形等も使用してもよい。 Various modifications may be made to the systems, methods, devices, mechanisms, techniques, and parts thereof described herein with respect to the various drawings, such modifications within the scope of the present invention. It is intended to be. For example, a particular order of steps or arrangement of functional elements is presented in the various embodiments described herein, but in the context of the various examples various other orders of steps or functional elements / Arrangement may be utilized. Further, although the modifications with respect to the examples may be described individually, in various examples, a plurality of modifications may be used at the same time or in order, or a combined modification or the like may be used.

特定のシステム、装置、方法論、機構等が上記で説明するように開示したが、当業者には、上述した以外のより多くの変形が本明細書に記載の新規の概念から逸脱することなく可能であることが明らかであろう。したがって、新規の主題は本開示の要旨を除いて限定されるものではない。さらに、本開示を解釈する際に、全ての用語は、文脈と一致する最も広い可能な方法で解釈されるべきである。特に、「含む（ｃｏｍｐｒｉｓｅ、ｃｏｍｐｒｉｓｉｎｇ）」なる用語は、非排他的な方法で要素、構成要素、又はステップを言及するとして解釈されるべきであり、言及された要素、構成要素、又はステップは、明示的に言及されていない他の要素、構成要素、又はステップと共に存在、利用、又は結合されてもよいことを示す。さらに、本明細書に記載した参考文献は、本願の一部をなすものでもあり、完全に記載されているかのようにその全体が援用されている。 Although specific systems, devices, methodologies, mechanisms, etc. have been disclosed as described above, those skilled in the art will be able to make more modifications than described above without departing from the novel concepts described herein. It will be clear that. Therefore, the new subject matter is not limited except for the gist of this disclosure. Moreover, in interpreting this disclosure, all terms should be interpreted in the broadest possible way consistent with the context. In particular, the term "complying" should be construed as referring to an element, component, or step in a non-exclusive manner, and the referred element, component, or step Indicates that it may exist, be used, or be combined with other elements, components, or steps not explicitly mentioned. Moreover, the references described herein are also part of the present application and are incorporated in their entirety as if they were fully described.

様々な実施例の態様は、特許請求の範囲及び／又は以下の番号付きの項において特定される。 Aspects of the various embodiments are specified in the claims and / or the numbered sections below.

１．少なくとも１つのビット・セル演算装置に結合されるビット・セル格納回路と、上記ビット・セル演算装置に結合されるビット・セル・コンデンサとを備え、上記ビット・セル・コンデンサは、上記ビット・セル・コンデンサと追加のコンデンサとの間にスイッチを設けることなく、１つ以上の追加のコンデンサにさらに結合されるビット・セル回路構成。 1. 1. A bit cell storage circuit coupled to at least one bit cell arithmetic device and a bit cell capacitor coupled to the bit cell arithmetic apparatus are provided, and the bit cell capacitor is the bit cell. -A bit-cell circuit configuration that is further coupled to one or more additional capacitors without the need for a switch between the capacitors and the additional capacitors.

２．上記ビット・セル・コンデンサの陰極プレートは、上記ビット・セル演算装置に結合される、第１項に記載のビット・セル回路構成。 2. The bit cell circuit configuration according to item 1, wherein the cathode plate of the bit cell capacitor is coupled to the bit cell arithmetic unit.

３．上記ビット・セル・コンデンサの陽極プレートは、上記追加のコンデンサに結合される、第１項に記載のビット・セル回路構成。 3. 3. The bit cell circuit configuration according to paragraph 1, wherein the anode plate of the bit cell capacitor is coupled to the additional capacitor.

４．上記ビット・セル演算装置は、２つのオペランドによって演算動作を実行するように構成される、第１項に記載のビット・セル回路構成。 4. The bit cell circuit configuration according to item 1, wherein the bit cell arithmetic unit is configured to execute an arithmetic operation by two operands.

５．図１１によって図示されるようなビット・セル回路構成。 5. A bit cell circuit configuration as illustrated by FIG.

６．上記構成によって、格納データと入力データとの間でＸＮＯＲ又はＡＮＤ演算の実装が可能となる、第５項に記載のビット・セル回路構成。 6. The bit cell circuit configuration according to item 5, wherein an XNOR or AND operation can be implemented between the stored data and the input data by the above configuration.

７．上記ビット・セル・コンデンサは、ともに少なくとも２セットになるように結合される、第１項の少なくとも２つのビット・セル構成のためのインターリーブ・レイアウト。 7. The interleaved layout for at least two bit cell configurations of the first term, wherein the bit cell capacitors are both coupled to form at least two sets.

８．上記結合されたビット・セル・コンデンサのセットは、１つ以上のビット・セル・トランジスタの上に配置される、第７項に記載のインターリーブ・レイアウト。 8. The interleaved layout according to paragraph 7, wherein the combined set of bit cell capacitors is placed on top of one or more bit cell transistors.

９．ローカル・コンデンサの１つのプレートを駆動し、他のビット・セル・コンデンサへの結合は他方のプレートで実現される、チャージ・ドメイン・イン・メモリ・コンピューティング・ビット・セル。 9. A charge domain-in-memory computing bit cell that drives one plate of the local capacitor and coupling to the other bit cell capacitor is achieved on the other plate.

１０．格納データと入力データとの間でＸＮＯＲ演算又はＡＮＤ演算を実装できるチャージ・ドメイン・イン・メモリ・コンピューティング・ビット・セル。 10. A charge domain-in-memory computing bit cell that can implement an XNOR or AND operation between stored and input data.

１１．ビット・セル・コンデンサは、複数の異なるセットになるように結合される、チャージ・ドメイン・イン・メモリ・コンピューティング・ビット・セルのためのインターリーブ・レイアウト。 11. Bit cell capacitors are an interleaved layout for charge domain-in-memory computing bit cells that are coupled into different sets.

１２．ビット・セル・コンデンサが複数の異なるセットになるように結合されることによって、Ｘ方向のインターリーブに対してＸセット存在し、ただし、Ｘは１より大きい整数である、チャージ・ドメイン・イン・メモリ・コンピューティング・ビット・セルのためのインターリーブ・レイアウト。 12. Charge domain-in-memory where X sets exist for interleave in the X direction, where X is an integer greater than 1, by coupling the bit cell capacitors into different sets. -Interleaved layout for computing bit cells.

１３．上記結合されたコンデンサの異なるセットは、上記ビット・セル・トランジスタの上に配置される、チャージ・ドメイン・イン・メモリ・コンピューティング・ビット・セルのためのレイアウト。 13. A different set of coupled capacitors is placed on top of the bit cell transistor in a layout for charge domain in memory computing bit cells.

１４．ビット・セルの格納データと１−ｂ入力信号との間でチャージ・ドメイン演算を実行するように構成された乗算ビット・セル（Ｍ−ＢＣ）。 14. A multiplication bit cell (M-BC) configured to perform a charge domain operation between the stored data in the bit cell and the 1-b input signal.

１５．上記チャージ・ドメイン演算を実行するために１つ以上のＮＭＯＳトランジスタが利用される、第１４項に記載のＭ−ＢＣ。 15. The M-BC according to paragraph 14, wherein one or more NMOS transistors are utilized to perform the charge domain operation.

１６．上記ビット・セルの上に配置された金属構造であるコンデンサをさらに備える、第１４項に記載のＭ−ＢＣ。 16. The M-BC according to paragraph 14, further comprising a capacitor having a metallic structure arranged on the bit cell.

１７．上記Ｍ−ＢＣは論理演算を実装するように構成される、第１４項に記載のＭ−ＢＣ。 17. The M-BC according to claim 14, wherein the M-BC is configured to implement a logical operation.

１８．上記論理演算は、ＸＮＯＲ演算、ＮＡＮＤ演算、ＡＮＤ演算、及び他の論理演算を含む、第１７項に記載のＭ−ＢＣ。 18. The M-BC according to claim 17, wherein the logical operation includes an XNOR operation, a NAND operation, an AND operation, and other logical operations.

１９．図１２に図示されるようなレイアウトをさらに備える、第１４項に記載のＭ−ＢＣ。 19. The M-BC according to paragraph 14, further comprising a layout as illustrated in FIG.

２０．６Ｔセルの拡張レイアウトをさらに備える、第１４項に記載のＭ−ＢＣ。 The M-BC according to paragraph 14, further comprising an extended layout of 20.6 T cells.

２１．正規のポリ構造を有するトランジスタをさらに備える、第２０項に記載のＭ−ＢＣ。 21. 20. The M-BC according to claim 20, further comprising a transistor having a regular polystructure.

２２．図１３に図示するようなレイアウトをさらに備える、第１４項に記載のＭ−ＢＣ。 22. 13. The M-BC according to paragraph 14, further comprising a layout as illustrated in FIG.

本発明の教示を組み込んだ様々な実施例を本明細書に詳細に図示及び説明したが、当業者は、これらの教示を依然として組み込んだ多くの他の変形実施例を容易に考え出すことが可能である。したがって、上記は本発明の様々な実施例に関するが、本発明の他及びさらなる実施例は、その基本的な範囲から逸脱することなく考え出され得る。 Although various embodiments incorporating the teachings of the present invention have been illustrated and described in detail herein, one of ordinary skill in the art can readily devise many other modified embodiments incorporating these teachings. be. Thus, although the above relates to various embodiments of the invention, other and further embodiments of the invention can be conceived without departing from its basic scope.

Claims

It is configured to receive massively parallel bit-by-bit input signals through a first compute-in-memory (CIM) array dimension and one or more stored signals through a second CIM array dimension. A CIM array of bit cells, each of which is associated with a common storage signal, forms a CIM channel configured to supply its own output signal. When,
The CIM array includes an in-memory configuration that uses a single-bit internal circuit and signals to perform multi-bit computing operations on the input and storage signals. -Computing architecture.

The in-memory computing according to claim 1, further comprising a reshaping buffer configured to reshape the sequence of received data words to form the massively parallel bit-by-bit input signal. architecture.

The in-device according to claim 1, further comprising an analog-to-digital converter (ADC) circuit configured to supply a sequence of multi-bit output words by processing the plurality of CIM channel output signals. Memory computing architecture.

The in-memory computing architecture of claim 1, further comprising a near-memory computing path configured to provide the sequence of multi-bit output words as a result of computing.

The ADC circuit applies binary weighting to each ADC configured to digitize the CIM channel output signal and the digitized CIM channel output signal for each CIM channel. The in-memory computing architecture of claim 3, comprising each shift register configured to form each portion of a multi-bit output word.

The ADC circuit digitizes the weighted analog addition of the CIM channel output signal of each subset of the CIM channel into each portion of the multi-bit output word for each of the plurality of subsets of the CIM channel. The in-memory computing architecture of claim 3, comprising each ADC configured to form.

Masking the massively parallel bit-by-bit input signal zero-value element so that the multi-bit computing operation avoids processing the massively-parallel bit-by-bit input signal zero-value element. The in-memory computing architecture according to claim 1, further comprising a spatial controller configured as described above.

The in-memory computing architecture of claim 1, wherein the input signal and the stored signal are combined with an existing signal in the memory.

The in-memory computing architecture of claim 1, wherein the input signal and the stored signal are separated from existing signals in the memory.

The in-memory computing architecture of claim 3, wherein each ADC and its respective stored signal form an in-memory computing channel.

The near memory computing path according to claim 4, wherein the near memory computing path includes one or more of a digital barrel shifter, a multiplexer, an accumulator, a look-up table, and a non-linear function element. In-memory computing architecture.

The multi-bit computing operation of the CIMA includes bit parallel / bit serial (BPBS) computing.
The bit parallel computing is
Load different matrix element bits into each in-memory computing channel and
Each barrel shifter is used to implement the corresponding bit weighting by barrel-shifting the digitized output from said computing channel.
Each accumulator is used to perform digital accumulation on all of the computing channels to generate multi-bit matrix element arithmetic results, including performing digital accumulation.
The bit serial computing is
Each bit of the vector element is applied individually to the stored matrix element bits, and the resulting digitized output is stored.
Using each barrel shifter, barrel-shifting the stored digitized output associated with each vector element bit prior to digital accumulation by the stored digitized output corresponding to subsequent input vector bits. The in-memory computing architecture of claim 10, including.

The fourth aspect of claim 4, wherein the near memory computing path is physically aligned with the in-memory computing architecture to increase throughput through the in-memory computing architecture. In-memory computing architecture.

The in-memory computing according to claim 1, further comprising one or more configurable finite state machines (FSMs) configured to control the computational operation of the in-memory computing architecture. architecture.

The FSM that controls the arithmetic operation is configured to control highly parallel computing hardware used by some or all of a plurality of in-memory computing channels. 14. The in-memory computing architecture according to 14.

The in-memory computing architecture of claim 14, wherein the FSM controls operations according to software instructions loaded into local memory.

The in-memory computing architecture of claim 2, wherein the reshaping buffer is configured to convert a first precision external digital word into a higher dimensional input vector.

The in-memory computing architecture of claim 2, wherein the reshaping buffer is configured to supply the CIMA with bits of the input vector elements in order and in parallel.

The in-memory compute according to claim 2, wherein the reshaping buffer is configured such that the aligned input data ensures the desired utilization and throughput of in-memory computing operations. Input architecture.

The in-memory architecture of claim 2, wherein the reshaping buffer is configured to allow reuse and shift of input data according to a convolutional neural network operation.

Each bit cell in the (CIM) array of bit cells is
A bit cell storage circuit coupled to at least one bit cell arithmetic unit,
A bit cell capacitor coupled to the bit cell arithmetic unit, the bit cell capacitor corresponds to another bit cell capacitor without providing a switch between the bit cell capacitors. The architecture of claim 1, wherein the architecture comprises a bit cell circuit configuration including a bit cell capacitor, which is further coupled to one or more additional capacitors.

21. The in-memory architecture of claim 21, wherein the bit cell circuit configuration further comprises a cathode plate of the bit cell capacitor coupled to the bit cell arithmetic unit.

21. The in-memory architecture of claim 21, wherein the bit cell circuit configuration further comprises an anode plate of the bit cell capacitor coupled to the additional capacitor.

21. The in-memory architecture of claim 21, wherein the bit cell circuit configuration further includes said bit cell arithmetic unit configured to perform arithmetic operations by two operands.

21. The in-memory architecture of claim 21, wherein the bit cell circuit configuration allows implementation of an XNOR or AND operation between stored data and input data.

The bit cell circuit configuration is an interleave for at least two bit cells that allows each corresponding individual operation by combining each bit cell capacitor into at least two separate sets. 21. The in-memory architecture of claim 21, including layout.

24. The in-memory architecture of claim 24, wherein the set of coupled bit cell capacitors is placed on top of one or more bit cell transistors.

The bit cell contains an in-memory computing bit cell in the charge domain that drives one plate of each local capacitor, and coupling with other bit cell capacitors is described in each of the above. The in-memory architecture of claim 26, implemented in other plates of local capacitors.

26. The in-memory computing bit cell of the charge domain is configured to implement an XNOR or AND operation between the stored data and the input data. architecture.

Charge domain in-memory computing so that bit cell capacitors are coupled into different sets configured to be selectively coupled via one or more peripheral switches. The in-memory architecture of claim 26, wherein an interleaved layout of bit cells is provided.

By coupling the bit cell capacitors into different sets, there is an X set for the X direction interleave, and the in-memory of the charge domain such that X is an integer greater than 1. The in-memory architecture of claim 26, wherein an interleaved layout of computing bit cells is provided.

26. A charge domain in-memory computing bit cell layout is provided such that different sets of coupled capacitors are placed on top of the bit cell transistor. In-memory architecture.