TWI811300B

TWI811300B - Computational memory device and simd controller thereof

Info

Publication number: TWI811300B
Application number: TW108104822A
Authority: TW
Inventors: 達里克韋伯; 威廉馬丁斯內爾格羅夫
Original assignee: 加拿大商溫德特爾人工智慧有限公司
Priority date: 2018-02-23
Filing date: 2019-02-13
Publication date: 2023-08-11
Also published as: TW201937490A; WO2019162738A1

Abstract

An example device includes a plurality of computational memory banks. Each computational memory bank of the plurality of computational memory banks includes an array of memory units and a plurality of processing elements connected to the array of memory units. The device further includes a plurality of single instruction, multiple data (SIMD) controllers. Each SIMD controller of the plurality of SIMD controllers is contained within at least one computational memory bank of the plurality of computational memory banks. Each SIMD controller is to provide instructions to the at least one computational memory bank.

Description

Computing memory device and its SIMD controller

(Related applications are cross-referenced)

本申請案主張於2018年3月26日申請之US 62/648,074的優先權，其係藉由引用形式而併入本文。本申請案是於2018年2月23日申請之US 15/903,754的部分接續案，其係藉由引用形式而併入本文。 This application claims priority from US 62/648,074, filed on March 26, 2018, which is incorporated herein by reference. This application is a partial continuation of US 15/903,754 filed on February 23, 2018, which is incorporated herein by reference.

本發明是關於計算記憶體和神經網路。 This invention relates to computational memory and neural networks.

深度學習已經被證明是一種可執行長期受抑制的人工智慧方法的功能的有力技術。舉例而言，深度學習可應用於識別雜亂圖像中的物件、語音理解和翻譯、醫學診斷、遊戲和機器人。深度學習技術一般應用對感興趣的任務進行訓練(因而「學習」)的神經網路的許多層(因而「深度」)。一旦經過訓練，神經網路可執行「推理(inference)」，亦即從新輸入資料推論出與其已經學習者一致的輸出。 Deep learning has proven to be a powerful technology that can perform functions that have long been inhibited by artificial intelligence methods. For example, deep learning can be applied to identifying objects in cluttered images, speech understanding and translation, medical diagnosis, games, and robotics. Deep learning techniques generally apply many layers (hence "depth") of a neural network that is trained (hence "learning") on a task of interest. Once trained, a neural network can perform "inference," that is, deducing outputs from new input data that are consistent with what it has learned.

神經網路(也稱為神經網)執行類似於生物神經元操作的計算，一般是計算加權總和(或內積)，並利用無記憶非線性來修改結果。然而，通常情況下還需要更一般的功能，例如記憶體、乘法的非線性和「池化(pooling)」。 Neural networks (also called neural nets) perform computations similar to the operations of biological neurons, typically computing weighted sums (or inner products) and exploiting memoryless nonlinearity to modify the results. Often, however, more general features are required, such as memory, multiplicative nonlinearity, and "pooling."

在許多電腦架構類型中，由於在記憶體和處理元件之間物理移動資料，因此功率消耗是非常重要的，而且通常是電源的主要用途。此功率消耗一般是因對佈線電容充電和放電所需要的能量，其大致上與佈線的長度成比例，也因此與記憶體和處理元件之間的距離成比例。因此，處理這種架構中的大量消耗(如深度學習和神經網路一般所需者)通常需要相對大量的功率。在較佳適合處理深度學習和神經網路的架構中，會產生其他的不足，例如複雜性增加、處理時間增加、以及晶片面積需求更大。 In many computer architecture types, power consumption is very important due to the physical movement of data between memory and processing elements, and is often the primary use of the power supply. This power consumption is generally due to the energy required to charge and discharge the wiring capacitance, which is roughly proportional to the length of the wiring, and therefore to the distance between the memory and the processing element. Therefore, processing the large consumption in such architectures (such as those commonly required by deep learning and neural networks) usually requires relatively large amounts of power. In architectures that are better suited to handle deep learning and neural networks, other drawbacks arise, such as increased complexity, increased processing time, and larger chip area requirements.

一種例示裝置包括複數個計算記憶體排組。所述複數個計算記憶體排組中的每一個計算記憶體排組包括一記憶單元陣列和連接至該記憶單元陣列的複數個處理元件。該裝置進一步包括複數個單一指令多重資料(SIMD)控制器，所述複數個SIMD控制器中的每一個SIMD控制器係包含於所述複數個計算記憶體排組中的至少一個計算記憶體排組內。每一個SIMD控制器對所述至少一個計算記憶體排組提供指令。 An example device includes a plurality of banks of computing memory. Each of the plurality of computing memory banks includes a memory cell array and a plurality of processing elements connected to the memory cell array. The apparatus further includes a plurality of Single Instruction Multiple Data (SIMD) controllers, each of the plurality of SIMD controllers being included in at least one of the plurality of computing memory banks. within the group. Each SIMD controller provides instructions to the at least one computing memory bank.

以下將詳細說明這個與其他示例。 This and other examples are detailed below.

12N:處理元件 12N: Processing element

15N:開關 15N: switch

16:匯流排 16:Bus

17:多工器 17:Multiplexer

17A:位元讀取輸出 17A: Bit read output

17B:位元寫入 17B: Bit writing

17C:多工器 17C:Multiplexer

17D:多工器 17D:Multiplexer

18:暫存器 18: Temporary register

19:暫存器 19: Temporary register

20:暫存器 20: Temporary register

21:全域控制匯流排 21:Global control bus

100:排組 100:Planning

104:記憶體單元陣列 104: Memory cell array

108:SIMD控制器 108:SIMD controller

112:線路 112:Line

116:處理元件 116: Processing components

120:位元線 120:Bit line

124:列匯流排 124: Column bus

128:列連接 128: Column connection

132:列匯流排 132: Column bus

140:單元 140:Unit

142:行 142: OK

144:胞元 144:cell

146:列 146: column

200:處理裝置 200:Processing device

202:行連接 202: Row connection

204:行匯流排 204: Row bus

220:處理裝置 220: Processing device

222:SIMD控制器 222:SIMD controller

240:處理裝置 240: Processing device

242:輸入/輸出電路 242:Input/output circuit

244:控制器匯流排 244:Controller bus

246:列匯流排 246: Column bus

248:列匯流排 248: Column bus

260:內部暫存器 260: Internal temporary register

262:算術邏輯單元(ALU) 262: Arithmetic Logic Unit (ALU)

264:通訊狀態 264: Communication status

266:內部狀態 266: Internal status

268:內部匯流排 268: Internal bus

280:多工器 280:Multiplexer

282:多工器 282:Multiplexer

284:多工器 284:Mux

286:多工器 286:Multiplexer

288:多工器 288:Multiplexer

290:ALU 290:ALU

295:ALU 295:ALU

300:處理元件 300: Processing components

302:多工器 302:Multiplexer

400:處理元件 400: Processing components

402:Σ方塊 402: Σ square

404:進位方塊 404: Carry block

500:列匯流排 500: Column bus

502:開關 502: switch

600:處理元件 600: Processing components

700:處理元件 700: Processing components

702:多工器 702:Mux

704:行 704: OK

800:SIMD控制器 800:SIMD controller

802:指令記憶體 802: Instruction memory

804:行選擇 804: Row selection

806:程式計數器 806: Program counter

808:解碼器 808:Decoder

900:控制器匯流排 900:Controller bus

第一圖是一種先前技術之電腦系統的示意圖，其中處理元件係嵌於記憶體中。 The first figure is a schematic diagram of a prior art computer system in which processing elements are embedded in memory.

第二圖是根據本發明之計算記憶體排組的方塊圖。 The second figure is a block diagram of a computing memory arrangement according to the present invention.

第三圖是根據本發明之具有複數個計算記憶體排組的裝置的方塊圖，其中所述計算記憶體排組具有由行匯流排所連接的處理元件。 Figure 3 is a block diagram of a device having a plurality of computing memory banks with processing elements connected by row buses in accordance with the present invention.

第四圖是根據本發明之具有複數個計算記憶體排組的裝置的方塊圖，所述計算記憶體排組具有在數個排組之間共享的一控制器。 Figure 4 is a block diagram of an apparatus having a plurality of computing memory banks with a controller shared among the banks according to the present invention.

第五圖是根據本發明之具有複數個計算記憶體排組的裝置的方塊圖，所述計算記憶體排組具有一輸入/輸出電路。 Figure 5 is a block diagram of a device having a plurality of computing memory banks with an input/output circuit in accordance with the present invention.

第六圖是根據本發明之處理元件的方塊圖。 Figure 6 is a block diagram of a processing element according to the present invention.

第七A圖為根據本發明之處理元件的算術邏輯單元的方塊圖。 Figure 7A is a block diagram of an arithmetic logic unit of a processing element according to the present invention.

第七B圖是根據本發明之處理元件的另一算術邏輯單元的方塊圖。 Figure 7B is a block diagram of another arithmetic logic unit of the processing element according to the present invention.

第七C圖是根據本發明之處理元件的另一算術邏輯單元的方塊圖。 Figure 7C is a block diagram of another arithmetic logic unit of the processing element according to the present invention.

第八圖是根據本發明之算術邏輯單元的例示算術運算表。 Figure 8 is an exemplary arithmetic operation table of an arithmetic logic unit according to the present invention.

第九圖是根據本發明之計算記憶體排組的分段匯流排的圖式。 Figure 9 is a diagram of a segmented bus of a computing memory array in accordance with the present invention.

第十圖是根據本發明之處理元件的內部匯流排的圖式。 Figure 10 is a diagram of an internal busbar of a processing element according to the present invention.

第十一圖是用於本發明之一位元(one-bit)處理元件的圖式，其適用於一位元數值的通用處理。 Figure 11 is a diagram of a one-bit processing element used in the present invention, which is suitable for general processing of one-bit values.

第十二圖是用於本發明之一位元(one-bit)處理元件的圖式，其在列方向上具有最近的鄰近通訊。 Figure 12 is a diagram of a one-bit processing element used in the present invention, which has nearest neighbor communication in the column direction.

第十三圖是用於本發明之一位元(one-bit)處理元件的圖式，其為每次記憶體讀取執行兩次運作。 Figure 13 is a diagram of a one-bit processing element used in the present invention, which performs two operations for each memory read.

第十四圖是用於本發明之多位元處理元件的圖式，其具有算術之載送產生器強化且可減少記憶體的使用。 Figure 14 is a diagram of a multi-bit processing element used in the present invention with arithmetic load generator enhancement and reduced memory usage.

第十五圖是根據本發明之處理元件的圖式，其中一運算碼多工器係經強化以作為一列匯流排。 Figure 15 is a diagram of a processing element in accordance with the present invention, in which an opcode multiplexer is enhanced to function as a column bus.

第十六圖是根據本發明之處理元件的圖式，其具有專用的加總和載送運算，使該列匯流排可同時用於通訊。 Figure 16 is a diagram of a processing element according to the present invention, which has dedicated summing and carrying operations so that the bus can be used for communication at the same time.

第十七圖是根據本發明之處理元件的圖式，其具有具分段開關的列匯流排。 Figure 17 is a diagram of a processing element according to the present invention having a segmented switch Column bus.

第十八圖是根據本發明之處理元件的圖式，其於行方向上具有最接近的鄰近通訊。 Figure 18 is a diagram of a processing element according to the present invention with closest neighbor communication in the row direction.

第十九圖是根據本發明之處理元件的圖式，其具有連接至一行匯流排的一第二多工器。 Figure 19 is a diagram of a processing element according to the present invention having a second multiplexer connected to a row of busses.

第二十圖是根據本發明之控制器的圖式，其可運作以驅動列位址和運算碼，並可加載和儲存指示於其相關聯列的記憶體中。 Figure 20 is a diagram of a controller in accordance with the present invention, which is operable to drive column addresses and operation codes, and to load and store instructions in the memory of its associated column.

第二十一圖是根據本發明之複數個控制器的圖式，其由一行匯流排互連，每一個控制器可運作以控制一計算記憶體排組，且可一起運作以允許指令記憶體的共享。 Figure 21 is a diagram of a plurality of controllers interconnected by a row of buses in accordance with the present invention. Each controller can operate to control a computing memory array and can operate together to allow instruction memory of sharing.

第二十二圖是根據本發明之複數個控制器的圖式，其各進一步可運作以解碼壓縮係數資料，且可一起運作以允許指令記憶體的共享及重新使用作為係數記憶體。 Figure 22 is a diagram of a plurality of controllers in accordance with the present invention, each further operable to decode compressed coefficient data, and operable together to allow sharing and reuse of instruction memory as coefficient memory.

第二十三圖是根據本發明之影像畫素資料的計算記憶體的例示佈局、以及用於神經網路第一層的相關編碼和核心輸出資料的圖式。 Figure 23 is an exemplary layout of a computational memory for image pixel data according to the present invention, as well as a diagram of related encoding and core output data for the first layer of a neural network.

第二十四圖是根據本發明之彩色畫素資料的計算記憶體的例示佈局、以及用於神經網路迴旋層的資料的圖式。 Figure 24 is an exemplary layout of a computational memory for color pixel data according to the present invention, and a diagram of data for a convolutional layer of a neural network.

第二十五圖是根據本發明之用於在一神經網路中池化之資料的計算記憶體的例示佈局圖式。 Figure 25 is an exemplary layout diagram of a computational memory for data pooling in a neural network in accordance with the present invention.

本文所述技術能夠以彈性的低精確度算術、具功率效率的通訊、以及指令與係數的本地儲存和解碼來處理大量的內積和相關神經網路計算。 The techniques described in this article can handle large amounts of inner products and related neural network calculations with flexible low-precision arithmetic, power-efficient communications, and local storage and decoding of instructions and coefficients.

深度學習中所涉及之計算可被視為是記憶體和處理元件的相互作用。記憶體為輸入資料、加權總和的權重、層之間流通的中間結果、控制與連接資訊、以及其他函數所需。記憶體中的資料是在處理元件(processing elements)或PEs中進行處理，例如通用電腦的CPU、圖靈機(Turning machine)的表、或繪圖處理器的處理器，並且回傳到記憶體。 The computations involved in deep learning can be thought of as the interaction of memory and processing elements. Memory is required for input data, weights for weighted sums, intermediate results passed between layers, control and connection information, and other functions. Data in memory is processed in processing elements or PEs, such as a general-purpose computer's CPU, a Turing machine's table, or a graphics processor's processor, and returned to memory.

深度學習和神經網路可受益於以具能量效率方式執行各種類型計算的低功率設計。低功率的實施鼓勵了在移動或隔離裝置中的使用，其中減少電池功率消耗是重要的，也鼓勵了大規模尺度的使用，其中對於處理和記憶體元件進行冷卻之需求會是限制因子。 Deep learning and neural networks can benefit from low-power designs that perform various types of calculations in an energy-efficient manner. Low-power implementation encourages use in mobile or isolated devices, where reduced battery power consumption is important, as well as mass-scale use, where the need for cooling of processing and memory components can be the limiting factor.

在「計算RAM：一種記憶體-SIMD複合結構(Computational RAM：A Memory-SIMD Hybrid)」一文中，伊萊特(Elliott)說明了「將一位元[處理元件]窄間距匹配到記憶體並限制對一維互連的通訊(pitch-matching narrow 1-bit [processing elements]to the memory and restricting communications to one-dimensional interconnects)」。這種設計旨在將記憶體和處理元件之間的距離減少至微米等級，其中傳統電腦架構所需之晶片對晶片的距離是在毫米或公分等級(大了數千或數萬倍)。伊萊特總結了早期的工作，包括路克(Loucks)、司奈高夫(Snelgove)和札基(Zaky)的早期學術工作，追溯到「VASTOR：一種用於小型應用之基於微處理器的關聯向量處理器(VASTOR：a microprocessor-based associative vector processor for small-scale applications)」，發表於1980年8月之並行處理國際研討會刊(Intl.Conf.on Parallel Processing)第37-46頁。伊萊特將此技術命名為「C*RAM」或「計算隨機存取記憶體(RAM)」。 In "Computational RAM: A Memory-SIMD Hybrid ," Elliott explains "matching a bit [processing element] to a narrow spacing of memory and constraining "Pitch-matching narrow 1-bit [processing elements] to the memory and restricting communications to one-dimensional interconnects ". This design aims to reduce the distance between memory and processing elements to the micron range, where the chip-to-die distance required by traditional computer architecture is in the millimeter or centimeter range (thousands or tens of thousands times larger). Ilett summarizes early work, including the early academic work of Loucks, Snelgove, and Zaky, going back to "VASTOR: A Microprocessor-Based Associative Vector for Small Applications"" VASTOR: a microprocessor-based associative vector processor for small-scale applications ", published in Intl.Conf.on Parallel Processing, August 1980, pages 37-46. Elite named this technology "C*RAM" or "Computational Random Access Memory (RAM)".

伊萊特和極為簡單的處理元件之其他細節可行設計需要間距匹配，包括一維通訊所需之電路。也可能從記憶體行與PEs的一對一對應稍微放寬間距匹配限制，例如，允許每一個PE佔據四個記憶體行的寬度。這減少了PEs 的數量，而且對於非常緊密的記憶體而言會是必須的、或更實際的。 The design of ELITE and other details of extremely simple processing components requires pitch matching, including the circuitry required for one-dimensional communication. It is also possible to slightly relax the spacing matching constraints from the one-to-one correspondence of memory rows to PEs, for example, allowing each PE to occupy the width of four memory rows. This reduces PEs amount, and may be necessary, or more practical, for very compact memories.

在美國專利第5,546,343號中，伊萊特和司奈高夫描述了使用多工器作為一算術和邏輯單元(ALU)，其可運作以執行處理元件的任何三位元狀態的函數。如第一圖所示，在這種設計類型中，係使用單一個晶片外控制器。 In US Patent No. 5,546,343, Ilett and Snigoff describe the use of a multiplexer as an arithmetic and logic unit (ALU) operable to perform a function on any three-dimensional state of a processing element. As shown in the first figure, in this design type a single off-chip controller is used.

在文獻「計算RAM：實施和位元並行架構(Computational RAM：Implementation and Bit-Parallel Architecture)」中，寇卡魯(Cojocaru)描述了將一位元處理元件群組化以允許多位元計算，增加專用硬體以加速二進位算術，以及增加暫存器以減少記憶體存取之需求。 In Computational RAM: Implementation and Bit-Parallel Architecture, Cojocaru describes grouping single-bit processing elements to allow multi-bit computation. Specialized hardware was added to speed up binary arithmetic, and registers were added to reduce memory access requirements.

葉氏(Yeap)在文獻「適合VLSI佈局之VASTOR處理元件的設計」中描述了合適的C*RAM之一位元處理元件(A.H.Yeap,M.A.Sc.,University of Toronto,1984)。 Yeap describes a suitable C*RAM bit processing element in the document "Design of a VASTOR processing element suitable for VLSI layout" (A.H. Yeap, M.A.Sc., University of Toronto, 1984).

在「影像與視頻壓縮之向量量化的計算*RAM實施(Computational*RAM Implementations of Vector Quantization for Image and Video Compression)」一文中，李氏(Le)描述了適合用於以計算RAM進行影像和視頻壓縮的演算法。 In the article " Computational*RAM Implementations of Vector Quantization for Image and Video Compression", Le describes a method suitable for using computational RAM for image and video compression. algorithm.

上述實施方式在幾個方面的低功率深度學習應用上仍為不足。首先，他們的一維通訊會使其難以處理具有許多通道的大的二維影像。此外，他們的複雜運算碼一般都過大，因而對於常見的算術運算也極耗功率。運算碼和通訊匯流排會佔據大量的晶片面積。此外，他們的處理元件無法執行排列或相關映射，也無法執行查找表或因處理器而異之操作。此外，這些方法傾向於仰賴晶片外控制器，其在與計算RAM本身通訊時會消耗大量功率。最後，他們一般為純的單指令串流、多重資料串流裝置，其可良好地處理對大資料集的一致操作，但在需要數個較小任務時則無法共享他們的處理資源。 The above implementations still fall short in several aspects for low-power deep learning applications. First, their one-dimensional communication makes it difficult to process large two-dimensional images with many channels. In addition, their complex operation codes are generally too large and therefore power-hungry for common arithmetic operations. Operation codes and communication buses take up a lot of chip area. In addition, their processing elements cannot perform permutation or correlation mapping, lookup tables, or processor-specific operations. Additionally, these approaches tend to rely on off-chip controllers, which consume significant amounts of power when communicating with the computing RAM itself. Finally, they are typically pure single-instruction streaming, multi-data streaming devices that can handle consistent operations on large data sets well, but cannot share their processing resources when several smaller tasks are required.

有鑑於過去嘗試的這些與其他缺點，本文所述技術目標在於改良計算記憶體，為了以彈性的低精確度算術來處理大量的內積和相關神經網路計算，提供具有功率效率的通訊，以及提供指令和係數的本地儲存和解碼。 In view of these and other shortcomings of past attempts, the techniques described in this article aim to improve Computational memory, designed to handle large numbers of inner products and related neural network calculations with flexible low-precision arithmetic, provide power-efficient communications, and provide local storage and decoding of instructions and coefficients.

第二圖說明根據本發明一具體實施例之計算記憶體排組100，其被稱為C*RAM。計算記憶體排組100包括一記憶體單元陣列104和連接到記憶體單元陣列104的複數個處理元件116。 The second figure illustrates a computing memory array 100, referred to as C*RAM, according to an embodiment of the present invention. The computing memory bank 100 includes a memory cell array 104 and a plurality of processing elements 116 connected to the memory cell array 104 .

計算記憶體排組100進一步包括包含在計算記憶體排組100內的單一指令多重資料(SIMD)控制器108。SIMD控制器108提供指令、視情況是資料到計算記憶體排組100。在這個具體實施例中，SIMD控制器108僅提供至一個計算記憶體排組100。在其他具體實施例中，可於數個計算記憶體排組100之間共享SIMD控制器108。 The computing memory bank 100 further includes a single instruction multiple data (SIMD) controller 108 included within the computing memory bank 100 . SIMD controller 108 provides instructions, and optionally data, to computing memory array 100 . In this particular embodiment, SIMD controller 108 is provided to only one compute memory bank 100 . In other embodiments, the SIMD controller 108 may be shared among several computing memory banks 100 .

此外，在這個具體實施例中，記憶體單元陣列104一般是矩形形狀，且SIMD控制器108是位於靠近陣列的窄端。SIMD控制器108可被設於記憶體單元陣列104的任一側上，亦即在右側上或在左側上，如圖所示。這可提供具空間效率的記憶體單元陣列104、SIMD控制器108和複數個處理元件116的配置，使得複數個排組100可以排列為矩形或方形配置，可於半導體基板或晶片上提供有效率的佈局。 Furthermore, in this particular embodiment, the memory cell array 104 is generally rectangular in shape, and the SIMD controller 108 is located near the narrow end of the array. The SIMD controller 108 may be located on either side of the memory cell array 104, that is, on the right side or on the left side, as shown. This provides a space-efficient arrangement of the memory cell array 104, the SIMD controller 108, and the processing elements 116 such that the arrays 100 can be arranged in a rectangular or square configuration, providing efficient operation on a semiconductor substrate or wafer. layout.

記憶體單元陣列104的每一個單元140都包括一行142記憶體胞元144。胞元144可配置以儲存一個位元資訊。在複數行142中同一相對位置處的胞元144可形成一列146的胞元144。記憶體單元陣列104的單元140也可排列成列，其中一列單元140包括複數列146的胞元144。在這個具體實施例中，每一行142的胞元144是藉由一位元線120連接到一不同處理元件116。在其他具體實施例中，多行142的胞元144是藉由位元線120而連接到每一個不同的處理元件116。 Each cell 140 of memory cell array 104 includes a row 142 of memory cells 144 . Cell 144 can be configured to store one bit of information. Cells 144 at the same relative position in a plurality of rows 142 may form a column 146 of cells 144 . The cells 140 of the memory cell array 104 may also be arranged in columns, where a column of cells 140 includes a plurality of columns 146 of cells 144 . In this particular embodiment, each row 142 of cells 144 is connected to a different processing element 116 by a bit line 120 . In other embodiments, multiple rows 142 of cells 144 are connected to each different processing element 116 via bit lines 120 .

記憶體單元陣列104係藉由一或多條列選擇線112連接至SIMD 控制器108，其也稱為字元線。SIMD控制器108可輸出訊號於選擇線112，以選擇一列146的胞元144。因此，SIMD控制器108可通過列選擇線112來定址一列146的記憶體陣列104，使行142中的被選擇位元為處理元件116通過位元線120可用。 Memory cell array 104 is connected to SIMD via one or more column select lines 112 Controller 108, which is also called word line. The SIMD controller 108 may output a signal on the select line 112 to select a column 146 of cells 144 . Accordingly, the SIMD controller 108 may address a column 146 of the memory array 104 via the column select line 112 so that the selected bits in the row 142 are available to the processing element 116 via the bit line 120 .

SIMD控制器108可包括指令記憶體，其從記憶體單元陣列104加載。 SIMD controller 108 may include instruction memory loaded from memory cell array 104 .

在這個具體實施例中，記憶體單元陣列104為靜態隨機存取記憶體(SRAM)。舉例而言，每一個記憶體胞元144可為六個電晶體(例如金屬-氧化物-半導體場效電晶體(MOSFETs))所形成，且可稱為6T記憶體。 In this specific embodiment, the memory cell array 104 is a static random access memory (SRAM). For example, each memory cell 144 may be formed of six transistors, such as metal-oxide-semiconductor field effect transistors (MOSFETs), and may be referred to as a 6T memory.

在其他具體實施例中，可使用其他類型的記憶體，例如動態RAM、鐵電性RAM、磁性RAM、或不同類型記憶體的組合、1T、2T、5T等。可使用SRAM記憶體胞元。特別可適用於本發明的記憶體是具有同時可致能行上對應位元的列定址的記憶體、以及以與SIMD控制器108和處理元件116間距匹配佈局方式建構的記憶體。 In other embodiments, other types of memory may be used, such as dynamic RAM, ferroelectric RAM, magnetic RAM, or a combination of different types of memory, 1T, 2T, 5T, etc. SRAM memory cells can be used. Memories that are particularly suitable for use in the present invention are memories that have column addressing simultaneously enabling corresponding bits in the rows, and memories that are constructed in a layout that matches the spacing of the SIMD controller 108 and processing elements 116 .

記憶體單元陣列104可分為具有不同存取能量成本的子集。舉例而言，「重的」子集可具有因較長位元線而具較大電容的記憶體胞元，其因而會利用較多功率來進行存取，但具有增加的密度。「輕的」子集可具有較低電容的記憶體胞元，其利用較少功率來存取，但具有較低的密度。因此，當使用重的子集來儲存受到較低存取頻率的資訊(例如係數和程式碼)，而且將輕的子集用於受到較高存取頻率的資訊(例如中間結果)時，功率消耗和空間效率即可被改善。 Memory cell array 104 may be divided into subsets with different access energy costs. For example, a "heavy" subset may have memory cells with greater capacitance due to longer bit lines, which therefore utilize more power to access, but with increased density. The "light" subset may have lower capacitance memory cells that use less power to access, but have lower density. Therefore, when a heavy subset is used to store information that is accessed less frequently (such as coefficients and codes), and a light subset is used for information that is accessed more frequently (such as intermediate results), the power Consumption and space efficiency can be improved.

處理元件116沿著記憶體單元陣列104的寬度排列，並且實際上是位於靠近記憶體單元陣列104處。處理元件116可排列為線性陣列，並被依序指定位址。在這個具體實施例中，每一個處理元件116都連接到一行142的記憶體單元陣列104，並與其對齊。 The processing elements 116 are arranged along the width of the memory cell array 104 and are physically located proximate the memory cell array 104 . The processing elements 116 may be arranged in a linear array and addressed sequentially. In this particular embodiment, each processing element 116 is connected to and aligned with a row 142 of the memory cell array 104 .

處理元件116的定址可以是大尾數端(endian)或小尾數端，且可基於實施偏好從左邊或右邊開始。 Addressing of processing elements 116 may be big-endian or little-endian, and may start from the left or right based on implementation preference.

處理元件116在結構上可彼此相同。大量的相對簡單且結構上實質相同的處理元件116可以有益於神經網路中的應用，因為神經網路通常需要處理大量係數。在這個上下文中，結構上實質相同是表示實施需要的小差異(例如硬體連線位址和最末端處理元件的不同連接)是可以被預期的。重複且簡化的處理元件116的陣列可降低神經網路應用中的設計複雜度並增加空間效率。 Processing elements 116 may be structurally identical to each other. A large number of relatively simple and structurally identical processing elements 116 can be beneficial for applications in neural networks, since neural networks typically need to process a large number of coefficients. In this context, substantial structural similarity means that small differences in implementation requirements (eg, hardware wiring addresses and different connections of the final processing elements) are to be expected. The repetitive and simplified array of processing elements 116 can reduce design complexity and increase space efficiency in neural network applications.

每一個處理元件116可包括暫存器和ALU。暫存器可包括用於以處理元件116執行操作的內部暫存器、以及與其他處理元件116通訊狀態的通訊暫存器。每一個處理元件116可進一步包括由一或多個其他處理元件116提供的通訊狀態。ALU係配置以執行一任意函數，例如由運算碼所定義的一或多個運算數的函數。 Each processing element 116 may include registers and an ALU. The registers may include internal registers for performing operations with the processing element 116 and communication registers for communication status with other processing elements 116 . Each processing element 116 may further include communication status provided by one or more other processing elements 116 . The ALU is configured to execute an arbitrary function, such as a function of one or more operands defined by an opcode.

處理元件116可藉由任何數量和排列的列匯流排124、132連接到該SIMD控制器108。列匯流排124、132可運作以於任何SIMD控制器108和複數個處理元件116之間單向或雙向地傳遞資訊。列匯流排132可提供一分段程度，使得處理元件116之一子集可以經由此一列匯流排132來通訊。列匯流排132的分段可以是永久的，或是可由開關來啟動而由SIMD控制器108或一處理元件116轉為開啟或關閉。列匯流排124、132可設有閂鎖，其可實現資料置換、本地操作、以及類似功能。雖然是以直線型來說明，但列匯流排124、132可包括任何數量的線。列匯流排124、132可連接至處理元件116的ALU，以增進計算，同時從匯流排124、132讀取資料以及將資料寫入匯流排124、132。 Processing elements 116 may be connected to the SIMD controller 108 via any number and arrangement of column buses 124, 132. Column buses 124 , 132 are operable to pass information between any SIMD controller 108 and a plurality of processing elements 116 in one or two directions. Column bus 132 may provide a level of segmentation such that a subset of processing elements 116 may communicate via column bus 132 . Segmentation of the column bus 132 may be permanent, or may be initiated by a switch to be turned on or off by the SIMD controller 108 or a processing element 116 . Column buses 124, 132 may be provided with latches that may enable data replacement, local operations, and similar functions. Although illustrated as linear, column bus bars 124, 132 may include any number of lines. Column buses 124, 132 may be connected to the ALU of processing element 116 to facilitate computing while data is read from and written to buses 124, 132.

舉例而言，複數個列匯流排124、132可包括一運算數匯流排124和一通用列匯流排132。運算數匯流排124可用以將來自SIMD控制器108的運算數選擇傳遞至處理元件116，使得每一個處理元件116對SIMD控制器108所選擇的本地運算數執行相同操作。通用列匯流排132可進位資料和運算碼資訊，以補充運算數匯流排124所進位的運算數選擇。 For example, the plurality of column buses 124 and 132 may include an operand bus 124 and a general column bus 132 . Operand bus 124 may be used to pass operand selections from SIMD controller 108 to processing elements 116 such that each processing element 116 is Selected local operands perform the same operation. Universal column bus 132 may carry carry data and opcode information to supplement operand selection carried by operand bus 124 .

處理元件列連接128可被提供以直接連接處理元件116，使得一給定處理元件116可與排組100中的一鄰近或遠端處理元件116直接通訊。列連接128可允許狀態資訊的共享，例如加總和進位值、以及位址資訊。列連接128可促進列移位，其可為單向(左或右)或雙向(左和右)，且進一步可以是圓形的。列連接128可配置以用作移位至相鄰處理元件116(例如任一/兩個方向中的下一個位元)、以及至末端的處理元件116(例如在任一/兩個方向中相隔八個或某個其他數量位元的處理元件116)。處理元件116的一或多個暫存器可被用以儲存經由列連接128所接收的資訊。 Processing element column connections 128 may be provided to directly connect processing elements 116 such that a given processing element 116 can communicate directly with an adjacent or remote processing element 116 in the array 100 . Column connections 128 may allow sharing of status information, such as sum and carry values, and address information. Column connections 128 can facilitate column shifting, which can be unidirectional (left or right) or bidirectional (left and right), and further can be circular. Column connections 128 may be configured to serve as shifts to adjacent processing elements 116 (eg, the next bit in either/both directions), and to end processing elements 116 (eg, eight bits apart in either/both directions). or some other number of bits of processing elements 116). One or more registers of processing element 116 may be used to store information received via column connection 128 .

處理元件列連接128可提供波紋鏈路供ALU的傳遞、加總或其他輸出。這些值不需要被閂鎖，且是根據處理元件116的本地暫存器的值和自任何匯流排接收的值。可使用動態邏輯(其可被預先充電為高位)，使得當輸入單調遞減時，波紋函數可單調遞減；進位(Carry)為單調遞增函數，可使其為有效低位，使得進位初始都是高位的(即預先充電)且可放電至低位，但將不轉高位進行加總。 Processing element column connection 128 may provide a ripple link for ALU pass-through, summation, or other output. These values do not need to be latched and are based on the values in the local registers of the processing element 116 and the values received from any bus. Dynamic logic (which can be precharged to high bits) can be used so that when the input decreases monotonically, the ripple function can decrease monotonically; Carry is a monotonically increasing function, which can be effectively low, so that the carry is initially high. (i.e. pre-charged) and can be discharged to low level, but will not turn to high level for summing.

在排組100內的處理元件116之間至少有四種類型的通訊。第一，可利用列連接128和相關聯的通訊暫存器執行同步通訊。第二，可經由列連接128執行通過一波紋傳遞類型鏈路的非同步通訊，而且其中兩個這種鏈路可於處理元件116的線性陣列中以相反方向傳送資訊。兩個鏈路可供作多位元算術(例如傳遞左或右行進和相反符號擴展行進)，並且可用於搜尋和最大池化類型操作。第三，處理元件116可對列匯流排132寫入資訊，且此資訊可由SIMD控制器108或由另一處理元件116讀取。舉例而言，一個群組的處理元件116可對一分段列匯流排132寫入資訊，其可接著由SIMD控制器108或由另一群組的處理元件116加以讀取。第四，在相鄰排組100中的處理元件116可同步通訊。在各種具體實施例中，可實施這四種類型的通訊中任一或多者。 There are at least four types of communications between processing elements 116 within array 100 . First, synchronous communications can be performed using column connections 128 and associated communications registers. Second, asynchronous communication over a ripple pass-type link can be performed via column connection 128 , and two such links can carry information in opposite directions in a linear array of processing elements 116 . Both links are available for multi-bit arithmetic (such as passing left or right marches and reverse sign-extended marches), and can be used for search and max-pooling type operations. Third, processing element 116 can write information to column bus 132 and this information can be read by SIMD controller 108 or by another processing element 116 . For example, one group of processing elements 116 may write information to a segmented column bus 132 , which may then be processed by the SIMD controller 108 or by another group. Component 116 to read. Fourth, processing elements 116 in adjacent banks 100 can communicate synchronously. In various embodiments, any or more of these four types of communications may be implemented.

從上述可知，計算記憶體排組100是一種可控制的計算記憶體的具空間效率單元，其適合以相同或實質相同的形式被重製為具空間效率的圖樣。可藉由相鄰的處理元件116來對儲存在記憶體單元陣列104中的資料執行操作，使得操作可以並行方式被執行，同時使於處理器和記憶體之間往返傳送資料所花費的能量降低或達最小。 From the above, it can be seen that the computing memory array 100 is a space-efficient unit of controllable computing memory, which is suitable for being reproduced in the same or substantially the same form into a space-efficient pattern. Operations on data stored in the memory cell array 104 can be performed by adjacent processing elements 116 so that the operations can be performed in parallel while reducing the energy spent on transferring data back and forth between the processor and the memory. Or reach the minimum.

第三圖說明了可從複數個計算記憶體排組(例如排組100)建構而成的處理裝置200的具體實施例。每一個計算記憶體排組100包括記憶體單元陣列104和複數個處理元件116，如本文其他處所描述。 The third figure illustrates a specific embodiment of a processing device 200 that may be constructed from a plurality of computing memory banks (eg, bank 100). Each computing memory bank 100 includes an array of memory cells 104 and a plurality of processing elements 116, as described elsewhere herein.

複數個SIMD控制器108是被提供至計算記憶體排組100以對排組100提供指令(以及視情況提供資料)。在這個具體實施例中，每一個排組100都包括其自身的不同SIMD控制器108。與由所有排組共享單一控制器相比，這可提供更細粒度的控制。可以主/從式方案、中斷/等待方案、或類似者來協調SIMD控制器108的運作。 A plurality of SIMD controllers 108 are provided to the computing memory bank 100 to provide instructions (and optionally data) to the bank 100 . In this particular embodiment, each bank 100 includes its own different SIMD controller 108 . This provides finer-grained control than having a single controller shared by all platoons. The operation of the SIMD controller 108 may be coordinated in a master/slave scheme, an interrupt/wait scheme, or the like.

可對處理裝置200提供任何數量的排組100。每一個排組100的大小和排組100的排列可經選擇，以提供裝置200之寬度(W)和高度(H)維度，以增加或最大化佈局效率，例如矽的有效率使用，而且同時減少或最小化處理和記憶體之間的距離，進以降低或最小化功率需求。排組100可排列為線性陣列，並且可被依序指定位址。 Any number of arrays 100 may be provided for the processing device 200 . The size of each row group 100 and the arrangement of the row groups 100 may be selected to provide the width (W) and height (H) dimensions of the device 200 to increase or maximize layout efficiency, such as efficient use of silicon, while also Reduce or minimize the distance between processing and memory, thereby reducing or minimizing power requirements. The arrays 100 may be arranged as a linear array and may be addressed sequentially.

排組100的定址可以是大尾數端或小尾數端，且可依據實施偏好而從頂部或底部開始。 The addressing of the row group 100 may be big-endian or little-endian, and may start from the top or bottom depending on implementation preference.

裝置200可包括處理元件行連接202以連接不同排組100中的處理元件116，使得一給定的處理元件116可與一鄰近或末端排組100的另一處理元件116直接通訊。行連接202有助於行移位，其可以是單向的(上或下)或雙向的(上和下)，且進一步可為圓形的。處理元件116的一或多個暫存器係可用以儲存經由行連接202所接收的資訊。 Apparatus 200 may include processing element row connections 202 to connect processing elements 116 in different rows 100 such that a given processing element 116 can communicate with another processing element in an adjacent or end row 100 Component 116 communicates directly. Row connections 202 facilitate row shifting, which may be unidirectional (up or down) or bidirectional (up and down), and further may be circular. One or more registers of processing element 116 may be used to store information received via row connection 202 .

裝置200可包括行匯流排204來連接任何數量的計算記憶體排組100的處理元件116。在這個具體實施例中，一行142的記憶體跨越了排組100，而且與同一行142相關聯的每一個處理元件116是藉由行匯流排204而連接。雖是以一條線來描述，但行匯流排204也可包括任何數量的線。任何數量及排列的行匯流排204都可被提供。 The device 200 may include a row bus 204 to connect any number of the processing elements 116 of the computing memory bank 100 . In this particular embodiment, a row 142 of memory spans the bank 100 , and each processing element 116 associated with the same row 142 is connected by a row bus 204 . Although depicted as one line, row bus 204 may include any number of lines. Any number and arrangement of row buses 204 may be provided.

在不同排組100中的處理元件116可透過行匯流排204彼此通訊。行匯流排204可運作以於連接的處理元件116之間單向或雙向地傳遞資訊。行匯流排204可傳遞運算碼資訊，以補充由其他路徑(例如在每一個排組100內的運算數匯流排124)所傳遞的資訊。任何數量及排列的行匯流排204都可被提供。一給定的行匯流排可提供一給定的分段程度，使得在一對應行142中的處理元件116之一子集可經由此一行匯流排來通訊。行匯流排204的分段可以是永久的，或是可以開關來啟動而藉由SIMD控制器108或處理元件116轉為開啟或關閉。 Processing elements 116 in different row groups 100 may communicate with each other via the row bus 204 . Row bus 204 is operable to pass information between connected processing elements 116 in one or two directions. Row bus 204 may pass opcode information to supplement information passed by other paths, such as operand bus 124 within each row group 100 . Any number and arrangement of row buses 204 may be provided. A given row bus may provide a given degree of segmentation such that a subset of the processing elements 116 in a corresponding row 142 may communicate via the row bus. The segmentation of the row bus 204 may be permanent, or may be switch enabled and toggled on or off by the SIMD controller 108 or processing element 116 .

於排組100內連接處理元件116的列匯流排132、以及於排組100之間連接處理元件116的行匯流排204可允許在處理裝置200內進行資料和指令的可受控制二維通訊。這可增進大影像的處理，其可映射至矩形或方形區域，以減少或最小化通訊距離及因此之功率需求。因此，匯流排132、204所提供之可受控制的二維通訊可允許處理影像或類似資訊的神經網路的有效率實施。 Column buses 132 connecting processing elements 116 within bank groups 100 and row buses 204 connecting processing elements 116 between bank groups 100 may allow controlled two-dimensional communication of data and instructions within processing device 200 . This enhances the processing of large images, which can be mapped to rectangular or square areas to reduce or minimize communication distance and therefore power requirements. Thus, the controlled two-dimensional communication provided by buses 132, 204 may allow efficient implementation of neural networks that process images or similar information.

此外，配置SIMD控制器108以匹配排組100的高度(H)可允許以具空間效率方式來放置多個控制排組100，一個在另一個上方地，在行方向中傾斜。這允許產生幾乎方形的陣列，其有利於封裝，即使單獨的排組100相對於其高度(亦即，在行維度或總高度H的主要部分中)是非常寬的(亦即，在列維度或寬度W中)。對於各種實際的RAM電路而言、以及對於具有大量處理器以分攤單一SIMD控制器108的面積與功率成本而言，這會是有用的。 Additionally, configuring the SIMD controller 108 to match the height (H) of the rows 100 may allow multiple control rows 100 to be placed in a space-efficient manner, one above the other, tilted in the row direction. This allows the production of almost square arrays, which facilitates packaging even though the individual row groups 100 are opposite is very wide (i.e., in the column dimension or width W) in its height (i.e., in the main part of the row dimension or total height H). This can be useful for a variety of practical RAM circuits, and for having a large number of processors to amortize the area and power costs of a single SIMD controller 108.

在一種例示實施中，參閱第二圖和第三圖，處理裝置200係包括32個計算記憶體排組100，其各具有包含4096行記憶體的記憶體單元陣列104。在每一個排組100內，每一行142含有連接至一處理元件116的192個記憶體位元。 In an exemplary implementation, referring to Figures 2 and 3, the processing device 200 includes 32 computing memory banks 100, each having a memory cell array 104 containing 4096 rows of memory. Within each bank 100 , each row 142 contains 192 memory bits connected to a processing element 116 .

從上述應可顯知，處理裝置200可包括計算記憶體排組100之一堆疊結構，以增加處理能力並允許大規模的並行運作，同時仍維持排組100的具空間效率的整體佈局，並減少或最小化在排組100之間往返傳送資料時所花費的能量。在行方向上可複製單一排組100的優點，此外，可為排組100提供通訊方式。 It should be apparent from the above that the processing device 200 may include a stacked structure of the computing memory array 100 to increase processing capabilities and allow for large-scale parallel operations while still maintaining a space-efficient overall layout of the array 100, and Reduce or minimize the energy expended in transmitting data back and forth between rows 100. The advantages of a single row group 100 can be replicated in the row direction, and in addition, a communication method can be provided for the row group 100 .

第四圖說明從複數個計算記憶體排組(例如排組100)建構之處理裝置220的具體實施例。處理裝置220與本文所述其他裝置類似，為求清晰係省略了重複的描述。可參照其他具體實施例的相關說明，其中相同的元件符號代表相同的構件。 Figure 4 illustrates a specific embodiment of a processing device 220 constructed from a plurality of compute memory banks (eg, bank 100). The processing device 220 is similar to other devices described herein, and repeated descriptions are omitted for clarity. Reference may be made to the related descriptions of other specific embodiments, in which the same reference numerals represent the same components.

複數個SIMD控制器108、222係被提供至計算記憶體排組100以對排組100提供指令(視情況為資料)。在這個具體實施例中，SIMD控制器222是包含在至少兩個計算記憶體排組100內。亦即，SIMD控制器222可由多個計算記憶體排組100共享。任何數量的其他排組100可包括專用的、或共享的SIMD控制器108、222。 A plurality of SIMD controllers 108, 222 are provided to the computing memory bank 100 to provide instructions (and optionally data) to the bank 100. In this particular embodiment, SIMD controller 222 is included within at least two computing memory banks 100 . That is, the SIMD controller 222 may be shared by multiple computing memory banks 100 . Any number of other banks 100 may include dedicated, or shared SIMD controllers 108, 222.

選擇共享控制器222的排組100對具有其自己的專用控制器108的排組100之比例，可允許實施處理元件的平衡利用，其朝向增加專用控制器108的數量驅動，因此可以良好的面積和功率效率來處理較小的問題，其朝向增加共享的控制器222的數量驅動以限制重複。 Choosing the ratio of banks 100 that share controllers 222 to banks 100 that have their own dedicated controllers 108 may allow for balanced utilization of processing elements to be implemented, which drives toward increasing the number of dedicated controllers 108 and thus enabling good area utilization. and power efficiency to deal with smaller problems, which is towards increasing Increase the number of shared controllers 222 driven to limit duplication.

第五圖說明從複數個計算記憶體排組(例如排組100)所建構的處理裝置240的具體實施例。處理裝置240與本文所述其他裝置類似，為求清晰則省略了重複的描述。可參照其他具體實施例的相關描述，其中相同的元件符號代表相同的構件。 Figure 5 illustrates a specific embodiment of a processing device 240 constructed from a plurality of compute memory banks (eg, bank 100). The processing device 240 is similar to other devices described herein, and repeated descriptions are omitted for clarity. Reference may be made to the related descriptions of other specific embodiments, in which the same reference numerals represent the same components.

在複數個計算記憶體排組100中，至少一個排組100包括一輸入/輸出電路242以供軟體驅動的輸入/輸出。在這個具體實施例中，最下方的排組100包括一輸入/輸出電路242，連接至其SIMD控制器108。軟體驅動的輸入/輸出可由另一裝置提供，例如通用處理器，其與處理裝置240共同位於同一較大裝置中，例如平板電腦、智慧型電話、穿戴式裝置等。輸入/輸出電路242可包括串行周邊介面(SPI)、雙倍資料率(DDR)介面、行動產業處理器介面(MIPI)、快捷周邊元件互連(PCIe)等。可提供任何數量的輸入/輸出電路242已支援任何數量的這類介面。 Among the plurality of computing memory banks 100, at least one bank 100 includes an input/output circuit 242 for software-driven input/output. In this particular embodiment, the lowermost bank 100 includes an input/output circuit 242 connected to its SIMD controller 108 . The software-driven input/output may be provided by another device, such as a general-purpose processor, co-located with the processing device 240 in the same larger device, such as a tablet, smart phone, wearable device, etc. The input/output circuit 242 may include a serial peripheral interface (SPI), a double data rate (DDR) interface, a mobile industrial processor interface (MIPI), a peripheral component interconnect express (PCIe), etc. Any number of input/output circuits 242 may be provided to support any number of such interfaces.

輸入/輸出電路242可被配置以使SIMD控制器108執行運作。SIMD控制器108可被配置以使輸入/輸出電路242執行運作。 Input/output circuitry 242 may be configured to cause SIMD controller 108 to perform operations. SIMD controller 108 may be configured to cause input/output circuitry 242 to perform operations.

輸入/輸出電路242可被配置以重置SIMD控制器108、以及讀取和寫入SIMD控制器108的暫存器。透過SIMD控制器108的暫存器，輸入/輸出電路242可使SIMD控制器108執行運作，包括寫入指令記憶體。因此，起始程序可包括重置SIMD控制器108、對指令記憶體的底部寫入啟動碼、以及釋放重置，在該時點執行啟動碼。 The input/output circuitry 242 may be configured to reset the SIMD controller 108 and to read and write the registers of the SIMD controller 108 . Through the registers of the SIMD controller 108, the input/output circuit 242 allows the SIMD controller 108 to perform operations, including writing instructions to the memory. Therefore, the initialization process may include resetting the SIMD controller 108, writing the startup code to the bottom of the instruction memory, and releasing the reset, at which point the startup code is executed.

此外，複數個SIMD控制器108可被連接至一控制器匯流排244，以於SIMD控制器108之間提供相互通訊。輸入/輸出電路242也可被連接至控制器匯流排244以與SIMD控制器108通訊，而且這類連接可如所述般透過其排組100的SIMD控制器108進行，或是直接地進行。控制器匯流排244可允許資料和指令的共享、以及處理運作的協調。 Additionally, a plurality of SIMD controllers 108 may be connected to a controller bus 244 to provide intercommunication between the SIMD controllers 108 . Input/output circuitry 242 may also be connected to controller bus 244 for communication with SIMD controller 108, and such connection may be made through the SIMD controller 108 of its array 100, as described, or directly. Controller bus 244 allows Sharing of data and instructions, and coordination of processing operations.

可提供任何數量的控制器匯流排244。控制器匯流排244可被分段為任何適當程度。舉例而言，第一控制器匯流排244可為一完整高度的匯流排，其連接所有的SIMD控制器108；第二控制器匯流排244可被分段為兩個一半高度的匯流排，其將SIMD控制器108分為兩個群組；而第三與第四控制器匯流排244可以各自分為四個分段。因此，可定義不同群組的SIMD控制器108來協調運作。一給定的SIMD控制器108可以預訂任何連接的控制器匯流排244。 Any number of controller busses 244 may be provided. Controller bus 244 may be segmented to any suitable extent. For example, the first controller bus 244 can be a full-height bus that connects all SIMD controllers 108; the second controller bus 244 can be segmented into two half-height busses that connect all SIMD controllers 108. The SIMD controller 108 is divided into two groups; and the third and fourth controller busses 244 can each be divided into four segments. Therefore, different groups of SIMD controllers 108 can be defined to coordinate operations. A given SIMD controller 108 can subscribe to any connected controller bus 244.

當SIMD控制器108要以主/從式方案運作時，作為從屬裝置而運作的SIMD控制器108除了從其主/從式控制器244中繼序列到其連接的計算記憶體排組100以外什麼也不做。指標暫存器、迴圈計數器、堆疊和指令記憶體作為從屬裝置而運作的SIMD控制器108會是不活動的。 When the SIMD controller 108 is to operate in a master/slave scheme, the SIMD controller 108 operating as a slave does nothing but relay sequences from its master/slave controller 244 to its connected computing memory bank 100 Don't do it either. The SIMD controller 108 operating as slave devices will be inactive with the pointer register, loop counter, stack and instruction memory.

此外，在這個具體實施例中，複數個通用列匯流排246、248被提供以於每一個排組100中連接處理元件116和SIMD控制器108。列匯流排246、248可包括一主要列匯流排246，其從SIMD控制器108單方向至排組100中的所有處理元件116，以及用於在排組100中處理元件116的群組之間進行本地雙向通訊的一分段列匯流排248。主要列匯流排246將SIMD控制器108連接至每一個排組100的處理元件116，以分配運算碼和資料。分段列匯流排248用於本地操作，例如置換和資訊的管線內處理元件傳送。 Additionally, in this particular embodiment, a plurality of universal column buses 246 , 248 are provided to connect the processing elements 116 and the SIMD controller 108 in each bank group 100 . Column buses 246 , 248 may include a main column bus 246 unidirectionally oriented from the SIMD controller 108 to all processing elements 116 in the bank 100 and between groups of processing elements 116 in the bank 100 A segmented column bus 248 for local bidirectional communication. The main column bus 246 connects the SIMD controller 108 to the processing elements 116 of each bank 100 to distribute operation codes and data. Segmented column bus 248 is used for local operations such as substitutions and in-pipeline processing element transfers of information.

從上述說明應可知，控制器匯流排244提供計算記憶體排組100的運作配置彈性。此外，輸入/輸出電路242允許SIMD控制器108管理和協調裝置240的運作。 It should be understood from the above description that the controller bus 244 provides operational configuration flexibility of the computing memory array 100 . Additionally, input/output circuitry 242 allows SIMD controller 108 to manage and coordinate the operation of device 240 .

第六圖說明可用於計算記憶體排組(例如排組100)的處理元件116的具體實施例。 Figure 6 illustrates a specific embodiment of a processing element 116 that may be used in a computational memory bank (eg, bank 100).

處理元件116包括內部暫存器260、算術邏輯單元(ALU)262、通訊狀態264和內部狀態266。內部暫存器260和通訊狀態264是經由內部匯流排268(其可為一差動匯流排)連接至一行的記憶體142。內部暫存器260可實施為接觸的6T記憶體胞元，其中除了對位元線的標準輸出以外，暫存器的狀態還可以被外部電路直接讀取。 The processing element 116 includes an internal register 260, an arithmetic logic unit (ALU) 262, Communication status 264 and internal status 266. Internal registers 260 and communication status 264 are connected to a row of memory 142 via an internal bus 268 (which may be a differential bus). The internal register 260 can be implemented as a contact 6T memory cell, where in addition to the standard outputs to the bit lines, the state of the register can also be read directly by external circuitry.

內部匯流排268可被寫入記憶體行142、內部暫存器260、ALU 262和通訊狀態264中，且可從其讀取。 Internal bus 268 can be written to and read from memory rows 142, internal registers 260, ALU 262, and communication status 264.

內部暫存器260可包括複數個通用暫存器(例如，R0、R1、R2、R3)、複數個靜態暫存器(例如，X、Y)、對於相鄰處理元件116為可存取的複數個通訊暫存器(例如，Xs、Ys)、以及一遮蔽位元(例如，K)。內部暫存器260可被連接至內部匯流排268以被寫入、或寫入其他暫存器，以及與ALU 262傳遞資訊。SIMD控制器108可控制哪些內部暫存器260要被寫入和讀取、以及遮蔽位元K是否要被覆蓋。 The internal register 260 may include a plurality of general registers (eg, R0, R1, R2, R3), a plurality of static registers (eg, X, Y), accessible to adjacent processing elements 116 A plurality of communication registers (for example, Xs, Ys), and a masking bit (for example, K). Internal register 260 may be connected to internal bus 268 to be written to, or to other registers, and to communicate information with ALU 262 . The SIMD controller 108 can control which internal registers 260 are to be written to and read from, and whether the shadow bit K is to be overwritten.

內部暫存器260可被配置以用於與ALU 262進行算術運算，例如加總和差。一般而言，內部暫存器260可被用以計算任何函數。 Internal register 260 may be configured for performing arithmetic operations with ALU 262, such as sums and differences. Generally speaking, the internal register 260 can be used to calculate any function.

靜態暫存器X、Y可被配置以經由與靜態暫存器X、Y相關聯而且複製靜態暫存器X、Y的值(亦即，Xs、Ys從屬於X、Y)的通訊暫存器Xs、Ys來提供資訊至相同排組或不同排組中的相鄰處理元件116。連接的處理元件116的通訊狀態264從本地通訊暫存器Xs、Ys取值。因此，ALU 262可被配置以於連接的處理元件116之間以同步或管線方式傳遞資料(例如，執行移位)。SIMD控制器108可提供特定於通訊暫存器Xs、Ys的選通脈衝，使得該選通脈衝可被跳過，且可節省其功率。處理元件116中的遮蔽位元K保護同一處理元件116中的靜態暫存器X和Y、而非通訊暫存器Xs和Ys。 Static registers X, Y may be configured via communication buffers that are associated with static registers The processors Xs, Ys are used to provide information to adjacent processing elements 116 in the same bank or in different banks. The communication status 264 of the connected processing element 116 is obtained from the local communication registers Xs, Ys. Accordingly, ALU 262 may be configured to transfer data (eg, perform shifting) in a synchronous or pipeline manner between connected processing elements 116 . The SIMD controller 108 can provide a strobe pulse specific to the communication registers Xs, Ys, so that the strobe pulse can be skipped and its power can be saved. The mask bit K in the processing element 116 protects the static registers X and Y in the same processing element 116 but not the communication registers Xs and Ys.

在這個示例中，通訊暫存器Xs、Ys可由同一排組中的相鄰處理元件116讀取，而且通訊暫存器Ys可由相鄰排組中的同一行中的處理元件116 讀取。亦即，暫存器Xs、Ys可藉由例如列連接128在列方向中傳遞資訊，而暫存器Ys則可藉由例如行連接202在行方向中傳遞資訊。其他示例也是可被預期的，例如將暫存器Ys限制於行通訊，而僅以暫存器Xs用於列通訊。 In this example, communication registers Xs, Ys are readable by adjacent processing elements 116 in the same bank, and communication register Ys is readable by processing elements 116 in the same row in adjacent banks Read. That is, the registers Xs and Ys can transmit information in the column direction through, for example, the column connection 128, and the register Ys can transmit information in the row direction through, for example, the row connection 202. Other examples are contemplated, such as limiting register Ys to row communications and using register Xs only for column communications.

通訊暫存器Xs、Ys可被實施為從屬閂鎖級，因此它們的值可被其他處理元件116使用，而不會產生競爭條件。 The communication registers Xs, Ys can be implemented as slave latch stages so that their values can be used by other processing elements 116 without creating race conditions.

遮蔽位元K可被配置以禁止所有寫入操作(例如寫入至記憶體行142、暫存器260、及/或列匯流排246、248)，除非它被連接的SIMD控制器108覆蓋。遮蔽位元K可被配置以於高位時禁止回寫；這可包括遮蔽位元K禁用其本身，因此除非遮蔽位元K被覆蓋，否則對遮蔽位元K的連續寫入將禁用線性陣列中越來越多的處理元件116。這具有遮蔽位元K可如同其他位元般被精確建立之實施優點、以及遮蔽位元K實現嵌套條件狀態(亦即「若(if)」語句)而不增加複雜度之編程優點。 Shadow bit K may be configured to disable all write operations (eg, writes to memory row 142, register 260, and/or column buses 246, 248) unless it is overridden by the connected SIMD controller 108. Masking bit K can be configured to disable writeback when high; this can include masking bit K disabling itself, so that unless masking bit K is overwritten, subsequent writes to masking bit K will disable writeback in the linear array. More and more processing elements 116 are coming. This has the implementation advantage that the shaded bit K can be established exactly like other bits, and the programming advantage that the shaded bit K implements nested conditional states (i.e., "if" statements) without increasing complexity.

ALU 262可包括多重層級的多工器(例如兩個層級)。ALU 262可被配置以選擇來自內部暫存器260、通訊狀態264和內部狀態266的輸入，並且允許對此輸入計算任意函數。函數可由經匯流排246、248所傳遞的資訊予以定義。 ALU 262 may include multiple levels of multiplexers (eg, two levels). ALU 262 may be configured to select inputs from internal registers 260, communication state 264, and internal state 266, and allow arbitrary functions to be calculated on these inputs. Functions may be defined from information passed via buses 246, 248.

通訊狀態264包括根據其他處理元件116的通訊暫存器(例如Xs、Ys)的資訊。通訊狀態264可被用於移位和類似的運作。 The communication status 264 includes information based on communication registers (eg, Xs, Ys) of other processing elements 116 . Communication status 264 may be used for shifting and similar operations.

通訊狀態264可包括來自相同排組100中相鄰處理元件116的通訊暫存器Xs的X-相鄰狀態Xm、Xp。通訊狀態Xm可為具有較低位址的相鄰處理元件116的暫存器Xs的值(亦即，「m」表示「減去(minus)」)；通訊狀態Xp可為具有較高位址的相鄰處理元件116的暫存器Xs的值(亦即，「p」表示「加上(plus)」)。在處理元件116的線性陣列的每一端部處的X-相鄰狀態Xm、Xp可被設定為特定值，例如0。在其他具體實施例中，在處理元件116的線性陣列的每一端部處的X-相鄰狀態Xm、Xp可以從相對端部的通訊暫存器Xs連線以取得其值，使得值可「滾動(roll)」。 Communication status 264 may include X-adjacent states Xm, Xp from communication registers Xs of adjacent processing elements 116 in the same bank 100 . The communication status Xm may be the value of the register Xs of the adjacent processing element 116 with a lower address (ie, "m" means "minus"); The value of the register Xs of the adjacent processing element 116 (ie, "p" means "plus"). The X-neighbor states Xm, Xp at each end of the linear array of processing elements 116 may be set to a specific value, such as zero. In other embodiments, on processing element 116 The X-adjacent states Xm,

為了增進在排組100內進行列基礎通訊的更大能力，通訊狀態264可包括來自同一排組100的相鄰處理元件116的通訊暫存器Ys的其他X-相鄰狀態Yxm、Yxp。通訊狀態Yxm可為具有較低位址的相鄰處理元件116的暫存器Ys的值；通訊狀態Yxp可為具有較高位址的相鄰處理元件116的暫存器Ys的值。在處理元件116的線性陣列的每一個端部處的其他X-相鄰狀態Yxm、Yxp可被設定為特定值，例如0。在其他具體實施例中，在處理元件116的線性陣列的每一個端部處的其他X-相鄰狀態Yxm、Yxp可以從相對端部的通訊暫存器Ys連線以取得其值，使得值可「滾動(roll)」。 To facilitate greater capability for column-based communications within a bank 100 , communication states 264 may include other X-adjacent states Yxm, Yxp from communication registers Ys of adjacent processing elements 116 of the same bank 100 . The communication status Yxm may be the value of the register Ys of the adjacent processing element 116 with a lower address; the communication status Yxp may be the value of the register Ys of the adjacent processing element 116 with a higher address. The other X-neighbor states Yxm, Yxp at each end of the linear array of processing elements 116 may be set to a specific value, such as zero. In other embodiments, the other X-neighbor states Yxm, Yxp at each end of the linear array of processing elements 116 may be wired to obtain their values from the communication register Ys at the opposite end, such that the values Can "roll".

通訊狀態264可包括來自同一排組100中離一固定位址距離遠(例如8個位元)的處理元件116的通訊暫存器Xs的X-遠端狀態Xm8、Xp8。通訊狀態Xm8可為具有低了8個位元位址的處理元件116的暫存器Xs的值。通訊狀態Xp8可為具有高了8個位元位址的處理元件116的暫存器Xs的值。靠近處理元件116的線性陣列的每一個端部的X-遠端狀態Xm8、Xp8可被設定為特定值，例如0。在其他的具體實施例中，靠近處理元件116的線性陣列的每一個端部的X-遠端狀態Xm8、Xp8可被連線以從靠近相對端部的一對應通訊暫存器Xs取值，使得值可「滾動(roll)」固定位址距離。 Communication status 264 may include X-remote status Xm8, The communication status Xm8 may be the value of the register Xs of the processing element 116 with an 8-bit lower address. The communication state Xp8 may be the value of the register Xs of the processing element 116 with an address that is 8 bits higher. The X-distal states Xm8, Xp8 near each end of the linear array of processing elements 116 may be set to a specific value, such as zero. In other embodiments, the X-remote states Xm8, Allows the value to "roll" a fixed address distance.

通訊狀態264可包括來自於相鄰排組100中同一行中的處理元件116的通訊暫存器Ys的Y-相鄰狀態Ym、Yp。通訊狀態Ym可為相鄰排組100中具有較低位址的對應處理元件116的暫存器Ys的值。通訊狀態Yp可為相鄰排組100中具有較高位址的對應處理元件116的暫存器Ys的值。可實施固定端值或滾動，如上述說明。 The communication status 264 may include Y-adjacent states Ym, Yp from the communication register Ys of the processing element 116 in the same row of the adjacent bank 100 . The communication status Ym may be the value of the register Ys of the corresponding processing element 116 with a lower address in the adjacent bank 100 . The communication status Yp may be the value of the register Ys of the corresponding processing element 116 with a higher address in the adjacent bank 100 . Fixed thresholds or rolling can be implemented, as explained above.

SIMD控制器108可配置以存取X-遠端狀態Xp8、Xm8以及在處理元件線性陣列中最末端的處理元件116的暫存器Xs、Ys，使得最末端的和鄰近的處理元件116的靜態暫存器X、Y值可被讀取。 SIMD controller 108 may be configured to access X-remote states Xp8, The registers Xs and Ys of the last processing element 116 in the linear array of processing elements are configured so that the static registers X and Y values of the last and adjacent processing elements 116 can be read.

通訊狀態264可進一步包括一進位輸入Ci和另一輸入Zi，其可代表符號擴展。 Communication state 264 may further include a carry input Ci and another input Zi, which may represent sign extension.

進位輸入Ci可從一相鄰處理元件116的進位輸出Co非同步地波動。最末端行的進位輸入Ci可由SIMD控制器108提供。當排組100被分為兩半時，排組100的每一半的最末端行的進位輸入Ci可由SIMD控制器108提供。可預期進位輸入Ci會隨時間單調遞減。 The carry input Ci may fluctuate asynchronously from the carry output Co of an adjacent processing element 116 . The carry input Ci of the last row may be provided by the SIMD controller 108 . When the bank 100 is divided into two halves, the carry input Ci of the end-most row of each half of the bank 100 may be provided by the SIMD controller 108 . The carry input Ci can be expected to decrease monotonically with time.

符號擴展輸入Zi可於與進位波紋相反的方向中從相鄰的處理元件116的總和Z非同步波動。與進位輸入Ci的最末端行相反，最末端行的符號擴展輸入Zi可由SIMD控制器108提供。當排組100被分為兩半時，排組100的每一半的最末端行的符號擴展輸入Zi可由SIMD控制器108提供。符號擴展輸入Zi可預期是隨時間單調遞減。輸入Zi也可用以波動一任意函數。 The sign extension input Zi may ripple asynchronously from the sum Z of adjacent processing elements 116 in the opposite direction to the carry ripple. As opposed to the last row of carry input Ci, the last row of sign extension input Zi may be provided by the SIMD controller 108 . When the row group 100 is divided into two halves, the sign extension input Zi for the end-most row of each half of the row group 100 may be provided by the SIMD controller 108 . The sign-extended input Zi can be expected to decrease monotonically with time. The input Zi can also be used to fluctuate an arbitrary function.

SIMD控制器108可被配置以從處理元件116的線性陣列的一端讀取進位輸出Co，並且在處理元件116的線性陣列的相對端處讀取輸出Zo(例如，符號擴展輸出)。 SIMD controller 108 may be configured to read carry output Co from one end of the linear array of processing elements 116 and read output Zo (eg, a sign-extended output) at an opposite end of the linear array of processing elements 116 .

一給定處理元件之通訊狀態264可被實施作為與其他處理元件116的直接連接128的端點。 A given processing element's communication status 264 may be implemented as an endpoint of direct connections 128 to other processing elements 116 .

內部狀態266可包括位址位元An、高位位元HB和低位位元LB。位址位元An、高位位元HB、低位位元LB可用於將一處理元件116置於處理元件116的線性陣列中複數個處理元件116的前後關係中。 Internal state 266 may include address bit An, high-order bit HB, and low-order bit LB. The address bits An, high-order bits HB, and low-order bits LB may be used to place a processing element 116 in the context of a plurality of processing elements 116 in a linear array of processing elements 116 .

位址位元An係經硬編碼，使得每一個處理元件116在排組100內都是可唯一定址的。在每一排組有4096個處理元件的示例中，可使用12個位址位元(A0-A11)。在其他具體實施例中，位址位元An可被儲存在暫存器中，而且可由SIMD控制器108加以組態。 The address bits are hard-coded such that each processing element 116 is uniquely addressable within array 100 . In the example of 4096 processing elements per bank, 12 address bits (A0-A11) can be used. In other embodiments, the address bit An may be stored in a temporary register, And can be configured by SIMD controller 108.

SIMD控制器108可為排組100選擇一精確度等級，而可從所選擇的精確度等級得出高位位元HB和低位位元LB。精確度等級選擇可對處理元件116識別要參照哪一個位址位元An來計算高位位元HB和低位位元LB。SIMD控制器108可藉由對排組100中的全部處理元件116傳遞一精確度訊號來做出精確度等級選擇。精確度訊號可指示哪一個位址位元An要被作為排組100之精確度指示位址位元An。精確度訊號可以是相當於位址位元An的數量的數條線路上的一單熱訊號，或可以是一編碼訊號，例如一4-位元訊號，其唯一地識別一個位址位元An。 The SIMD controller 108 may select a precision level for the array 100 and the upper bits HB and lower bits LB may be derived from the selected accuracy level. The accuracy level selection may identify to the processing element 116 which address bit An is to be referenced for calculating the upper bit HB and the lower bit LB. SIMD controller 108 may make accuracy level selections by passing an accuracy signal to all processing elements 116 in array 100 . The accuracy signal may indicate which address bit An is to be used as the accuracy indicating address bit An of the array 100 . The accuracy signal may be a single hot signal on a number of lines equivalent to the number of address bits An, or it may be a coded signal, such as a 4-bit signal, which uniquely identifies an address bit An. .

高位位元HB和低位位元LB可劃分處理元件116的群組以供多位元算術用。這些群組在尺寸上可被固定而且冪次為2。 The high-order bits HB and low-order bits LB may group processing elements 116 for multi-bit arithmetic. These groups can be fixed in size and are powers of 2.

低位位元LB定義了群組中的最低位元。當處理元件116的精確度指示位址位元An未被設定(例如0)、且低位位元方向中的下一個處理元件116的精確度指示位址位元An被設定(例如1)時，低位位元LB在一特定處理元件116中被設定(例如設定為1)。 The low bit LB defines the lowest bit in the group. When the accuracy indication address bit An of the processing element 116 is not set (eg, 0), and the accuracy indication address bit An of the next processing element 116 in the low-order bit direction is set (eg, 1), The lower bit LB is set (for example, set to 1) in a specific processing element 116 .

高位位元HB定義了群組中的最高位位元。當處理元件116的精確度指示位址位元An被設定(例如1)且在高位位元方向中的下一個處理元件116的精確度指示位址位元An未被設定(例如0)時，高位位元HB在一特定處理元件116中被設定(例如設定為1)。 The high bit HB defines the highest bit in the group. When the accuracy indication address bit An of processing element 116 is set (eg, 1) and the accuracy indication address bit An of the next processing element 116 in the upper bit direction is not set (eg, 0), The high-order bit HB is set (for example, set to 1) in a specific processing element 116 .

高位位元HB和低位位元LB中只有一個需要被計算。若處理元件116具有其高位位元HB設定，則下一個處理元件116的低位位元LB可被設定。相反地，若處理元件116不具其高位位元HB設定，則下一個處理元件116的低位位元LB應不被設定。 Only one of the high-order bit HB and the low-order bit LB needs to be calculated. If processing element 116 has its upper bit HB set, then the lower bit LB of the next processing element 116 can be set. On the contrary, if the processing element 116 does not have its high-order bit HB set, then the low-order bit LB of the next processing element 116 should not be set.

用於經由高位位元HB和低位位元LB來設定位址和設定精確度的技術可適用於大尾數端和小尾數端慣例。 Used to set the address and set the accuracy via the high-order bit HB and the low-order bit LB The technique works with both big-endian and little-endian conventions.

高位位元HB和低位位元LB可用以限制進位輸入Ci和符號擴展輸入Zi的傳播，從而滿足排組100的運作精確度。 The high-order bits HB and low-order bits LB can be used to limit the propagation of the carry input Ci and the sign extension input Zi to meet the operational accuracy of the array 100 .

如第七A圖所示，ALU 262可包括兩個層級的多工器。第一層級可包括多工器280、282、284，而第二層級可包括多工器286、288。為求空間和能量效率，多工器可以動態邏輯實施。SIMD控制器108可提供時鐘以對多工器進行閘控。 As shown in Figure 7A, the ALU 262 may include two levels of multiplexers. The first level may include multiplexers 280, 282, 284, while the second level may include multiplexers 286, 288. For space and energy efficiency, multiplexers can be implemented as dynamic logic. SIMD controller 108 may provide a clock to gate the multiplexers.

第一層級多工器280、282、284可被配置以根據來自運算數匯流排124的輸入為第二層級多工器286、288提供選擇位元(例如三個選擇位元)。第一層級多工器280、282、284可為單熱輸入而配置，使得其中一個輸入可被選擇。對第一層級多工器280、282、284的輸入可包括在處理元件116處可用的各種位元中的任何位元，例如內部暫存器260、通訊狀態264以及內部狀態266。對第二層級多工器286、288的輸出可包括差動訊號。可使用並行的N型金屬氧化物半導體邏輯(NMOS)裝置來實施第一層級多工器280、282、284。 The first level multiplexers 280 , 282 , 284 may be configured to provide selection bits (eg, three selection bits) to the second level multiplexers 286 , 288 based on input from the operand bus 124 . The first level multiplexers 280, 282, 284 can be configured for single hot inputs such that one of the inputs can be selected. Inputs to the first level multiplexers 280, 282, 284 may include any of the various bits available at the processing element 116, such as internal registers 260, communication status 264, and internal status 266. The outputs to the second level multiplexers 286, 288 may include differential signals. The first level multiplexers 280, 282, 284 may be implemented using parallel N-type metal oxide semiconductor logic (NMOS) devices.

內部暫存器260、通訊狀態264和內部狀態266可被提供作為輸入，以允許執行一任意函數。舉例而言，暫存器X、Y和R1-R4以及通訊狀態Xp、Yp、Xm、Ym、Xp8、Xm8、Yxp和Yxm可被用於執行算術、移位等等，位址位元A0-A11可被用於對特定的處理元件指定特定值，以供多位元值的翻轉等。對於可被執行的任意函數則無特定限制。 Internal registers 260, communication status 264, and internal status 266 may be provided as inputs to allow execution of an arbitrary function. For example, registers X, Y and R1-R4 and communication states Xp, Yp, Xm, Ym, Xp8, A11 can be used to assign specific values to specific processing elements, for flipping multi-bit values, etc. There are no specific restrictions on any functions that can be executed.

第二層級多工器286、288可包括一主要匯流排多工器286和一分段匯流排多工器288。主要匯流排多工器286可被配置以經由主要列匯流排246接收輸入，例如來自SIMD控制器108的真值表，其可為8位元。分段匯流排多工器288可被配置以經由分段列匯流排246接收輸入，例如來自SIMD控制器108的真值表，其可為8位元。第二層級多工器286、288計算一任意函數，其可透過匯流排246、248定義。此函數可對第一層級多工器280、282、284所選擇、且被提供至第二層級多工器286、288作為選擇輸入的運算數(例如3位元)運作。由來自第一層級多工器280、282、284的差動訊號所驅動的NMOS開關樹可用以實施第二層級多工器286、288。 Second level multiplexers 286, 288 may include a main bus multiplexer 286 and a segment bus multiplexer 288. Primary bus multiplexer 286 may be configured to receive input via primary column bus 246 , such as a truth table from SIMD controller 108 , which may be 8 bits. Segment bus multiplexer 288 may be configured to receive input via segment column bus 246 , such as a truth table from SIMD controller 108 , which may be 8 bits. The second level multiplexers 286, 288 compute an arbitrary function, It can be defined through buses 246, 248. This function operates on an operand (eg, 3 bits) selected by the first level multiplexer 280, 282, 284 and provided as a selection input to the second level multiplexer 286, 288. NMOS switch trees driven by differential signals from the first level multiplexers 280, 282, 284 can be used to implement the second level multiplexers 286, 288.

含有ALU 262的處理元件116的狀態資訊被提供至第一層級多工器280、282、284，其控制輸入是由相關聯的SIMD控制器108經由運算數匯流排124提供至排組100中的所有處理元件116。因此，基於使用SIMD控制器108所選擇的運算數，經由運算數匯流排124，即可於所有的處理元件116間執行操作，而且此操作可基於整個排組100間經由主要列匯流排246所共享的操作或其他資訊、及/或在分段列匯流排246上本地共享的操作或其他資訊。 Status information from the processing elements 116 including the ALU 262 is provided to the first level multiplexers 280, 282, 284 whose control inputs are provided to the bank 100 by the associated SIMD controller 108 via the operand bus 124. All processing elements 116. Thus, operations can be performed across all processing elements 116 via operand bus 124 based on the operands selected using SIMD controller 108 , and the operations can be performed across the entire array 100 via primary column bus 246 Shared operations or other information, and/or locally shared operations or other information on segmented column bus 246 .

ALU 262可被用於寫入匯流排204、246、248。用於寫入的匯流排線路可由第一層級多工器280、282、284的輸出(亦即3-位元輸出)予以選擇，以選擇八條線路其中一條。 ALU 262 may be used to write to buses 204, 246, 248. The bus lines used for writing are selected by the outputs (ie, 3-bit outputs) of the first level multiplexers 280, 282, 284 to select one of the eight lines.

第七B圖說明根據另一具體實施例的ALU 290。ALU 290類似於ALU 262，除了並不提供固定的位址位元A0-A11作為對第一層級多工器280、282、284的輸入以外。ALU 290是一較簡單的ALU，其並不允許與處理元件116的位址有關的函數。預期有各種其他的ALUs可採用為ALU 262所示之輸入子集。 Figure 7B illustrates ALU 290 according to another specific embodiment. ALU 290 is similar to ALU 262 except that it does not provide fixed address bits A0-A11 as inputs to the first level multiplexers 280, 282, 284. ALU 290 is a simpler ALU that does not allow functions related to the address of processing element 116 . It is contemplated that various other ALUs may be used as the input subset shown for ALU 262.

第七C圖說明根據另一具體實施例之ALU 295。ALU 295與ALU 262類似，除了是提供一可選擇的位址位元An而非提供固定的位址位元A0-A11作為對第一層級多工器280的輸入以外。因此，ALU 295可存取一選擇的位址位元以供其計算用。 Figure 7C illustrates an ALU 295 according to another specific embodiment. ALU 295 is similar to ALU 262, except that it provides a selectable address bit An instead of fixed address bits A0-A11 as input to the first level multiplexer 280. Therefore, ALU 295 can access a selected address bit for its calculation.

第八圖說明ALU 262之一例示算術運算表，其示出了進位輸出Co和總和Z之真值表。例示運算是加法，而且其他的運算可直接被實施。 The eighth figure illustrates one of the exemplary arithmetic operations tables of the ALU 262 showing the truth table for the carry output Co and the sum Z. The example operation is addition, and other operations can be performed directly.

第一層級多工器280、282、284可提供運算數值(例如暫存器R0和R1的值)和對第二層級多工器286、288的進位輸入Ci，其分別經由主要列匯流排246接收總和Z真值表，以及經由分段列匯流排248接收進位輸出Co真值表。因此，第二層級多工器286、288可計算總和Z和進位輸出Co。 The first level multiplexers 280, 282, 284 may provide operand values (eg, the values of registers R0 and R1) and carry inputs Ci to the second level multiplexers 286, 288 via the main column bus 246, respectively. A sum Z truth table is received, and a carry output Co truth table is received via segmented column bus 248 . Therefore, the second level multiplexers 286, 288 can calculate the sum Z and the carry output Co.

進位輸出Co和總和Z真值表可被視為加法之運算碼。在此例中，16進制中的加法運算碼是0x2b 0x69。運算碼部分0x2b是進位輸出Co真值表(即從底部到頂部讀取的Co行的位元0010 1011)，而運算碼部分0x69是總和Z真值表(即從底部到頂部讀取的Z行的位元0110 1001)。進位輸出Co運算碼部分0x2b和總和Z運算碼部分0x69分別被提供至分段的列匯流排248和主要列匯流排246，以使第二層級多工器286、288將第一層級多工器280、282、284提供的運算數與輸出總和Z和進位Co相加。 The truth table of the carry output Co and the sum Z can be regarded as the operation code of addition. In this example, the addition operation code in hexadecimal is 0x2b 0x69. The opcode part 0x2b is the carry output Co truth table (i.e. bits 0010 1011 of the Co row read from bottom to top), while the opcode part 0x69 is the sum Z truth table (i.e. Z read from bottom to top Row bits 0110 1001). The carry output Co opcode portion 0x2b and the sum Z opcode portion 0x69 are provided to the segmented column bus 248 and main column bus 246 respectively, such that the second level multiplexers 286, 288 transfer the first level multiplexer The operands provided by 280, 282, and 284 are added to the output sum Z and carry Co.

進位可作為進位輸入Ci透過處理元件116的群組傳播至進位輸出Co。進位傳播被劃定在SIMD控制器108所選擇的冪次為2的位置處，這種劃定可用於處理元件116為高位位元HB和低位位元LB。 The carry may be propagated through the group of processing elements 116 as carry input Ci to carry output Co. Carry propagation is delineated at positions selected by the SIMD controller 108 as a power of 2, which delimitation is available to the processing element 116 for the upper bit HB and the lower bit LB.

第九圖說明分段匯流排248的一個具體實施例。每一個處理元件116的分段匯流排多工器288的每一個輸入可被連接至分段匯流排248的一對應線路。分段匯流排248可由SIMD控制器108於每一個分段中預設為高位，然後保持浮動，使任何啟用的分段匯流排多工器288可以將線路拉為低位，然後閂鎖。 Figure 9 illustrates a specific embodiment of segmented bus 248. Each input of segment bus multiplexer 288 of each processing element 116 may be connected to a corresponding line of segment bus 248 . The segment bus 248 can be preset high in each segment by the SIMD controller 108 and then left floating so that any enabled segment bus multiplexer 288 can pull the line low and then latch.

SIMD控制器108具有對最末端分段的存取，且可配置以讀取和寫入最末端分段。這可以用於將資料從記憶體單元陣列104泵送至SIMD控制器108，以例如從主要記憶體加載控制器編碼。特定於處理元件116的資料可從SIMD控制器108類似地分配。包括一輸入/輸出電路242之排組100可使用此一機制進行輸入/輸出。 SIMD controller 108 has access to the end-most segment and can be configured to read and write the end-most segment. This can be used to pump data from the memory cell array 104 to the SIMD controller 108, for example to load controller code from main memory. Information specific to processing element 116 may be similarly distributed from SIMD controller 108 . The array 100 including an input/output circuit 242 may use this mechanism for input/output.

分段的匯流排248也可用以執行查找表，其中處理元件116設定它們自身的運算碼，因為分段的匯流排248可被本地寫入和讀取。 Segmented bus 248 can also be used to perform lookup tables, where processing elements 116 set their own opcodes, since segmented bus 248 can be written to and read locally.

第十圖說明一處理元件116的內部匯流排的具體實施例、以及處理元件116的例示實施細節。可使用感測-放大器結構來實施重的和輕的記憶胞元以及內部暫存器，如圖所示。 Figure 10 illustrates a specific embodiment of the internal busbars of the processing element 116, as well as exemplary implementation details of the processing element 116. Heavy and light memory cells and internal registers can be implemented using sense-amplifier structures as shown in the figure.

第十一圖至第十四圖說明可適用於本發明的計算記憶體排組100中使用的習知處理元件。雖然處理元件的某些結構/功能是已知的，但將它們適用至計算記憶體排組100則被視為本發明的一部分。 Figures 11-14 illustrate conventional processing elements suitable for use in the computing memory array 100 of the present invention. While certain structures/functions of the processing elements are known, their application to the computing memory array 100 is considered part of the present invention.

第十一圖說明先前技術的處理元件12N，其可被使用作為本發明之計算記憶體排組100中的處理元件116。處理元件12N含有一ALU，係實施作為一8：1多工器17。多工器17的輸出線路是連接到暫存器18(即靜態暫存器X)、19(即靜態暫存器Y)的資料輸入，以及連接到寫入啟動暫存器20和位元寫入17B(其可被提供至記憶體單元陣列104中的一行142)。位元讀取輸出17A可與暫存器18和19的資料輸出一起被提供至行142，以定址多工器17，並因而選擇要將來自一全域控制匯流排21的其輸入的八條運算碼線路中的哪一條連接至其輸出。在這種方式中，多工器17用於計算17A、18和19處的位元值的任意函數。此一任意函數可由全域控制匯流排21上的8位元值所表示的真值表來定義。全域控制匯流排21可以是列匯流排132，如本文中其他部分之說明。 Figure 11 illustrates a prior art processing element 12N that can be used as the processing element 116 in the computing memory bank 100 of the present invention. Processing element 12N contains an ALU implemented as an 8:1 multiplexer 17. The output lines of the multiplexer 17 are connected to the data inputs of the registers 18 (ie, static register X) and 19 (ie, static register Y), and to the write enable register 20 and the bit write Input 17B (which may be provided to a row 142 in the memory cell array 104). Bit read output 17A may be provided to line 142 along with the data outputs of registers 18 and 19 to address multiplexer 17 and thereby select the eight operations to be input to it from a global control bus 21 Which of the code lines is connected to its output. In this manner, multiplexer 17 is used to calculate an arbitrary function of the bit values at 17A, 18 and 19. This arbitrary function may be defined by a truth table represented by 8-bit values on the global control bus 21. Global control bus 21 may be column bus 132, as described elsewhere herein.

寫入啟動暫存器20可允許條件式執行。舉例而言，藉由在某些處理元件12N中而非其他處理元件12N中禁用寫入，可在所有的處理元件12N中執行相同指令，但選擇性地啟動寫入。因此，可藉由啟動寫入(其藉由計算寫入啟動作為所有處理元件12N的條件)、然後執行「則(THEN)」方塊，然後在所有處理元件12N中反轉寫入啟動、然後執行「否則(ELSE)」方塊來處理導致「則(THEN)」方塊或「否則(ELSE)」方塊的執行的條件(「若(IF)」)。 Writing to startup register 20 allows conditional execution. For example, by disabling writing in some processing elements 12N but not others, the same instructions can be executed in all processing elements 12N, but writing is selectively enabled. Therefore, one can do this by initiating a write (which is done by calculating the write enable as a condition for all processing elements 12N), then executing the THEN block, then inverting the write enable in all processing elements 12N, and then executing The "ELSE" block handles the condition ("IF") that causes the execution of the "THEN" or "ELSE" block.

除了提供ALU之8位元真值表以外，全域控制匯流排21也可提供時鐘訊號「寫入X(Write X)」、「寫入Y(Write Y)」、「寫入W/E(Write W/E)」，以使ALU資料可被計時到暫存器18、19和20中。匯流排21可進一步提供控制訊號「群組寫入(Group Write)」和「寫入「Write」」，其允許外部輸入資料寫入到記憶體中而不使用ALU。此一外部輸入資料可透過開關15N而從例如一16位元資料匯流排16被驅動至線路17B上。資料匯流排16也可用以透過此一路徑來加載暫存器18和19。 In addition to providing the 8-bit truth table of the ALU, the global control bus 21 can also provide clock signals "Write W/E)" so that ALU data can be clocked into registers 18, 19 and 20. The bus 21 may further provide control signals "Group Write" and "Write", which allow external input data to be written into the memory without using the ALU. This external input data may be driven from, for example, a 16-bit data bus 16 onto line 17B through switch 15N. Data bus 16 can also be used to load registers 18 and 19 via this path.

第十二圖說明了伊萊特(Elliott)所提出的一種習知技術之一位元處理元件，其在列方向中具有最接近的鄰近通訊。此處理元件可適用以作為本發明之計算記憶體排組100中的處理元件116。此處理元件將二次輸入和輸出加到X和Y暫存器，使每一個X暫存器可從ALU的輸出被加載到其右方(「左移位」)、或使每一個Y暫存器可從ALU的輸出被加載到其左方(「右移位」)或兩者皆可。 Figure 12 illustrates a bit processing element of a conventional technique proposed by Elliott, which has closest neighbor communication in the column direction. This processing element can be suitably used as the processing element 116 in the computing memory array 100 of the present invention. This processing element adds secondary inputs and outputs to the X and Y registers, allowing each The register can be loaded from the output of the ALU to its left ("right shift") or both.

第十三圖說明了美國專利號5,546,343中的一種習知技術的一位元處理元件，其於每一次記憶體讀取可執行兩個運作。此處理元件可被適用以作為本發明之計算記憶體排組100的處理元件116。全域控制匯流排可被加倍為16位元寬，因此其可進位兩個8位元真值表。多工器17C和17D同時計算三個本地狀態位元X、Y和記憶體的兩個函數。X和Y之值可被同時計算。 Figure 13 illustrates a conventional one-bit processing device disclosed in US Pat. No. 5,546,343, which can perform two operations per memory read. This processing element may be adapted as the processing element 116 of the computing memory array 100 of the present invention. The global control bus can be doubled to 16 bits wide, so it can carry two 8-bit truth tables. Multiplexers 17C and 17D simultaneously compute three local status bits X, Y and two functions of memory. X and Y values can be calculated simultaneously.

第十四圖說明了寇卡魯(Cojocaru)所提的一種習知技術之多位元處理元件。該處理元件包括用於算術的進位產生器增強，且具有較少的記憶體使用。此處理元件可適用以作為本發明之計算記憶體排組100中的處理元件116。一個明顯的特徵在於，X和Y暫存器已經被通用化而成為暫存器排組，在這個情況中，各有兩個暫存器(例如X和AX)，且記憶體已經被類似處理為暫存器排組的一種類型，其中一個暫存器(「M」)被來自記憶體的位元讀取所取代。唯讀位元也可被處理作為暫存器排組中的暫存器。對於低功率應用而言，會需要在低功率暫存器中快取資料，而不是重複地去參考高功率記憶體。注意在本文中其他處所描述的左右式最接近鄰近通訊也可用於此結構。 Figure 14 illustrates a conventional multi-bit processing element proposed by Cojocaru. This processing element includes carry generator enhancements for arithmetic and has less memory usage. This processing element can be suitably used as the processing element 116 in the computing memory array 100 of the present invention. An obvious feature is that the X and Y registers have been generalized into register banks, in this case there are two registers each (such as X and AX), and the memory has been treated similarly A type of register arrangement in which one register ("M") is replaced by a bit read from memory. Read-only bits can also be processed as registers in a register array. For low-power applications, there may be a need to cache data in low-power registers rather than repeatedly referencing high-power memory. Note that the left-right nearest neighbor communication described elsewhere in this article can also be used with this structure.

此處進一步增強的是「進位(Carry)」方塊的加入，其具有來自一相鄰處理元件的輸入「Carry-in」，其可與來自X和Y暫存器排組的資料結合，且其產生視情況會被傳到相反方向中下一個處理元件的「Carry Out」。暫存器S和B可被用以抑制進位傳播(「S」)和以一給定位元「B」來予以替換。舉例而言，若暫存器S被設定以抑制每四個處理元件中的進位傳播，並且以「0」來替換進位，則效果為從具有N個單一位元處理元件的計算記憶體排組100中產生具有N/4位元處理元件的系統。若需要在四個處理元件的群組中一次執行四個位元的8位元計算，則可加入用以於本地處理元件中儲存「Carry-Out」的路徑。 A further enhancement here is the addition of a "Carry" block, which has an input "Carry-in" from an adjacent processing element, which can be combined with data from the X and Y register arrays, and which The resulting "Carry Out" will be passed to the next processing element in the opposite direction as appropriate. Registers S and B can be used to suppress carry propagation ("S") and replace it with a given bit "B". For example, if register S is set up to suppress carry propagation in every four processing elements, and the carry is replaced with a "0", the effect is to queue from a compute memory with N single-bit processing elements A system with N/4-bit processing elements is produced in 100. If you need to perform 8-bit calculations four bits at a time in a group of four processing elements, you can add a path for storing "Carry-Out" in the local processing element.

第十四圖也說明了一種習知技術的分段匯流排，其中暫存器T可被用以啟動或禁用連接相鄰匯流排分段(被標示為「匯流排繫屬分段(Bus-tie segment)」)的開關。這允許一單一匯流排被切分為任意數量的較小本地匯流排。 Figure 14 also illustrates a conventional segmented bus in which a register T can be used to enable or disable connections between adjacent bus segments (labeled "Bus-Official Segments"). tie segment)") switch. This allows a single bus to be split into any number of smaller local busses.

第十五圖說明根據本發明的一種處理元件300。處理元件300可被使用作為本發明之計算記憶體排組100中的處理元件116。處理元件300類似於本文中所述的其他裝置，且為求清晰即省略了重複的說明。也可參照其他具體實施例的相關說明，其中相同的元件符號是表示相同的構件。 Figure 15 illustrates a processing element 300 according to the present invention. The processing element 300 may be used as the processing element 116 in the computing memory array 100 of the present invention. Processing element 300 is similar to other devices described herein, and repeated description is omitted for clarity. Reference may also be made to the relevant descriptions of other specific embodiments, in which the same reference numerals represent the same components.

處理元件300包括一運算碼多工器302，其係配置以作為列方向匯流排。多工器302被使用於雙向通訊。具有面積效率的多工器可利用開關樹來實施，這不需要增加複雜度。X和Y暫存器(R0和R1)被提供，且也在連接至多工器302的多工側之埠口上也是雙向的。暫存器的三態和感測放大器類型可被用於X和Y暫存器。在本發明的各種其他具體實施例中，雙向多工器302與本文所述其他特徵結合，例如暫存器排組、雙運算數或進位增強的處理元件、進位抑制等。 Processing element 300 includes an opcode multiplexer 302 configured as a column-wise bus. Multiplexer 302 is used for bidirectional communication. Area-efficient multiplexers can be implemented using switch trees without adding complexity. X and Y registers (R0 and R1) are provided and are also bidirectional on the port connected to the multiplex side of multiplexer 302. Register tri-state and sense amplifier types are available Used for X and Y registers. In various other embodiments of the present invention, bidirectional multiplexer 302 is combined with other features described herein, such as register banks, double-operand or carry-enhanced processing elements, carry suppression, and the like.

若空間非常珍貴，或是在要增加通訊頻寬時要補充空間，則使多工器302是雙向的即可允許列匯流排132被排除。 If space is at a premium, or if space needs to be supplemented when increasing communication bandwidth, making the multiplexer 302 bidirectional allows the column bus 132 to be eliminated.

第十六圖說明根據本發明的一種處理元件400，其具有專用的加總和進位運作，使列方向匯流排可同時用於通訊。處理元件400可被使用作為本發明之計算記憶體排組100中的一處理元件116。處理元件400類似於本文所述其他裝置，且為求清晰即省略了重複的描述。可參照其他具體實施例的相關說明，其中相同的元件符號代表相同的構件。 Figure 16 illustrates a processing element 400 according to the present invention, which has dedicated sum and carry operations so that the column-oriented bus can be used for communication simultaneously. The processing element 400 may be used as a processing element 116 in the computing memory array 100 of the present invention. Processing element 400 is similar to other devices described herein, and repeated description is omitted for clarity. Reference may be made to the related descriptions of other specific embodiments, in which the same reference numerals represent the same components.

Σ(sigma)方塊402可運作以計算其三個輸入X、Y和M的總和位元。進位方塊404可運作以同時計算進位位元。加總和進位兩者都會被回寫至X、Y、M(記憶體)和W(寫入啟動)暫存器的任何組合，其可實施作為記憶體排組。在同時，列匯流排132可被讀入X、Y、M或W中，或是可從X、Y、M或W驅動由三個X、Y、M所選擇的單一列匯流排線路。任何暫存器可實施作為暫存器匯流排。此外，算術方塊可被驅動，而且多工器可由來自這些暫存器檔案的不同暫存器定址。此外，多工器位址或算術輸入的閂鎖可被提供。列匯流排位元可獨立於算術運算而被定址。 Σ(sigma) block 402 operates to calculate the bit sum of its three inputs X, Y, and M. Carry block 404 may operate to simultaneously calculate carry bits. Both sums and carries are written back to any combination of X, Y, M (memory) and W (write enable) registers, which can be implemented as memory banks. At the same time, column bus 132 can be read into X, Y, M, or W, or a single column bus line selected by three X, Y, M can be driven from X, Y, M, or W. Any register can be implemented as a register bus. In addition, arithmetic blocks can be driven and multiplexers can be addressed by different registers from these register files. In addition, latching of multiplexer address or arithmetic inputs can be provided. Column bus bits can be addressed independently of arithmetic operations.

第十七圖說明具有具分段開關502的列匯流排500的處理元件400。在一些具體實施例中，開關502可由相關聯的處理元件400中的暫存器控制。在其他具體實施例中，開關502是直接由計算記憶體排組100的SIMD控制器108控制。 Figure 17 illustrates a processing element 400 having a column bus 500 with segmented switches 502. In some embodiments, switch 502 may be controlled by a register in an associated processing element 400 . In other embodiments, switch 502 is controlled directly by SIMD controller 108 of computing memory bank 100 .

第十八圖說明根據本發明之處理元件600，其於行方向中具有最接近的鄰近通訊。處理元件600可被使用作為本發明之計算記憶體排組100中的處理元件116。處理元件600類似於本文中所述的其他裝置，且為求清晰即省略了重複的說明。也可參照其他具體實施例的相關說明，其中相同的元件符號是表示相同的構件。 Figure 18 illustrates a processing element 600 with closest neighbor communication in the row direction in accordance with the present invention. The processing element 600 may be used as the computing memory array 100 of the present invention. processing element 116. Processing element 600 is similar to other devices described herein, and repeated description is omitted for clarity. Reference may also be made to the relevant descriptions of other specific embodiments, in which the same reference numerals represent the same components.

在行方向中的最接近的鄰近通訊可與列方向上最接近的鄰近通訊結合。在一些具體實施例中，X和Y是單一暫存器，且一2：1多工器選擇暫存器X和Y是否在列或行方向中傳遞資料。在其他具體實施例中，X和Y是暫存器排組，而暫存器排組X和Y內的不同暫存器可由列和行方向中的鄰近處理元件600加以設定。 Closest neighbor communication in the row direction may be combined with closest neighbor communication in the column direction. In some embodiments, X and Y are single registers, and a 2:1 multiplexer selects whether registers X and Y pass data in the column or row direction. In other embodiments, X and Y are register banks, and different registers within register banks X and Y can be configured by adjacent processing elements 600 in the column and row directions.

第十九圖說明了一種處理元件700，其具有連接至行匯流排704的第二多工器702。處理元件700可被使用作為本發明之計算記憶體排組100中的處理元件116。處理元件700類似於本文中所述的其他裝置，且為求清晰即省略了重複的說明。也可參照其他具體實施例的相關說明，其中相同的元件符號是表示相同的構件。 Figure 19 illustrates a processing element 700 having a second multiplexer 702 connected to a row bus 704. The processing element 700 may be used as the processing element 116 in the computing memory array 100 of the present invention. Processing element 700 is similar to other devices described herein, and repeated description is omitted for clarity. Reference may also be made to the relevant descriptions of other specific embodiments, in which the same reference numerals represent the same components.

第二十圖說明可運作以驅動列位址和運算碼、且可於一相關聯的記憶體單元陣列104中加載和儲存指令的SIMD控制器800。SIMD控制器800可被使用作為本發明之計算記憶體排組100中的SIMD控制器108。SIMD控制器800類似於本文中所述的其他裝置，且為求清晰即省略了重複的說明。也可參照其他具體實施例的相關說明，其中相同的元件符號是表示相同的構件。 Figure 20 illustrates a SIMD controller 800 operable to drive column addresses and operation codes, and to load and store instructions in an associated array of memory cells 104. The SIMD controller 800 may be used as the SIMD controller 108 in the computing memory array 100 of the present invention. SIMD controller 800 is similar to other devices described herein, and repeated description is omitted for clarity. Reference may also be made to the relevant descriptions of other specific embodiments, in which the same reference numerals represent the same components.

SIMD控制器800包括指令記憶體802、行選擇804、程式計數器806和解碼器808。解碼器808解碼指令，並且可進一步包括一解壓縮器以解壓縮指令及/或資料，其可以壓縮形式儲存以節省記憶體。 SIMD controller 800 includes instruction memory 802, row selector 804, program counter 806, and decoder 808. Decoder 808 decodes instructions and may further include a decompressor to decompress instructions and/or data, which may be stored in compressed form to save memory.

SIMD控制器800係配置以根據需要而從排組的記憶體單元陣列104獲取指令。所獲取的指令可以被儲存在指令記憶體802中。指令可指示處理元件和它們的相關聯匯流排所需要的控制線路、以及為選擇處理元件之記憶體資料所需之列位址。 The SIMD controller 800 is configured to obtain instructions from the array of memory cells 104 as needed. The retrieved instructions may be stored in instruction memory 802 . Instructions may instruct control circuits required for processing elements and their associated buses, as well as memory for selecting processing elements List of addresses required for the data.

在執行從記憶體單元陣列104以外的記憶體獲取指令期間，會需要實施「哈佛架構(Harvard architecture)」，其中可從記憶體單元陣列104得到的指令和(視情況)資料會被並行獲取。相反地，因為有些計算是屬資料重度的，而其他是指令重度的，因此從排組的記憶體單元陣列104加載指令會是有利的。 During the execution of fetching instructions from memory other than memory cell array 104, a "Harvard architecture" may be implemented, in which instructions and (optionally) data available from memory cell array 104 are fetched in parallel. Conversely, since some calculations are data-heavy and others are instruction-heavy, it may be advantageous to load instructions from arrays of arrays of memory cells 104 .

指令解碼器808可位於指令記憶體802和記憶體單元陣列104與處理元件116之間。 Instruction decoder 808 may be located between instruction memory 802 and memory cell array 104 and processing element 116 .

SIMD控制器800可透過程式計數器806來定址其指令記憶體802，對其以解碼器808讀取的內容進行解碼，並且使用此資訊來驅動記憶體單元陣列104和處理元件116。可使用管線來避免必須在執行之前等待指令讀取和解碼。指令集可包括「OP」指令，其驅動處理元件116的運算碼和加載暫存器；操縱程式計數器806的跳轉指令(例如JMP和JSR)；可允許間接和指標定址的位址暫存器；迴圈結構(例如固定長度的迴圈)和條件式跳轉。 SIMD controller 800 may address its instruction memory 802 through program counter 806, decode its contents read by decoder 808, and use this information to drive memory cell array 104 and processing element 116. Pipelining can be used to avoid having to wait for instructions to be fetched and decoded before executing. The instruction set may include "OP" instructions, which drive the opcodes and load registers of processing element 116; jump instructions (such as JMP and JSR) that manipulate program counter 806; address registers that may allow indirect and pointer addressing; Loop structures (such as fixed-length loops) and conditional jumps.

第二十一圖說明由一控制器匯流排900互連的複數個SIMD控制器800。每一個SIMD控制器800可運作以控制計算記憶體排組100，而且SIMD控制器800一起運作以允許指令記憶體的共享。 Figure 21 illustrates a plurality of SIMD controllers 800 interconnected by a controller bus 900. Each SIMD controller 800 operates to control the computing memory array 100, and the SIMD controllers 800 operate together to allow sharing of instruction memory.

第二十二圖說明了複數個SIMD控制器800，其各進一步可運作以解碼經壓縮的係數資料，並可一起運作以允許指令記憶體的共享、以及重新使用指令記憶體作為係數記憶體。 Figure 22 illustrates a plurality of SIMD controllers 800, each of which is further operable to decode compressed coefficient data, and which can operate together to allow sharing of instruction memory and reuse of instruction memory as coefficient memory.

神經網路通常需要儲存大量係數，例如就習知的識別演算法AlexNet而言，約是2.5億的等級。可預期要以壓縮形式來儲存係數(例如，以單一「0」位元來儲存零係數之共同特殊情況)。解壓縮可由一計算記憶體排組100藉由處理元件116和記憶體單元陣列104來執行，或利用對SIMD控制器 800提供的一獨立構件來執行，例如解壓縮引擎，以讀取和解壓縮一串可變長度的已壓縮數字。 Neural networks usually need to store a large number of coefficients. For example, for the commonly known recognition algorithm AlexNet, it is about 250 million levels. It is expected that coefficients will be stored in a compressed form (e.g., the common special case of storing zero coefficients with a single "0" bit). Decompression may be performed by a computing memory array 100 through the processing element 116 and the array of memory cells 104, or using a SIMD controller. The 800 provides a separate component to execute, such as a decompression engine, to read and decompress a variable-length sequence of compressed numbers.

係數壓縮不僅僅是可用於節省空間。舉例而言，若係數為零，則可簡單略過內積的相關聯乘-加步驟，節省時間和功率兩者。除了解壓縮數字以外、或取而代之，解壓縮還可配置以返送編碼。舉例而言，解壓縮可被配置以返送子常式的位址，其中子常式係有效率地處理給定係數(例如零，如上述說明，或一純位元移位)的特殊情況，連同作為對此子常式的參數的一暫存器值(例如移位的位元數)。 Coefficient compression can be used for more than just space savings. For example, if the coefficients are zero, the associated multiply-add step of the inner product can simply be skipped, saving both time and power. In addition to, or instead of, decompressing numbers, decompression can be configured to fallback encoding. For example, decompression can be configured to return the address of a subroutine that efficiently handles the special case of a given coefficient (such as zero, as explained above, or a pure bit shift), Along with a register value (such as the number of bits to shift) as an argument to this subroutine.

解壓縮可與一指令解碼器共享指令記憶體，或可被供以一獨立記憶體。在大向量場景中，其中多個SIMD控制器800運行相同係數的相同編碼，一個控制器可執行解壓縮，而另一個則作為主控制器。 Decompression may share instruction memory with an instruction decoder, or may be provided in a separate memory. In a large vector scenario, where multiple SIMD controllers 800 run the same encoding with the same coefficients, one controller can perform decompression while the other acts as the master controller.

第二十三圖說明根據本發明之各種計算記憶體排組100中用於一影像之畫素資料和用於一神經網路第一層的相關編碼和核心輸出資料的例示佈局。 Figure 23 illustrates an exemplary layout of pixel data for an image and associated encoding and core output data for the first layer of a neural network in various computing memory arrays 100 in accordance with the present invention.

第二十四圖詳細說明根據本發明之一計算記憶體排組100中彩色畫素資料和用於一神經網路迴旋層之資料的例示佈局。 Figure 24 details an exemplary layout of color pixel data in a computational memory array 100 and data for a neural network convolutional layer in accordance with the present invention.

第二十五圖說明在一計算記憶體排組100中用於在一神經網路中池化之資料的例示佈局。 Figure 25 illustrates an example layout of data in a computing memory array 100 for pooling in a neural network.

上述中的影像資料是由代表畫素座標的值組(tuples)表示。示例的影像大小為256x256個畫素。 The above image data is represented by tuples representing pixel coordinates. The example image size is 256x256 pixels.

當要處理的資料的向量大於單一計算記憶體排組100時，多個SIMD控制器會發出相同的運算碼和控制。這可通過複製所有相關SIMD控制器的記憶體中的指令、並且使用上述同步化來使其保持鎖定在一起而完成。一給定的SIMD控制器可被配置以作為主控制器，而其他的SIMD控制器則從屬於該給定的控制器。控制器匯流排可增進此一運作模式，且控制器匯流排可被分段，使得控制器的多個群組可以此方式獨立運作。在一群組中的控制器可被編程以切換主控制，使較大的程式可適合指令記憶體，因為它是共享的而不是複製的。 When the vectors of data to be processed are larger than 100 in a single compute memory array, multiple SIMD controllers issue the same opcodes and controls. This can be done by copying the instructions in the memory of all relevant SIMD controllers and using the synchronization described above to keep them locked together. A given SIMD controller can be configured to act as the master controller, while other SIMD controllers are slaves to it. given controller. Controller busses can facilitate this mode of operation, and controller busses can be segmented so that multiple groups of controllers can operate independently in this manner. Controllers in a group can be programmed to switch master controls so that larger programs can fit into the command memory since it is shared rather than replicated.

鑑於上述，應理解計算記憶體排組、SIMD控制器、處理元件和它們的互連匯流排允許以彈性的低精確度架構、具功率效率的通訊和指令與係數的本地儲存與解碼來進行大量內積和相關神經網路計算的處理。 In view of the above, it should be understood that computing memory arrays, SIMD controllers, processing elements, and their interconnection buses allow for large amounts of processing with flexible low-precision architectures, power-efficient communications, and local storage and decoding of instructions and coefficients. Processing of inner product and correlation neural network computations.

應當理解，上述各種示例的特徵和構想也可結合為其他示例，其亦屬於本發明之範疇。此外，圖式並未依比例繪製，基於說明目的，其可能具有誇大的尺寸和形狀。 It should be understood that the features and concepts of the above various examples can also be combined into other examples, which also fall within the scope of the present invention. Furthermore, the drawings are not to scale and may have exaggerated size and shape for illustrative purposes.

100:排組 100:Planning

104:記憶體單元陣列 104: Memory cell array

108:SIMD控制器 108:SIMD controller

112:線路 112:Line

116:處理元件 116: Processing components

120:位元線 120:Bit line

124:列匯流排 124: Column bus

128:列連接 128: Column connection

132:列匯流排 132: Column bus

140:單元 140:Unit

142:行 142: OK

144:胞元 144:cell

146:列 146: column

Claims

A computing memory device, comprising: a plurality of computing memory banks, each of the plurality of computing memory banks including a memory unit array and a plurality of processes connected to the memory unit array component; and a plurality of Single Instruction Multiple Data (SIMD) controllers, each SIMD controller of the plurality of SIMD controllers being included in at least one computing memory bank of the plurality of computing memory banks. Within; wherein each of the SIMD controllers will provide instructions to the at least one computing memory bank and control the execution of the instructions through the at least one computing memory bank; wherein among the plurality of processing elements each of the processing elements includes a static register and an arithmetic logic unit (ALU) to perform operations with the static register; and wherein each of the processing elements is configured to receive a static register from another processing element The communication state, the ALU is used to perform operations with the static register and the communication state.

The device of claim 1 further includes a bus connecting the plurality of processing elements in one of the plurality of computing memory banks.

The device of claim 2, wherein the bus is connected to a SIMD controller of the computing memory array, and wherein the bus is configured to carry operational code to the plurality of processing elements.

For example, in the device of Item 2 of the patent application scope, the busbar is segmented.

For example, the device of Item 1 of the patent application further includes a plurality of buses, each of which is operable to transmit information in one or two directions between the SIMD controller and the plurality of processing elements, wherein the At least one of the busbars is segmented, and one of the busbars At least the other one is unsegmented.

The device of claim 1 further includes a bus connecting a processing element of one of the plurality of computing memory banks to the plurality of computing memory banks. Another array of computing memory processing elements.

For example, in the device of Item 6 of the patent application scope, the busbar is segmented.

The device of claim 1 further includes a plurality of buses, each of which is operable to transmit information in one or two directions between any of the computing memory banks, wherein among the buses At least one of the busbars is segmented, and at least another of the busbars is unsegmented.

As in the device of claim 1, each of the SIMD controllers is included in a different computing memory bank among the plurality of computing memory banks.

As claimed in the device of claim 1, wherein one SIMD controller of the plurality of SIMD controllers is included in at least two of the computing memory banks of the plurality of computing memory banks.

For example, the device of Item 1 of the patent application further includes a bus connected to the plurality of SIMD controllers.

The device of claim 1 further includes an input/output circuit connected to the plurality of SIMD controllers.

For example, in the device of Item 1 of the patent application, the ALU includes multiple levels of multiplexers.

For example, the device of Item 1 of the patent application further includes a bus connecting the plurality of processing elements in a computing memory bank and the SIMD controller, and the bus is used to connect the data from the SIM controller. Operand selections are passed to the ALU for each of the processing elements.

For example, the device in item 1 of the patent application further includes a bus connected to a Compute the plurality of processing elements and the SIMD controller in the memory bank, and the bus is used to transfer a function to the ALU of each of the processing elements.

For example, the device of Item 1 of the patent application further includes a communication register subordinate to the static register, and the communication register is used to provide communication status to another processing element.

The device of claim 1 further includes at least one direct connection between each of the plurality of processing elements and at least one other processing element.

For example, in the device of claim 17, the at least one direct connection is used to provide the communication status.

The device of claim 17, wherein the at least one direct connection allows sharing of status information, the status information including shipping and signature information.

A SIMD controller having an instruction memory configured to load from memory in one or more banks of computational memory, each of said banks including the memory and configured to utilize the memory The memory performs a plurality of processing elements operating in parallel; wherein the SIMD controller provides instructions to the at least one computing memory bank and controls execution of the instructions through the at least one computing memory bank.

A plurality of SIMD controllers, each of said SIMD controllers including instruction memory configured to load from memory in one or more banks of computational memory, each of said banks of computational memory including the memory and a plurality of processing elements for performing parallel operations using the memory, wherein at least one of the SIMD controllers includes a decompressor operable to stream variable lengths from its instruction memory data and generates decompression information, the decompression information including one or both of the coefficients and a fixed-length representation of the instruction, the plurality of SIMD controller devices connected by a bus to share the decompression information; each of which The SIMD controller will provide instructions to the at least one computing memory bank and control execution of the instructions through the at least one computing memory bank.

A computing memory device, comprising: a plurality of computing memory banks, each of the plurality of computing memory banks including a memory unit array and a plurality of processes connected to the memory unit array component; and a plurality of Single Instruction Multiple Data (SIMD) controllers, each SIMD controller of the plurality of SIMD controllers being included in at least one computing memory bank of the plurality of computing memory banks. Within; wherein each of the SIMD controllers will provide instructions to the at least one computing memory bank and control the execution of the instructions through the at least one computing memory bank; wherein among the plurality of processing elements Each of the processing elements includes a static register and an arithmetic logic unit (ALU) to perform operations with the static register; and a communication register subordinate to the static register, the communication register The register is used to provide communication status to another processing element.

A computing memory device, comprising: a plurality of computing memory banks, each of the plurality of computing memory banks including a memory unit array and a plurality of processes connected to the memory unit array component; and a plurality of Single Instruction Multiple Data (SIMD) controllers, each SIMD controller of the plurality of SIMD controllers being included in at least one computing memory bank of the plurality of computing memory banks. wherein each of the SIMD controllers will provide instructions to the at least one computing memory bank and control the execution of the instructions by the at least one computing memory bank; and a bus that will all A processing element of a computing memory bank of the plurality of computing memory banks is connected to a processing element of another computing memory bank of the plurality of computing memory banks. pieces.

A computing memory device, comprising: a plurality of computing memory banks, each of the plurality of computing memory banks including a memory unit array and a plurality of processes connected to the memory unit array component; and a plurality of Single Instruction Multiple Data (SIMD) controllers, each SIMD controller of the plurality of SIMD controllers being included in at least one computing memory bank of the plurality of computing memory banks. Within, each of the SIMD controllers will provide instructions to the at least one computing memory bank and control the execution of the instructions through the at least one computing memory bank; and a plurality of buses, each The busses are operable to pass information in one or two directions between any of the computing memory banks, wherein at least one of the busses is segmented and at least one other of the busses is not segmented.

A computing memory device, comprising: a plurality of computing memory banks, each of the plurality of computing memory banks including a memory unit array and a plurality of processes connected to the memory unit array component; and a plurality of Single Instruction Multiple Data (SIMD) controllers, each SIMD controller of the plurality of SIMD controllers being included in at least one computing memory bank of the plurality of computing memory banks. Within; wherein each of the SIMD controllers will provide instructions to the at least one computing memory bank and control the execution of the instructions through the at least one computing memory bank; a bus connecting the Multiple SIMD controllers.