TWI714806B - Dpu architecture - Google Patents
Dpu architecture Download PDFInfo
- Publication number
- TWI714806B TWI714806B TW106131867A TW106131867A TWI714806B TW I714806 B TWI714806 B TW I714806B TW 106131867 A TW106131867 A TW 106131867A TW 106131867 A TW106131867 A TW 106131867A TW I714806 B TWI714806 B TW I714806B
- Authority
- TW
- Taiwan
- Prior art keywords
- row
- dram
- cell array
- array
- cell
- Prior art date
Links
- 238000012545 processing Methods 0.000 claims abstract description 50
- 238000004364 calculation method Methods 0.000 claims description 103
- 230000006870 function Effects 0.000 claims description 50
- 239000003990 capacitor Substances 0.000 claims description 24
- 238000003491 array Methods 0.000 claims description 11
- 238000010977 unit operation Methods 0.000 claims 3
- 239000013078 crystal Substances 0.000 claims 2
- 238000010586 diagram Methods 0.000 description 8
- 238000000034 method Methods 0.000 description 8
- 238000012546 transfer Methods 0.000 description 7
- 238000013135 deep learning Methods 0.000 description 5
- 101100496858 Mus musculus Colec12 gene Proteins 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 238000004891 communication Methods 0.000 description 3
- 230000009977 dual effect Effects 0.000 description 3
- 238000010801 machine learning Methods 0.000 description 3
- 238000013507 mapping Methods 0.000 description 3
- 230000002093 peripheral effect Effects 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 238000013468 resource allocation Methods 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 238000010606 normalization Methods 0.000 description 2
- 239000013598 vector Substances 0.000 description 2
- 230000004913 activation Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 239000004575 stone Substances 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F13/00—Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
- G06F13/14—Handling requests for interconnection or transfer
- G06F13/16—Handling requests for interconnection or transfer for access to memory bus
- G06F13/1668—Details of memory controller
-
- G—PHYSICS
- G11—INFORMATION STORAGE
- G11C—STATIC STORES
- G11C11/00—Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor
- G11C11/21—Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor using electric elements
- G11C11/34—Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor using electric elements using semiconductor devices
- G11C11/40—Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor using electric elements using semiconductor devices using transistors
- G11C11/401—Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor using electric elements using semiconductor devices using transistors forming cells needing refreshing or charge regeneration, i.e. dynamic cells
- G11C11/406—Management or control of the refreshing or charge-regeneration cycles
- G11C11/40622—Partial refresh of memory arrays
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3885—Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
- G06F15/76—Architectures of general purpose stored program computers
- G06F15/78—Architectures of general purpose stored program computers comprising a single central processing unit
- G06F15/7807—System on chip, i.e. computer system on a single chip; System in package, i.e. computer system on one or more chips in a single package
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
- G06F15/76—Architectures of general purpose stored program computers
- G06F15/78—Architectures of general purpose stored program computers comprising a single central processing unit
- G06F15/7807—System on chip, i.e. computer system on a single chip; System in package, i.e. computer system on one or more chips in a single package
- G06F15/7821—Tightly coupled to memory, e.g. computational memory, smart memory, processor in memory
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
- G06F15/76—Architectures of general purpose stored program computers
- G06F15/80—Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
-
- G—PHYSICS
- G11—INFORMATION STORAGE
- G11C—STATIC STORES
- G11C11/00—Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor
- G11C11/21—Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor using electric elements
- G11C11/34—Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor using electric elements using semiconductor devices
- G11C11/40—Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor using electric elements using semiconductor devices using transistors
-
- G—PHYSICS
- G11—INFORMATION STORAGE
- G11C—STATIC STORES
- G11C11/00—Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor
- G11C11/21—Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor using electric elements
- G11C11/34—Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor using electric elements using semiconductor devices
- G11C11/40—Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor using electric elements using semiconductor devices using transistors
- G11C11/401—Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor using electric elements using semiconductor devices using transistors forming cells needing refreshing or charge regeneration, i.e. dynamic cells
- G11C11/403—Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor using electric elements using semiconductor devices using transistors forming cells needing refreshing or charge regeneration, i.e. dynamic cells with charge regeneration common to a multiplicity of memory cells, i.e. external refresh
- G11C11/405—Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor using electric elements using semiconductor devices using transistors forming cells needing refreshing or charge regeneration, i.e. dynamic cells with charge regeneration common to a multiplicity of memory cells, i.e. external refresh with three charge-transfer gates, e.g. MOS transistors, per cell
-
- G—PHYSICS
- G11—INFORMATION STORAGE
- G11C—STATIC STORES
- G11C11/00—Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor
- G11C11/21—Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor using electric elements
- G11C11/34—Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor using electric elements using semiconductor devices
- G11C11/40—Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor using electric elements using semiconductor devices using transistors
- G11C11/401—Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor using electric elements using semiconductor devices using transistors forming cells needing refreshing or charge regeneration, i.e. dynamic cells
- G11C11/4063—Auxiliary circuits, e.g. for addressing, decoding, driving, writing, sensing or timing
- G11C11/407—Auxiliary circuits, e.g. for addressing, decoding, driving, writing, sensing or timing for memory cells of the field-effect type
- G11C11/4076—Timing circuits
-
- G—PHYSICS
- G11—INFORMATION STORAGE
- G11C—STATIC STORES
- G11C11/00—Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor
- G11C11/21—Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor using electric elements
- G11C11/34—Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor using electric elements using semiconductor devices
- G11C11/40—Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor using electric elements using semiconductor devices using transistors
- G11C11/401—Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor using electric elements using semiconductor devices using transistors forming cells needing refreshing or charge regeneration, i.e. dynamic cells
- G11C11/4063—Auxiliary circuits, e.g. for addressing, decoding, driving, writing, sensing or timing
- G11C11/407—Auxiliary circuits, e.g. for addressing, decoding, driving, writing, sensing or timing for memory cells of the field-effect type
- G11C11/409—Read-write [R-W] circuits
- G11C11/4091—Sense or sense/refresh amplifiers, or associated sense circuitry, e.g. for coupled bit-line precharging, equalising or isolating
-
- G—PHYSICS
- G11—INFORMATION STORAGE
- G11C—STATIC STORES
- G11C11/00—Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor
- G11C11/21—Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor using electric elements
- G11C11/34—Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor using electric elements using semiconductor devices
- G11C11/40—Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor using electric elements using semiconductor devices using transistors
- G11C11/401—Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor using electric elements using semiconductor devices using transistors forming cells needing refreshing or charge regeneration, i.e. dynamic cells
- G11C11/4063—Auxiliary circuits, e.g. for addressing, decoding, driving, writing, sensing or timing
- G11C11/407—Auxiliary circuits, e.g. for addressing, decoding, driving, writing, sensing or timing for memory cells of the field-effect type
- G11C11/409—Read-write [R-W] circuits
- G11C11/4096—Input/output [I/O] data management or control circuits, e.g. reading or writing circuits, I/O drivers or bit-line switches
-
- G—PHYSICS
- G11—INFORMATION STORAGE
- G11C—STATIC STORES
- G11C7/00—Arrangements for writing information into, or reading information out from, a digital store
- G11C7/10—Input/output [I/O] data interface arrangements, e.g. I/O data control circuits, I/O data buffers
- G11C7/1006—Data managing, e.g. manipulating data before writing or reading out, data bus switches or control circuits therefor
-
- G—PHYSICS
- G11—INFORMATION STORAGE
- G11C—STATIC STORES
- G11C7/00—Arrangements for writing information into, or reading information out from, a digital store
- G11C7/10—Input/output [I/O] data interface arrangements, e.g. I/O data control circuits, I/O data buffers
- G11C7/1006—Data managing, e.g. manipulating data before writing or reading out, data bus switches or control circuits therefor
- G11C7/1012—Data reordering during input/output, e.g. crossbars, layers of multiplexers, shifting or rotating
Landscapes
- Engineering & Computer Science (AREA)
- Computer Hardware Design (AREA)
- Microelectronics & Electronic Packaging (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Databases & Information Systems (AREA)
- Dram (AREA)
- Memory System (AREA)
Abstract
Description
本專利申請案主張於2016年10月27日提出申請的美國臨時專利申請案第62/413,977號的優先權,所述美國臨時專利申請案的揭露內容全文併入本案供參考。 This patent application claims the priority of U.S. Provisional Patent Application No. 62/413,977 filed on October 27, 2016, and the disclosure content of the U.S. Provisional Patent Application is incorporated into this case for reference in its entirety.
本文中本發明是有關於記憶體系統,更具體而言是有關於動態隨機存取記憶體(dynamic random access memory,DRAM)處理單元(DRAM processing unit,DPU)。 The present invention herein relates to a memory system, and more specifically relates to a dynamic random access memory (DRAM) processing unit (DPU).
傳統上,使用圖形處理單元(Graphics Processing Unit,GPU)及張量處理單元(Tensor Processing Unit,TPU)來進行深度學習處理(deep learning processing)。深度學習處理包括可能無法藉由圖形處理單元或張量處理單元得到高效執行的高度平行化的處理。 Traditionally, a graphics processing unit (GPU) and a tensor processing unit (TPU) are used for deep learning processing. Deep learning processing includes highly parallel processing that may not be efficiently executed by graphics processing units or tensor processing units.
示例性實施例提供一種動態隨機存取記憶體(DRAM)處理單元(DPU),所述DPU可包括:至少一個計算胞元陣列,具有排列成陣列的多個DRAM式計算胞元,所述陣列具有至少一 個行,其中所述至少一個行可包括至少三個列的DRAM式計算胞元,所述至少三個列的DRAM式計算胞元被配置成提供對所述至少三個列中的第一列及第二列進行運算的邏輯功能且被配置成將所述邏輯功能的結果儲存於所述至少三個列中的第三列中;以及控制器,可耦接至所述至少一個計算胞元陣列以將所述至少一個計算胞元陣列配置成執行DPU運算。 Exemplary embodiments provide a dynamic random access memory (DRAM) processing unit (DPU). The DPU may include: at least one computing cell array having a plurality of DRAM-type computing cells arranged in an array, the array Have at least one Rows, wherein the at least one row may include at least three columns of DRAM-based computing cells, and the at least three columns of DRAM-based computing cells are configured to provide a reference to the first of the at least three columns And a second row of logical functions for performing operations and configured to store the result of the logical function in a third row of the at least three rows; and a controller, which can be coupled to the at least one computing cell The array is configured to configure the at least one computing cell array to perform DPU operations.
示例性實施例提供一種DPU,所述DPU可包括:至少一個計算胞元陣列,可包括排列成陣列的多個DRAM式計算胞元,所述陣列具有至少一個行,其中所述至少一個行可包括至少三個列的DRAM式計算胞元,所述至少三個列的DRAM式計算胞元被配置成提供對所述至少三個列中的第一列及第二列進行運算的邏輯功能且被配置成將所述邏輯功能的結果儲存於所述至少三個列中的第三列中;至少一個資料胞元陣列,可包括排列成至少一個行的至少一個DRAM式記憶體胞元;以及控制器,耦接至所述至少一個計算胞元陣列以將所述至少一個計算胞元陣列配置成執行DPU運算,且所述控制器耦接至所述至少一個資料胞元陣列以執行記憶體運算。在一個實施例中,所述至少一個行的所述DRAM式計算胞元可各自包括三電晶體及一電容器(three transistor,one capacitor,3T1C)式DRAM記憶體胞元,且所述至少一個行的所述DRAM式計算胞元可提供反或邏輯功能。在另一實施例中,所述至少一個行的所述DRAM式計算胞元可各自包括一電晶體及一電容器(one transistor,one capacitor,1T1C)式DRAM記憶體胞 元,且所述DRAM式計算胞元中的每一者可更包括算術邏輯單元(ALU),所述算術邏輯單元耦接至所述DRAM式計算胞元的位元線,其中所述算術邏輯單元可提供所述邏輯功能。 An exemplary embodiment provides a DPU, which may include: at least one computing cell array, which may include a plurality of DRAM-type computing cells arranged in an array, the array having at least one row, wherein the at least one row may A DRAM-based computing cell including at least three columns, the DRAM-based computing cell of the at least three columns being configured to provide a logic function for performing operations on a first column and a second column of the at least three columns, and Is configured to store the result of the logic function in the third row of the at least three rows; at least one data cell array may include at least one DRAM-type memory cell arranged in at least one row; and The controller is coupled to the at least one computing cell array to configure the at least one computing cell array to perform DPU operations, and the controller is coupled to the at least one data cell array to perform memory Operation. In one embodiment, the DRAM-type computing cells of the at least one row may each include a three transistor and a capacitor (three transistor, one capacitor, 3T1C) DRAM memory cell, and the at least one row The DRAM-type computing cell can provide an inverse OR logic function. In another embodiment, the DRAM computing cells of the at least one row may each include a transistor and a capacitor (one transistor, one capacitor, 1T1C) DRAM memory cell And each of the DRAM-based calculation cells may further include an arithmetic logic unit (ALU), which is coupled to the bit line of the DRAM-based calculation cell, wherein the arithmetic logic The unit can provide the logic function.
示例性實施例提供DPU,所述DPU可包括:至少一個計算胞元陣列,可包括排列成陣列的多個DRAM式計算胞元,所述陣列具有至少一個行,其中所述至少一個行可包括至少三個列的DRAM式計算胞元,所述至少三個列的DRAM式計算胞元被配置成提供對所述至少三個列中的第一列及第二列進行運算的邏輯功能且被配置成將所述邏輯功能的結果儲存於所述至少三個列中的第三列中;至少一個隨機計算胞元陣列,可包括排列成陣列的多個DRAM式隨機計算胞元,所述陣列具有至少一個行,其中所述至少一個行可包括至少三個列的DRAM式隨機計算胞元,所述至少三個列的DRAM式隨機計算胞元被配置成提供對所述至少三個列中的第一列及第二列進行運算的邏輯功能且被配置成將所述邏輯功能的結果儲存於所述至少三個列中的第三列中;控制器,耦接至所述至少一個計算胞元陣列以將所述至少一個計算胞元陣列配置成執行DPU運算,且所述控制器耦接至所述至少一個隨機計算胞元陣列以執行隨機邏輯運算。 Exemplary embodiments provide a DPU, which may include: at least one computing cell array, which may include a plurality of DRAM-type computing cells arranged in an array, the array having at least one row, wherein the at least one row may include At least three columns of DRAM-based computing cells, the at least three columns of DRAM-based computing cells are configured to provide a logic function for performing operations on the first and second columns of the at least three columns and are It is configured to store the result of the logic function in the third column of the at least three columns; at least one random calculation cell array may include a plurality of DRAM-type random calculation cells arranged in an array, the array There is at least one row, wherein the at least one row may include at least three columns of DRAM-based random computing cells, and the at least three columns of DRAM-based random computing cells are configured to provide The first row and the second row of the logic function of the calculation are configured to store the result of the logic function in the third row of the at least three rows; the controller is coupled to the at least one calculation The cell array is configured to configure the at least one computing cell array to perform DPU operations, and the controller is coupled to the at least one random computing cell array to perform random logic operations.
100、700:DPU 100, 700: DPU
101a、101b:記憶體庫 101a, 101b: memory bank
102a、102b:子陣列 102a, 102b: sub-array
103:緩衝器 103: Buffer
104:系統匯流排 104: system bus
105、105n:記憶體陣列片 105, 105n: memory array chip
105a、105b:記憶體陣列片/計算胞元行 105a, 105b: memory array slice/computing cell row
106:資料胞元陣列 106: data cell array
107:計算胞元陣列/計算胞元 107: Calculate cell array/calculate cell
107a、107b、107c、107d:計算胞元 107a, 107b, 107c, 107d: calculation cell
108:移位陣列/記憶體陣列片內移位陣列 108: shift array/memory array on-chip shift array
109:虛線 109: dotted line
110:解碼器/資料胞元陣列解碼器 110: decoder/data cell array decoder
111:解碼器/計算胞元陣列解碼器 111: decoder/computing cell array decoder
112:移位陣列/記憶體陣列片間移位陣列 112: shift array/memory array inter-chip shift array
112e、112f:資料移位線 112e, 112f: data shift line
113:記憶體陣列片間轉發陣列 113: Memory array inter-chip forwarding array
113g:第一資料轉發線 113g: The first data forwarding line
113h:第二資料轉發線 113h: The second data forwarding line
114:控制器/子陣列控制器 114: Controller/sub-array controller
201:3T1C式DRAM計算胞元形狀/3T1C式計算胞元形狀 201: 3T1C DRAM calculation cell shape/3T1C calculation cell shape
202:1T1C式DRAM計算胞元形狀/1T1C式計算胞元形狀 202:1T1C DRAM calculation cell shape/1T1C calculation cell shape
715:隨機資料區/隨機資料陣列 715: Random data area/random data array
716:轉換器-隨機陣列 716: Converter-Random Array
900:系統架構 900: System Architecture
910:硬體層 910: hardware layer
911:快速周邊組件互連裝置 911: Rapid peripheral component interconnection device
912:雙列直插記憶體模組 912: Dual in-line memory module
920:程式庫及驅動器層/DPU程式庫及驅動器層 920: Library and driver layer/DPU library and driver layer
921:DPU程式庫 921: DPU library
922:DPU驅動器 922: DPU drive
923:DPU編譯器 923: DPU compiler
930:框架層 930: Frame layer
940:應用層 940: application layer
addr:位址 addr: address
BL:位元線 BL: bit line
C1、C2:電容器 C 1 , C 2 : Capacitor
CNTL1、CNTL2、CNTL3、CNTL4:控制訊號 CNTL1, CNTL2, CNTL3, CNTL4: control signal
FCL:轉發控制線 FCL: Forwarding control line
FDL:轉發資料線 FDL: Forwarding data line
FSL:轉發區段線 FSL: Forwarding section line
ISLcL:記憶體陣列片間移位控制線 ISLcL: Inter-chip shift control line for memory array
SA:感測放大器 SA: sense amplifier
SL:移位線 SL: shift line
SLcL:控制線/左移位控制線 SLcL: control line/left shift control line
SML:移位遮罩線 SML: shift mask line
SRcL:控制線/右移位控制線 SRcL: control line/right shift control line
T1:電晶體/第一電晶體 T 1 : Transistor/first transistor
T2:電晶體/第二電晶體 T 2 : Transistor/Second Transistor
T3:電晶體/第三電晶體 T 3 : Transistor/Third Transistor
T4、T112a、T112b、T112c、T112d、T113a、T113b、T113c、T113d、T113e、T113f:電晶體 T 4 , T 112a , T 112b , T 112c , T 112d , T 113a , T 113b , T 113c , T 113d , T 113e , T 113f : Transistor
T5:第五電晶體 T 5 : fifth transistor
T6:電晶體/第六電晶體 T 6 : Transistor/Sixth Transistor
WENX、WENY、WENR:寫入賦能線 WEN X , WEN Y , WEN R : write enable line
WL、WLX、WLY、WLR:字元線 WL, WL X , WL Y , WL R : character line
在以下部分中,將參照圖中所示的示例性實施例來闡述本文中所揭露主題的態樣,其中: 圖1繪示根據本文中所揭露主題的動態隨機存取記憶體(DRAM)式處理單元(DPU)的示例性實施例的方塊圖。 In the following sections, the aspects of the subject matter disclosed herein will be explained with reference to the exemplary embodiments shown in the figures, in which: FIG. 1 shows a block diagram of an exemplary embodiment of a dynamic random access memory (DRAM) processing unit (DPU) according to the subject matter disclosed herein.
圖2A繪示可用於計算胞元陣列中的計算胞元的三電晶體及一電容器式DRAM計算胞元形狀的示例性實施例。 FIG. 2A shows an exemplary embodiment of a three-transistor and a capacitor DRAM that can be used to calculate the shape of the calculation cell in the calculation cell array.
圖2B繪示可用於計算胞元陣列中的計算胞元的一電晶體及一電容器式DRAM計算胞元形狀的替代性示例性實施例。 FIG. 2B shows an alternative exemplary embodiment of a transistor and a capacitor DRAM that can be used to calculate the shape of the calculation cell in the calculation cell array.
圖3繪示根據本文中所揭露主題的記憶體陣列片內移位陣列(intra-mat shift array)的示例性實施例。 FIG. 3 illustrates an exemplary embodiment of an intra-mat shift array of a memory array according to the subject matter disclosed herein.
圖4A繪示根據本文中所揭露主題的記憶體陣列片間移位陣列(inter-mat shift array)的實施例。 FIG. 4A shows an embodiment of an inter-mat shift array of a memory array according to the subject matter disclosed herein.
圖4B概念性地繪示根據本文中所揭露主題的用於左記憶體陣列片間移位的相鄰計算胞元行中的兩個相同位置的計算胞元之間的記憶體陣列片間移位互連配置。 FIG. 4B conceptually illustrates the inter-memory array shift between two computation cells at the same position in the adjacent computation cell row for shifting between the left memory array chips according to the subject matter disclosed in this article Bit interconnect configuration.
圖4C概念性地繪示根據本文中所揭露主題的用於左記憶體陣列片間移位的相鄰計算胞元行中的兩個不同位置的計算胞元之間的記憶體陣列片間移位互連配置。 FIG. 4C conceptually illustrates the inter-memory array inter-slice shift between two different positions in the adjacent computing cell row in the left memory array inter-slice shift according to the subject matter disclosed herein Bit interconnect configuration.
圖5繪示根據本文中所揭露主題的記憶體陣列片間轉發陣列(inter-mat forwarding array)的實施例。 FIG. 5 shows an embodiment of an inter-mat forwarding array of a memory array according to the subject matter disclosed herein.
圖6A至圖6G繪示根據本文中所揭露主題的可由DPU提供的基於反或邏輯的運算。 6A to 6G illustrate operations based on inverse OR logic that can be provided by the DPU according to the subject matter disclosed herein.
圖7繪示根據本文中所揭露主題的包括隨機資料區(stochastic data region)的DPU的示例性實施例的方塊圖。 FIG. 7 shows a block diagram of an exemplary embodiment of a DPU including a stochastic data region according to the subject matter disclosed herein.
圖8A及圖8B分別繪示對可被轉換成多工運算(multiplexing operation)的加法運算以及可被轉換成及邏輯運算(AND logic operation)的乘法運算的隨機計算操作。 FIGS. 8A and 8B respectively illustrate random calculation operations of addition operations that can be converted into multiplexing operations and multiplication operations that can be converted into AND logic operations.
圖9繪示根據本文中所揭露主題的包括DPU的系統架構。 FIG. 9 shows a system architecture including a DPU according to the subject matter disclosed herein.
在以下詳細說明中,闡述了諸多具體細節以提供對本發明的透徹理解。然而,熟習此項技術者將理解,沒有該些具體細節亦可實踐所揭露的態樣。在其他實例中,未詳細闡述眾所習知的方法、程序、組件及電路以避免使本文中所揭露的主題模糊不清。 In the following detailed description, many specific details are described to provide a thorough understanding of the present invention. However, those familiar with the art will understand that the disclosed aspect can be practiced without these specific details. In other instances, the well-known methods, procedures, components and circuits are not described in detail to avoid obscuring the subject matter disclosed in this article.
本說明書通篇中所提及的「一個實施例」或「實施例」意指結合所述實施例所闡述的特定特徵、結構或特性可包括於本文中所揭露的至少一個實施例中。因此,在本說明書通篇中各處出現的片語「在一個實施例中」或「在實施例中」抑或「根據一個實施例」(或具有相似含義的其他片語)可能未必皆指同一實施例。此外,在一或多個實施例中,特定特徵、結構或特性可以任何適當的方式進行組合。就此而言,本文中所使用的詞「示例性」意指「充當例子、實例、或例證」。本文中被闡述為「示例性」的任何實施例並非被視為必定較其他實施例更佳或具有優勢。此外,相依於本文論述的上下文,單數用語可包括對應的複數形式且複數用語可包括對應的單數形式。應進一步注意,本文中所示出及論述的各種圖(包括組件圖)僅用於說明目的,且並非按比 例繪製。同樣地,示出各種波形圖及時序圖僅用於說明目的。舉例而言,為清晰起見,可相對於其他元件而誇大一些元件的尺寸。此外,適當情況下,已在各圖中重複使用參考編號來指示對應的及/或類似的元件。 The “one embodiment” or “embodiment” mentioned throughout this specification means that a specific feature, structure, or characteristic described in conjunction with the embodiment can be included in at least one embodiment disclosed herein. Therefore, the phrases "in one embodiment" or "in an embodiment" or "according to one embodiment" (or other phrases with similar meanings) appearing in various places throughout this specification may not all refer to the same Examples. In addition, in one or more embodiments, specific features, structures, or characteristics may be combined in any suitable manner. In this regard, the word "exemplary" as used herein means "serving as an example, instance, or illustration." Any embodiment described herein as "exemplary" is not to be regarded as necessarily better or advantageous than other embodiments. Furthermore, depending on the context discussed herein, singular terms may include corresponding plural forms and plural terms may include corresponding singular forms. It should be further noted that the various diagrams (including component diagrams) shown and discussed in this article are for illustrative purposes only and are not to be compared. Example drawing. Likewise, various waveform diagrams and timing diagrams are shown for illustrative purposes only. For example, the size of some elements may be exaggerated relative to other elements for clarity. In addition, where appropriate, reference numbers have been repeatedly used in each figure to indicate corresponding and/or similar elements.
本文所使用術語僅用於闡述特定示例性實施例,而並非旨在限制所主張主題。除非上下文中清楚地另外指明,否則本文所使用的單數形式「一(a、an)」及「所述」旨在亦包含複數形式。更應理解,當在本說明書中使用用語「包括(comprises及/或comprising)」時,是指明所陳述特徵、整數、步驟、操作、元件、及/或組件的存在,但不排除一或多個其他特徵、整數、步驟、操作、元件、組件、及/或其群組的存在或添加。本文中所使用的用語「第一」、「第二」等是作為其後面所跟名詞的標記來使用,且除非明確說明,否則並不暗含任何類型的次序(例如,空間的、時間的、邏輯的等)。此外,可在兩個或更多個圖中交叉使用相同的參考編號來指代具有相同或相似功能性的部件、組件、區塊、電路、單元、或模組。然而,此類用法僅是出於說明簡潔及易於論述的目的,而並非暗含此類組件或單元的構造或架構細節在所有實施例中均相同抑或暗含此類具有共用參考編號的部件/模組是實作本文中所揭露的特定實施例的教示內容的唯一途徑。 The terminology used herein is only used to describe specific exemplary embodiments and is not intended to limit the claimed subject matter. Unless the context clearly indicates otherwise, the singular forms "一 (a, an)" and "the" used herein are intended to also include the plural forms. It should be understood that when the term "comprises and/or comprising" is used in this specification, it refers to the existence of the stated features, integers, steps, operations, elements, and/or components, but does not exclude one or more The existence or addition of several other features, integers, steps, operations, elements, components, and/or groups thereof. The terms "first", "second", etc. used in this article are used as marks for the nouns that follow them, and unless explicitly stated otherwise, they do not imply any type of order (for example, spatial, temporal, Logical etc.). In addition, the same reference numbers may be used interchangeably in two or more figures to refer to parts, components, blocks, circuits, units, or modules having the same or similar functionality. However, such usage is only for the purpose of concise description and ease of discussion, and does not imply that the structure or structure details of such components or units are the same in all embodiments or that such components/modules with common reference numbers It is the only way to implement the teaching content of the specific embodiments disclosed in this article.
除非另外定義,否則本文所用的全部用語(包括技術及科學用語)的含義均與本主題所屬技術領域中具有通常知識者所通常理解的含義相同。更應理解,用語(例如在常用字典中所定 義的用語)應被解釋為具有與其在相關技術的上下文中的含義一致的含義,且除非本文中進行明確定義,否則不應將其解釋為具有理想化或過於正式的意義。 Unless otherwise defined, the meanings of all terms (including technical and scientific terms) used herein are the same as those commonly understood by those with ordinary knowledge in the technical field to which the subject belongs. It should be understood that the terms (such as those defined in commonly used dictionaries) The term “meaningful” should be interpreted as having a meaning consistent with its meaning in the context of related technologies, and unless it is clearly defined in this article, it should not be interpreted as having an idealized or overly formal meaning.
本文中所揭露主題提供一種可程式化且可重組態以進行不同運算(例如但不限於加法、乘法、移位、最大值/最小值、及比較)的動態隨機存取記憶體(DRAM)式處理單元(DPU)。在一個實施例中,DPU是基於三電晶體及一電容器(3T1C)DRAM處理及結構。在另一實施例中,DPU是基於經微小修改的一電晶體及一電容器(1T1C)DRAM處理及結構。因此,DPU不含有專門的計算邏輯電路系統(如加法器(adder)),但會使用記憶體胞元利用高度平行運算來提供計算。在一個實施例中,DPU可包括隨機計算陣列,在所述隨機計算陣列中,加法可被轉換成多工運算且乘法可被轉換成及邏輯運算。 The subject disclosed in this article provides a dynamic random access memory (DRAM) that is programmable and reconfigurable to perform different operations (such as but not limited to addition, multiplication, shifting, maximum/minimum, and comparison) Type processing unit (DPU). In one embodiment, the DPU is based on three transistors and one capacitor (3T1C) DRAM processing and structure. In another embodiment, the DPU is based on a slightly modified DRAM process and structure of a transistor and a capacitor (1T1C). Therefore, the DPU does not contain a special calculation logic circuit system (such as an adder), but uses memory cells to provide calculations with highly parallel operations. In one embodiment, the DPU may include a random calculation array in which addition may be converted into a multiplex operation and multiplication may be converted into a logical operation.
本文中所揭露主題亦提供一種用於對DPU進行程式化及重組態的包括具有框架擴展(framework extension)、程式庫、驅動器、編譯器、及指令集架構(instruction set architecture,ISA)的環境(生態系統)的系統架構。 The subject disclosed in this article also provides an environment for programming and reconfiguring DPU including framework extensions, libraries, drivers, compilers, and instruction set architecture (ISA) (Ecosystem) system architecture.
另外,本文中所揭露主題提供一種適合於資料中心及/或行動應用且為用於二進制計算及定點計算二者的機器學習應用提供記憶體內處理器(Processor-in-Memory,PIM)解決方案的系統架構,所述二進制計算及定點計算二者的機器學習應用是對圖形處理單元/應用專用積體電路(Application Specific Integrated Circuit,ASIC)(TPU)/現場可程式化閘極陣列(Field Programmable Gate Array,FPGA)機器學習應用的替代。在一個實施例中,本文中所揭露主題提供一種為例如二進制權重神經網絡(Binary Weight Neural Network)提供加速深度學習的高效能、高能量效率、及低成本的系統。 In addition, the subject disclosed in this article provides a processor-in-memory (PIM) solution suitable for data centers and/or mobile applications and for machine learning applications for both binary computing and fixed-point computing. System architecture. The machine learning application of both binary calculation and fixed-point calculation is an application specific integrated circuit for graphics processing unit/application specific integrated circuit (Application Specific Integrated Circuit). Circuit, ASIC) (TPU) / Field Programmable Gate Array (Field Programmable Gate Array, FPGA) machine learning application replacement. In one embodiment, the subject disclosed herein provides a high-performance, high-energy-efficiency, and low-cost system for accelerating deep learning for, for example, a Binary Weight Neural Network.
本文中所揭露主題是有關於一種可使用動態隨機存取記憶體(DRAM)技術來形成且可重組態及可程式化的DRAM式處理單元(DPU)。在一個實施例中,DPU可包括DRAM式記憶體胞元陣列及可被配置成執行例如加法、乘法、分類(sort)等不同運算的DRAM式計算胞元陣列。 The subject disclosed in this article is about a DRAM-based processing unit (DPU) that can be formed using dynamic random access memory (DRAM) technology and is reconfigurable and programmable. In one embodiment, the DPU may include a DRAM-type memory cell array and a DRAM-type calculation cell array that may be configured to perform different operations such as addition, multiplication, and sorting.
DPU的內部架構可包括連接至多個子陣列記憶體庫的系統匯流排。在一個實施例中,系統匯流排可被配置成提供H樹連接(H-tree-connected)式子陣列記憶體庫。每一子陣列可包括本地控制器,且每一各別的子陣列可被單獨地或同時地激活。在一個實施例中,DRAM式胞元可被劃分成兩個陣列-資料胞元陣列及計算胞元陣列。在一個實施例中,計算胞元陣列可藉由DRAM式記憶體胞元來實作。在另一實施例中,計算胞元陣列可藉由具有邏輯電路系統的DRAM式記憶體胞元來實作。DPU內部架構亦可包括資料移位電路(data-shifting circuit)及資料移動電路(data-movement circuit)。在一些實施例中,可存在可被配置用於隨機資料計算的第三DRAM式胞元陣列。 The internal architecture of the DPU may include a system bus connected to multiple sub-array memory banks. In one embodiment, the system bus can be configured to provide an H-tree-connected sub-array memory bank. Each sub-array can include a local controller, and each individual sub-array can be activated individually or simultaneously. In one embodiment, the DRAM cell can be divided into two arrays-a data cell array and a computing cell array. In one embodiment, the computing cell array can be implemented by DRAM-type memory cells. In another embodiment, the computing cell array can be implemented by a DRAM-type memory cell with a logic circuit system. The internal structure of the DPU may also include a data-shifting circuit and a data-movement circuit. In some embodiments, there may be a third DRAM-like cell array that can be configured for random data calculation.
圖1繪示根據本文中所揭露主題的DPU 100的示例性實
施例的方塊圖。DPU 100可包括一或多個記憶體庫101a至101m,圖1中僅繪示所述一或多個記憶體庫101a至101m中的記憶體庫101a及101b。每一記憶體庫101可包括一或多個子陣列102a至102n,圖1中僅繪示所述一或多個子陣列102a至102n中的子陣列102a及102b。每一記憶體庫101亦可包括緩衝器103。緩衝器103可耦接至各別的子陣列102且耦接至系統匯流排104。緩衝器103可讀取記憶體庫102中的一整列,並接著將所述列寫回至同一記憶體庫或寫入至另一記憶體庫。緩衝器103亦可將列資料的拷貝廣播至子陣列102中的多個記憶體陣列片(Mat)105a至105n。在一個實施例中,記憶體庫101及系統匯流排104可被配置成提供H樹連接式記憶體庫。
Figure 1 shows an exemplary implementation of the
每一子陣列102可包括一或多個記憶體陣列片(或管道(lane))105,圖1中僅繪示子陣列102a的所述一或多個記憶體陣列片105中的記憶體陣列片105a至105n。每一記憶體陣列片105為DPU 100的可包括資料胞元陣列106、計算胞元陣列107、及記憶體陣列片內移位陣列108的區。圖1中以由虛線109包圍的方式指示示例性記憶體陣列片105。每一記憶體陣列片105可與相鄰記憶體陣列片共享資料胞元陣列解碼器110、計算胞元陣列解碼器111、記憶體陣列片間移位陣列112、及記憶體陣列片間轉發陣列113。在一個實施例中,資料胞元陣列解碼器110、計算胞元陣列解碼器111、及記憶體陣列片間移位陣列112可與相鄰記憶體陣列片105之間的子陣列控制器114實體地交替排列。在一個實
施例中,解碼器110及111可作為傳統DRAM型記憶體解碼器來運作。
Each sub-array 102 may include one or more memory array chips (or lanes) 105, and only the memory array of the one or more memory array chips 105 of the sub-array 102a is shown in FIG.
在一個實施例中,每一記憶體陣列片105能夠通訊地耦接至子陣列控制器114。每一子陣列控制器114可被配置成獨立於其他子陣列控制器114。子陣列控制器114可自DRAM位址匯流排接收指令來作為位址(addr)。因應於所述位址(即,位址訊號),子陣列控制器114可將經解碼位址作為輸出提供至資料胞元陣列106及計算胞元陣列107中的一者或資料胞元陣列106及計算胞元陣列107二者。亦即,子陣列控制器114可輸出由相關聯資料胞元陣列106的資料胞元陣列解碼器110所解碼的源頭/終點(source/destination,src/dst)位址,且在計算胞元陣列107的情形中,可輸出由計算胞元陣列解碼器111所解碼的運算/計算(operation/calculation,op/calc)位址。子陣列控制器114亦可自DRAM匯流排接收使二或更多個子陣列控制器114以協調方式運作的指令來作為位址。子陣列控制器114亦可控制資料移動電路,例如控制記憶體陣列片內移位陣列108、記憶體陣列片間移位陣列112、及記憶體陣列片間轉發陣列113。
In one embodiment, each
每一資料胞元陣列106可包括排列成至少一個行及至少一個列的一或多個動態隨機存取記憶體(DRAM)胞元。在一個實施例中,資料胞元陣列106可被配置成傳統DRAM胞元陣列。在一個實施例中,資料胞元陣列106可包括2K個行及16個列。在另一實施例中,資料胞元陣列106可包括少於或多於2K個行及
/或少於或多於16個列。
Each
每一計算胞元陣列107可包括排列成至少一個行及至少一個列的一或多個計算胞元。計算胞元陣列107中的行的數目相同於資料胞元陣列106中的行的數目。在一個實施例中,計算胞元陣列107可包括2K個行及16個列。在另一實施例中,計算胞元陣列107可包括少於或多於2K個行及/或少於或多於16個列。
Each computing
圖2A繪示可用於計算胞元陣列107中的計算胞元的三電晶體及一電容器(3T1C)式DRAM計算胞元形狀201的示例性實施例。如圖2A中所繪示,列X中的3T1C式計算胞元包括第一電晶體T1,第一電晶體T1具有電性耦接至寫入位元線(寫入BL)的源極端子、電性耦接至電容器C1的第一端子及第二電晶體T2的閘極端子二者的汲極端子、以及電性耦接至寫入賦能線(WENX)的閘極端子。電容器C1的第二端子電性耦接至接地線。第二電晶體T2包括電性耦接至接地線的源極端子、及電性耦接至第三電晶體T3的源極端子的汲極端子。第三電晶體T3包括電性耦接至字元線(WLX)的閘極端子、及電性耦接至讀取位元線(讀取BL)的汲極端子。3T1C式計算胞元形狀201包括感測放大器SA,感測放大器SA具有電性耦接至讀取位元線的輸入及電性耦接至寫入位元線的輸出。
FIG. 2A illustrates an exemplary embodiment of a three-transistor and a capacitor (3T1C) type DRAM
列Y中的計算胞元及列R中的計算胞元均可亦包括排列成與列X中的計算胞元的排列相似的3T1C式DRAM配置的三個電晶體T1至T3、及電容器C1。圖2A中所繪示的示例性的所述三
個計算胞元及感測放大器SA被配置成提供反或邏輯運算(即,X NOR Y邏輯運算),其中結果儲存於列R中。儘管圖2A中明確繪示僅一個行的3T1C式DRAM計算胞元,然而應理解,在另一實施例中,3T1C式計算胞元可被配置成多個行(即,2K個行)。亦應理解,在另一實施例中,可提供多於三個列。此外,儘管圖2A中所繪示的3T1C式DRAM計算胞元配置提供反或邏輯運算,然而應理解,3TIC式DRAM計算胞元形狀201的反或邏輯運算可被用於提供各種功能運算,例如但不限於互斥反或(XNOR)、加法(ADD)、選擇(SET)、最大值(MAX)、正負號(SIGN)、多工(MUX)、條件和加法邏輯(conditional-sum addition logic,CSA)、乘法、位1計數(popcount)、及比較。移位陣列108及112亦提供移位功能。
The calculation cells in row Y and the calculation cells in row R can also include three transistors T 1 to T 3 and capacitors arranged in a 3T1C DRAM configuration similar to the arrangement of the calculation cells in row X C 1 . The exemplary three calculation cells and the sense amplifier SA shown in FIG. 2A are configured to provide inverse OR logic operations (ie, X NOR Y logic operations), wherein the results are stored in row R. Although FIG. 2A clearly shows a 3T1C DRAM computing cell with only one row, it should be understood that in another embodiment, the 3T1C computing cell may be configured into multiple rows (ie, 2K rows). It should also be understood that in another embodiment, more than three columns may be provided. In addition, although the 3T1C DRAM calculation cell configuration shown in FIG. 2A provides the inverse OR logic operation, it should be understood that the inverse OR logic operation of the 3TIC DRAM
圖2B繪示可用於圖1所示計算胞元陣列107中的計算胞元的一電晶體及一電容器(1T1C)式DRAM計算胞元形狀202的替代性示例性實施例。如圖2B中所繪示,1T1C式DRAM計算胞元包括電晶體T4,電晶體T4具有電性連接至電容器C2的第一端子的源極端子、電性連接至位元線(BL)的汲極端子、及電性連接至字元線(WL)的閘極端子。電容器C2的第二端子電性耦接至接地線。位元線BL電性耦接至感測放大器SA的輸入。感測放大器SA的輸出電性耦接至多工器(MUX)的第一輸入、第五電晶體T5的汲極端子、及算術邏輯單元(arithmetic logic unit,ALU)的輸入。多工器的輸出電性耦接至栓鎖器(LATCH)的輸入。第
五電晶體T5的源極端子電性耦接至栓鎖器的輸出。算術邏輯單元的輸出電性耦接至多工器的第二輸入。圖2B中的第五電晶體T5、多工器、栓鎖器、及算術邏輯單元各自分別自控制器114接收控制訊號CNTL1至CNTL4。在一個實施例中,算術邏輯單元可被配置成提供反或功能。儘管電性耦接至圖2B中的位元線BL的邏輯電路系統提供反或邏輯運算,然而應理解,電性耦接至位元線BL的所述邏輯電路系統(即,算術邏輯單元)可提供其他功能運算,例如但不限於互斥反或(XNOR)、加法(ADD)、選擇(SET)、最大值、正負號、多工(MUX)、條件和加法邏輯(CSA)、乘法、位1計數、及比較。移位陣列108及112亦提供移位功能。應理解,圖2B中繪示僅一個1T1C計算胞元,且應理解,可提供多個行及多個列的1T1C計算胞元。
FIG. 2B shows an alternative exemplary embodiment of a transistor and a capacitor (1T1C) type DRAM
如可在圖2A及圖2B中看出,DPU的計算胞元不包括專門的複雜計算邏輯,而是包括相對簡單的形狀,所述相對簡單的形狀具有提供執行多種不同類型的計算的能力的可再程式化性質。另外,DPU的形狀可被排列成利用記憶體結構中所固有的巨量平行性(massive parallelism)來更快地及更高效地執行更多計算。 As can be seen in Figures 2A and 2B, the calculation cell of the DPU does not include specialized complex calculation logic, but includes a relatively simple shape that provides the ability to perform multiple different types of calculations. Re-programmable nature. In addition, the shape of the DPU can be arranged to take advantage of the massive parallelism inherent in the memory structure to perform more calculations faster and more efficiently.
圖3繪示根據本文中所揭露主題的記憶體陣列片內移位陣列108的示例性實施例。為簡化對記憶體陣列片內移位陣列108的說明,考慮其中寬度為四行計算胞元107的記憶體陣列片105(例如,圖3中所繪示者)。記憶體陣列片內移位陣列108包括排
列成陣列的多個第六電晶體T6(圖3中僅指示所述多個第六電晶體T6中的一個電晶體T6)、2n個移位線SL(其中n是記憶體陣列片105中的計算胞元的行)、n+2個左移位控制線SLcL、2個右移位控制線SRcL、以及n個移位遮罩線SML。記憶體陣列片內移位陣列108的一些第六電晶體T6電性連接於寫入位元線與所述2n個移位線SL之間,且記憶體陣列片內移位陣列108的其他第六電晶體T6連接於讀取位元線與所述2n個移位線SL之間。這些第六電晶體T6的閘極電性耦接至所述n+2個左移位控制線SLcL及所述2個右移位控制線SRcL。記憶體陣列片內移位陣列108的其他第六電晶體T6電性連接於所述n個移位遮罩線SML與所述2n個移位線SL之間。記憶體陣列片內移位陣列108的控制線電性耦接至與記憶體陣列片105相關聯的子陣列控制器114。
FIG. 3 shows an exemplary embodiment of the memory array on-
記憶體陣列片內移位陣列108可藉由控制線SLcL及SRcL上的適當訊號來使資料在記憶體陣列片105內左右移位。對左移位而言,資料可被正負號位元(sign bit)填充,且每一運算移位1位或(n-1)位,其中n是每一記憶體陣列片105的行的數目。對右移位而言,資料可在指令控制下被0或1填充,且被移位20、21、...、2k-1、2k直至每一記憶體陣列片的行的數目,其中2k是行的數目。
The on-
圖4A繪示根據本文中所揭露主題的記憶體陣列片間移位陣列112的實施例。為簡化對記憶體陣列片間移位陣列112的說明,考慮其中記憶體陣列片105的寬度為兩行計算胞元107的
配置(例如,圖4A至圖4C中所繪示者)。亦即,每一記憶體陣列片105包括第一行的計算胞元107a及第二行的計算胞元107b。記憶體陣列片間移位陣列112包括電晶體T112a及T112b、電晶體T112c及T112d、資料移位線112e及112f、及記憶體陣列片間移位控制線ISLcL。在記憶體陣列片內,電晶體T112a包括電性耦接至第一行的計算胞元107a的讀取位元線的源極端子、電性耦接至資料移位線112e的汲極端子。電晶體T112b包括電性耦接至第二行的計算胞元107b的讀取位元線的源極端子、電性耦接至資料移位線112f的汲極端子。資料移位線112e及112f電性耦接至緩衝器103(圖4A中未示出)。在不同記憶體陣列片之間,電晶體T112c包括分別電性耦接至鄰近記憶體陣列片中的資料移位線112e的源極端子及汲極端子。電晶體T112d包括分別電性耦接至鄰近記憶體陣列片中的資料移位線112f的源極端子及汲極端子。電晶體T112c及T112d的閘極分別電性耦接至各自不同的記憶體陣列片間移位控制線ISLcL。記憶體陣列片間移位陣列112可藉由記憶體陣列片間移位控制線ISLcL上的適當訊號來使資料在不同記憶體陣列片之間左右移位。記憶體陣列片間移位陣列112的控制線電性耦接至與記憶體陣列片105相關聯的子陣列控制器114。
FIG. 4A shows an embodiment of a memory array
圖4B概念性地繪示根據本文中所揭露主題的用於左記憶體陣列片間移位的相鄰計算胞元行105a與105b中的兩個相同位置的計算胞元之間的記憶體陣列片間移位互連配置。圖4B所示互連配置可藉由著重顯示的操作互連節點(operative
interconnection node)來概念性地繪示。舉例而言,電晶體T112c及T112d被激活以使每一電晶體之間存在導電路徑,藉此連接計算胞元行105a(左側)與105b(右側)之間的資料移位線112e與112f。電晶體T112c及T112d的閘極端子電性連接至激活的記憶體陣列片間移位控制線ISLcL。記憶體陣列片105b中的電晶體T112a及T112b被激活以使記憶體陣列片105b中的計算胞元107a的讀取位元線電性連接至位於記憶體陣列片105b左側的記憶體陣列片105a中的計算胞元107a的寫入位元線,且使記憶體陣列片105b中的計算胞元107b的讀取位元線電性連接至位於記憶體陣列片105b左側的記憶體陣列片105a中的計算胞元107b的寫入位元線。
FIG. 4B conceptually illustrates a memory array between two calculation cells at the same position in adjacent
圖4C概念性地繪示根據本文中所揭露主題的用於左記憶體陣列片間移位的相鄰計算胞元行105a與105b中的兩個不同位置的計算胞元之間的記憶體陣列片間移位互連配置。圖4C所示互連配置可藉由著重顯示的操作互連節點來概念性地繪示。舉例而言,電晶體T112c及T112d被激活以使每一電晶體之間存在導電路徑,藉此連接計算胞元行105a(左側)與105b(右側)之間的資料移位線112e與112f。電晶體T112c及T112d的閘極端子電性連接至激活的記憶體陣列片間移位控制線ISLcL。記憶體陣列片105a中的電晶體T112a及T112b被激活以使記憶體陣列片105a中的計算胞元107a的讀取位元線電性連接至位於記憶體陣列片105a右側的記憶體陣列片105b中的計算胞元107a的寫入位元線,且使記憶體陣列片105a中的計算胞元107b的讀取位元線電性連接至位
於記憶體陣列片105a右側的記憶體陣列片105b中的計算胞元107b的寫入位元線。
FIG. 4C conceptually illustrates a memory array between two calculation cells at two different positions in adjacent
圖5繪示根據本文中所揭露主題的記憶體陣列片間轉發陣列113的實施例。為簡化對記憶體陣列片間轉發陣列113的說明,考慮其中記憶體陣列片105的寬度為兩行計算胞元107的配置(例如,圖5中所繪示者)。亦即,每一記憶體陣列片105包括第一行的計算胞元107a及第二行的計算胞元107b。對於記憶體陣列片105,記憶體陣列片間轉發陣列113包括電晶體T113a及T113b、電晶體T113c及T113d、以及電晶體T113e及T113f、2n個轉發資料線FDL(其中n是記憶體陣列片中的計算胞元行的數目)、轉發控制線FCL、及2m個轉發區段線FSL(其中m是區段的數目)。電晶體T113a及T113b的源極端子分別電性連接第一行的計算胞元107a的寫入位元線及讀取位元線。電晶體T113a及T113b的汲極端子電性耦接至第一資料轉發線FDL 113g。電晶體T113c及T113d的源極端子分別電性連接第二行的計算胞元107b的寫入位元線及讀取位元線。電晶體T113c及T113d的汲極端子電性耦接至第二資料轉發線FDL 113h。電晶體T113e及T113f的源極端子分別電性耦接至電晶體T113a及T113b的閘極端子。電晶體T113e及T113f的汲極端子均耦接至同一轉發區段線FSL。電晶體T113e及T113f的閘極端子分別耦接至不同轉發控制線FCL。記憶體陣列片間轉發陣列113可藉由轉發控制線FCL上的適當訊號來在記憶體陣列片之間轉發資料。記憶體陣列片間轉發陣列113的控制線電性耦接至與資料在其間進
行轉發的記憶體陣列片105相關聯的子陣列控制器114。
FIG. 5 shows an embodiment of a memory array inter-chip forwarding
圖6A至圖6G繪示根據本文中所揭露主題的可由DPU提供的基於反或邏輯的運算。在圖6A至圖6G中,可將第一運算元(operand)儲存於列X中且可將第二運算元儲存於列Y或列W中。圖6A至圖6G中的箭頭表示對整個列的計算胞元的反或邏輯運算的輸入流及輸出流。舉例而言,圖6A中的列X可表示儲存於列X的計算胞元中的整個列的運算元。將對儲存於列X中的運算元及儲存於列Y中的運算元的反或邏輯運算的結果儲存於所得列R中。在一個實施例中,列X及列Y中的運算元可包括例如100個行(即,x1、x2、...、x100、及y1、y2、...、y100)且可將結果儲存於列R(即,r1、r2、...、r100)中。亦即,xi NOR yi=ri,其中i是行索引。在另一實施例中,列X可表示列中的計算胞元的僅選定群組。 6A to 6G illustrate operations based on inverse OR logic that can be provided by the DPU according to the subject matter disclosed herein. In FIGS. 6A to 6G, the first operand can be stored in row X and the second operand can be stored in row Y or row W. The arrows in FIGS. 6A to 6G indicate the input flow and output flow of the inverse OR logic operation on the entire column of calculation cells. For example, the row X in FIG. 6A may represent the entire row of operands stored in the calculation cell of row X. The result of the inverse logical operation on the operands stored in row X and the operands stored in row Y is stored in the resulting row R. In one embodiment, the operands in column X and column Y may include, for example, 100 rows (ie, x 1 , x 2 , ..., x 100 , and y 1 , y 2 , ..., y 100 ) And the result can be stored in the row R (ie, r 1 , r 2 ,..., r 100 ). That is, x i NOR y i = r i , where i is the row index. In another embodiment, column X may represent only selected groups of calculation cells in the column.
圖6B繪示基於前綴克格-斯通加法器(prefix Kogge-Stone adder)的對N位數的示例性全加法器運算。在圖6B中,將第一N位運算元儲存於列X中且將第二N位運算元儲存於列Y中。對於圖6B中所繪示示例性加法運算,將計算中間項G0、P0、G1、P1、G2、P2、...、GlogN+1、及PlogN+1。圖6B所示最上區塊表示使用來自列X及列Y的所輸入運算元來確定G0及P0的五個單獨的運算。在第一運算中,最上區塊確定列X的反量(即,~X),其儲存於列1中。第二運算確定列Y的反量(即,~Y),其儲存於列2中。第三運算確定運算列X NOR列Y,其儲存於列3中。第
四運算確定運算G0=列1 NOR列2,其儲存於列4中。第五運算確定P0=列3 NOR列4,其儲存於列5中。
FIG. 6B illustrates an exemplary full adder operation based on a prefix Kogge-Stone adder for N digits. In FIG. 6B, the first N-bit operand is stored in row X and the second N-bit operand is stored in row Y. For the exemplary addition operation shown in FIG. 6B, the intermediate terms G 0 , P 0 , G 1 , P 1 , G 2 , P 2 , ..., G logN+1 , and P logN+1 will be calculated. The top block shown in FIG. 6B represents five separate operations that use input operands from column X and column Y to determine G 0 and P 0 . In the first operation, the top block determines the inverse of row X (ie, ~X), which is stored in
在圖6B所示中間區塊中,使用來自最上區塊的中間結果G0及P0來確定中間結果Gi+1及Pi+1,其中i是行索引。亦即,使用在圖6B所示最上區塊中確定的中間結果G0及P0來確定中間結果G1及P1。使用中間結果G1及P1來確定中間結果G2及P2,並且以此確定中間結果GlogN+1及PlogN+1。在圖6B所示最底部區塊中,結果列R1及列R2分別儲存全加法器運算的進位結果及和結果。 In the intermediate block shown in FIG. 6B, the intermediate results G 0 and P 0 from the top block are used to determine the intermediate results G i+1 and P i+1 , where i is the row index. That is, the intermediate results G 0 and P 0 determined in the uppermost block shown in FIG. 6B are used to determine the intermediate results G 1 and P 1 . The intermediate results G 1 and P 1 are used to determine the intermediate results G 2 and P 2 , and the intermediate results G logN+1 and P logN+1 are determined accordingly . In the bottom block shown in FIG. 6B, the result row R1 and row R2 respectively store the carry result and the sum result of the full adder operation.
圖6C繪示可由3T1C式DRAM計算胞元形狀201提供的示例性選擇器運算。列1儲存列X的反量的中間結果(即,~X)。列2儲存列Y的反量的中間結果(即,~Y)。列3儲存列S的反量的中間結果(即,~S)。列4儲存列1 NOR列3的中間結果。列5儲存列2 NOR列S的中間結果。列6儲存列4 NOR列5的中間結果。列R儲存列6的反量的結果(即,S?X:Y)。
FIG. 6C illustrates an exemplary selector operation that can be provided by the 3T1C DRAM to calculate the
圖6D繪示可由3T1C式DRAM計算胞元形狀201提供的替代性示例性選擇器運算。列1儲存列X的反量的中間結果(即,~X)。列2儲存列S的反量的中間結果(即,~S)。列3儲存列1 NOR列S的中間結果。列4儲存列X的反量的中間結果(即,~X)。列R儲存列3 NOR列4的結果(即,S?X:~X)。
FIG. 6D illustrates an alternative exemplary selector operation that can be provided by the 3T1C DRAM calculating
圖6E繪示可由3T1C式DRAM計算胞元形狀201提供的示例性最大值/最小值運算。列1儲存列Y的反量的中間結果(即,
~Y)。列2儲存列X+的中間結果(~Y+1)。列3儲存Cout>>n的中間結果。列4儲存Cout?X:Y的中間結果。列R儲存MAX(X:Y)的結果。
FIG. 6E illustrates an exemplary maximum/minimum calculation that can be provided by the 3T1C DRAM calculating
圖6F繪示可由3T1C式DRAM計算胞元形狀201提供的示例性1位乘法運算。列1儲存列X NOR列W的中間結果。列2儲存列X NOR列1的中間結果。列3儲存列W NOR列1的中間結果。結果列R儲存列2 NOR列3的結果,即列X XNOR列W的結果。
FIG. 6F illustrates an exemplary 1-bit multiplication operation that can be provided by the 3T1C DRAM calculating
圖6G繪示可由3T1C式DRAM計算胞元形狀201提供的示例性多位乘法運算。在圖6G所示上部區快中,列1儲存列W的反量的中間結果(即,~W)。列2儲存向左移位2i次的列X的反量的中間結果(即,~X<<2i),其中i是索引。列3儲存列1 NOR列2的中間結果,即PPi=~W NOR~X<<2i。在圖6G所示下部區快中,列1儲存列PP0 SUM(總和)列PPi的中間結果(即,Σ PPi)。列2儲存列1 XNOR列Wsign的中間結果。列R儲存X*W的結果。
FIG. 6G illustrates an exemplary multi-bit multiplication operation that can be provided by the 3T1C DRAM calculating
圖7繪示根據本文中所揭露主題的包括隨機資料區715的DPU 700的示例性實施例的方塊圖。DPU 700的具有與圖1中所繪示DPU 100的組件的參考指示符相同的參考指示符的各種組件是相似的,且此處已省略對此種相似組件的說明。DPU 700的子陣列102包括隨機資料陣列715及轉換器-隨機陣列(converter-to-stochastic array)716、以及(真實)資料胞元陣列
106、計算胞元陣列107、及記憶體陣列片內移位陣列108。
FIG. 7 shows a block diagram of an exemplary embodiment of a
每一隨機資料陣列715可包括排列成至少一個行及至少一個列的一或多個隨機計算胞元。隨機資料陣列715中的行的數目相同於資料胞元陣列106及計算胞元陣列107中的行的數目。在一個實施例中,隨機資料陣列715可包括2K個行及16個列。在另一實施例中,隨機資料陣列715可包括少於或多於2K個行及/或少於或多於16個列。在隨機資料陣列715中,使用「1」的存在機率及使用2n位來表示n位的值。轉換器-隨機陣列716中的隨機數字產生器可用於將真實數字轉換成隨機數字。可使用位1計數運算將隨機數字轉換回真實數字。
Each
利用隨機計算方式,加法可被轉換成多工運算且乘法可被轉換成及邏輯運算。舉例而言,圖8A繪示提供隨機加法運算來作為多工運算的電路,且圖8B繪示提供隨機乘法運算來作為及邏輯運算的電路。隨機計算的傳統技術需要巨大的記憶體容量;然而,本文中所揭露主題可用於提供非常高效的隨機計算,乃因DRAM式DPU能夠執行大的平行的及運算與MUX運算。使用本文中所揭露DPU的隨機計算亦使得可將其中深度學習作為一種典型應用的複雜運算加速。 Using random calculation methods, addition can be converted into multiple operations and multiplication can be converted into logical operations. For example, FIG. 8A shows a circuit that provides a random addition operation as a multiplexing operation, and FIG. 8B shows a circuit that provides a random multiplication operation as a sum logic operation. The traditional technology of random computing requires huge memory capacity; however, the subject disclosed in this article can be used to provide very efficient random computing because DRAM-style DPUs can perform large parallel operations and MUX operations. The random calculation using the DPU disclosed in this article also makes it possible to speed up complex operations in which deep learning is a typical application.
圖9繪示根據本文中所揭露主題的包括DPU的系統架構900。系統架構900可包括硬體層910、程式庫及驅動器層920、框架層930、及應用層940。
FIG. 9 shows a
硬體層910可包括硬體裝置及/或具有嵌置DPU(例如,
本文中所述DPU)的組件。裝置及/或組件的一個實施例可為可包括一或多個嵌置DPU的快速周邊組件互連(Peripheral Component Interconnect Express,PCIe)裝置911。裝置及/或組件的另一實施例可為可包括一或多個嵌置DPU的雙列直插記憶體模組(Dual In-line Memory Module,DIMM)912。應理解,系統架構900的硬體層910並非僅限於快速周邊組件互連裝置及/或雙列直插記憶體模組,而是可包括系統晶片(System on a Chip,SOC)裝置或可含有DPU的其他記憶體類型的裝置。可在硬體層910處嵌置於所述裝置及/或組件中的DPU可被配置成相似於圖1中的DPU 100及/或相似於圖7中的DPU 700。在任意實施例中,DPU的特定計算胞元陣列可被配置成包括3T1C式計算胞元形狀201(圖2A)或1T1C式計算胞元形狀202(圖2B)。
The hardware layer 910 may include a hardware device and/or have an embedded DPU (for example,
DPU described herein). One embodiment of the device and/or component may be a Peripheral Component Interconnect Express (PCIe)
系統架構900的程式庫及驅動器層920可包括DPU程式庫921、DPU驅動器922、及DPU編譯器923。DPU程式庫921可被配置成為可在應用層940處運作的不同應用的硬體層910中的DPU中的每一子陣列提供最佳映射功能性(optimal mapping functionality)、資源分配功能性(resource allocation functionality)、及排程功能性(scheduling functionality)。
The library and driver layer 920 of the
在一個實施例中,DPU程式庫921可為可包括例如移動(move)、加法、乘法等運算的框架層930提供高層階應用程式設計介面(application programming interface,API)。舉例而言,DPU程式庫921亦可包括標準型常式的實作形式,例如但不限於可應 用於加速深度學習過程的正卷積(forward convolution)及反卷積(backward convolution)層、集用(pooling)層、正態化(normalization)層、及激活(activation)層。在一個實施例中,DPU程式庫921可包括對卷積神經網絡(convolution neural network,CNN)的整個卷積層的計算進行映射的應用程式設計介面類功能(API-like function)。另外,DPU程式庫921可包括用於對將卷積層計算映射至DPU上的映射過程進行最佳化的應用程式設計介面類功能。 In one embodiment, the DPU library 921 may provide a high-level application programming interface (API) for the framework layer 930 that may include operations such as move, addition, and multiplication. For example, the DPU library 921 may also include the implementation of standard routines, such as but not limited to For accelerating the deep learning process, forward convolution and backward convolution layers, pooling layers, normalization layers, and activation layers. In one embodiment, the DPU library 921 may include an API-like function that maps the calculation of the entire convolutional layer of a convolution neural network (CNN). In addition, the DPU library 921 may include application programming interface functions for optimizing the mapping process of the convolutional layer calculation to the DPU.
DPU程式庫921亦可包括用於藉由將任務(批(batch)、輸出通道、畫素、輸入通道、卷積核心(convolution kernel))內的任意各別或多個平行性映射成晶片、記憶體庫、子陣列、及/或記憶體陣列片層階處的對應DPU平行性來對資源分配進行最佳化的應用程式設計介面類功能。另外,DPU程式庫921可包括在效能(即,資料移動流)與功耗之間作出權衝的初始化及/或運行時間處提供最佳DPU配置的應用程式設計介面類功能。由DPU程式庫921所提供的其他應用程式設計介面類功能可包括設計鈕型功能(design-knob-type function),例如設定每一記憶體庫的主動子陣列的數目、每一主動子陣列的所輸入特徵映射的數目、特徵映射的分區、及/或卷積核心的複用方案(reuse scheme)。又一些其他應用程式設計介面類功能可藉由對每一子陣列分配特定任務(例如,卷積計算、通道求和(channel sum up)、及/或資料調度(data dispatching))來提供附加的資源分配最佳化。若運算元將 在整數與隨機數字之間轉換,則DPU程式庫921包括在滿足精確約束條件的同時將費用最小化的應用程式設計介面類功能。在精確度低於期望的事件中,DPU程式庫921可包括使用附加位元重新計算隨機表示形式的值、或將任務卸載至其他硬體(例如,中央處理單元)的應用程式設計介面類功能。 The DPU library 921 may also include functions for mapping any individual or multiple parallelisms in a task (batch, output channel, pixel, input channel, convolution kernel) into a chip, The memory bank, sub-array, and/or memory array slice level corresponds to the DPU parallelism to optimize resource allocation application programming interface functions. In addition, the DPU library 921 may include application programming interface functions that provide optimal DPU configuration at initialization and/or runtime to make a weight between performance (ie, data movement flow) and power consumption. Other application programming interface functions provided by the DPU library 921 can include design-knob-type functions, such as setting the number of active sub-arrays of each memory bank, and the number of active sub-arrays of each active sub-array. The number of input feature maps, the partition of feature maps, and/or the reuse scheme of the convolution core. Some other application programming interface functions can provide additional functions by assigning specific tasks to each sub-array (for example, convolution calculation, channel sum up, and/or data dispatching). Optimize resource allocation. If the operand will To convert between integers and random numbers, the DPU library 921 includes application programming interface functions that meet precise constraints while minimizing costs. In the event that the accuracy is lower than expected, the DPU library 921 may include application programming interface functions that use additional bits to recalculate the value of the random representation, or offload tasks to other hardware (for example, the central processing unit) .
DPU程式庫921亦可包括對DPU中的多個經激活子陣列同時進行排程且對資料移動進行排程以使所述資料移動被計算操作隱藏的應用程式設計介面類功能。 The DPU library 921 may also include application programming interface functions that simultaneously schedule multiple activated sub-arrays in the DPU and schedule data movement so that the data movement is hidden by computing operations.
DPU程式庫921的另一態樣包括用於進一步的DPU開發的擴展介面。在一個實施例中,DPU程式庫921可使用反或及移位邏輯來提供用於對功能性進行直接程式化的介面,使得可提供除標準型運算(即,加法、乘法、最大值/最小值等)以外的運算。擴展介面亦可提供以下介面:所述介面使得並非專門由DPU程式庫921來支援的運算可在程式庫及驅動器層920處卸載至系統晶片控制器(圖中未示出)、中央處理單元/圖形處理單元(CPU/GPU)組件、及/或中央處理單元/張量處理單元(CPU/TPU)組件。DPU程式庫921的又一態樣提供當DPU記憶體不用於計算時使用所述DPU的記憶體作為記憶體擴展的應用程式設計介面類功能。 Another aspect of the DPU library 921 includes an extended interface for further DPU development. In one embodiment, the DPU library 921 can use inverse OR and shift logic to provide an interface for directly programming functionality, so that standard operations (ie, addition, multiplication, maximum/minimum) can be provided. Value, etc.). The extended interface can also provide the following interface: the interface allows operations not specifically supported by the DPU library 921 to be offloaded to the system chip controller (not shown in the figure), the central processing unit/ Graphics processing unit (CPU/GPU) components, and/or central processing unit/tensor processing unit (CPU/TPU) components. Another aspect of the DPU library 921 provides application programming interface functions that use the DPU memory as a memory extension when the DPU memory is not used for calculation.
DPU驅動器922可被配置成在硬體層910處的DPU、DPU程式庫921、及更高層處的作業系統(operating system,OS)之間提供介面連接以將DPU硬體層整合成系統。亦即,DPU驅動器922將DPU暴露至系統作業系統及DPU程式庫921。在一個實施 例中,DPU驅動器922可在初始化時提供DPU控制。在一個實施例中,DPU驅動器922可以DRAM型位址或DRAM型位址的序列的形式向DPU發送指令且可控制進出於DPU的資料移動。DPU驅動器922可提供多DPU通訊以及處置DPU-CPU通訊及/或DPU-GPU通訊。 The DPU driver 922 can be configured to provide an interface connection between the DPU at the hardware layer 910, the DPU library 921, and an operating system (OS) at a higher layer to integrate the DPU hardware layer into a system. That is, the DPU driver 922 exposes the DPU to the system operating system and the DPU library 921. In one implementation In an example, the DPU driver 922 can provide DPU control during initialization. In one embodiment, the DPU driver 922 can send instructions to the DPU in the form of a DRAM-type address or a sequence of DRAM-type addresses and can control the movement of data in and out of the DPU. The DPU driver 922 can provide multiple DPU communications and handle DPU-CPU communications and/or DPU-GPU communications.
DPU編譯器923可將來自DPU程式庫921的DPU編碼編譯成呈記憶體位址形式的DPU指令,所述記憶體位址被DPU驅動器922用來控制DPU。由DPU編譯器923所產生的DPU指令可為對DPU中的一個列及/或兩個列進行操作的單一指令;向量指令、及/或集合向量、操作中讀取的指令(read-on-operation instruction)。 The DPU compiler 923 can compile the DPU code from the DPU library 921 into DPU instructions in the form of memory addresses, which are used by the DPU driver 922 to control the DPU. The DPU instruction generated by the DPU compiler 923 may be a single instruction that operates on one column and/or two columns in the DPU; vector instructions, and/or set vectors, instructions read during operation (read-on- operation instruction).
框架層930可被配置成為程式庫及驅動器層920及硬體層910提供使用者友好型介面。在一個實施例中,框架層930可提供可與應用層940處的各種各樣的應用相容的使用者友好型介面且使DPU硬體層910對於使用者而言透明。在另一實施例中,框架層930可包括向現有的傳統方法(例如但不限於火炬7型(Torch7-type)應用及張量流型(TensorFlow-type)應用)增添定量功能(quantitation function)的框架擴展。在一個實施例中,框架層930可包括向訓練演算法(training algorithm)增添定量功能。在另一實施例中,框架層930可優先於除法、乘法、及平方根的現有的批次正規化方法(batch-normalization method)而提供除法、乘法、及平方根的移位逼近方法(shift approximated method)。在又一實施例中,框架層930可提供使得使用者能夠設定用於計算的位元的數目的擴展。在再一實施例中,框架層930提供將來包覆自DPU程式庫及驅動器層920至框架層930的能力的多DPU應用程式設計介面,使得使用者可以與多個圖形處理單元的使用方式相似的方式在硬體層處使用多個DPU。框架層930的又一特徵使得使用者能夠將功能指派給硬體層910處的DPU或圖形處理單元。 The framework layer 930 can be configured as a library and driver layer 920 and a hardware layer 910 to provide a user-friendly interface. In one embodiment, the frame layer 930 may provide a user-friendly interface compatible with various applications at the application layer 940 and make the DPU hard layer 910 transparent to the user. In another embodiment, the framework layer 930 may include adding a quantitation function to existing traditional methods (such as but not limited to Torch7-type applications and TensorFlow-type applications) The framework is extended. In one embodiment, the framework layer 930 may include adding quantitative functions to a training algorithm. In another embodiment, the framework layer 930 may prioritize the existing batch-normalization methods of division, multiplication, and square root and provide shift approximated methods of division, multiplication, and square root. method). In yet another embodiment, the framework layer 930 may provide an extension that enables the user to set the number of bits used for calculation. In yet another embodiment, the framework layer 930 provides a multi-DPU application programming interface that covers the capabilities of the DPU library and driver layer 920 to the framework layer 930 in the future, so that the user can use multiple graphics processing units in a similar manner The way to use multiple DPU at the hardware layer. Another feature of the framework layer 930 enables the user to assign functions to the DPU or graphics processing unit at the hardware layer 910.
應用層940可包括各種各樣的應用,例如但不限於影像標標籤處理(image tag processing)、自主駕駛/自主領航車輛(self-driving/piloting vehicle)、阿爾法狗型深度思維應用(AlphaGo-type deep-mind application)、及/或言語研究(speech research)。 The application layer 940 can include a variety of applications, such as but not limited to image tag processing, autonomous driving/piloting vehicle, and AlphaGo-type deep thinking application (AlphaGo-type deep-mind application), and/or speech research.
熟習此項技術者將認識到,可在各種各樣的應用中對本文中所闡述的新穎概念作出潤飾及變化。因此,所主張的主題的範圍不應僅限於以上所論述的任何具體示例性教示內容,而是由以下申請專利範圍所限定。 Those familiar with this technology will recognize that the novel concepts described in this article can be modified and changed in a variety of applications. Therefore, the scope of the claimed subject matter should not be limited to any specific exemplary teachings discussed above, but is limited by the scope of the following patent applications.
100‧‧‧DPU 100‧‧‧DPU
101a、101b‧‧‧記憶體庫 101a, 101b‧‧‧Memory Bank
102a、102b‧‧‧子陣列 102a, 102b‧‧‧Sub-array
103‧‧‧緩衝器 103‧‧‧Buffer
104‧‧‧系統匯流排 104‧‧‧System bus
105a、105b‧‧‧記憶體陣列片/計算胞元行 105a, 105b‧‧‧Memory array chip/computing cell row
105n‧‧‧記憶體陣列片 105n‧‧‧Memory array chip
106‧‧‧資料胞元陣列 106‧‧‧Data cell array
107‧‧‧計算胞元陣列/計算胞元 107‧‧‧Calculate cell array/calculate cell
108‧‧‧移位陣列/記憶體陣列片內移位陣列 108‧‧‧Shift Array/Memory Array On-chip shift array
109‧‧‧虛線 109‧‧‧dotted line
110‧‧‧解碼器/資料胞元陣列解碼器 110‧‧‧Decoder/Data Cell Array Decoder
111‧‧‧解碼器/計算胞元陣列解碼器 111‧‧‧Decoder/Calculation Cell Array Decoder
112‧‧‧移位陣列/記憶體陣列片間移位陣列 112‧‧‧Shift Array/Memory Array Inter-chip Shift Array
113‧‧‧記憶體陣列片間轉發陣列 113‧‧‧Memory array inter-chip forwarding array
114‧‧‧控制器/子陣列控制器 114‧‧‧Controller/Sub-Array Controller
addr‧‧‧位址 addr‧‧‧Address
Claims (20)
Applications Claiming Priority (6)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US201662413977P | 2016-10-27 | 2016-10-27 | |
US62/413,977 | 2016-10-27 | ||
US201662418155P | 2016-11-04 | 2016-11-04 | |
US62/418,155 | 2016-11-04 | ||
US15/426,033 US10242728B2 (en) | 2016-10-27 | 2017-02-06 | DPU architecture |
US15/426,033 | 2017-02-06 |
Publications (2)
Publication Number | Publication Date |
---|---|
TW201816592A TW201816592A (en) | 2018-05-01 |
TWI714806B true TWI714806B (en) | 2021-01-01 |
Family
ID=62022501
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
TW106131867A TWI714806B (en) | 2016-10-27 | 2017-09-18 | Dpu architecture |
Country Status (5)
Country | Link |
---|---|
US (1) | US10242728B2 (en) |
JP (1) | JP6799520B2 (en) |
KR (1) | KR102139213B1 (en) |
CN (1) | CN108008974B (en) |
TW (1) | TWI714806B (en) |
Families Citing this family (29)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10777262B1 (en) | 2016-12-06 | 2020-09-15 | Gsi Technology, Inc. | Read data processing circuits and methods associated memory cells |
US10249362B2 (en) | 2016-12-06 | 2019-04-02 | Gsi Technology, Inc. | Computational memory cell and processing array device using the memory cells for XOR and XNOR computations |
US10725777B2 (en) * | 2016-12-06 | 2020-07-28 | Gsi Technology, Inc. | Computational memory cell and processing array device using memory cells |
US10943648B1 (en) | 2016-12-06 | 2021-03-09 | Gsi Technology, Inc. | Ultra low VDD memory cell with ratioless write port |
US10854284B1 (en) | 2016-12-06 | 2020-12-01 | Gsi Technology, Inc. | Computational memory cell and processing array device with ratioless write port |
US10770133B1 (en) | 2016-12-06 | 2020-09-08 | Gsi Technology, Inc. | Read and write data processing circuits and methods associated with computational memory cells that provides write inhibits and read bit line pre-charge inhibits |
US10847212B1 (en) | 2016-12-06 | 2020-11-24 | Gsi Technology, Inc. | Read and write data processing circuits and methods associated with computational memory cells using two read multiplexers |
US10891076B1 (en) | 2016-12-06 | 2021-01-12 | Gsi Technology, Inc. | Results processing circuits and methods associated with computational memory cells |
US11227653B1 (en) | 2016-12-06 | 2022-01-18 | Gsi Technology, Inc. | Storage array circuits and methods for computational memory cells |
US10847213B1 (en) | 2016-12-06 | 2020-11-24 | Gsi Technology, Inc. | Write data processing circuits and methods associated with computational memory cells |
US10860320B1 (en) | 2016-12-06 | 2020-12-08 | Gsi Technology, Inc. | Orthogonal data transposition system and method during data transfers to/from a processing array |
US10614875B2 (en) | 2018-01-30 | 2020-04-07 | Micron Technology, Inc. | Logical operations using memory cells |
CN108985449B (en) * | 2018-06-28 | 2021-03-09 | 中国科学院计算技术研究所 | Control method and device for convolutional neural network processor |
US10755766B2 (en) | 2018-09-04 | 2020-08-25 | Micron Technology, Inc. | Performing logical operations using a logical operation component based on a rate at which a digit line is discharged |
KR20200057475A (en) | 2018-11-16 | 2020-05-26 | 삼성전자주식회사 | Memory device including arithmetic circuit and neural network system including the same |
US10949214B2 (en) * | 2019-03-29 | 2021-03-16 | Intel Corporation | Technologies for efficient exit from hyper dimensional space in the presence of errors |
US11074008B2 (en) * | 2019-03-29 | 2021-07-27 | Intel Corporation | Technologies for providing stochastic key-value storage |
US11157692B2 (en) * | 2019-03-29 | 2021-10-26 | Western Digital Technologies, Inc. | Neural networks using data processing units |
US10777253B1 (en) * | 2019-04-16 | 2020-09-15 | International Business Machines Corporation | Memory array for processing an N-bit word |
US10958272B2 (en) | 2019-06-18 | 2021-03-23 | Gsi Technology, Inc. | Computational memory cell and processing array device using complementary exclusive or memory cells |
US10877731B1 (en) | 2019-06-18 | 2020-12-29 | Gsi Technology, Inc. | Processing array device that performs one cycle full adder operation and bit line read/write logic features |
US10930341B1 (en) | 2019-06-18 | 2021-02-23 | Gsi Technology, Inc. | Processing array device that performs one cycle full adder operation and bit line read/write logic features |
US11409527B2 (en) * | 2019-07-15 | 2022-08-09 | Cornell University | Parallel processor in associative content addressable memory |
US12061971B2 (en) | 2019-08-12 | 2024-08-13 | Micron Technology, Inc. | Predictive maintenance of automotive engines |
US11435946B2 (en) * | 2019-09-05 | 2022-09-06 | Micron Technology, Inc. | Intelligent wear leveling with reduced write-amplification for data storage devices configured on autonomous vehicles |
JPWO2021053453A1 (en) * | 2019-09-20 | 2021-03-25 | ||
US11226816B2 (en) * | 2020-02-12 | 2022-01-18 | Samsung Electronics Co., Ltd. | Systems and methods for data placement for in-memory-compute |
US20220058471A1 (en) * | 2020-08-19 | 2022-02-24 | Micron Technology, Inc. | Neuron using posits |
CN116136835B (en) * | 2023-04-19 | 2023-07-18 | 中国人民解放军国防科技大学 | Three-in two-out numerical value acquisition method, device and medium |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7602214B2 (en) * | 2002-09-06 | 2009-10-13 | Pact Xpp Technologies Ag | Reconfigurable sequencer structure |
US20110013442A1 (en) * | 2009-07-16 | 2011-01-20 | Avidan Akerib | Using storage cells to perform computation |
US20110026323A1 (en) * | 2009-07-30 | 2011-02-03 | International Business Machines Corporation | Gated Diode Memory Cells |
US20120063202A1 (en) * | 2010-09-15 | 2012-03-15 | Texas Instruments Incorporated | 3t dram cell with added capacitance on storage node |
US20160232951A1 (en) * | 2015-02-05 | 2016-08-11 | The Board Of Trustees Of The University Of Illinois | Compute memory |
US9455022B2 (en) * | 2013-07-25 | 2016-09-27 | Renesas Electronics Corporation | Semiconductor integrated circuit device |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5838165A (en) * | 1996-08-21 | 1998-11-17 | Chatter; Mukesh | High performance self modifying on-the-fly alterable logic FPGA, architecture and method |
US9455020B2 (en) * | 2014-06-05 | 2016-09-27 | Micron Technology, Inc. | Apparatuses and methods for performing an exclusive or operation using sensing circuitry |
DE102015214138A1 (en) | 2014-07-28 | 2016-01-28 | Victor Equipment Co. | Automated gas cutting system with auxiliary burner |
US9954533B2 (en) * | 2014-12-16 | 2018-04-24 | Samsung Electronics Co., Ltd. | DRAM-based reconfigurable logic |
-
2017
- 2017-02-06 US US15/426,033 patent/US10242728B2/en active Active
- 2017-05-12 KR KR1020170059482A patent/KR102139213B1/en active IP Right Grant
- 2017-09-13 CN CN201710823568.7A patent/CN108008974B/en active Active
- 2017-09-18 TW TW106131867A patent/TWI714806B/en active
- 2017-10-17 JP JP2017201264A patent/JP6799520B2/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7602214B2 (en) * | 2002-09-06 | 2009-10-13 | Pact Xpp Technologies Ag | Reconfigurable sequencer structure |
US20110013442A1 (en) * | 2009-07-16 | 2011-01-20 | Avidan Akerib | Using storage cells to perform computation |
US20110026323A1 (en) * | 2009-07-30 | 2011-02-03 | International Business Machines Corporation | Gated Diode Memory Cells |
US20120063202A1 (en) * | 2010-09-15 | 2012-03-15 | Texas Instruments Incorporated | 3t dram cell with added capacitance on storage node |
US9455022B2 (en) * | 2013-07-25 | 2016-09-27 | Renesas Electronics Corporation | Semiconductor integrated circuit device |
US20160232951A1 (en) * | 2015-02-05 | 2016-08-11 | The Board Of Trustees Of The University Of Illinois | Compute memory |
Also Published As
Publication number | Publication date |
---|---|
KR102139213B1 (en) | 2020-07-29 |
US20180122456A1 (en) | 2018-05-03 |
JP6799520B2 (en) | 2020-12-16 |
KR20180046345A (en) | 2018-05-08 |
CN108008974B (en) | 2023-05-26 |
CN108008974A (en) | 2018-05-08 |
TW201816592A (en) | 2018-05-01 |
US10242728B2 (en) | 2019-03-26 |
JP2018073402A (en) | 2018-05-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
TWI714806B (en) | Dpu architecture | |
TWI713047B (en) | Circuits and micro-architecture for a dram-based processing unit | |
TWI718336B (en) | System for dpu operations | |
TWI640003B (en) | Apparatuses and methods for logic/memory devices | |
CN107408404B (en) | Apparatus and methods for memory devices as storage of program instructions | |
TWI622991B (en) | Apparatuses and methods for cache operations | |
Talati et al. | mmpu—a real processing-in-memory architecture to combat the von neumann bottleneck | |
JP6791522B2 (en) | Equipment and methods for in-data path calculation operation | |
TWI671744B (en) | Apparatuses and methods for in-memory data switching networks | |
JP6218294B2 (en) | Composite marching memory, computer system and marching memory cell array | |
US20220366968A1 (en) | Sram-based in-memory computing macro using analog computation scheme | |
US11355181B2 (en) | High bandwidth memory and system having the same | |
Bottleneck | mMPU—A Real Processing-in-Memory Architecture to Combat the von | |
CN117636956A (en) | In-memory computing (IMC) circuit and device, and neural network device | |
Kim et al. | DRAM-Based Processing-in-Memory |