TWI601062B

TWI601062B - Apparatus employing user-specified binary point fixed point arithmetic

Info

Publication number: TWI601062B
Application number: TW105132062A
Authority: TW
Inventors: Ｇ葛蘭亨利; 泰瑞派克斯
Original assignee: 上海兆芯集成電路有限公司
Priority date: 2015-10-08
Filing date: 2016-10-04
Publication date: 2017-10-01
Also published as: TWI579694B; TW201714080A; TWI616825B; CN106598545A; TW201714079A; TW201714120A; CN106599991A; CN106599992A; TWI626587B; TW201714091A; CN106599992B; CN106599989A; CN106598545B; TW201714119A; CN106599989B; CN106650923B; TWI608429B; CN106599990B; CN106599990A; TW201714078A

Description

Use user-specified two-dimensional fixed-point arithmetic Device

本申請案主張下列之美國臨時申請案之國際優先權。這些優先權案之全文併入本案以供參考。 This application claims the international priority of the following U.S. Provisional Application. The entire contents of these priority cases are incorporated herein by reference.

本申請案係關聯於下列同時提出申請之美國臨時申請案。這些關聯申請案之全文併入本案以供參考。 This application is related to the following US provisional application filed concurrently. The entire contents of these related applications are incorporated herein by reference.

近年來，人工神經網路(artificial neural networks,ANN)重新吸引了人們的注意。這些研究通常被稱為深度學習(deep learning)、電腦學習(computer learning)等類似術語。通用處理器運算能力的提升也推升了人們在數十年後的現在對於人工神經網路的興趣。人工神經網路近期的應用包括語言與影像辨識等。對於提升人工神經網路之運算效能與效率的需求似乎正在增加。 In recent years, artificial neural networks (ANN) have re-engaged people's attention. These studies are often referred to as deep learning, computer learning, and the like. The increased computing power of general-purpose processors has also boosted people's interest in artificial neural networks decades later. Recent applications of artificial neural networks include language and image recognition. The need to improve the computational efficiency and efficiency of artificial neural networks seems to be increasing.

有鑑於此，本發明提供一種裝置，包括一個由N個處理單元構成之陣列、一第一記憶體與一第二記憶體。N個處理單元構成之陣列中之各個處理單元包括一個累加器(accumulator)、一個算術單元、一權重輸入與一個多工暫存器。累加器具有一輸出。算術單元具有第一、第二與第三輸入，並對其執行運算以產生一結果儲存至累加器，前述第一輸入係接收累加器之輸出。權重輸入係由前述第二輸入接收至算術單元。多工暫存器具有第一與第二資料輸入、一輸出與一控制輸入，多工暫存器之輸出係由前述第三輸入接收至算術單元，控制輸入係控制對於第一與第二資料輸入之選擇。其中，多工暫存器之輸出並由一相鄰處理單元之多工暫存器之第二資料輸入所接收，當控制輸入選定第二資料輸入時，N個處理單元之多工暫存器係共同運作如同一N個文字之旋轉器。第一記憶體係裝載W個列之N個權重文字，並將W個列之其中一個列之N個權重文字提供至處理單元陣列之N個處理單元之相對應權重輸入。第二記憶體係裝載D個列之N個資料文字，並將D個列之其中一個列之N個資料文字提供至處理單元陣列之N個處理單元之多工暫存器之相對應第一資料輸入。 In view of the above, the present invention provides an apparatus comprising an array of N processing units, a first memory and a second memory. Each of the processing units in the array of N processing units includes an accumulator, an arithmetic unit, a weight input, and a multiplex register. The accumulator has an output. The arithmetic unit has first, second and third inputs and performs an operation thereon to generate a result stored to the accumulator, the first input receiving the output of the accumulator. The weight input is received by the aforementioned second input to the arithmetic unit. The multiplex register has first and second data inputs, an output and a control input, and more The output of the scratchpad is received by the aforementioned third input to the arithmetic unit, and the control input controls the selection of the first and second data inputs. Wherein, the output of the multiplex register is received by the second data input of the multiplex register of an adjacent processing unit, and when the control input selects the second data input, the multiplex register of the N processing units The rotators that work together as the same N characters. The first memory system loads N weight texts of W columns and provides N weight texts of one of the W columns to the corresponding weight inputs of the N processing units of the processing unit array. The second memory system loads N data words of D columns, and provides N data words of one of the D columns to the corresponding first data of the multiplexed registers of the N processing units of the processing unit array. Input.

本發明並提供一種處理器。此處理器包括一指令集、一個由N個處理單元構成之陣列，一第一記憶體與一第二記憶體。指令集具有架構指令以指示處理器運作。由N個處理單元構成之陣列中之各個處理單元包括一個累加器、一個算術單元、一權重輸入與一個多工暫存器。累加器具有一輸出。算術單元具有第一、第二與第三輸入，並對其執行運算以產生一結果儲存至累加器，前述第一輸入係接收累加器之輸出。權重輸入係由前述第二輸入接收至算術單元。多工暫存器具有第一與第二資料輸入、一輸出與一控制輸入，多工暫存器之輸出係由前述第三輸入接收至算術單元，控制輸入係控制對於第一與第二資料輸入之選擇。其中，多工暫存器之輸出並由一相鄰處理單元之多工暫存器之第二資料輸入所接收，當控制輸入選定第二資料輸入時，N個處理單元之多工暫存器係共同運作如同一N個文字之旋轉器。第一記憶體係裝載W個列之N個權重文字，並將W個列之其中一個列之N個權重文字提供至處理單元陣列之N個處理單元之相對應權重輸入。第二記憶體係裝載D個列之N個資料文字，並將D個列之其中一個列之N個資料文字提供至處理單元陣列之N個處理單元之多工暫存器之相對應第一資料輸入。 The invention also provides a processor. The processor includes an instruction set, an array of N processing units, a first memory and a second memory. The instruction set has architectural instructions to indicate that the processor is operational. Each of the processing units in the array of N processing units includes an accumulator, an arithmetic unit, a weight input, and a multiplex register. The accumulator has an output. The arithmetic unit has first, second and third inputs and performs an operation thereon to generate a result stored to the accumulator, the first input receiving the output of the accumulator. The weight input is received by the aforementioned second input to the arithmetic unit. The multiplex register has first and second data inputs, an output and a control input, and the output of the multiplex register is received by the third input to the arithmetic unit, and the control input controls the first and second data. Enter the choice. Wherein, the output of the multiplex register is received by the second data input of the multiplex register of an adjacent processing unit, and when the control input selects the second data input, the N processing The multiplexed registers of the unit work together as a rotator of the same N characters. The first memory system loads N weight texts of W columns and provides N weight texts of one of the W columns to the corresponding weight inputs of the N processing units of the processing unit array. The second memory system loads N data words of D columns, and provides N data words of one of the D columns to the corresponding first data of the multiplexed registers of the N processing units of the processing unit array. Input.

本發明並提供一種編碼於至少一非暫態電腦可使用媒體以供一電腦裝置使用之一電腦程式產品。此電腦程式產品包括內含於該媒體之電腦可使用程式碼，用以描述一裝置，此電腦可使用程式碼包括第一程式碼、第二程式碼與第三程式碼。第一程式碼係指定一個由N個處理單元構成之陣列。此陣列中之各個處理單元包括一個累加器、一個算術單元、一權重輸入與一個多工暫存器。累加器具有一輸出。算術單元具有第一、第二與第三輸入，並對其執行運算以產生一結果儲存至累加器，前述第一輸入係接收累加器之輸出。權重輸入係由前述第二輸入接收至算術單元。多工暫存器具有第一與第二資料輸入、一輸出與一控制輸入，多工暫存器之輸出係由前述第三輸入接收至算術單元，控制輸入係控制對於第一與第二資料輸入之選擇。其中，多工暫存器之輸出並由一相鄰處理單元之多工暫存器之第二資料輸入所接收，當控制輸入選定第二資料輸入時，N個處理單元之多工暫存器係共同運作如同一N個文字之旋轉器。第二程式碼係指定一第一記憶體，以裝載W個列之N 個權重文字，並將W個列之其中一個列之N個權重文字提供至處理單元陣列之N個處理單元之相對應權重輸入。第三程式碼係指定一第二記憶體，以裝載D個列之N個資料文字，並將D個列之其中一個列之N個資料文字提供至處理單元陣列之N個處理單元之多工暫存器之相對應第一資料輸入。 The present invention also provides a computer program product encoded in at least one non-transitory computer usable medium for use by a computer device. The computer program product includes a computer usable code embodied in the medium for describing a device. The computer can use the code including the first code, the second code and the third code. The first code specifies an array of N processing units. Each processing unit in the array includes an accumulator, an arithmetic unit, a weight input, and a multiplex register. The accumulator has an output. The arithmetic unit has first, second and third inputs and performs an operation thereon to generate a result stored to the accumulator, the first input receiving the output of the accumulator. The weight input is received by the aforementioned second input to the arithmetic unit. The multiplex register has first and second data inputs, an output and a control input, and the output of the multiplex register is received by the third input to the arithmetic unit, and the control input controls the first and second data. Enter the choice. Wherein, the output of the multiplex register is received by the second data input of the multiplex register of an adjacent processing unit, and when the control input selects the second data input, the multiplex register of the N processing units The rotators that work together as the same N characters. The second code specifies a first memory to load N columns of N Weight text, and N weight texts of one of the W columns are provided to the corresponding weight inputs of the N processing units of the processing unit array. The third code specifies a second memory to load N data words of the D columns, and provides N data characters of one of the D columns to the multiplex of the N processing units of the processing unit array. The corresponding data input of the register is the first data input.

本發明所採用的具體實施例，將藉由以下之實施例及圖式作進一步之說明。 The specific embodiments of the present invention will be further described by the following examples and drawings.

100‧‧‧處理器 100‧‧‧ processor

101‧‧‧指令攫取單元 101‧‧‧Command capture unit

102‧‧‧指令快取 102‧‧‧ instruction cache

103‧‧‧架構指令 103‧‧‧Architecture Instructions

104‧‧‧指令轉譯器 104‧‧‧Instruction Translator

105‧‧‧微指令 105‧‧‧Microinstructions

106‧‧‧重命名單元 106‧‧‧Renaming unit

108‧‧‧保留站 108‧‧‧Reservation station

112‧‧‧其他執行單元 112‧‧‧Other execution units

114‧‧‧記憶體子系統 114‧‧‧ memory subsystem

116‧‧‧通用暫存器 116‧‧‧Universal register

118‧‧‧媒體暫存器 118‧‧‧Media register

121‧‧‧神經網路單元 121‧‧‧Neural Network Unit

122‧‧‧資料隨機存取記憶體 122‧‧‧ Data Random Access Memory

124‧‧‧權重隨機存取記憶體 124‧‧‧ weighted random access memory

126‧‧‧神經處理單元 126‧‧‧Neural Processing Unit

127‧‧‧控制與狀態暫存器 127‧‧‧Control and Status Register

128‧‧‧定序器 128‧‧‧Sequencer

129‧‧‧程式記憶體 129‧‧‧Program memory

123,125,131‧‧‧記憶體位址 123,125,131‧‧‧ memory address

133,215,133A,133B,215A,215B‧‧‧結果 133, 215, 133A, 133B, 215A, 215B‧‧‧ Results

202‧‧‧累加器 202‧‧‧ accumulator

204‧‧‧算術邏輯單元 204‧‧‧Arithmetic Logic Unit

203,209,217,203A,203B,209A,209B,217A,217B‧‧‧輸出 203,209,217,203A,203B,209A,209B,217A,217B‧‧‧ Output

205,205A,205B‧‧‧暫存器 205, 205A, 205B‧‧‧ register

206,207,211,1811,206A,206B,207A,207B,211A, 211B,1811A,1811B,711‧‧‧ 輸入 206, 207, 211, 1811, 206A, 206B, 207A, 207B, 211A, 211B, 1811A, 1811B, 711‧‧ Input

208,705,208A,208B‧‧‧多工暫存器 208,705,208A,208B‧‧‧Multiplex register

213,713,803‧‧‧控制輸入 213,713,803‧‧‧Control input

212‧‧‧啟動函數單元 212‧‧‧Start function unit

242‧‧‧乘法器 242‧‧‧Multiplier

244‧‧‧加法器 244‧‧‧Adder

246,246A,246B‧‧‧乘積 246,246A, 246B‧‧‧ product

802‧‧‧多工器 802‧‧‧Multiplexer

1104‧‧‧列緩衝器 1104‧‧‧ column buffer

1112‧‧‧啟動函數單元 1112‧‧‧Start function unit

1400‧‧‧MTNN指令 1400‧‧‧MTNN Directive

1500‧‧‧MFNN指令 1500‧‧‧MFNN Directive

1432,1532‧‧‧函數 1432, 1532‧‧‧ function

1402,1502‧‧‧執行碼欄位 1402, 1502‧‧‧ Code field

1404‧‧‧src1欄位 1404‧‧‧src1 field

1406‧‧‧src2欄位 1406‧‧‧src2 field

1408,1508‧‧‧gpr欄位 1408, 1508‧‧‧gpr field

1412,1512‧‧‧立即欄位 1412, 1512‧‧‧ immediate field

1422,1522‧‧‧位址 1422, 1522‧‧‧ address

1424,1426,1524‧‧‧資料塊 1424, 1426, 1524‧‧‧ data block

1428,1528‧‧‧選定列 1428, 1528‧‧‧ Selected columns

1434‧‧‧控制邏輯 1434‧‧‧Control logic

1504‧‧‧dst欄位 1504‧‧dst field

1602‧‧‧讀取埠 1602‧‧‧Reading

1604‧‧‧寫入埠 1604‧‧‧written 埠

1606‧‧‧記憶體陣列 1606‧‧‧Memory array

1702‧‧‧埠 1702‧‧‧埠

1704‧‧‧緩衝器 1704‧‧‧buffer

1898‧‧‧運算元選擇邏輯 1898‧‧‧Operator selection logic

1896A‧‧‧寬多工器 1896A‧‧‧Wide multiplexer

1896B‧‧‧窄多工器 1896B‧‧‧Narrow multiplexer

242A‧‧‧寬乘法器 242A‧‧‧Wide Multiplier

242B‧‧‧窄乘法器 242B‧‧‧Narrow multiplier

244A‧‧‧寬加法器 244A‧‧‧ wide adder

244B‧‧‧窄加法器 244B‧‧‧Narrow adder

204A‧‧‧寬算術邏輯單元 204A‧‧‧wide arithmetic logic unit

204B‧‧‧窄算術邏輯單元 204B‧‧‧Narrow arithmetic logic unit

202A‧‧‧寬累加器 202A‧‧‧Wide accumulator

202B‧‧‧窄累加器 202B‧‧‧Narrow Accumulator

212A‧‧‧寬啟動函數單元 212A‧‧‧wide start function unit

212B‧‧‧窄啟動函數單元 212B‧‧‧Narrow start function unit

2402‧‧‧卷積核 2402‧‧‧Convolution Core

2404‧‧‧資料陣列 2404‧‧‧Data Array

2406A,2406B‧‧‧資料矩陣 2406A, 2406B‧‧‧ Data Matrix

2602,2604,2606,2608,2902,2912,2914,2922,2924,2926,2932,2934,2942,2944,2952,2954, 2956,2923,2962,2964‧‧‧ 欄位 2602, 2604, 2606, 2608, 2902, 2912, 2914, 2922, 2924, 2926, 2932, 2934, 2942, 2944, 2952, 2954, 2956, 2923, 2962, 2964‧‧ Field

3003‧‧‧隨機位元來源 3003‧‧‧ Random bit source

3005‧‧‧隨機位元 3005‧‧‧ random bit

3002‧‧‧正類型轉換器與輸出二進位小數點對準器 3002‧‧‧Positive type converter and output binary decimal point aligner

3004‧‧‧捨入器 3004‧‧‧ rounder

3006‧‧‧多工器 3006‧‧‧Multiplexer

3008‧‧‧標準尺寸壓縮器與飽和器 3008‧‧‧Standard size compressor and saturator

3012‧‧‧位元選擇與飽和器 3012‧‧‧ bit selection and saturator

3018‧‧‧校正器 3018‧‧‧ Corrector

3014‧‧‧倒數乘法器 3014‧‧‧Reciprocal multiplier

3016‧‧‧向右移位器 3016‧‧‧ Right shifter

3028‧‧‧標準尺寸傳遞值 3028‧‧‧Standard size transfer value

3022‧‧‧雙曲正切模組 3022‧‧‧Hyper tangent module

3024‧‧‧S型模組 3024‧‧‧S type module

3026‧‧‧軟加模組 3026‧‧‧Soft Plus Module

3032‧‧‧多工器 3032‧‧‧Multiplexer

3034‧‧‧符號恢復器 3034‧‧‧ Symbol Recovery

3036‧‧‧尺寸轉換器與飽和器 3036‧‧‧Dimensional Converter and Saturator

3037‧‧‧多工器 3037‧‧‧Multiplexer

3038‧‧‧輸出暫存器 3038‧‧‧Output register

3402‧‧‧多工器 3402‧‧‧Multiplexer

3401‧‧‧神經處理單元管線級 3401‧‧‧Neuro Processing Unit Pipeline Level

3404‧‧‧解碼器 3404‧‧‧Decoder

3412,3414,3418‧‧‧微運算 3412, 3414, 3418‧‧‧ micro operations

3416‧‧‧微指令 3416‧‧‧ microinstructions

3422‧‧‧模式指標 3422‧‧‧ model indicators

3502‧‧‧時頻產生邏輯 3502‧‧‧Time-frequency generation logic

3504‧‧‧時頻降低邏輯 3504‧‧‧Time-frequency reduction logic

3514‧‧‧介面邏輯 3514‧‧‧Interface logic

3522‧‧‧資料隨機存取記憶體緩衝 3522‧‧‧ Data Random Access Memory Buffer

3524‧‧‧權重隨機存取記憶體緩衝 3524‧‧‧ weight random access memory buffer

3512‧‧‧緩和指標 3512‧‧‧Relief indicator

3802‧‧‧程式計數器 3802‧‧‧Program Counter

3804‧‧‧迴圈計數器 3804‧‧‧Circle counter

3806‧‧‧迭代次數計數器 3806‧‧‧ Iteration number counter

3912,3914,3916‧‧‧欄位 3912, 3914, 3916‧‧‧ fields

4901‧‧‧神經處理單元群組 4901‧‧‧Neuro Processing Unit Group

4903‧‧‧遮罩 4903‧‧‧ mask

4905,4907,5599‧‧‧輸入 4905, 4907, 5599‧‧‧ Input

第一圖係顯示一包含一神經網路單元(neural network unit,NNU)之處理器之方塊示意圖。 The first figure shows a block diagram of a processor including a neural network unit (NNU).

第二圖係顯示第一圖之一神經處理單元(neural processing unit,NPU)之方塊示意圖。 The second figure shows a block diagram of a neural processing unit (NPU) of the first figure.

第三圖係一方塊圖，顯示利用第一圖之神經網路單元之N個神經處理單元之N個多工暫存器，對於由第一圖之資料隨機存取記憶體取得之一列資料文字執行如同一N個文字之旋轉器(rotator)或稱循環移位器(circular shifter)之運作。 The third figure is a block diagram showing N multiplex registers of N neural processing units of the neural network unit of the first figure, and one column of data is obtained for the data random access memory of the first figure. The operation of a rotator or a circular shifter such as the same N characters is performed.

第四圖係一表格，顯示一個儲存於第一圖之神經網路單元之程式記憶體並由該神經網路單元執行之程式。 The fourth diagram is a table showing a program stored in the program memory of the neural network unit of the first figure and executed by the neural network unit.

第五圖係顯示神經網路單元執行第四圖之程式之時序圖。 The fifth diagram shows the timing diagram of the neural network unit executing the program of the fourth figure.

第六A圖係顯示第一圖之神經網路單元執行第四圖之程式之方塊示意圖。 Figure 6A shows the process of executing the fourth figure of the neural network unit of the first figure. Schematic diagram of the block.

第六B圖係一流程圖，顯示第一圖之處理器執行一架構程式，以利用神經網路單元執行關聯於一人工神經網路之隱藏層之神經元之典型乘法累加啟動函數運算之運作，如同由第四圖之程式執行之運作。 Figure 6B is a flow chart showing that the processor of the first figure executes an architectural program to perform a typical multiply-accumulate start function operation of a neuron associated with a hidden layer of an artificial neural network using a neural network unit As the operation performed by the program of the fourth figure.

第七圖係顯示第一圖之神經處理單元之另一實施例之方塊示意圖。 Figure 7 is a block diagram showing another embodiment of the neural processing unit of the first figure.

第八圖係顯示第一圖之神經處理單元之又一實施例之方塊示意圖。 The eighth figure is a block diagram showing still another embodiment of the neural processing unit of the first figure.

第九圖係一表格，顯示一個儲存於第一圖之神經網路單元之程式記憶體並由該神經網路單元執行之程式。 The ninth figure is a table showing a program stored in the program memory of the neural network unit of the first figure and executed by the neural network unit.

第十圖係顯示神經網路單元執行第九圖之程式之時序圖。 The tenth figure shows a timing diagram of the execution of the program of the ninth diagram by the neural network unit.

第十一圖係顯示第一圖之神經網路單元之一實施例之方塊示意圖。在第十一圖之實施例中，一個神經元係分成兩部分，即啟動函數單元部分與算術邏輯單元部分(此部分並包含移位暫存器部分)，而各個啟動函數單元部分係由多個算術邏輯單元部分共享。 Figure 11 is a block diagram showing one embodiment of a neural network unit of the first figure. In the embodiment of the eleventh diagram, a neuron is divided into two parts, that is, a start function unit part and an arithmetic logic unit part (this part includes a shift register part), and each start function unit part is composed of The arithmetic logic units are partially shared.

第十二圖係顯示第十一圖之神經網路單元執行第四圖之程式之時序圖。 Figure 12 is a timing diagram showing the execution of the fourth diagram of the neural network unit of the eleventh figure.

第十三圖係顯示第十一圖之神經網路單元執行第四圖之程式之時序圖。 The thirteenth figure is a timing chart showing the execution of the program of the fourth figure by the neural network unit of the eleventh figure.

第十四圖係一方塊示意圖，顯示一移動至神經網路(MTNN)架構指令以及其對應於第一圖之神經網路單元之部分之運作。 Figure 14 is a block diagram showing the operation of a mobile-to-neural network (MTNN) architecture command and its portion of the neural network unit corresponding to the first map.

第十五圖係一方塊示意圖，顯示一移動至神經網路(MTNN)架構指令以及其對應於第一圖之神經網路單元之部分之運作。 The fifteenth diagram is a block diagram showing the operation of a mobile-to-neural network (MTNN) architecture instruction and its portion of the neural network unit corresponding to the first diagram.

第十六圖係顯示第一圖之資料隨機存取記憶體之一實施例之方塊示意圖。 Figure 16 is a block diagram showing an embodiment of the data random access memory of the first figure.

第十七圖係顯示第一圖之權重隨機存取記憶體與一緩衝器之一實施例之方塊示意圖。 Figure 17 is a block diagram showing an embodiment of a weighted random access memory and a buffer of the first figure.

第十八圖係顯示第一圖之一可動態配置之神經處理單元之方塊示意圖。 Figure 18 is a block diagram showing a dynamically configurable neural processing unit of the first figure.

第十九圖係一方塊示意圖，顯示依據第十八圖之實施例，利用第一圖之神經網路單元之N個神經處理單元之2N個多工暫存器，對於由第一圖之資料隨機存取記憶體取得之一列資料文字執行如同一旋轉器(rotator)之運作。 Figure 19 is a block diagram showing the data of the first image by using the NN neural processing units of the neural network unit of the first figure according to the embodiment of the eighteenth embodiment. The random access memory obtains one of the data texts to perform the operation as the same rotator.

第二十圖係一表格，顯示一個儲存於第一圖之神經網路單元之程式記憶體並由該神經網路單元執行之程式，而此神經網路單元具有如第十八圖之實施例所示之神經處理單元。 Figure 20 is a table showing a program stored in the program memory of the neural network unit of the first figure and executed by the neural network unit, and the neural network unit has the embodiment as shown in the eighteenth The neuroprocessing unit shown.

第二十一圖係顯示一神經網路單元執行第二十圖之程式之時序圖，此神經網路單元具有如第十八圖所示之神經處理單元執行於窄配置。 The twenty-first figure shows a timing diagram of a neural network unit executing the program of the twentieth diagram, the neural network unit having the neural processing unit as shown in the eighteenth diagram executed in a narrow configuration.

第二十二圖係顯示第一圖之神經網路單元之方塊示意圖，此神經網路單元具有如第十八圖所示之神經處理單元以執行第二十圖之程式。 Figure 22 is a block diagram showing the neural network unit of the first figure, the neural network unit having a neural processing unit as shown in Fig. 18 to execute the program of the twentieth diagram.

第二十三圖係顯示第一圖之一可動態配置之神經處理單元之另一實施例之方塊示意圖。 The twenty-third figure shows a dynamically configurable neural processing list of the first figure A block diagram of another embodiment of the element.

第二十四圖係一方塊示意圖，顯示由第一圖之神經網路單元使用以執行一卷積(convolution)運作之資料結構之一範例。 The twenty-fourth diagram is a block diagram showing an example of a data structure used by the neural network unit of the first figure to perform a convolution operation.

第二十五圖係一流程圖，顯示第一圖之處理器執行一架構程式以利用神經網路單元依據第二十四圖之資料陣列執行卷積核之卷積運算。 The twenty-fifth diagram is a flowchart showing that the processor of the first figure executes an architectural program to perform a convolution operation of the convolution kernel according to the data array of the twenty-fourth figure using the neural network unit.

第二十六A圖係一神經網路單元程式之一程式列表，此神經網路單元程式係利用第二十四圖之卷積核執行一資料矩陣之卷積運算並將其寫回權重隨機存取記憶體。 The twenty-sixth A diagram is a list of programs of a neural network unit program that performs a convolution operation on a data matrix using the convolution kernel of the twenty-fourth graph and writes it back to the weight random Access memory.

第二十六B圖係顯示第一圖之神經網路單元之控制暫存器之某些欄位之一實施例之方塊示意圖。 Figure 26B is a block diagram showing one embodiment of certain fields of the control register of the neural network unit of the first figure.

第二十七圖係一方塊示意圖，顯示第一圖中填入輸入資料之權重隨機存取記憶體之一範例，此輸入資料係由第一圖之神經網路單元執行共源運作(pooling operation)。 Figure 27 is a block diagram showing an example of weighted random access memory filled with input data in the first figure. The input data is performed by the neural network unit of the first figure. ).

第二十八圖係一神經網路單元程式之一程式列表，此神經網路單元程式係執行第二十七圖之輸入資料矩陣之共源運作並將其寫回權重隨機存取記憶體。 The twenty-eighth figure is a list of programs of a neural network unit program that performs the common source operation of the input data matrix of the twenty-seventh figure and writes it back to the weighted random access memory.

第二十九A圖係顯示第一圖之控制暫存器之一實施例之方塊示意圖。 Figure 29A is a block diagram showing one embodiment of the control register of the first figure.

第二十九B圖係顯示第一圖之控制暫存器之另一實施例之方塊示意圖。 Figure 29B is a block diagram showing another embodiment of the control register of the first figure.

第二十九C圖係顯示以兩個部分儲存第二十九A圖之倒數(reciprocal)之一實施例之方塊示意圖。 The twenty-ninth C diagram shows a block diagram of one embodiment of storing a reciprocal of the twenty-ninth A map in two parts.

第三十圖係顯示第二圖之啟動函數單元(AFU)之一實施例之方塊示意圖。 The thirtieth figure shows the implementation of one of the start function units (AFU) of the second figure. A block diagram of an example.

第三十一圖係顯示第三十圖之啟動函數單元之運作之一範例。 The thirty-first figure shows an example of the operation of the start function unit of the thirtieth figure.

第三十二圖係顯示第三十圖之啟動函數單元之運作之第二個範例。 Figure 32 shows a second example of the operation of the start function unit of Figure 30.

第三十三圖係顯示第三十圖之啟動函數單元之運作之第三個範例。 The thirty-third figure shows a third example of the operation of the start function unit of the thirtieth figure.

第三十四圖係顯示第一圖之處理器以及神經網路單元之部分細節之方塊示意圖。 The thirty-fourth diagram is a block diagram showing a portion of the details of the processor of the first diagram and the neural network unit.

第三十五圖係一方塊圖，顯示具有一可變率神經網路單元之處理器。 The thirty-fifth diagram is a block diagram showing a processor having a variable rate neural network unit.

第三十六A圖係一時序圖，顯示一具有神經網路單元之處理器運作於一般模式之一運作範例，此一般模式即以主要時頻率運作。 The thirty-sixth A diagram is a timing diagram showing an operational example in which a processor having a neural network unit operates in a general mode, which operates at a primary time frequency.

第三十六B圖係一時序圖，顯示一具有神經網路單元之處理器運作於緩和模式之一運作範例，緩和模式之運作時頻率低於主要時頻率。 The thirty-sixth B diagram is a timing diagram showing an operation example in which a processor having a neural network unit operates in a mitigation mode, and the mitigation mode operates at a lower frequency than the primary time.

第三十七圖係一流程圖，顯示第三十五圖之處理器之運作。 Figure 37 is a flow chart showing the operation of the processor of the thirty-fifth figure.

第三十八圖係一方塊圖，詳細顯示神經網路單元之序列。 The thirty-eighth figure is a block diagram showing in detail the sequence of neural network elements.

第三十九圖係一方塊圖，顯示神經網路單元之控制與狀態暫存器之某些欄位。 The thirty-ninth figure is a block diagram showing certain fields of the control and status register of the neural network unit.

第四十圖係一方塊圖，顯示Elman時間遞歸神經網路(recurrent neural network,RNN)之一範例。 The fortieth figure is a block diagram showing an example of an Elman time recurrent neural network (RNN).

第四十一圖係一方塊圖，顯示當神經網路單元執行關聯於第四十圖之Elman時間遞歸神經網路之計算時，神經網路單元之資料隨機存取記憶體與權重隨機存取記憶體內之資料配置之一範例。 The forty-first figure is a block diagram showing when the neural network unit performs the association An example of data configuration in the data random access memory and weight random access memory of the neural network unit in the calculation of the Elman time recurrent neural network in the 40th.

第四十二圖係一表格，顯示儲存於神經網路單元之程式記憶體之一程式，此程式係由神經網路單元執行，並依據第四十一圖之配置使用資料與權重，以達成Elman時間遞歸神經網路 Figure 42 shows a table showing a program memory stored in a neural network unit, which is executed by a neural network unit and uses data and weights according to the configuration of the 41st figure to achieve Elman time recurrent neural network

第四十三圖係一方塊圖顯示Jordan時間遞歸神經網路之一範例。 A forty-third figure is a block diagram showing an example of a Jordan time recurrent neural network.

第四十四圖係一方塊圖，顯示當神經網路單元執行關聯於第四十三圖之Jordan時間遞歸神經網路之計算時，神經網路單元之資料隨機存取記憶體與權重隨機存取記憶體內之資料配置之一範例。 The forty-fourth figure is a block diagram showing that when the neural network unit performs the calculation of the Jordan time recurrent neural network associated with the forty-third figure, the data random access memory and the weight of the neural network unit are randomly stored. Take an example of data configuration in memory.

第四十五圖係一表格，顯示儲存於神經網路單元之程式記憶體之一程式，此程式係由神經網路單元執行，並依據第四十四圖之配置使用資料與權重，以達成Jordan時間遞歸神經網路。 The forty-fifth diagram is a table showing a program stored in a neural network unit of a program memory, the program is executed by the neural network unit, and the data and weights are used according to the configuration of the forty-fourth figure to achieve Jordan time recurrent neural network.

第四十六圖係一方塊圖，顯示長短期記憶(long short term memory,LSTM)胞之一實施例。 The forty-sixth diagram is a block diagram showing one embodiment of a long short term memory (LSTM) cell.

第四十七圖係一方塊圖，顯示當神經網路單元執行關聯於第四十六圖之長短期記憶胞層之計算時，神經網路單元之資料隨機存取記憶體與權重隨機存取記憶體內之資料配置之一範例。 The forty-seventh diagram is a block diagram showing the data random access memory and weight random access of the neural network unit when the neural network unit performs the calculation associated with the long-short-term memory cell layer of the forty-sixth figure. An example of data configuration in memory.

第四十八圖係一表格，顯示儲存於神經網路單元之程式記憶體之一程式，此程式係由神經網路單元執行並依據第四十七圖之配置使用資料與權重，以達成關聯於長短期記憶胞層之計算。 The forty-eighth figure is a table showing a program stored in the neural network unit of the program memory, the program is executed by the neural network unit and The configuration of Figure 47 uses data and weights to achieve calculations associated with long- and short-term memory cells.

第四十九圖係一方塊圖，顯示一神經網路單元之實施例，此實施例之神經處理單元群組內具有輸出緩衝遮蔽與反饋能力。 A forty-ninth diagram is a block diagram showing an embodiment of a neural network unit having an output buffer masking and feedback capability within the neural processing unit group of this embodiment.

第五十圖係一方塊圖，顯示當神經網路單元執行關聯於第四十六圖之長短期記憶胞層之計算時，第四十九圖之神經網路單元之資料隨機存取記憶體，權重隨機存取記憶體與輸出緩衝器內之資料配置之一範例。 Figure 50 is a block diagram showing the data random access memory of the neural network unit of the forty-ninth figure when the neural network unit performs the calculation of the long-short-term memory cell layer associated with the forty-sixth figure. An example of data configuration in weighted random access memory and output buffers.

第五十一圖係一表格，顯示儲存於神經網路單元之程式記憶體之一程式，此程式係由第四十九圖之神經網路單元執行並依據第五十圖之配置使用資料與權重，以達成關聯於長短期記憶胞層之計算。 The fifty-first figure is a table showing a program memory stored in the neural network unit, which is executed by the neural network unit of the forty-ninth figure and uses the data according to the configuration of the fifty-fifth figure. Weights are calculated to achieve a correlation with the long- and short-term memory cell layers.

第五十二圖係一方塊圖，顯示一神經網路單元之實施例，此實施例之神經處理單元群組內具有輸出緩衝遮蔽與反饋能力，並且共享啟動函數單元。 A fifty-second diagram is a block diagram showing an embodiment of a neural network unit having an output buffer masking and feedback capability within the group of neural processing units of this embodiment, and sharing a start function unit.

第五十三圖係一方塊圖，顯示當神經網路單元執行關聯於第四十六圖之長短期記憶胞層之計算時，第四十九圖之神經網路單元之資料隨機存取記憶體，權重隨機存取記憶體與輸出緩衝器內之資料配置之另一實施例。 The fifty-third figure is a block diagram showing the data random access memory of the neural network unit of the forty-ninth figure when the neural network unit performs the calculation of the long-short-term memory cell layer associated with the forty-sixth figure. Another embodiment of the configuration of the data in the weighted random access memory and the output buffer.

第五十四圖係一表格，顯示儲存於神經網路單元之程式記憶體之一程式，此程式係由第四十九圖之神經網路單元執行並依據第五十三圖之配置使用資料與權重，以達成關聯於長短期記憶胞層之計算。 Figure 54 shows a table showing a program memory stored in a neural network unit. The program is executed by the neural network unit of the forty-ninth figure and is used according to the configuration of the fifty-third figure. And weights to achieve calculations associated with long- and short-term memory cell layers.

第五十五圖係一方塊圖，顯示本發明另一實施例之部分神經處理單元。 Figure 55 is a block diagram showing a portion of another embodiment of the present invention Neural processing unit.

第五十六圖係一方塊圖，顯示當神經網路單元執行關聯於第四十三圖之Jordan時間遞歸神經網路之計算並利用第五十五圖之實施例時，神經網路單元之資料隨機存取記憶體與權重隨機存取記憶體內之資料配置之一範例。 Figure 56 is a block diagram showing the neural network unit when the neural network unit performs the calculation of the Jordan time recurrent neural network associated with the forty-third figure and utilizes the embodiment of the fifty-fifth figure. An example of data configuration in data random access memory and weight random access memory.

第五十七圖係一表格，顯示儲存於神經網路單元之程式記憶體之一程式，此程式係由神經網路單元執行並依據第五十六圖之配置使用資料與權重，以達成Jordan時間遞歸神經網路。 Figure 57 shows a table showing a program memory stored in a neural network unit. The program is executed by the neural network unit and uses data and weights according to the configuration of Figure 56 to achieve Jordan. Time recurrent neural network.

具有架構神經網路單元之處理器 Processor with an architectural neural network unit

第一圖係顯示一包含一神經網路單元(neural network unit,NNU)121之處理器100之方塊示意圖。如圖中所示，此處理器100包含一指令攫取單元101，一指令快取102，一指令轉譯器104，一重命名單元106，多個保留站108，多個媒體暫存器118，多個通用暫存器116，前述神經網路單元121外之多個執行單元112與一記憶體子系統114。 The first figure shows a block diagram of a processor 100 including a neural network unit (NNU) 121. As shown in the figure, the processor 100 includes an instruction capture unit 101, an instruction cache 102, an instruction translator 104, a rename unit 106, a plurality of reservation stations 108, a plurality of media registers 118, and a plurality of The universal register 116 is a plurality of execution units 112 and a memory subsystem 114 outside the neural network unit 121.

處理器100係一電子裝置，作為積體電路之中央處理單元。處理器100接收輸入的數位資料，依據由記憶體攫取之指令處理這些資料，並產生由指令指示之運算的處理結果作為其輸出。此處理器100可用於一桌上型電腦、行動裝置、或平板電腦，並用於計算、文字處理、多媒體顯示與網路瀏覽等應用。此處理器100並可設置於一嵌入系統內，以控制各種包括設備、行動電話、智能電話、車輛、與工業用控制器之裝置。中央處理器係透過對資料執行包括算術、邏輯與輸入/輸出等運算，以執行電腦程式(或稱為電腦應用程式或應用程式)指令之電子電路(即硬體)。積體電路係一組製作於一小型半導體材料，通常是矽，之電子電路。積體電路也通常被用於表示晶片、微晶片或晶粒。 The processor 100 is an electronic device that acts as a central processing unit of the integrated circuit. The processor 100 receives the input digital data, processes the data according to instructions fetched by the memory, and generates a processing result of the operation indicated by the instruction as its output. The processor 100 can be used in a desktop computer, mobile device, or tablet computer, and is used for applications such as computing, word processing, multimedia display, and web browsing. This processor 100 can It is placed in an embedded system to control a variety of devices including devices, mobile phones, smart phones, vehicles, and industrial controllers. The central processing unit executes electronic circuits (ie, hardware) that execute instructions of a computer program (or computer application or application) by performing operations including arithmetic, logic, and input/output on the data. An integrated circuit is a group of electronic circuits that are fabricated on a small semiconductor material, usually germanium. Integrated circuits are also commonly used to represent wafers, microchips or dies.

指令攫取單元101控制由系統記憶體(未圖示)攫取架構指令103至指令快取102之運作。指令攫取單元101提供一攫取位址至指令快取102，以指定處理器100攫取至快取記憶體102之架構指令位元組之快取列的記憶體位址。攫取位址之選定係基於處理器100之指令指標(未圖示)的當前值或程式計數器。一般而言，程式計數器會依照指令大小循序遞增，直到指令串流中出現例如分支、呼叫或返回之控制指令，或是發生例如中斷、岔斷(trap)、例外或錯誤等例外條件，而需要以如分支目標位址、返回位址或例外向量等非循序位址更新程式計數器。總而言之，程式計數器會因應執行單元112/121執行指令而進行更新。程式計數器亦可在偵測到例外條件時進行更新，例如指令轉譯器104遭遇到未被定義於處理器100之指令集架構之指令103。 The instruction fetching unit 101 controls the operation of the system memory 103 (not shown) to retrieve the architecture instruction 103 to the instruction cache 102. The instruction fetch unit 101 provides a fetch address to the instruction cache 102 to specify that the processor 100 fetches the memory address of the cache line of the architectural instruction byte of the cache memory 102. The selection of the retrieved address is based on the current value of the instruction indicator (not shown) of the processor 100 or a program counter. In general, the program counters are incremented according to the instruction size until a control instruction such as a branch, call, or return occurs in the instruction stream, or an exception condition such as an interrupt, a trap, an exception, or an error occurs. The program counter is updated with non-sequential addresses such as branch target address, return address, or exception vector. In summary, the program counter is updated in response to execution of the execution unit 112/121. The program counter may also be updated upon detection of an exception condition, such as instruction instruction 104 encountering instruction 103 that is not defined in the instruction set architecture of processor 100.

指令快取102係儲存攫取自一個耦接至處理器100之系統記憶體之架構指令103。這些架構指令103包括一移動至神經網路(MTNN)指令與一由神經網路移出(MFNN)指令，詳如後述。在一實施例中，架構指令103 是x86指令集架構之指令，並附加上MTNN指令與MFNN指令。在本揭露內容中，x86指令集架構處理器係理解為在執行相同機械語言指令之情況下，與Intel® 80386®處理器在指令集架構層產生相同結果之處理器。不過，其他指令集架構，例如，進階精簡指令集機器架構(ARM)、昇陽(SUN)之可擴充處理器架構(SPARC)、或是增強精簡指令集性能運算性能優化架構(PowerPC)，亦可用於本發明之其他實施例。指令快取102提供架構指令103至指令轉譯器104，以將架構指令103轉譯為微指令105。 The instruction cache 102 is stored from a fabric instruction 103 coupled to the system memory of the processor 100. These architectural instructions 103 include a Move to Neural Network (MTNN) instruction and a Neural Network Removal (MFNN) instruction, as will be described later. In an embodiment, the architectural instructions 103 It is an instruction of the x86 instruction set architecture, and is attached with the MTNN instruction and the MFNN instruction. In the context of this disclosure, an x86 instruction set architecture processor is understood to be a processor that produces the same results as an Intel® 80386® processor at the instruction set architecture level while executing the same mechanical language instructions. However, other instruction set architectures, such as the Advanced Reduced Instruction Set Machine Architecture (ARM), Sun's Scalable Processor Architecture (SPARC), or the Enhanced Reduced Instruction Set Performance Computing Performance Architecture (PowerPC), It can also be used in other embodiments of the invention. Instruction cache 102 provides architectural instructions 103 to instruction translator 104 to translate architectural instructions 103 into microinstructions 105.

微指令105係提供至重命名單元106而最終由執行單元112/121執行。這些微指令105會實現架構指令。就一較佳實施例而言，指令轉譯器104包括一第一部分，用以將頻繁執行以及/或是相對較不複雜之架構指令103轉譯為微指令105。此指令轉譯器104並包括一第二部分，其具有一微碼單元(未圖示)。微碼單元具有一微碼記憶體裝載微碼指令，以執行架構指令集中複雜與/或少用的指令。微碼單元並包括一微定序器(microsequencer)提供一非架構微程式計數器(micro-PC)至微碼記憶體。就一較佳實施例而言，這些微指令係經由微轉譯器(未圖示)轉譯為微指令105。選擇器依據微碼單元當前是否具有控制權，選擇來自第一部分或第二部分之微指令105提供至重命名單元106。 Microinstructions 105 are provided to rename unit 106 and ultimately executed by execution unit 112/121. These microinstructions 105 implement architectural instructions. In a preferred embodiment, the instruction translator 104 includes a first portion for translating frequently executed and/or relatively less complex architectural instructions 103 into microinstructions 105. The instruction translator 104 also includes a second portion having a microcode unit (not shown). The microcode unit has a microcode memory loaded microcode instruction to execute complex and/or less useful instructions in the architectural instruction set. The microcode unit includes a microsequencer that provides a non-architected microprogram counter (micro-PC) to the microcode memory. In a preferred embodiment, the microinstructions are translated into microinstructions 105 via a micro-translator (not shown). The selector selects the microinstruction 105 from the first portion or the second portion to provide the rename unit 106 depending on whether the microcode unit currently has control.

重命名單元106會將架構指令103指定之架構暫存器重命名為處理器100之實體暫存器。就一較佳實施例而言，此處理器100包括一重排緩衝器(未圖示)。重命名單元106會依照程式順序將重排緩衝器之項目分配給各個微指令105。如此即可使處理器100依據程式順序撤除微指令105以及其相對應之架構指令103。在一實施例中，媒體暫存器118具有256位元寬度，而通用暫存器116具有64位元寬度。在一實施例中，媒體暫存器118為x86媒體暫存器，例如先進向量擴充(AVX)暫存器。 Renaming unit 106 renames the architectural register specified by architectural instruction 103 to the physical scratchpad of processor 100. In a preferred embodiment, the processor 100 includes a rearrangement buffer (not shown). The rename unit 106 assigns the items of the rearrangement buffer to the respective microinstructions 105 in accordance with the program order. This allows the processor 100 to remove the microinstructions 105 and their corresponding architectural instructions 103 in accordance with the program sequence. In one embodiment, the media register 118 has a 256-bit width and the general-purpose register 116 has a 64-bit width. In one embodiment, media register 118 is an x86 media register, such as an Advanced Vector Expansion (AVX) register.

在一實施例中，重排緩衝器之各個項目具有儲存空間以儲存微指令105之結果。此外，處理器100包括一架構暫存器檔案，此架構暫存器檔案具有一實體暫存器對應於各個架構暫存器，如媒體暫存器118、通用暫存器116以及其他架構暫存器。(就一較佳實施例而言，舉例來說，媒體暫存器118與通用暫存器116之大小不同，即可使用分開的暫存器檔案對應至這兩種暫存器。)對於微指令105中指定有一個架構暫存器之各個源運算元，重命名單元會利用寫入架構暫存器之舊有微指令105中最新一個微指令之重排緩衝器目錄，填入微指令105之源運算元欄位。當執行單元112/121完成微指令105之執行，執行單元112/121會將其結果寫入此微指令105之重排緩衝器項目。當微指令105撤除時，撤除單元(未圖示)會將來自此微指令之重排緩衝器欄位之結果寫入實體暫存器檔案之暫存器，此實體暫存器檔案係關聯於由此撤除微指令105所指定之架構自的暫存器。 In one embodiment, each item of the rearrangement buffer has a storage space to store the results of the microinstructions 105. In addition, the processor 100 includes an architecture register file having a physical register corresponding to each architecture register, such as the media register 118, the general register 116, and other architecture temporary storage. Device. (For a preferred embodiment, for example, the media register 118 and the general register 116 are different in size, and a separate register file can be used to correspond to the two registers.) The instruction source 105 specifies a source operand of an architecture register, and the rename unit fills in the microinstruction 105 by using the reorder buffer directory of the latest microinstruction in the old microinstruction 105 written to the architecture register. The source operand field. When execution unit 112/121 completes execution of microinstruction 105, execution unit 112/121 writes its result to the reorder buffer entry of this microinstruction 105. When the microinstruction 105 is removed, the removal unit (not shown) writes the result from the reorder buffer field of the microinstruction to the scratchpad of the physical scratchpad file associated with the physical scratchpad file Thereby, the scratchpad of the architecture specified by the microinstruction 105 is removed.

在另一實施例中，處理器100包括一實體暫存器檔案，其具有之實體暫存器的數量多於架構暫存器的數量，不過，此處理器100不包括一架構暫存器檔案，而且重排緩衝器項目內不包括結果儲存空間。(就一較佳實施例而言，因為媒體暫存器118與通用暫存器116之大小不同，即可使用分開的暫存器檔案對應至這兩種暫存器。)此處理器100並包括一指標表，其具有各個架構暫存器之相對應指標。對於微指令105內指定有架構暫存器之各個運算元，重命名單元會利用一個指向實體暫存器檔案內一自由暫存器之指標，填入微指令105內之目的運算元欄位。若是實體暫存器檔案內不存在自由暫存器，重命名單元106會暫時擱置管線。對於微指令105內指定有架構暫存器之各個源運算元，重命名單元會利用一個指向實體暫存器檔案中，指派給寫入架構暫存器之舊有微指令105中最新微指令之暫存器的指標，填入微指令105內之源運算元欄位。當執行單元112/121完成執行微指令105，執行單元112/121會將結果寫入實體暫存器檔案中微指令105之目的運算元欄位指向之一暫存器。當微指令105撤除時，撤除單元會將微指令105之目的運算元欄位值複製至關聯於此撤除微指令105指定之架構目的暫存器之指標表的指標。 In another embodiment, the processor 100 includes a physical scratchpad file having more physical scratchpads than the number of architectural registers, however, the processor 100 does not include an architectural register file. Case, and the result storage space is not included in the rearrangement buffer project. (In a preferred embodiment, because the media register 118 and the general register 116 are different in size, a separate register file can be used to correspond to the two registers.) The processor 100 It includes an indicator table with corresponding indicators for each architecture register. For each operand specified in the microinstruction 105 with an architectural register, the rename unit fills the destination operand field in the microinstruction 105 with an indicator pointing to a free register in the physical scratchpad file. If there is no free register in the physical scratchpad file, the rename unit 106 will temporarily suspend the pipeline. For each source operand in the microinstruction 105 that specifies the architecture register, the rename unit will utilize a pointer to the entity register file and assign the latest microinstruction to the old microinstruction 105 of the write architecture register. The indicator of the register is filled in the source operand field in the microinstruction 105. When execution unit 112/121 completes execution of microinstruction 105, execution unit 112/121 writes the result to the destination operand field of microinstruction 105 in the physical scratchpad file. When the microinstruction 105 is removed, the removal unit copies the destination operand field value of the microinstruction 105 to the indicator associated with the indicator table of the schema destination register specified by the microinstruction 105.

保留站108會裝載微指令105，直到這些微指令完成發佈至執行單元112/121以供執行之準備。當一個微指令105之所有源運算元都可取用並且執行單元112/121也可用於執行時，即為此微指令105完成發佈之準備。執行單元112/121係由重排緩衝器或前述第一實施例所述之架構暫存器檔案，或是由前述第二實施例所述之實體暫存器檔案接收暫存器源運算元。此外，執行單元112/121可直接透過結果傳送匯流排(未圖示)接收暫存器源運算元。此外，執行單元112/121可以從保留站108接收微指令105所指定之立即運算元。MTNN與MFNN架構指令103包括一立即運算元以指定神經網路單元121所要執行之功能，而此功能係由MTNN與MFNN架構指令103轉譯產生之一個或多個微指令105所提供，詳如後述。 The reservation station 108 will load the microinstructions 105 until they are ready to be issued to the execution unit 112/121 for execution. When all of the source operands of a microinstruction 105 are available and the execution unit 112/121 is also available for execution, the microinstruction 105 is ready for publication. The execution unit 112/121 is configured by the rearrangement buffer or the architecture register file described in the foregoing first embodiment, or the physical register file of the foregoing second embodiment to receive the register source operation unit. In addition, the execution order The element 112/121 can receive the scratchpad source operand directly through the result transfer bus (not shown). Additionally, execution unit 112/121 can receive the immediate operand specified by microinstruction 105 from reservation station 108. The MTNN and MFNN architecture instructions 103 include an immediate operand to specify the functions to be performed by the neural network unit 121, and this functionality is provided by one or more microinstructions 105 generated by the MTNN and MFNN architecture instructions 103, as described below. .

執行單元112包括一個或多個載入/儲存單元(未圖示)，由記憶體子系統114載入資料並且儲存資料至記憶體子系統114。就一較佳實施例而言，此記憶體子系統114包括一記憶體管理單元(未圖示)，此記憶體管理單元可包括，例如多個轉譯查找(lookaside)緩衝器、一個表移動(tablewalk)單元、一個階層一資料快取(與指令快取102)、一個階層二統一快取與一個作為處理器100與系統記憶體間之介面的匯流排介面單元。在一實施例中，第一圖之處理器100係以一多核處理器之多個處理核心之其中之一來表示，而此多核處理器係共享一個最後階層快取記憶體。執行單元112並可包括多個整數單元、多個媒體單元、多個浮點單元與一個分支單元。 Execution unit 112 includes one or more load/store units (not shown) that are loaded by memory subsystem 114 and store data to memory subsystem 114. In a preferred embodiment, the memory subsystem 114 includes a memory management unit (not shown), which may include, for example, a plurality of lookaside buffers, a table move ( A tablewalk unit, a hierarchy-data cache (with instruction cache 102), a hierarchy 2 unified cache, and a bus interface unit that serves as an interface between processor 100 and system memory. In one embodiment, the processor 100 of the first diagram is represented by one of a plurality of processing cores of a multi-core processor that shares a last-level cache memory. Execution unit 112 may include a plurality of integer units, a plurality of media units, a plurality of floating point units, and a branch unit.

神經網路單元121包括一權重隨機存取記憶體(RAM)124、一資料隨機存取記憶體122、N個神經處理單元(NPU)126、一個程式記憶體129、一個定序器128與多個控制與狀態暫存器127。這些神經處理單元126在概念上係如同神經網路中之神經元之功能。權重隨機存取記憶體124、資料隨機存取記憶體122與程式記憶體129均可透過MTNN與MFNN架構指令103分別寫入與讀取。權重隨機存取記憶體124係排列為W列，每列N個權重文字，資料隨機存取記憶體122係排列為D列，每列N個資料文字。各個資料文字與各個權重文字均為複數個位元，就一較佳實施例而言，可以是8個位元、9個位元、12個位元或16個位元。各個資料文字係作為網路中前一層之一神經元的輸出值(有時以啟動值表示)，各個權重文字係作為網路中關聯於進入網路當前層之一神經元之一連結的權重。雖然在神經網路單元121之許多應用中，裝載於權重隨機存取記憶體124之文字或運算元實際上就是關聯於進入一神經元之連結的權重，不過需要注意的是，在神經網路單元121之某些應用中，裝載於權重隨機存取記憶體124之文字並非權重，不過因為這些文字是儲存於權重隨機存取記憶體124中，所以仍然以“權重文字”之用語表示。舉例來說，在神經網路單元121之某些應用中，例如第二十四至二十六A圖之卷積運算之範例或是第二十七至二十八圖之共源運作之範例，權重隨機存取記憶體124會裝載權重以外之物件，例如資料矩陣(如影像畫素資料)之元素。同樣地，雖然在神經網路單元121之許多應用中，裝載於資料隨機存取記憶體122之文字或運算元實質上就是神經元之輸出值或啟動值，不過需要注意的是，在神經網路單元121之某些應用中，裝載於資料隨機存取記憶體122之文字並非如此，不過因為這些文字是儲存於資料隨機存取記憶體122中，所以仍然以“資料文字”之用語表示。舉例來說，在神經網路單元121之某些應用中，例如第二十四至二十六A圖之卷積運算之範例，資料隨機存取記憶體122會裝載非神經元之輸出，例如卷積核之元素。 The neural network unit 121 includes a weight random access memory (RAM) 124, a data random access memory 122, N neural processing units (NPU) 126, a program memory 129, and a sequencer 128 and more. Control and status register 127. These neural processing units 126 are conceptually like the functions of neurons in a neural network. The weighted random access memory 124, the data random access memory 122, and the program memory 129 can each be written and read through the MTNN and MFNN architecture instructions 103. The weight random access memory 124 is arranged in W columns, each column of N weight characters, and the data random access memory 122 is arranged in D columns, each column of N data characters. Each of the data texts and each weight text is a plurality of bits. For a preferred embodiment, it may be 8 bits, 9 bits, 12 bits, or 16 bits. Each data text is used as the output value of a neuron in the previous layer of the network (sometimes expressed as a startup value), and each weight text is used as a weight in the network associated with one of the neurons entering the current layer of the network. . Although in many applications of the neural network unit 121, the literal or operand loaded in the weighted random access memory 124 is actually associated with the weight of the link into a neuron, it should be noted that in the neural network In some applications of unit 121, the text loaded in weighted random access memory 124 is not weighted, but because the text is stored in weighted random access memory 124, it is still expressed in terms of "weighted text." For example, in some applications of the neural network unit 121, for example, an example of a convolution operation of the twenty-fourth to twenty-sixth A or an example of a common source operation of the twenty-seventh to twenty-eighth The weighted random access memory 124 loads objects other than weights, such as elements of a data matrix (such as image pixel data). Similarly, although in many applications of the neural network unit 121, the words or operands loaded in the data random access memory 122 are essentially the output values or activation values of the neurons, it should be noted that in the neural network In some applications of the way unit 121, the text loaded in the data random access memory 122 is not the case, but since the words are stored in the data random access memory 122, they are still expressed in terms of "data text". For example, in some applications of neural network unit 121, such as the convolution of the twenty-fourth to twenty-sixth A maps As an example of the operation, the data random access memory 122 will load the output of the non-neuron, such as the elements of the convolution kernel.

在一實施例中，神經處理單元126與定序器128包括組合邏輯、定序邏輯、狀態機器、或是其組合。架構指令(例如MFNN指令1500)會將狀態暫存器127之內容載入其中一個通用暫存器116，以確認神經網路單元121之狀態，如神經網路單元121已經從程式記憶體129完成一個命令或是一個程式之運作，或是神經網路單元121可自由接收一個新的命令或開始一個新的神經網路單元程式。 In an embodiment, the neural processing unit 126 and the sequencer 128 comprise combinational logic, sequencing logic, state machines, or a combination thereof. The architectural instructions (e.g., MFNN instruction 1500) load the contents of state register 127 into one of the general purpose registers 116 to confirm the state of neural network unit 121, such as neural network unit 121 having completed from program memory 129. The operation of a command or a program, or the neural network unit 121 is free to receive a new command or start a new neural network unit program.

神經處理單元126之數量可依據需求增加，權重隨機存起記憶體124與資料隨機存取記憶體122之寬度與深度亦可隨之調整進行擴張。就一較佳實施例而言，權重隨機存取記憶體124會大於資料隨機存取記憶體122，這是因為典型的神經網路層中存在許多連結，因而需要較大之儲存空間儲存關聯於各個神經元的權重。本文揭露許多關於資料與權重文字之大小、權重隨機存取記憶體124與資料隨機存取記憶體122之大小、以及不同神經處理單元126數量之實施例。在一實施例中，神經網路單元121具有一個大小為64KB(8192位元x64列)之資料隨機存取記憶體122，一個大小為2MB(8192位元x2048列)之權重隨機存取記憶體124，以及512個神經處理單元126。此神經網路單元121是以台灣積體電路(TSMC)之16奈米製程製造，其所占面積大約是3.3毫米平方。 The number of neural processing units 126 may be increased according to requirements, and the width and depth of the random storage memory 124 and the data random access memory 122 may be adjusted to expand. In a preferred embodiment, the weighted random access memory 124 is larger than the data random access memory 122 because there are many links in a typical neural network layer, and thus a larger storage space is required to be associated with The weight of each neuron. A number of embodiments are disclosed herein regarding the size of data and weight text, the size of the weighted random access memory 124 and the data random access memory 122, and the number of different neural processing units 126. In one embodiment, the neural network unit 121 has a data random access memory 122 of 64 KB (8192 bits x 64 columns) and a weighted random access memory of 2 MB (8192 bits x 2048 columns). 124, and 512 neural processing units 126. The neural network unit 121 is manufactured by a 16 nanometer process of Taiwan Integrated Circuit (TSMC), which occupies an area of approximately 3.3 mm square.

定序器128係由程式記憶體129攫取指令並執行，其執行之運作還包括產生位址與控制信號提供給資料隨機存取記憶體122、權重隨機存取記憶體124與神經處理單元126。定序器128產生一記憶體位址123與一讀取命令提供給資料隨機存取記憶體122，藉以在D個列之N個資料文字中選擇其一提供給N個神經處理單元126。定序器128並會產生一記憶體位址125與一讀取命令提供給權重隨機存取記憶體124，藉以在W個列之N個權重文字中選擇其一提供給N個神經處理單元126。定序器128產生並提供給神經處理單元126之位址123,125的順序即確定神經元間之“連結”。定序器128還會產生一記憶體位址123與一寫入命令提供給資料隨機存取記憶體122，藉以在D個列之N個資料文字中選擇其一由N個神經處理單元126進行寫入。定序器128還會產生一記憶體位址125與一寫入命令提供給權重隨機存取記憶體124，藉以在W個列之N個權重文字中選擇其一由N個神經處理單元126進行寫入。定序器128還會產生一記憶體位址131至程式記憶體129以選擇提供給定序器128之一神經網路單元指令，這部分在後續章節會進行說明。記憶體位址131係對應至程式計數器(未圖示)，定序器128通常是依據程式記憶體129之位置順序使程式計數器遞增，除非定序器128遭遇到一控制指令，例如一迴圈指令(請參照如第二十六A圖所示)，在此情況下，定序器128會將程式計數器更新為此控制指令之目標位址。定序器128還會產生控制信號至神經處理單元126，指示神經處理單元126 執行各種不同之運算或功能，例如起始化、算術/邏輯運算、轉動/移位運算、啟動函數、以及寫回運算，相關之範例在後續章節(請參照如第三十四圖之微運算3418所示)會有更詳細的說明。 The sequencer 128 is fetched and executed by the program memory 129. The execution of the sequencer 128 further includes generating address and control signals for the data random access memory 122, the weight random access memory 124, and the neural processing unit 126. The sequencer 128 generates a memory address 123 and a read command to the data random access memory 122, whereby one of the N data words of the D columns is selected for supply to the N neural processing units 126. The sequencer 128 also generates a memory address 125 and a read command to the weighted random access memory 124, whereby one of the N weight words of the W columns is selected for supply to the N neural processing units 126. The sequence in which the sequencer 128 generates and provides the addresses 123, 125 to the neural processing unit 126 determines the "link" between the neurons. The sequencer 128 also generates a memory address 123 and a write command to the data random access memory 122, whereby one of the N data words of the D columns is selected for writing by the N neural processing units 126. In. The sequencer 128 also generates a memory address 125 and a write command to the weighted random access memory 124, whereby one of the N weight words of the W columns is selected for writing by the N neural processing units 126. In. The sequencer 128 also generates a memory address 131 to the program memory 129 to select one of the neural network unit instructions provided to the sequencer 128, as will be explained in subsequent sections. The memory address 131 corresponds to a program counter (not shown), and the sequencer 128 generally increments the program counter according to the position order of the program memory 129 unless the sequencer 128 encounters a control command, such as a loop command. (Refer to Figure 26A). In this case, sequencer 128 updates the program counter to the target address of this control instruction. The sequencer 128 also generates a control signal to the neural processing unit 126 indicating the neural processing unit 126 Perform a variety of different operations or functions, such as initialization, arithmetic / logic operations, rotation / shift operations, start functions, and write back operations, the relevant examples in the following chapters (please refer to the micro-operations as shown in Figure 34) A more detailed description will be given in 3418).

N個神經處理單元126會產生N個結果文字133，這些結果文字133可被寫回權重隨機存取記憶體124或資料隨機存取記憶體122之一個列。就一較佳實施例而言，權重隨機存取記憶體124與資料隨機存取記憶體122係直接耦接至N個神經處理單元126。進一步來說，權重隨機存取記憶體124與資料隨機存取記憶體122係轉屬於這些神經處理單元126，而不分享給處理器100中其他的執行單元112，這些神經處理單元126能夠持續地在每一個時頻週期內從權重隨機存取記憶體124與資料隨機存取記憶體122之一或二者取得並完成一個列，就一較佳實施例而言，可採管線方式處理。在一實施例中，資料隨機存取記憶體122與權重隨機存取記憶體124中的每一個都可以在每一個時頻週期內提供8192個位元至神經處理單元126。這8192個位元可以視為512個16位元組或是1024個8位元組來進行處理，詳如後述。 The N neural processing units 126 generate N result words 133 which can be written back to one of the weighted random access memory 124 or the data random access memory 122. In a preferred embodiment, the weight random access memory 124 and the data random access memory 122 are directly coupled to the N neural processing units 126. Further, the weight random access memory 124 and the data random access memory 122 belong to the neural processing unit 126 and are not shared with other execution units 112 in the processor 100. The neural processing units 126 can continuously A column is obtained and completed from one or both of the weight random access memory 124 and the data random access memory 122 in each time-frequency period. In a preferred embodiment, the pipeline processing is performed. In one embodiment, each of the data random access memory 122 and the weighted random access memory 124 can provide 8192 bits to the neural processing unit 126 in each time-frequency period. These 8192 bits can be treated as 512 16-bits or 1024 8-bits, as will be described later.

由神經網路單元121處理之資料組大小並不受限於權重隨機存取記憶體124與資料隨機存取記憶體122的大小，而只會受限於系統記憶體的大小，這是因為資料與權重可在系統記憶體與權重隨機存取記憶體124以及資料隨機存取記憶體122間透過MTNN與MFNN指令之使用(例如，透過媒體暫存器118)而移動。在一實施例中，資料隨機存取記憶體122係被賦予雙埠，使能在由資料隨機存取記憶體122讀取資料文字或寫入資料文字至資料隨機存取記憶體122之同時，寫入資料文字至資料隨機存取記憶體122。另外，包括快取記憶體在內之記憶體子系統114之大型記憶體階層結構可提供非常大的資料頻寬供系統記憶體與神經網路單元121間進行資料傳輸。此外，就一較佳實施例而言，此記憶體子系統114包括硬體資料預攫取器，追蹤記憶體之存取模式，例如由系統記憶體載入之神經資料與權重，並對快取階層結構執行資料預攫取以利於在傳輸至權重隨機存取記憶體124與資料隨機存取記憶體122之過程中達成高頻寬與低延遲之傳輸。 The size of the data set processed by the neural network unit 121 is not limited by the weight of the random access memory 124 and the data random access memory 122, but is limited only by the size of the system memory, because the data The weights can be moved between the system memory and the weighted random access memory 124 and the data random access memory 122 by the use of MTNN and MFNN instructions (e.g., via the media register 118). In a In the embodiment, the data random access memory 122 is given a double port, enabling writing in the data random access memory 122 or writing the data to the data random access memory 122. The data is to the data random access memory 122. In addition, the large memory hierarchy of the memory subsystem 114, including the cache memory, provides a very large data bandwidth for data transfer between the system memory and the neural network unit 121. Moreover, in a preferred embodiment, the memory subsystem 114 includes a hardware data prefetcher that tracks memory access modes, such as neural data and weights loaded by system memory, and caches The hierarchical structure performs data prefetching to facilitate high frequency wide and low latency transmissions during transmission to the weighted random access memory 124 and the data random access memory 122.

雖然本文之實施例中，由權重記憶體提供至各個神經處理單元126之其中一個運算元係標示為權重，此用語常見於神經網路，不過需要理解的是，這些運算元也可以是其他與計算有關聯之類型的資料，而其計算速度可透過這些裝置加以提升。 Although in the embodiments herein, one of the operands provided by the weight memory to each of the neural processing units 126 is labeled as a weight, this term is commonly found in neural networks, but it should be understood that these operands may also be other Calculate the type of data associated with it, and its calculation speed can be improved by these devices.

第二圖係顯示第一圖之一神經處理單元126之方塊示意圖。如圖中所示，此神經處理單元126之運作可執行許多功能或運算。尤其是，此神經處理單元126可作為人工神經網路內之一神經元或節點進行運作，以執行典型之乘積累加功能或運算。也就是說，一般而言，神經網路單元126(神經元)係用以：(1)從各個與其具有連結之神經元接收一輸入值，此連結通常會但不必然是來自人工神經網路中之前一層；(2)將各個輸出值乘上關聯於其連結之一相對應權重值以產生一乘積；(3)將所有乘積加總以產生一總數；(4)對此總數執行一啟動函數以產生神經元之輸出。不過，不同於傳統方式需要執行關聯於所有連結輸入之所有乘法運算並將其乘積加總，本發明之各個神經元在一給定之時頻週期內可執行關聯於其中一個連結輸入之權重乘法運算並將其乘積與關聯於該時點前之時頻週期內所執行之連結輸入之乘積的累加值相加(累加)。假定一共有M個連結連接至此神經元，在M個乘積加總後(大概需要M個時頻週期的時間)，此神經元會對此累加數執行啟動函數以產生輸出或結果。此方式之優點在於可減少所需之乘法器的數量，並且在神經元內只需要一個較小、較簡單且更為快速之加法器電路(例如使用兩個輸入之加法器)，而不需使用能夠將所有連結輸入之乘積加總或甚至對其中一子集合加總所需之加法器。此方式亦有利於在神經網路單元121內使用極大數量(N)之神經元(神經處理單元126)，如此，在大約M個時頻週期後，神經網路單元121就可產生此大數量(N)神經元之輸出。最後，對於大量之不同連結輸入，由這些神經元構成之神經網路單元121就能有效地作為一人工神經網路層執行。也就是說，若是不同層中M的數量有所增減，產生記憶胞輸出所需之時頻週期數也會相對應地增減，而資源(例如乘法器與累加器)會被充分利用。相較之下，傳統設計對於較小之M值而言，會有某些乘法器與加法器之部分未能被利用。因此，因應神經網路單元之連結輸出數，本文所述之實施例兼具彈性與效率之優點，而能提供極高的效能。 The second figure shows a block diagram of a neural processing unit 126 of the first figure. As shown in the figure, the operation of the neural processing unit 126 can perform a number of functions or operations. In particular, the neural processing unit 126 can operate as a neuron or node within the artificial neural network to perform a typical multiply-accumulate function or operation. That is to say, in general, the neural network unit 126 (neuron) is used to: (1) receive an input value from each of the neurons with which it is connected, which usually but not necessarily from an artificial neural network Before the middle layer; (2) will lose each The value is multiplied by a corresponding weight value associated with one of its links to produce a product; (3) all products are summed to produce a total; (4) a start function is performed on the total to generate an output of the neuron. However, unlike conventional methods that require performing all multiplication operations associated with all of the concatenated inputs and summing their products, the various neurons of the present invention can perform weight multiplication operations associated with one of the concatenated inputs in a given time-frequency period. The product is added (accumulated) to the accumulated value of the product of the join input executed in the time-frequency period associated with the time point before the point in time. Assuming that a total of M links are connected to this neuron, after the M products are summed (approximately M time-frequency cycles are required), the neuron will perform a start function on this accumulated number to produce an output or result. The advantage of this approach is that it reduces the number of multipliers required, and requires only a smaller, simpler, and faster adder circuit (eg, an adder using two inputs) within the neuron, without the need for Use an adder that is capable of summing the products of all the connected inputs or even adding one of the subsets. This approach also facilitates the use of a very large number (N) of neurons (neural processing unit 126) within neural network unit 121, such that neural network unit 121 can produce this large number after approximately M time-frequency periods. (N) Output of neurons. Finally, for a large number of different connection inputs, the neural network unit 121 composed of these neurons can be effectively implemented as an artificial neural network layer. That is to say, if the number of M in different layers increases or decreases, the number of time-frequency cycles required to generate the memory cell output will correspondingly increase and decrease, and resources (such as multipliers and accumulators) will be fully utilized. In contrast, conventional designs have some of the multipliers and adders that are not utilized for smaller M values. Therefore, in response to the number of connected outputs of the neural network unit, this article describes The embodiment combines the advantages of flexibility and efficiency to provide extremely high performance.

神經處理單元126包括一暫存器205、一個雙輸入多工暫存器208、一算術邏輯單元(ALU)204、一累加器202、與一啟動函數單元(AFU)212。暫存器205由權重隨機存取記憶體124接收一權重文字206並在一後續時頻週期提供其輸出203。多工暫存器208在兩個輸入207,211中選擇其一儲存於其暫存器並在一後續時頻週期提供於其輸出209。輸入207接收來自資料隨機存取記憶體122之資料文字。另一個輸入211則接收相鄰神經處理單元126之輸出209。第二圖所示之神經處理單元126係於第一圖所示之N個神經處理單元中標示為神經處理單元J。也就是說，神經處理單元J是這N個神經處理單元126之一代表範例。就一較佳實施例而言，神經處理單元126之J範例之多工暫存器208的輸入211係接收神經處理單元126之J-1範例之多工暫存器208之輸出209，而神經處理單元J之多工暫存器208的輸出209係提供給神經處理單元126之J+1範例之多工暫存器208之輸入211。如此，N個神經處理單元126之多工暫存器208即可共同運作，如同一N個文字之旋轉器或稱循環移位器，這部分在後續第三圖會有更詳細的說明。多工暫存器208係利用一控制輸入213控制這兩個輸入中哪一個會被多工暫存器208選擇儲存於其暫存器並於後續提供於輸出209。 The neural processing unit 126 includes a register 205, a dual input multiplex register 208, an arithmetic logic unit (ALU) 204, an accumulator 202, and an enable function unit (AFU) 212. The register 205 receives a weighted text 206 from the weighted random access memory 124 and provides its output 203 at a subsequent time-frequency period. Multiplex register 208 selects one of the two inputs 207, 211 to store in its register and provide its output 209 for a subsequent time-frequency period. Input 207 receives the data text from data random access memory 122. Another input 211 receives the output 209 of the adjacent neural processing unit 126. The neural processing unit 126 shown in the second figure is labeled as the neural processing unit J among the N neural processing units shown in the first figure. That is, the neural processing unit J is an example of one of the N neural processing units 126. In a preferred embodiment, the input 211 of the multiplex register 208 of the J example of the neural processing unit 126 receives the output 209 of the multiplex register 208 of the J-1 example of the neural processing unit 126, and the neural The output 209 of the multiplex register 208 of the processing unit J is provided to the input 211 of the J+1 paradigm multiplex register 208 of the neural processing unit 126. Thus, the multiplex registers 208 of the N neural processing units 126 can operate together, such as the same N-word rotator or cyclic shifter, which will be described in more detail in the subsequent third figure. The multiplex register 208 utilizes a control input 213 to control which of the two inputs is selected by the multiplex register 208 for storage in its register and for subsequent supply to the output 209.

算術邏輯單元204具有三個輸入。其中一個輸入由暫存器205接收權重文字203。另一個輸入接收多工暫存器208之輸出209。再另一個輸入接收累加器202 之輸出217。此算術邏輯單元204會對其輸入執行算術與/或邏輯運算以產生一結果提供於其輸出。就一較佳實施例而言，算術邏輯單元204執行之算術與/或邏輯運算係由儲存於程式記憶體129之指令所指定。舉例來說，第四圖中的乘法累加指令指定一乘法累加運算，亦即，結果215會是累加器202數值217與權重文字203以及多工暫存器208輸出209之資料文字之乘積的加總。不過也可以指定其他運算，這些運算包括但不限於：結果215是多工暫存器輸出209傳遞之數值；結果215是權重文字203傳遞之數值；結果215是零值；結果215是累加器202數值217與權重203之加總；結果215是累加器202數值217與多工暫存器輸出209之加總；結果215是累加器202數值217與權重203中的最大值；結果215是累加器202數值217與多工暫存器輸出209中的最大值。 The arithmetic logic unit 204 has three inputs. One of the inputs is received by the register 205 as a weight text 203. The other input receives the output 209 of the multiplex register 208. Yet another input receive accumulator 202 The output is 217. This arithmetic logic unit 204 performs arithmetic and/or logical operations on its inputs to produce a result for its output. For a preferred embodiment, the arithmetic and/or logical operations performed by arithmetic logic unit 204 are specified by instructions stored in program memory 129. For example, the multiply-accumulate instruction in the fourth figure specifies a multiply-accumulate operation, that is, the result 215 would be the product of the accumulator 202 value 217 and the weight text 203 and the data text of the output 209 of the multiplex register 208. total. However, other operations may be specified, including but not limited to: result 215 is the value passed by multiplex register output 209; result 215 is the value passed by weight text 203; result 215 is a zero value; result 215 is accumulator 202 The sum of the value 217 and the weight 203; the result 215 is the sum of the accumulator 202 value 217 and the multiplex register output 209; the result 215 is the maximum value of the accumulator 202 value 217 and the weight 203; the result 215 is the accumulator The maximum value of 202 value 217 and multiplex register output 209.

算術邏輯單元204提供其輸出215至累加器202儲存。算術邏輯單元204包括一乘法器242對權重文字203與多工暫存器208輸出209之資料文字進行乘法運算以產生一乘積246。在一實施例中，乘法器242係將兩個16位元運算元相乘以產生一個32位元之結果。此算術邏輯單元204並包括一加法器244在累加器202之輸出217加上乘積246以產生一總數，此總數即為儲存於累加器202之累加運算的結果215。在一實施例中，加法器244係在累加器202之一個41位元值217加上乘法器242之一個32位元結果以產生一個41位元結果。如此，在多個時頻週期之期間內利用多工暫存器208所具有之旋轉器特性，神經處理單元126即可達成神經網路所需之神經元之乘積加總運算。此算術邏輯單元204亦可包括其他電路元件以執行其他如前所述之算術/邏輯運算。在一實施例中，第二加法器係在多工暫存器208輸出209之資料文字減去權重文字203以產生一差值，隨後加法器244會在累加器202之輸出217加上此差值以產生一結果215，此結果即為累加器202內之累加結果。如此，在多個時頻週期之期間內，神經處理單元126就能達成差值加總之運算。就一較佳實施例而言，雖然權重文字203與資料文字209之大小相同(以位元計)，他們也可具有不同之二進位小數點位置，詳如後述。就一較佳實施例而言，乘法器242與加法器244係為整數乘法器與加法器，相較於使用浮點運算之算術邏輯單元，此算術邏輯單元204具有低複雜度、小型、快速與低耗能之優點。不過，在本發明之其他實施例中，算術邏輯單元204亦可執行浮點運算。 Arithmetic logic unit 204 provides its output 215 to accumulator 202 for storage. The arithmetic logic unit 204 includes a multiplier 242 that multiplies the weight text 203 and the data text of the multiplex register 208 output 209 to produce a product 246. In one embodiment, multiplier 242 multiplies two 16-bit operands to produce a 32-bit result. The arithmetic logic unit 204 also includes an adder 244 that adds a product 246 to the output 217 of the accumulator 202 to produce a total, which is the result 215 of the accumulation operation stored in the accumulator 202. In one embodiment, adder 244 is coupled to a 41 bit value 217 of accumulator 202 plus a 32 bit result of multiplier 242 to produce a 41 bit result. Thus, the rotator of the multiplex register 208 is utilized during a plurality of time-frequency periods. The neural processing unit 126 can achieve the product summation operation of the neurons required by the neural network. This arithmetic logic unit 204 may also include other circuit elements to perform other arithmetic/logic operations as previously described. In one embodiment, the second adder subtracts the weight text 203 from the data text output 209 of the multiplex register 208 to generate a difference, and then the adder 244 adds the difference to the output 217 of the accumulator 202. The value is used to produce a result 215 which is the cumulative result in accumulator 202. Thus, during a plurality of time-frequency periods, the neural processing unit 126 can perform the summation of the differences. In a preferred embodiment, although the weight text 203 and the data text 209 are the same size (in terms of bits), they may have different binary decimal point positions, as will be described later. In a preferred embodiment, multiplier 242 and adder 244 are integer multipliers and adders. This arithmetic logic unit 204 has low complexity, small size, and fast compared to arithmetic logic units that use floating point operations. With the advantages of low energy consumption. However, in other embodiments of the invention, arithmetic logic unit 204 may also perform floating point operations.

雖然第二圖之算術邏輯單元204內只顯示一個乘法器242與加法器244，不過，就一較佳實施例而言，此算術邏輯單元204還包括有其他元件以執行前述其他不同的運算。舉例來說，此算術邏輯單元204可包括一比較器(未圖示)比較累加器202與一資料/權重文字，以及一多工器(未圖示)在比較器指定之兩個數值中選擇較大者(最大值)儲存至累加器202。在另一個範例中，算術邏輯單元204包括選擇邏輯(未圖示)，利用一資料/權重文字來跳過乘法器242，使加法器224在累加器202之數值217加上此資料/權重文字以產生一總數儲存至累加器202。這些額外的運算會在後續章節如第十八至二十九A圖有更詳細的說明，而這些運算也有助於如卷積運算與共源運作之執行。 Although only one multiplier 242 and adder 244 are shown in the arithmetic logic unit 204 of the second figure, in a preferred embodiment, the arithmetic logic unit 204 includes other components to perform the other different operations described above. For example, the arithmetic logic unit 204 can include a comparator (not shown) to compare the accumulator 202 with a data/weight text, and a multiplexer (not shown) to select between two values specified by the comparator. The larger (maximum) is stored to accumulator 202. In another example, arithmetic logic unit 204 includes selection logic (not shown) that utilizes a data/weight text to skip multiplier 242, causing adder 224 to add this data/weight text to value 217 of accumulator 202. To generate a total amount to store Adder 202. These additional operations are described in more detail in subsequent sections, such as Figures 18 through 29A, which also contribute to the implementation of convolution operations and common source operations.

啟動函數單元212接收累加器202之輸出217。啟動函數單元212會對累加器202之輸出執行一啟動函數以產生第一圖之結果133。一般而言，人工神經網路之中介層之神經元內的啟動函數可用來標準化乘積累加後之總數，尤其可以採用非線性之方式進行。為了“標準化”累加總數，當前神經元之啟動函數會在連接當前神經元之其他神經元預期接收作為輸入之數值範圍內產生一結果值。(標準化後的結果有時會稱為“啟動”，在本文中，啟動是當前節點之輸出，而接收節點會將此輸出乘上一關聯於輸出節點與接收節點間連結之權重以產生一乘積，而此乘積會與關聯於此接收節點之其他輸入連結的乘積累加。)舉例來說，在接收/被連結神經元預期接收作為輸入之數值介於0與1間之情況下，輸出神經元會需要非線性地擠壓與/或調整(例如向上移位以將負值轉換為正值)超出0與1之範圍外的累加總數，使其落於此預期範圍內。因此，啟動函數單元212對累加器202數值217執行之運算會將結果133帶到一已知範圍內。N個神經執行單元126之結果133都可被同時寫回資料隨機存取記憶體122或權重隨機存取記憶體124。就一較佳實施例而言，啟動函數單元212係用以執行多個啟動函數，而例如來自控制暫存器127之一輸入會在這些啟動函數中選擇其一執行於累加器202之輸出217。這些啟動函數可包括但不限於階梯函數、校正函數、S型函數、雙曲正切函數與軟加函數(也稱為平滑校正函數)。軟加函數之解析公式為f(x)=ln(1+e^x)，也就是1與e^x之加總的自然對數，其中，“e”是歐拉數(Euler’s number)，x是此函數之輸入217。就一較佳實施例而言，啟動函數亦可包括一傳遞(pass-through)函數，直接傳遞累加器202數值217或其中一部分，詳如後述。在一實施例中，啟動函數單元212之電路會在單一個時頻週期內執行啟動函數。在一實施例中，啟動函數單元212包括多個表單，其接收累加值並輸出一數值，對某些啟動函數，如S型函數、雙取正切函數、軟加函數等，此數值會近似於真正的啟動函數所提供之數值。 The start function unit 212 receives the output 217 of the accumulator 202. The start function unit 212 performs a start function on the output of the accumulator 202 to produce the result 133 of the first map. In general, the start function in the neurons of the intervening layer of the artificial neural network can be used to normalize the total number of multiply accumulates, especially in a non-linear manner. To "normalize" the accumulated total, the current neuron's start function will produce a result value within the range of values that other neurons connected to the current neuron are expected to receive as input. (The result of normalization is sometimes referred to as "startup". In this paper, the start is the output of the current node, and the receiving node multiplies this output by a weight associated with the link between the output node and the receiving node to produce a product. And this product will be multiplied by the other input links associated with the receiving node.) For example, if the received/linked neurons are expected to receive as input values between 0 and 1, the output neurons It may be necessary to non-linearly squeeze and/or adjust (eg, shift up to convert a negative value to a positive value) beyond the cumulative total of the range of 0 and 1 to fall within the expected range. Thus, the operation performed by the start function unit 212 on the accumulator 202 value 217 will bring the result 133 to a known range. The results 133 of the N neural execution units 126 can all be written back to the data random access memory 122 or the weighted random access memory 124. In a preferred embodiment, the boot function unit 212 is operative to execute a plurality of boot functions, and for example, an input from the control register 127 selects one of the start functions to execute on the output of the accumulator 202. . These start functions may include, but are not limited to, step functions, correction functions, sigmoid functions, hyperbolic tangent functions, and soft addition functions (also referred to as smoothing correction functions). The analytical formula of the soft addition function is f(x)=ln(1+e ^x ), which is the total natural logarithm of the sum of 1 and e ^x , where “e” is the Euler's number and x is this Input 217 for the function. In a preferred embodiment, the start function may also include a pass-through function that directly passes the value 217 of the accumulator 202 or a portion thereof, as will be described later. In an embodiment, the circuitry of the startup function unit 212 performs the startup function in a single time-frequency cycle. In an embodiment, the start function unit 212 includes a plurality of forms that receive the accumulated value and output a value. For some start functions, such as a sigmoid function, a double take tangent function, a soft add function, etc., the value is approximated. The value provided by the real startup function.

就一較佳實施例而言，累加器202之寬度(以位元計)係大於啟動函數功能212之輸出133之寬度。舉例來說，在一實施例中，此累加器之寬度為41位元，以避免在累加至最多512個32位元之乘積的情況下(這部分在後續章節如對應於第三十圖處會有更詳細的說明)損失精度，而結果133之寬度為16位元。在一實施例中，在後續時頻週期中，啟動函數單元212會傳遞累加器202輸出217之其他未經處理之部分，並且會將這些部分寫回資料隨機存取記憶體122或權重隨機存取記憶體124，這部分在後續章節對應於第八圖處會有更詳細的說明。如此即可將未經處理之累加器202數值透過MFNN指令載回媒體暫存器118，藉此，在處理器100之其他執行單元112執行之指令就可以執行啟動函數單元212無法執行之複雜啟動函數，例如常見的軟極大(softmax)函數，此函數也被稱為標準化指數函數。在一實施例中，處理器100之指令集架構包括執行此指數函數之一指令，通常表示為e^x或exp(x)，此指令可由處理器100之其他執行單元112使用以提升軟極大啟動函數之執行速度。 In a preferred embodiment, the width (in terms of bits) of accumulator 202 is greater than the width of output 133 of start function function 212. For example, in one embodiment, the accumulator has a width of 41 bits to avoid accumulation in the product of up to 512 32-bits (this portion corresponds to the thirty-th figure in subsequent sections). There will be a more detailed description of the loss accuracy, and the result 133 has a width of 16 bits. In one embodiment, in a subsequent time-frequency period, the start function unit 212 passes the other unprocessed portions of the output 217 of the accumulator 202 and writes the portions back to the data random access memory 122 or the weights are stored in random. The memory 124 is taken, and this section will be described in more detail in the subsequent sections corresponding to the eighth figure. Thus, the unprocessed accumulator 202 value can be loaded back to the media register 118 via the MFNN instruction, whereby the instructions executed by the other execution units 112 of the processor 100 can perform complex startups that the boot function unit 212 cannot perform. Functions, such as the common softmax function, are also called normalized exponential functions. In one embodiment, the instruction set architecture of processor 100 includes instructions to perform one of the exponential functions, generally denoted as e ^x or exp(x), which may be used by other execution units 112 of processor 100 to promote soft maximal startup. The execution speed of the function.

在一實施例中，神經處理單元126係採管線設計。舉例來說，神經處理單元126可包括算術邏輯單元204之暫存器，例如位於乘法器與加法器以及/或是算術邏輯單元204之其他電路間之暫存器，神經處理單元126還可包括一個裝載啟動函數功能212輸出之暫存器。此神經處理單元126之其他實施例會在後續章節進行說明。 In an embodiment, the neural processing unit 126 is a pipeline design. For example, the neural processing unit 126 can include a temporary register of the arithmetic logic unit 204, such as a register between the multiplier and the adder and/or other circuits of the arithmetic logic unit 204. The neural processing unit 126 can also include A register that loads the output of the boot function function 212. Other embodiments of this neural processing unit 126 are described in subsequent sections.

第三圖係一方塊圖，顯示利用第一圖之神經網路單元121之N個神經處理單元126之N個多工暫存器208，對於由第一圖之資料隨機存取記憶體122取得之一列資料文字207執行如同一N個文字之旋轉器(rotator)或稱循環移位器(circular shifter)之運作。在第三圖之實施例中，N是512，因此，神經網路單元121具有512個多工暫存器208，標示為0至511，分別對應至512個神經處理單元126。每個多工暫存器208會接收資料隨機存取記憶體122之D個列之其中一個列上的相對應資料文字207。也就是說，多工暫存器0會從資料隨機存取記憶體122列接收資料文字0，多工暫存器1會從資料隨機存取記憶體122列接收資料文字1，多工暫存器2會從資料隨機存取記憶體122列接收資料文字2，依此類推，多工暫存器 511會從資料隨機存取記憶體122列接收資料文字511。此外，多工暫存器1會接收多工暫存器0之輸出209作為另一輸入211，多工暫存器2會接收多工暫存器1之輸出209作為另一輸入211，多工暫存器3會接收多工暫存器2之輸出209作為另一輸入211，依此類推，多工暫存器511會接收多工暫存器510之輸出209作為另一輸入211，而多工暫存器0會接收多工暫存器511之輸出209作為其他輸入211。每個多工暫存器208都會接收一控制輸入213以控制其選擇資料文字207或是循環輸入211。在此運作之一模式中，控制輸入213會在一第一時頻週期內，控制每個多工暫存器208選擇資料文字207以儲存至暫存器並於後續步驟提供給算術邏輯單元204，而在後續之時頻週期內(如前述M-1個時頻週期)，控制輸入213會控制每個多工暫存器208選擇循環輸入211以儲存至暫存器並於後續步驟提供給算術邏輯單元204。 The third diagram is a block diagram showing N multiplex registers 208 of the N neural processing units 126 of the neural network unit 121 of the first figure, obtained for the data random access memory 122 of the first figure. One of the column data words 207 performs the operation of a rotator or a circular shifter such as the same N characters. In the embodiment of the third figure, N is 512. Thus, neural network unit 121 has 512 multiplex registers 208, labeled 0 through 511, corresponding to 512 neural processing units 126, respectively. Each multiplex register 208 receives the corresponding data text 207 on one of the D columns of the data random access memory 122. That is to say, the multiplex register 0 will receive the data text 0 from the data random access memory 122 column, and the multiplex register 1 will receive the data text 1 from the data random access memory 122 column, and the multiplex temporary storage 2 will receive the data text 2 from the data random access memory 122 column, and so on, the multiplex register The 511 receives the data 511 from the data random access memory 122 column. In addition, the multiplex register 1 receives the output 209 of the multiplex register 0 as another input 211, and the multiplex register 2 receives the output 209 of the multiplex register 1 as another input 211, multiplexing. The register 3 receives the output 209 of the multiplex register 2 as another input 211, and so on, the multiplex register 511 receives the output 209 of the multiplex register 510 as another input 211, and more The scratchpad 0 receives the output 209 of the multiplex register 511 as the other input 211. Each multiplex register 208 receives a control input 213 to control its selection profile 207 or loop input 211. In one of the modes of operation, the control input 213 controls each of the multiplex registers 208 to select the data text 207 for storage in the scratchpad and to provide to the arithmetic logic unit 204 in subsequent steps during a first time-frequency period. And in subsequent time-frequency periods (such as the aforementioned M-1 time-frequency periods), control input 213 controls each multiplex register 208 to select loop input 211 for storage to the scratchpad and provide it to subsequent steps. Arithmetic logic unit 204.

雖然第三圖(以及後續之第七與十九圖)所描述之實施例中，多個神經處理單元126可用以將這些多工暫存器208/705之數值向右旋轉，亦即由神經處理單元J朝向神經處理單元J+1移動，不過本發明並不限於此，在其他的實施例中(例如對應於第二十四至二十六圖之實施例)，多個神經處理單元126可用以將多工暫存器208/705之數值向左旋轉，亦即由神經處理單元J朝向神經處理單元J-1移動。此外，在本發明之其他實施例中，這些神經處理單元126可選擇性地將多工暫存器208/705之數值向左或向右旋轉，舉例來說，此選擇可由神經網路單元指令所指定。 Although in the embodiment depicted in the third figure (and subsequent seventh and nineteenth figures), a plurality of neural processing units 126 may be used to rotate the values of these multiplex registers 208/705 to the right, ie by the nerves The processing unit J moves toward the neural processing unit J+1, although the invention is not limited thereto, and in other embodiments (e.g., corresponding to the embodiments of the twenty-fourth to twenty-sixth embodiments), the plurality of neural processing units 126 It can be used to rotate the value of the multiplex register 208/705 to the left, that is, to move from the neural processing unit J toward the neural processing unit J-1. Moreover, in other embodiments of the invention, the neural processing unit 126 can selectively rotate the value of the multiplex register 208/705 to the left or to the right, for example, the selection can be Specified by the neural network unit directive.

第四圖係一表格，顯示一個儲存於第一圖之神經網路單元121之程式記憶體129並由該神經網路單元121執行之程式。如前所述，此範例程式係執行與人工神經網路之一層有關的計算。第四圖之表格顯示有四個列與三個行。每一個列係對應於程式記憶體129中標示於第一行之一位址。第二行指定相對應的指令，而第三行係指出關聯於此指令之時頻週期數。就一較佳實施例而言，前述時頻週期數係表示在管線執行之實施例中每指令時頻週期值之有效的時頻週期數，而非指令延遲。如圖中所示，因為神經網路單元121具有管線執行的本質，每個指令均有一相關聯之時頻週期，位於位址2之指令是一個例外，此指令實際上自己會重複執行511次，因而需要511個時頻週期，詳如後述。 The fourth diagram is a table showing a program stored in the program memory 129 of the neural network unit 121 of the first figure and executed by the neural network unit 121. As mentioned earlier, this sample program performs calculations related to one layer of the artificial neural network. The table in Figure 4 shows four columns and three rows. Each column corresponds to an address in the program memory 129 labeled in the first row. The second line specifies the corresponding instruction, and the third line indicates the number of time-frequency cycles associated with this instruction. In the preferred embodiment, the number of time-frequency cycles is indicative of the number of effective time-frequency cycles per instruction time-frequency period value in the embodiment of the pipeline execution, rather than the instruction delay. As shown in the figure, because the neural network unit 121 has the nature of pipeline execution, each instruction has an associated time-frequency period. The instruction at address 2 is an exception, and the instruction actually repeats 511 times. Therefore, 511 time-frequency cycles are required, as will be described later.

所有的神經處理單元126會平行處理程式中的每個指令。也就是說，所有的N個神經處理單元126都會在同一個時頻週期執行第一列之指令，所有的N個神經處理單元126都會在同一個時頻週期執行第二列之指令，依此類推。不過本發明並不限於此，在後續章節之其他實施例中，有些指令則是以部分平行部分序列之方式執行，舉例來說，如第十一圖之實施例所述，在多個神經處理單元126共享一個啟動函數單元之實施例中，啟動函數與位於位址3與4之輸出指令即是以此方式執行。第四圖之範例中假定一個層具有512個神經元(神經處理單元126)，而每個神經元具有512個來自前一層之 512個神經元之連結輸入，總共有256K個連結。每個神經元會從每個連結輸入接收一個16位元資料值，並將此16位元資料值乘上一個適當的16位元權重值。 All neural processing units 126 will process each instruction in the program in parallel. That is to say, all N neural processing units 126 execute the instructions of the first column in the same time-frequency cycle, and all N neural processing units 126 execute the instructions of the second column in the same time-frequency cycle. analogy. However, the invention is not limited thereto, and in other embodiments of the subsequent sections, some instructions are executed in the form of a sequence of partially parallel portions, for example, as described in the embodiment of the eleventh embodiment, in multiple neural processing In an embodiment where unit 126 shares a start function unit, the start function and the output instructions at addresses 3 and 4 are executed in this manner. The example in the fourth figure assumes that one layer has 512 neurons (neural processing unit 126) and each neuron has 512 from the previous layer. 512 neuron link inputs, for a total of 256K links. Each neuron receives a 16-bit data value from each of the link inputs and multiplies the 16-bit data value by an appropriate 16-bit weight value.

位於位址0之第一列(亦可指定至其他位址)會指定一初始化神經處理單元指令。此初始化指令會清除累加器202數值使之為零。在一實施例中，初始化指令亦可在累加器202內載入資料隨機存取記憶體122或權重隨機存取記憶體124之一個列中，由此指令所指定之相對應之文字。此初始化指令也會將配置值載入控制暫存器127，這部分在後續第二十九A與二十九B圖會有更詳細的說明。舉例來說，可將資料文字207與權重文字209之寬度載入，供算術邏輯單元204利用以確認電路執行之運算大小，此寬度也會影響儲存於累加器202之結果215。在一實施例中，神經處理單元126包括一電路在算術邏輯單元204之輸出215儲存於累加器202前填滿此輸出215，而初始化指令會將一配置值載入此電路，此配置值會影響前述之填滿運算。在一實施例中，也可在算術邏輯單元函數指令(如位址1之乘法累加指令)或輸出指令(如位址4之寫入起始函數單元輸出指令)中如此指定，以將累加器202清除至零值。 The first column of address 0 (which can also be assigned to another address) specifies an initialization neuroprocessing unit instruction. This initialization instruction clears the value of accumulator 202 to zero. In an embodiment, the initialization command may also be loaded into the column of the data random access memory 122 or the weight random access memory 124 in the accumulator 202, thereby instructing the corresponding text specified by the instruction. This initialization command also loads the configuration values into the control register 127, which is described in more detail in subsequent twenty-ninth and twenty-ninth B-pictures. For example, the width of the data text 207 and the weight text 209 can be loaded for use by the arithmetic logic unit 204 to confirm the size of the operation performed by the circuit, which width also affects the result 215 stored in the accumulator 202. In one embodiment, the neural processing unit 126 includes a circuit that fills the output 215 before the output 215 of the arithmetic logic unit 204 is stored in the accumulator 202, and the initialization command loads a configuration value into the circuit. Affects the aforementioned filling operation. In an embodiment, it may also be specified in an arithmetic logic unit function instruction (such as a multiplication accumulating instruction of address 1) or an output instruction (such as a write start function unit output instruction of address 4) to accumulate the accumulator. 202 clears to zero value.

位於位址1之第二列係指定一乘法累加指令指示這512個神經處理單元126從資料隨機存取記憶體122之一列載入一相對應之資料文字以及從權重隨機存取記憶體124之一列載入一相對應之權重文字，並且對此資料文字輸入207與權重文字輸入206執行一第一乘法累加運算，即加上初始化累加器202零值。進一步來說，此指令會指示定序器128在控制輸入213產生一數值以選擇資料文字輸入207。在第四圖之範例中，資料隨機存取記憶體122之指定列為列17，權重隨機存取記憶體124之指定列為列0，因此定序器會被指示輸出數值17作為一資料隨機存取記憶體位址123，輸出數值0作為一權重隨機存取記憶體位址125。因此，來自資料隨機存取記憶體122之列17之512個資料文字係提供作為512個神經處理單元126之相對應資料輸入207，而來自權重隨機存取記憶體124之列0之512個權重文字係提供作為512個神經處理單元126之相對應權重輸入206。 The second column located in address 1 specifies a multiply accumulate instruction to instruct the 512 neural processing units 126 to load a corresponding data word from the data random access memory 122 and from the weight random access memory 124. A column loads a corresponding weight text, and a first multiplication is performed on the data text input 207 and the weight text input 206. The addition operation, that is, the initialization accumulator 202 zero value is added. Further, this command will instruct sequencer 128 to generate a value at control input 213 to select data entry 207. In the example of the fourth figure, the designated column of the data random access memory 122 is column 17, and the designated column of the weight random access memory 124 is column 0, so the sequencer is instructed to output the value 17 as a data random. The memory address address 123 is accessed and the value 0 is output as a weighted random access memory address 125. Thus, 512 data words from column 17 of data random access memory 122 are provided as corresponding data inputs 207 for 512 neural processing units 126, and 512 weights from column 0 of weight random access memory 124. The text is provided as a corresponding weight input 206 for 512 neural processing units 126.

位於位址2之第三列係指定一乘法累加旋轉指令，此指令具有一計數其數值為511，以指示這512個神經處理單元126執行511次乘法累加運算。此指令指示這512個神經處理單元126將511次乘法累加運算之每一次運算中輸入算術邏輯單元204之資料文字209，作為從鄰近神經處理單元126來的旋轉值211。也就是說，此指令會指示定序器128在控制輸入213產生一數值以選擇旋轉值211。此外，此指令會指示這512個神經處理單元126將511次乘法累加運算之每一次運算中之一相對應權重值載入權重隨機存取記憶體124之“下一”列。也就是說，此指令會指示定序器128將權重隨機存取記憶體位址125從前一個時頻週期的數值增加一，在此範例中，指令之第一時頻週期是列1，下一個時頻週期就是列2，在下一個時頻週期就是列3，依此類推，第511個時頻週期就是列511。在這511個乘法累加運算中之每一個運算中，旋轉輸入211與權重文字輸入206之乘積會被加入累加器202之前一個數值。這512個神經處理單元126會在511個時頻週期內執行這511個乘法累加運算，每個神經處理單元126會對於來自資料隨機存取記憶體122之列17之不同資料文字-也就是，相鄰之神經處理單元126在前一個時頻週期執行運算的資料文字，以及關聯於資料文字之不同權重文字執行一個乘法累加運算在概念上即為神經元之不同連結輸入。此範例假設各個神經處理單元126(神經元)具有512個連結輸入，因此牽涉到512個資料文字與512個權重文字之處理。在列2之乘法累加旋轉指令重複最後一次迭代後，累加器202內就會存放有這512個連結輸入之乘積的加總。在一實施例中，神經處理單元126之指令集係包括一“執行”指令以指示算術邏輯單元204執行由初始化神經處理單元指令指定之一算術邏輯單元運算，例如第二十九A圖之算術邏輯單元函數2926所指定者，而非對於各個不同類型之算術邏輯運算(例如前述之乘法累加、累加器與權重之最大值等)具有一獨立的指令。 The third column located in address 2 specifies a multiply-accumulate rotation instruction having a count of 511 to indicate that the 512 neural processing units 126 perform 511 multiply-accumulate operations. This instruction instructs the 512 neural processing units 126 to input the data word 209 of the arithmetic logic unit 204 into the rotation value 211 from the adjacent neural processing unit 126 in each of the 511 multiply-accumulate operations. That is, this command will instruct sequencer 128 to generate a value at control input 213 to select rotation value 211. In addition, the instruction will instruct the 512 neural processing units 126 to load one of the 511 multiply-accumulate operations into the "next" column of the weighted random access memory 124. That is, the instruction will instruct sequencer 128 to increment the weighted random access memory address 125 from the value of the previous time-frequency period by one, in this example, the first time-frequency period of the instruction is column 1, the next time The frequency period is column 2, the next time-frequency period is column 3, and so on, the 511th time-frequency period is Is column 511. In each of the 511 multiply-accumulate operations, the product of the rotary input 211 and the weighted text input 206 is added to the previous value of the accumulator 202. The 512 neural processing units 126 perform the 511 multiply-accumulate operations in 511 time-frequency cycles, each of the neural processing units 126 for different data words from the column 17 of the data random access memory 122 - that is, The adjacent neural processing unit 126 performs the arithmetic data of the operation in the previous time-frequency cycle, and performs a multiplication and accumulation operation on the different weight words associated with the data characters, which are conceptually different input inputs of the neurons. This example assumes that each neural processing unit 126 (neuron) has 512 linked inputs, thus involving the processing of 512 data words and 512 weighted words. After the last iteration of the multiply-accumulate rotation instruction of column 2, the sum of the products of the 512 connection inputs is stored in the accumulator 202. In one embodiment, the instruction set of the neural processing unit 126 includes an "execute" instruction to instruct the arithmetic logic unit 204 to perform an arithmetic logic unit operation specified by the initialization neural processing unit instruction, such as the arithmetic of the twenty-ninth A-picture. The logic unit function 2926 specifies, rather than having separate instructions for each of the different types of arithmetic logic operations (e.g., the aforementioned multiply accumulate, accumulator and weight max, etc.).

位於位址3之第四列係指定一啟動函數指令。此啟動函數指令指示啟動函數單元212對於累加器202數值執行所指定之啟動函數以產生結果133。啟動函數之實施例在後續章節會有更詳細的說明。 The fourth column located in address 3 specifies a start function instruction. This start function instruction instructs the start function unit 212 to execute the specified start function for the accumulator 202 value to produce the result 133. The embodiment of the startup function will be described in more detail in subsequent chapters.

位於位址4之第五列係指定一寫入啟動函數單元輸出指令，以指示這512個神經處理單元216將其啟動函數單元212輸出作為結果133寫回至資料隨機存取記憶體122之一列，在此範例中即列16。也就是說，此指令會指示定序器128輸出數值16作為資料隨機存取記憶體位址123以及一寫入命令(相對應於由位址1之乘法累加指令所指定之讀取命令)。就一較佳實施例而言，因為管線執行之特性，寫入啟動函數單元輸出指令可與其他指令同時執行，因此寫入啟動函數單元輸出指令實際上可以在單一個時頻週期內執行。 The fifth column located in address 4 specifies a write start function unit output instruction to indicate that the 512 neural processing units 216 will The boot function unit 212 output is written back as a result 133 to one of the columns of data random access memory 122, in this example column 16. That is, this instruction will instruct sequencer 128 to output a value of 16 as data random access memory address 123 and a write command (corresponding to the read command specified by the multiply accumulate instruction of address 1). In a preferred embodiment, the write start function unit output instructions can be executed concurrently with other instructions because of the nature of the pipeline execution, so the write start function unit output instructions can actually be executed in a single time-frequency cycle.

就一較佳實施例而言，每個神經處理單元126係作為一管線，此管線具有各種不同功能元件，例如多工暫存器208(以及第七圖之多工暫存器705)、算術邏輯單元204、累加器202、啟動函數單元212、多工器802(請參照第八圖)、列緩衝器1104與啟動函數單元1112(請參照第十一圖)等，其中某些元件本身即可管線執行。除了資料文字207與權重文字206外，此管線還會從程式記憶體129接收指令。這些指令會沿著管線流動並控制多種功能單元。在另一實施例中，此程式內不包含啟動函數指令，而是由初始化神經處理單元指令指定執行於累加器202數值217之啟動函數，指出被指定之啟動函數之一數值係儲存於一配置暫存器，供管線之啟動函數單元212部分在產生最後的累加器202數值217後，也就是在位址2之乘法累加旋轉指令重複最後一次執行後，加以利用。就一較佳實施例而言，為了節省耗能，管線之啟動函數單元212部分在寫入啟動函數單元輸出指令到達前會處於不啟動狀態，在指令到達時，啟動函數單元212 會啟動並對初始化指令指定之累加器202輸出217執行啟動函數。 In a preferred embodiment, each neural processing unit 126 acts as a pipeline having various functional components, such as multiplex register 208 (and multiplex register 705 of the seventh diagram), arithmetic. Logic unit 204, accumulator 202, start function unit 212, multiplexer 802 (please refer to FIG. 8), column buffer 1104 and start function unit 1112 (please refer to FIG. 11), etc., some of which are themselves Can be executed in pipeline. In addition to the data text 207 and the weight text 206, this pipeline also receives instructions from the program memory 129. These instructions flow along the pipeline and control multiple functional units. In another embodiment, the program does not include a start function instruction, but the initialization neural processing unit instruction specifies a start function that executes the value 217 of the accumulator 202, indicating that the value of one of the specified start functions is stored in a configuration. The register, the portion of the start function unit 212 for the pipeline, is utilized after the last accumulator 202 value 217 is generated, that is, after the last execution of the multiply-accumulate rotation instruction of address 2 is repeated. In a preferred embodiment, in order to save energy, the boot function unit 212 portion of the pipeline will be in an inactive state before the write start function unit output instruction arrives. When the instruction arrives, the function unit 212 is started. The start function is executed and the accumulator 202 output 217 specified by the initialization instruction is executed.

第五圖係顯示神經網路單元121執行第四圖之程式之時序圖。此時序圖之每一列係對應至第一行指出之連續時頻週期。其他行則是分別對應至這512個神經處理單元126中不同的神經處理單元126並指出其運算。圖中僅顯示神經處理單元0,1,511之運算以簡化說明。 The fifth diagram shows a timing diagram of the execution of the program of the fourth diagram by the neural network unit 121. Each column of this timing diagram corresponds to the continuous time-frequency period indicated by the first row. The other lines correspond to the different neural processing units 126 of the 512 neural processing units 126, respectively, and indicate their operations. Only the operation of the neural processing unit 0, 1, 511 is shown in the figure to simplify the explanation.

在時頻週期0，這512個神經處理單元126中的每一個神經處理單元126都會執行第四圖之初始化指令，在第五圖中即是將一零值指派給累加器202。 At time-frequency period 0, each of the 512 neural processing units 126 performs an initialization instruction of the fourth map, and in the fifth diagram, a zero value is assigned to the accumulator 202.

在時頻週期1，這512個神經處理單元126中的每一個神經處理單元126都會執行第四圖中位址1之乘法累加指令。如圖中所示，神經處理單元0會將累加器202數值(即零)加上資料隨機存取記憶體122之列17之文字0與權重隨機存取記憶體124之列0之文字0之乘積；神經處理單元1會將累加器202數值(即零)加上資料隨機存取記憶體122之列17之文字1與權重隨機存取記憶體124之列0之文字1之乘積；依此類推，神經處理單元511會將累加器202數值(即零)加上資料隨機存取記憶體122之列17之文字511與權重隨機存取記憶體124之列0之文字511之乘積。 In time-frequency period 1, each of the 512 neural processing units 126 performs a multiply-accumulate instruction of address 1 in the fourth figure. As shown in the figure, the neural processing unit 0 adds the value of the accumulator 202 (i.e., zero) to the text 0 of the column 17 of the data random access memory 122 and the text 0 of the column 0 of the weight random access memory 124. Product; the neural processing unit 1 adds the value of the accumulator 202 (i.e., zero) to the product of the text 1 of the column 17 of the data random access memory 122 and the text 1 of the column 0 of the weight random access memory 124; Similarly, the neural processing unit 511 adds the value of the accumulator 202 (i.e., zero) to the product of the text 511 of the column 17 of the data random access memory 122 and the text 511 of the column 0 of the weight random access memory 124.

在時頻週期2，這512個神經處理單元126中的每一個神經處理單元126都會進行第四圖中位址2之乘法累加旋轉指令之第一次迭代。如圖中所示，神經處理單元0會將累加器202數值加上由神經處理單元511之多工暫存器208輸出209接收之旋轉資料文字211(即由資料隨機存取記憶體122接收之資料文字511)與權重隨機存取記憶體124之列1之文字0之乘積；神經處理單元1會將累加器202數值加上由神經處理單元0之多工暫存器208輸出209接收之旋轉資料文字211(即由資料隨機存取記憶體122接收之資料文字0)與權重隨機存取記憶體124之列1之文字1之乘積；依此類推，神經處理單元511會將累加器202數值加上由神經處理單元510之多工暫存器208輸出209接收之旋轉資料文字211(即由資料隨機存取記憶體122接收之資料文字510)與權重隨機存取記憶體124之列1之文字511之乘積。 In time-frequency period 2, each of the 512 neural processing units 126 performs the first iteration of the multiply-accumulate rotation instruction of address 2 in the fourth figure. As shown in the figure, the nerve The processing unit 0 adds the value of the accumulator 202 to the rotated data text 211 (i.e., the data text 511 received by the data random access memory 122) received by the multiplexer 208 output 209 of the neural processing unit 511, and the weights are random. The product of the text 0 of the column 1 of the memory 124 is accessed; the neural processing unit 1 adds the value of the accumulator 202 to the rotated data 211 received by the output 209 of the multiplex register 208 of the neural processing unit 0 (ie, by data) The product text 0) received by the random access memory 122 and the text 1 of the column 1 of the weight random access memory 124; and so on, the neural processing unit 511 adds the value of the accumulator 202 to the neural processing unit 510. The multiplexer 208 outputs 209 the product of the rotated data text 211 (ie, the data text 510 received by the data random access memory 122) and the text 511 of the column 1 of the weight random access memory 124.

在時頻週期3，這512個神經處理單元126中的每一個神經處理單元126都會進行第四圖中位址2之乘法累加旋轉指令之第二次迭代。如圖中所示，神經處理單元0會將累加器202數值加上由神經處理單元511之多工暫存器208輸出209接收之旋轉資料文字211(即由資料隨機存取記憶體122接收之資料文字510)與權重隨機存取記憶體124之列2之文字0之乘積；神經處理單元1會將累加器202數值加上由神經處理單元0之多工暫存器208輸出209接收之旋轉資料文字211(即由資料隨機存取記憶體122接收之資料文字511)與權重隨機存取記憶體124之列2之文字1之乘積；依此類推，神經處理單元511會將累加器202數值加上由神經處理單元510之多工暫存器208輸出209接收之旋轉資料文字211(即由資料隨機存取記憶體122接收之資料文字509)與權重隨機存取記憶體124之列2之文字511之乘積。如同第五圖之省略標號顯示，接下來509個時頻週期會依此持續進行，直到時頻週期512。 In time-frequency period 3, each of the 512 neural processing units 126 performs a second iteration of the multiply-accumulate rotation instruction of address 2 in the fourth figure. As shown in the figure, the neural processing unit 0 adds the value of the accumulator 202 to the rotated data text 211 received by the multiplexer 208 output 209 of the neural processing unit 511 (i.e., received by the data random access memory 122). The data text 510) is multiplied by the text 0 of column 2 of the weight random access memory 124; the neural processing unit 1 adds the value of the accumulator 202 to the rotation received by the output 209 of the multiplex register 208 of the neural processing unit 0. The product text 211 (i.e., the data text 511 received by the data random access memory 122) and the text 1 of the column 2 of the weight random access memory 124; and so on, the neural processing unit 511 will increment the value of the accumulator 202. The rotated data text 211 received by the output 209 of the multiplex register 208 of the neural processing unit 510 is added (ie, the data is stored randomly). The product of the data text 509 received by the memory 122 and the text 511 of the column 2 of the weight random access memory 124 is taken. As the ellipsis of the fifth figure shows, the next 509 time-frequency periods will continue until the time-frequency period 512.

在時頻週期512，這512個神經處理單元126中的每一個神經處理單元126都會進行第四圖中位址2之乘法累加旋轉指令之第511次迭代。如圖中所示，神經處理單元0會將累加器202數值加上由神經處理單元511之多工暫存器208輸出209接收之旋轉資料文字211(即由資料隨機存取記憶體122接收之資料文字1)與權重隨機存取記憶體124之列511之文字0之乘積；神經處理單元1會將累加器202數值加上由神經處理單元0之多工暫存器208輸出209接收之旋轉資料文字211(即由資料隨機存取記憶體122接收之資料文字2)與權重隨機存取記憶體124之列511之文字1之乘積；依此類推，神經處理單元511會將累加器202數值加上由神經處理單元510之多工暫存器208輸出209接收之旋轉資料文字211(即由資料隨機存取記憶體122接收之資料文字0)與權重隨機存取記憶體124之列511之文字511之乘積。在一實施例中需要多個時頻週期從資料隨機存取記憶體122與權重隨機存取記憶體124讀取資料文字與權重文字以執行第四圖中位址1之乘法累加指令；不過，資料隨機存取記憶體122、權重隨機存取記憶體124與神經處理單元126係採管線配置，如此在第一個乘法累加運算開始後(如第五圖之時頻週期1所示)，後續的乘法累加運算(如第五圖之時頻週期2-512所示)就會開始在接續的時頻週期內執行。就一較佳實施例而言，因應利用架構指令，如MTNN或MFNN指令(在後續第十四與十五圖會進行說明)，對於資料隨機存取記憶體122與/或權重隨機存取記憶體124之存取動作，或是架構指令轉譯出之微指令，這些神經處理單元126會短暫地擱置。 At time-frequency period 512, each of the 512 neural processing units 126 performs the 511th iteration of the multiply-accumulate rotation instruction of address 2 in the fourth diagram. As shown in the figure, the neural processing unit 0 adds the value of the accumulator 202 to the rotated data text 211 received by the multiplexer 208 output 209 of the neural processing unit 511 (i.e., received by the data random access memory 122). The data text 1) is multiplied by the text 0 of the column 511 of the weight random access memory 124; the neural processing unit 1 adds the value of the accumulator 202 to the rotation received by the output 209 of the multiplex register 208 of the neural processing unit 0. The product text 211 (i.e., the data text 2 received by the data random access memory 122) and the text 1 of the column 511 of the weight random access memory 124; and so on, the neural processing unit 511 will increment the value of the accumulator 202. The rotated data text 211 (ie, the data text 0 received by the data random access memory 122) received by the multiplexer 208 output 209 of the neural processing unit 510 and the column 511 of the weight random access memory 124 are added. The product of the text 511. In one embodiment, multiple time-frequency periods are required to read the data text and weight text from the data random access memory 122 and the weight random access memory 124 to perform the multiplication accumulation instruction of the address 1 in the fourth figure; however, The data random access memory 122, the weighted random access memory 124, and the neural processing unit 126 are pipelined, such that after the first multiply-accumulate operation begins (as shown in the time-frequency period 1 of the fifth figure), subsequent Multiply-accumulate operation (such as the time-frequency of the fifth picture) Cycles 2-512) will begin to execute during the connected time-frequency period. In a preferred embodiment, the data random access memory 122 and/or the weight random access memory are utilized in response to architectural instructions, such as MTNN or MFNN instructions (described in subsequent fourteenth and fifteenth figures). The access to the body 124, or the micro-instructions translated by the architectural instructions, will temporarily rest on these neural processing units 126.

在時頻週期513，這512個神經處理單元126中的每一個神經處理單元126之啟動函數單元212都會執行第四圖中位址3之啟動函數。最後，在時頻週期514，這512個神經處理單元126中的每一個神經處理單元126會透過將其結果133寫回資料隨機存取記憶體122之列16中之相對應文字以執行第四圖中位址4之寫入啟動函數單元輸出指令，也就是說，神經處理單元0之結果133會被寫入資料隨機存取記憶體122之文字0，神經處理單元1之結果133會被寫入資料隨機存取記憶體122之文字1，依此類推，神經處理單元511之結果133會被寫入資料隨機存取記憶體122之文字511。對應於前述第五圖之運算之相對應方塊圖係顯示於第六A圖。 At time-frequency period 513, the start function unit 212 of each of the 512 neural processing units 126 performs the start function of address 3 in the fourth diagram. Finally, at time-frequency period 514, each of the 512 neural processing units 126 will perform a fourth by writing the result 133 back to the corresponding text in column 16 of the data random access memory 122. The write start function unit output instruction of the address 4 in the figure, that is, the result 133 of the neural processing unit 0 is written to the text 0 of the data random access memory 122, and the result 133 of the neural processing unit 1 is written. The text 1 of the data random access memory 122 is entered, and so on, the result 133 of the neural processing unit 511 is written to the text 511 of the data random access memory 122. Corresponding block diagrams corresponding to the operations of the aforementioned fifth figure are shown in Figure 6A.

第六A圖係顯示第一圖之神經網路單元121執行第四圖之程式之方塊示意圖。此神經網路單元121包括512個神經處理單元126、接收位址輸入123之資料隨機存取記憶體122，與接收位址輸入125之權重隨機存取記憶體124。在時頻週期0的時候，這512個神經處理單元126會執行初始化指令。此運作在圖中並未顯示。如圖中所示，在時頻週期1的時候，列17之512個16位元之資料文字會從資料隨機存取記憶體122讀出並提供至這512個神經處理單元126。在時頻週期1至512之過程中，列0至列511之512個16位元之權重文字會分別從權重隨機存取記憶體122讀出並提供至這512個神經處理單元126。在時頻週期1的時候，這512個神經處理單元126會對載入之資料文字與權重文字執行其相對應之乘法累加運算。此運作在圖中並未顯示。在時頻週期2至512之過程中，512個神經處理單元126之多工暫存器208會如同一個具有512個16位元文字之旋轉器進行運作，而將先前由資料隨機存取記憶體122之列17載入之資料文字轉動至鄰近之神經處理單元126，而這些神經處理單元126會對轉動後之相對應資料文字以及由權重隨機存取記憶體124載入之相對應權重文字執行乘法累加運算。在時頻週期513的時候，這512個啟動函數單元212會執行啟動指令。此運作在圖中並未顯示。在時頻週期514的時候，這512個神經處理單元126會將其相對應之512個16位元結果133寫回資料隨機存取記憶體122之列16。 Figure 6A is a block diagram showing the execution of the fourth diagram of the neural network unit 121 of the first figure. The neural network unit 121 includes 512 neural processing units 126, a data random access memory 122 that receives the address input 123, and a weighted random access memory 124 that receives the address input 125. At time interval 0, the 512 neural processing units 126 execute initialization instructions. This operation is not shown in the figure. As shown in the figure, at time-frequency period 1, 512 16-bit elements of column 17 The data text is read from the data random access memory 122 and provided to the 512 neural processing units 126. During the time-frequency period 1 to 512, 512 16-bit weighted texts of columns 0 through 511 are read from weight random access memory 122 and provided to the 512 neural processing units 126, respectively. During the time-frequency period 1, the 512 neural processing units 126 perform their corresponding multiply-accumulate operations on the loaded data text and weight text. This operation is not shown in the figure. During the time-frequency period 2 to 512, the multiplexer 208 of the 512 neural processing unit 126 operates as a rotator with 512 16-bit characters, and will be previously accessed by the data random access memory. The data characters loaded in column 122 are rotated to the adjacent neural processing unit 126, and the neural processing units 126 perform the corresponding data characters after the rotation and the corresponding weight characters loaded by the weight random access memory 124. Multiply accumulate operation. At time FIFO period 513, the 512 start function units 212 execute a start command. This operation is not shown in the figure. During time-frequency period 514, the 512 neural processing units 126 write their corresponding 512 16-bit results 133 back to column 16 of data random access memory 122.

如圖中所示，產生結果文字(神經元輸出)並寫回資料隨機存取記憶體122或權重隨機存取記憶體124需要之時頻週期數大致為神經網路之當前層接收到的資料輸入(連結)數量的平方根。舉例來說，若是當前層具有512個神經元，而各個神經元具有512個來自前一層的連結，這些連結的總數就是256K，而產生當前層結果需要的時頻週期數就會略大於512。因此，神經網路單元121在神經網路計算方面可提供極高的效能。 As shown in the figure, the number of time-frequency cycles required to generate the resulting text (neuron output) and write back to the data random access memory 122 or the weighted random access memory 124 is approximately the data received by the current layer of the neural network. Enter the square root of the number of (links). For example, if the current layer has 512 neurons and each neuron has 512 links from the previous layer, the total number of these links is 256K, and the number of time-frequency cycles required to produce the current layer result is slightly greater than 512. Therefore, the neural network unit 121 can provide extremely high performance in neural network calculation.

第六B圖係一流程圖，顯示第一圖之處理器100執行一架構程式，以利用神經網路單元121執行關聯於一人工神經網路之隱藏層之神經元之典型乘法累加啟動函數運算之運作，如同由第四圖之程式執行之運作。第六B圖之範例係假定有四個隱藏層(標示於初始化步驟602之變數NUM_LAYERS)，各個隱藏層具有512個神經元，各個神經元係連結前一層全部之512個神經元(透過第四圖之程式)。不過，需要理解的是，這些層與神經元之數量的選擇係為說明本案發明，神經網路單元121當可將類似的計算應用於不同數量隱藏層之實施例，每一層中具有不同數量神經元之實施例，或是神經元未被全部連結之實施例。在一實施例中，對於這一層中不存在之神經元或是不存在之神經元連結的權重值會被設定為零。就一較佳實施例而言，架構程式會將第一組權重寫入權重隨機存取記憶體124並啟動神經網路單元121，當神經網路單元121正在執行關聯於第一層之計算時，此架構程式會將第二組權重寫入權重隨機存取記憶體124，如此，一旦神經網路單元121完成第一隱藏層之計算，神經網路單元121就可以開始第二層之計算。如此，架構程式會往返於權重隨機存取記憶體124之兩個區域，以確保神經網路單元121可以被充分利用。此流程始於步驟602。 FIG. 6B is a flow chart showing that the processor 100 of the first figure executes an architectural program to perform a typical multiply-accumulate start function operation of a neuron associated with a hidden layer of an artificial neural network using the neural network unit 121. The operation is as if it were performed by the program of the fourth figure. The example in Figure 6B assumes that there are four hidden layers (variables labeled NUM_LAYERS in initialization step 602), each hidden layer has 512 neurons, and each neuron is connected to all 512 neurons in the previous layer (through the fourth Figure program). However, it is to be understood that the selection of the number of these layers and neurons is illustrative of the present invention. The neural network unit 121 can apply similar calculations to different numbers of hidden layers, with different numbers of nerves in each layer. Embodiments of the element, or embodiments in which the neurons are not all connected. In one embodiment, the weight value for a neuron that does not exist in this layer or a neuron that does not exist will be set to zero. In a preferred embodiment, the architectural program writes the first set of weights to the weighted random access memory 124 and activates the neural network unit 121 when the neural network unit 121 is performing the calculation associated with the first layer. The architecture program writes the second set of weights to the weighted random access memory 124. Thus, once the neural network unit 121 completes the calculation of the first hidden layer, the neural network unit 121 can begin the calculation of the second layer. Thus, the architecture program will travel to and from the two regions of the weighted random access memory 124 to ensure that the neural network unit 121 can be fully utilized. This process begins in step 602.

在步驟602，如第六A圖之相關章節所述，執行架構程式之處理器100係將輸入值寫入資料隨機存取記憶體122之當前神經元隱藏層，也就是寫入資料隨機存取記憶體122之列17。這些值也可能已經位於資料隨機存取記憶體122之列17作為神經網路單元121針對前一層之運算結果133(例如卷積、共源或輸入層)。其次，架構程式會將變數N初始化為數值1。變數N代表隱藏層中即將由神經網路單元121處理之當前層。此外，架構程式會將變數NUM_LAYERS初始化為數值4，因為在本範例中有四個隱藏層。接下來流程前進至步驟604。 In step 602, as described in the relevant section of FIG. A, the processor 100 executing the architecture program writes the input value to the current neuron hidden layer of the data random access memory 122, that is, the data is written. The machine accesses column 17 of memory 122. These values may also already be in column 17 of data random access memory 122 as the result 133 (e.g., convolution, common source, or input layer) of neural network unit 121 for the previous layer. Second, the architecture program initializes the variable N to a value of 1. The variable N represents the current layer in the hidden layer that is to be processed by the neural network unit 121. In addition, the architecture program initializes the variable NUM_LAYERS to a value of 4 because there are four hidden layers in this example. The flow then proceeds to step 604.

在步驟604，處理器100將層1之權重文字寫入權重隨機存取記憶體124，例如第六A圖所示之列0至511。接下來流程前進至步驟606。 At step 604, processor 100 writes the weight text of layer 1 into weight random access memory 124, such as columns 0 through 511 shown in Figure 6A. The flow then proceeds to step 606.

在步驟606中，處理器100利用指定一函數1432以寫入程式記憶體129之MTNN指令1400，將一乘法累加啟動函數程式(如第四圖所示)寫入神經網路單元121程式記憶體129。處理器100隨後利用一MTNN指令1400以啟動神經網路單元程式，此指令係指定一函數1432開始執行此程式。接下來流程前進至步驟608。 In step 606, the processor 100 writes a multiply-accumulate start function program (as shown in the fourth figure) to the neural network unit 121 program memory by specifying a function 1432 to write the MTNN instruction 1400 of the program memory 129. 129. The processor 100 then utilizes an MTNN instruction 1400 to initiate a neural network unit program that specifies a function 1432 to begin execution of the program. The flow then proceeds to step 608.

在決策步驟608中，架構程式確認變數N之數值是否小於NUM_LAYERS。若是，流程就會前進至步驟612；否則就前進至步驟614。 In decision step 608, the architecture program determines if the value of the variable N is less than NUM_LAYERS. If so, the flow proceeds to step 612; otherwise, it proceeds to step 614.

在步驟612中，處理器100將層N+1之權重文字寫入權重隨機存取記憶體124，例如列512至1023。因此，架構程式就可以在神經網路單元121執行當前層之隱藏層計算時將下一層的權重文字寫入權重隨機存取記憶體124，藉此，在完成當前層之計算，也就是寫入資料隨機存取記憶體122後，神經網路單元121就可以立刻開始執行下一層之隱藏層計算。接下來前進至步驟614。 In step 612, processor 100 writes the weight text of layer N+1 into weighted random access memory 124, such as columns 512 through 1023. Therefore, the architecture program can write the weight text of the next layer to the weight random access memory 124 when the neural network unit 121 performs the hidden layer calculation of the current layer, thereby completing the calculation of the current layer, that is, writing. After the data random access memory 122, the neural network unit 121 can be opened immediately Start the hidden layer calculation of the next layer. Next, proceed to step 614.

在步驟614中，處理器100確認正在執行之神經網路單元程式(就層1而言，在步驟606開始執行，就層2至4而言，則是在步驟618開始執行)是否已經完成執行。就一較佳實施例而言，處理器100會透過執行一MFNN指令1500讀取神經網路單元121狀態暫存器127以確認是否已經完成執行。在另一實施例中，神經網路單元121會產生一中斷，表示已經完成乘法累加啟動函數層程式。接下來流程前進至決策步驟616。 In step 614, the processor 100 confirms that the neural network unit program being executed (in the case of layer 1, starting at step 606, and in the case of layers 2 through 4, starting at step 618) has been completed. . In a preferred embodiment, processor 100 reads neural network unit 121 status register 127 by executing an MFNN instruction 1500 to confirm whether execution has been completed. In another embodiment, the neural network unit 121 generates an interrupt indicating that the multiply-accumulate start function layer program has been completed. The flow then proceeds to decision step 616.

在決策步驟616中，架構程式確認變數N之數值是否小於NUM_LAYERS。若是，流程會前進至步驟618；否則就前進至步驟622。 In decision step 616, the architecture program determines if the value of the variable N is less than NUM_LAYERS. If so, the flow will proceed to step 618; otherwise, proceed to step 622.

在步驟618中，處理器100會更新乘法累加啟動函數程式，使能執行層N+1之隱藏層計算。進一步來說，處理器100會將第四圖中位址1之乘法累加指令之資料隨機存取記憶體122列值，更新為資料隨機存取記憶體122中前一層計算結果寫入之列(例如更新為列16)並更新輸出列(例如更新為列15)。處理器100隨後開始更新神經網路單元程式。在另一實施例中，第四圖之程式係指定位址4之輸出指令之同一列作為位址1之乘法累加指令所指定之列(也就是由資料隨機存取記憶體122讀取之列)。在此實施例中，輸入資料文字之當前列會被覆寫(因為此列資料文字已經被讀入多工暫存器208並透過N文字旋轉器在這些神經處理單元126間進行旋轉，只要這列資料文字不需被用於其他目的，這樣的處理方式就是可以被允許的)。在此情況下，在步驟618中就不需要更新神經網路單元程式，而只需要將其重新啟動。接下來流程前進至步驟622。 In step 618, the processor 100 updates the multiply-accumulate start function program to enable the hidden layer calculation of layer N+1. Further, the processor 100 updates the column value of the data random access memory 122 of the multiply-accumulate instruction of the address 1 in the fourth figure to the column in which the calculation result of the previous layer in the data random access memory 122 is written ( For example, update to column 16) and update the output column (for example, update to column 15). The processor 100 then begins updating the neural network unit program. In another embodiment, the program of the fourth figure is the same column of the output instruction of the specified address 4 as the column specified by the multiply-accumulate instruction of the address 1 (that is, the column read by the data random access memory 122). ). In this embodiment, the current column of the input data text is overwritten (because the column data has been read into the multiplex register 208 and rotated between the neural processing units 126 via the N-text rotator, as long as this column The text does not need to be used for other purposes. It can be allowed). In this case, there is no need to update the neural network unit program in step 618, but only need to restart it. The flow then proceeds to step 622.

在步驟622中，處理器100從資料隨機存取記憶體122讀取層N之神經網路單元程式之結果。不過，若是這些結果只會被用於下一層，架構程式就不須從資料隨機存取記憶體122讀取這些結果，而可將其保留在資料隨機存取記憶體122供下一個隱藏層計算之用。接下來流程前進至步驟624。 In step 622, processor 100 reads the result of the neural network unit program of layer N from data random access memory 122. However, if these results are only used in the next layer, the architecture program does not need to read the results from the data random access memory 122, but can retain it in the data random access memory 122 for the next hidden layer calculation. Use. The flow then proceeds to step 624.

在決策步驟624中，架構程式確認變數N之數值是否小於NUM_LAYERS。若是，流程前進至步驟626；否則就終止此流程。 In decision step 624, the architecture program determines if the value of the variable N is less than NUM_LAYERS. If so, the flow proceeds to step 626; otherwise, the process is terminated.

在步驟626中，架構程式會將N的數值增加一。接下來流程會回到決策步驟608。 In step 626, the architecture program increments the value of N by one. The flow then returns to decision step 608.

如同第六B圖之範例所示，大致上每512個時頻週期，這些神經處理單元126就會對資料隨機存取記憶體122執行一次讀取與一次寫入(透過第四圖之神經網路單元程式之運算的效果)。此外，這些神經處理單元126大致上每個時頻週期都會對權重隨機存取記憶體124進行讀取以讀取一列權重文字。因此，權重隨機存取記憶體124全部的頻寬都會因為神經網路單元121以混合方式執行隱藏層運算而被消耗。此外，假定在一實施例中具有一個寫入與讀取緩衝器，例如第十七圖之緩衝器1704，神經處理單元126進行讀取之同時，處理器100對權重隨機存取記憶體124進行寫入，如此緩衝器1704大致上每16個時頻週期會對權重隨機存取記憶體124執行一次寫入以寫入權重文字。因此，在權重隨機存取記憶體124為單一埠之實施例中(如同第十七圖之相對應章節所述)，大致上每16個時頻週期這些神經處理單元126就會暫時擱置對權重隨機存取記憶體124進行之讀取，而使緩衝器1704能夠對權重隨機存取記憶體124進行寫入。不過，在雙埠權重隨機存取記憶體124之實施例中，這些神經處理單元126就不需被擱置。 As shown in the example of FIG. B, the neuroprocessing unit 126 performs a read and a write to the data random access memory 122 substantially every 512 time-frequency cycles (through the neural network of the fourth figure). The effect of the operation of the road unit program). In addition, the neural processing unit 126 reads the weighted random access memory 124 for each time-frequency period to read a list of weighted words. Therefore, the full bandwidth of the weighted random access memory 124 is consumed because the neural network unit 121 performs the hidden layer operation in a mixed manner. Furthermore, assuming that in one embodiment there is a write and read buffer, such as buffer 1704 of FIG. 17, the neural processing unit 126 performs the reading while the processor 100 performs the weighted random access memory 124. Write, so buffer 1704 roughly A write to weight random access memory 124 is performed every 16 time-frequency cycles to write weight text. Thus, in an embodiment where the weighted random access memory 124 is a single unit (as described in the corresponding section of Figure 17), the neuroprocessing units 126 are temporarily placed on the weights for every 16 time-frequency periods. The random access memory 124 reads, and the buffer 1704 can write to the weight random access memory 124. However, in embodiments of the dual weight random access memory 124, the neural processing units 126 need not be placed on hold.

第七圖係顯示第一圖之神經處理單元126之另一實施例之方塊示意圖。第七圖之神經處理單元126係類似於第二圖之神經處理單元126。不過，第七圖之神經處理單元126另外具有一個雙輸入多工暫存器705。此多工暫存器705選擇其中一個輸入206或711儲存於其暫存器，並於後續時頻週期提供於其輸出203。輸入206從權重隨機存取記憶體124接收權重文字。另一個輸入711則是接收相鄰神經處理單元126之第二多工暫存器705之輸出203。就一較佳實施例而言，神經處理單元J之輸入711會接收之排列在J-1之神經處理單元126之多工暫存器705輸出203，而神經處理單元J之輸出203則是提供至排列在J+1之神經處理單元126之多工暫存器705之輸入711。如此，N個神經處理單元126之多工暫存器705就可共同運作，如同一N個文字之旋轉器，其運作係類似於前述第三圖所示之方式，不過是用於權重文字而非資料文字。多工暫存器705係利用一控制輸入213控制這兩個輸入中哪一個會被多工暫存器705選擇儲存於其暫存器並於後續提供於輸出203。 The seventh figure is a block diagram showing another embodiment of the neural processing unit 126 of the first figure. The neural processing unit 126 of the seventh diagram is similar to the neural processing unit 126 of the second figure. However, the neural processing unit 126 of the seventh diagram additionally has a dual input multiplex register 705. The multiplex register 705 selects one of the inputs 206 or 711 to be stored in its register and provides its output 203 for subsequent time-frequency periods. Input 206 receives the weight text from weight random access memory 124. Another input 711 is the output 203 of the second multiplex register 705 that receives the adjacent neural processing unit 126. In a preferred embodiment, the input 711 of the neural processing unit J receives the output 203 of the multiplex register 705 of the neural processing unit 126 arranged at J-1, and the output 203 of the neural processing unit J is provided. The input 711 to the multiplex register 705 of the neural processing unit 126 arranged at J+1. Thus, the multiplex registers 705 of the N neural processing units 126 can operate together, such as the same N text rotator, and the operation is similar to that shown in the third figure above, but for the weight text. Non-data text. The multiplex register 705 uses a control input 213 to control which of the two inputs is selected by the multiplex register 705 for storage in its register. And provided to output 203 in the subsequent.

利用多工暫存器208與/或多工暫存器705(以及如第十八與二十三圖所示之其他實施例中之多工暫存器)，實際上構成一個大型的旋轉器將來自資料隨機存取記憶體122與/或權重隨機存取記憶體124之一列之資料/權重進行旋轉，神經網路單元121就不需要在資料隨機存取記憶體122與/或權重隨機存取記憶體124間使用一個非常大的多工器以提供需要的資料/權重文字至適當的神經網路單元。 Utilizing the multiplex register 208 and/or the multiplex register 705 (and the multiplex registers in other embodiments as shown in the eighteenth and twenty-third figures) actually constitutes a large rotator The data/weights from one of the data random access memory 122 and/or the weighted random access memory 124 are rotated, and the neural network unit 121 does not need to be stored in the data random access memory 122 and/or the weights. A very large multiplexer is used between memory 124 to provide the required data/weight text to the appropriate neural network unit.

除啟動函數結果外再寫回累加器數值 Write back the accumulator value in addition to the start function result

對於某些應用而言，讓處理器100接收回(例如透過第十五圖之MFNN指令接收至媒體暫存器118)未經處理之累加器202數值217，以提供給執行於其他執行單元112之指令執行計算，確實有其用處。舉例來說，在一實施例中，啟動函數單元212不針對軟極大啟動函數之執行進行配置以降低啟動函數單元212之複雜度。所以，神經網路單元121會輸出未經處理之累加器202數值217或其中一個子集合至資料隨機存取記憶體122或權重隨機存取記憶體124，而架構程式在後續步驟可以由資料隨機存取記憶體122或權重隨機存取記憶體124讀取並對此未經處理之數值進行計算。不過，對於未經處理之累加器202數值217之應用並不限於執行軟極大運算，其他應用亦為本發明所涵蓋。 For some applications, processor 100 is received (eg, received through MFNN instruction of FIG. 15 to media register 118) unprocessed accumulator 202 value 217 for execution by other execution units 112. The instruction execution calculations do have their usefulness. For example, in an embodiment, the startup function unit 212 is not configured for execution of the soft maximal startup function to reduce the complexity of the startup function unit 212. Therefore, the neural network unit 121 outputs the unprocessed accumulator 202 value 217 or one of the subsets to the data random access memory 122 or the weighted random access memory 124, and the architecture program can be randomly randomized in subsequent steps. The access memory 122 or the weight random access memory 124 reads and calculates the unprocessed value. However, the application of the value 217 for the unprocessed accumulator 202 is not limited to performing soft maximal operations, and other applications are also encompassed by the present invention.

第八圖係顯示第一圖之神經處理單元126 之又一實施例之方塊示意圖。第八圖之神經處理單元126係類似於第二圖之神經處理單元126。不過，第八圖之神經處理單元126在啟動函數單元212內包括一多工器802，而此啟動函數單元212具有一控制輸入803。累加器202之寬度(以位元計)係大於資料文字之寬度。多工器802具有多個輸入以接收累加器202輸出217之資料文字寬度部分。在一實施例中，累加器202之寬度為41個位元，而神經處理單元216可用以輸出一個16位元之結果文字133；如此，舉例來說，多工器802(或第三十圖之多工器3032與/或多工器3037)具有三個輸入分別接收累加器202輸出217之位元[15：0]、位元[31：16]與位元[47：32]。就一較佳實施例而言，非由累加器202提供之輸出位元(例如位元[47：41])會被強制設定為零值位元。 The eighth diagram shows the neural processing unit 126 of the first figure. A block diagram of yet another embodiment. The nerve processing unit 126 of the eighth diagram is similar to the neural processing unit 126 of the second figure. However, the neural processing unit 126 of the eighth diagram includes a multiplexer 802 within the startup function unit 212, and the activation function unit 212 has a control input 803. The width (in bits) of the accumulator 202 is greater than the width of the data text. Multiplexer 802 has a plurality of inputs to receive the data text width portion of accumulator 202 output 217. In one embodiment, the accumulator 202 has a width of 41 bits, and the neural processing unit 216 can be used to output a 16-bit result text 133; thus, for example, the multiplexer 802 (or the thirty-th diagram) The multiplexer 3032 and/or multiplexer 3037) has three inputs that receive the bits [15:0], the bits [31:16], and the bits [47:32] of the output 217 of the accumulator 202, respectively. In a preferred embodiment, output bits (e.g., bits [47:41]) that are not provided by accumulator 202 are forced to be zero-valued bits.

定序器128會在控制輸入803產生一數值，控制多工器802在累加器202之文字(如16位元)中選擇其一，以因應一寫入累加器指令，例如後續第九圖中位於位址3至5之寫入累加器指令。就一較佳實施例而言，多工器802並具有一個或多個輸入以接收啟動函數電路(如第三十圖中之元件3022,3024,3026,3018,3014與3016)之輸出，而這些啟動函數電路產生之輸出的寬度等於一個資料文字。定序器128會在控制輸入803產生一數值以控制多工器802在這些啟動函數電路輸出中選擇其一，而非在累加器202之文字中選擇其一，以因應如第四圖中位址4之啟動函數單元輸出指令。 The sequencer 128 will generate a value at the control input 803, and the control multiplexer 802 selects one of the words (e.g., 16 bits) of the accumulator 202 to respond to an accumulator instruction, such as in the subsequent ninth diagram. Write accumulator instruction at address 3 through 5. In a preferred embodiment, multiplexer 802 has one or more inputs to receive the output of a start function circuit (e.g., elements 3022, 3024, 3026, 3018, 3014, and 3016 in FIG. 30). The width of the output produced by these start function circuits is equal to one data word. The sequencer 128 will generate a value at the control input 803 to control the multiplexer 802 to select one of these start function circuit outputs instead of selecting one of the words of the accumulator 202 in response to the bit in the fourth figure. The start function unit of address 4 outputs an instruction.

第九圖係一表格，顯示一個儲存於第一圖之神經網路單元121之程式記憶體129並由該神經網路單元121執行之程式。第九圖之範例程式係類似於第四圖之程式。尤其是，二者在位址0至2之指令完全相同。不過，第四圖中位址3與4之指令在第九圖中則是由寫入累加器指令取代，此指令會指示512個神經處理單元126將其累加器202輸出217作為結果133寫回資料隨機存取記憶體122之三個列，在此範例中即列16至18。也就是說，此寫入累加器指令會指示定序器128在第一時頻週期輸出一數值為16之資料隨機存取記憶體位址123以及一寫入命令，在第二時頻週期輸出一數值為17之資料隨機存取記憶體位址123以及一寫入命令，在第三時頻週期則是輸出一數值為18之資料隨機存取記憶體位址123與一寫入命令。就一較佳實施例而言，寫入累加器指令之執行時間可以與其他指令重疊，如此，寫入累加器指令就實際上就可以在這三個時頻週期內執行，其中每一個時頻週期會寫入資料隨機存取記憶體122之一列。在一實施例中，使用者指定啟動函數2934與控制暫存器127之輸出命令2956欄之數值(第二十九A圖)，以將累加器202之所需部份寫入資料隨機存取記憶體122或權重隨機存取記憶體124。另外，寫入累加器指令可以選擇性地寫回累加器202之一子集，而非寫回累加器202之全部內容。在一實施例中，可寫回標準型之累加器202。這部分在後續對應於第二十九至三十一圖之章節會有更詳細的說明。 The ninth picture is a table showing one stored in the first picture The program memory 129 of the neural network unit 121 is executed by the neural network unit 121. The example program in the ninth figure is similar to the program in the fourth figure. In particular, the two instructions are exactly the same at addresses 0 through 2. However, the instructions for addresses 3 and 4 in the fourth figure are replaced by a write accumulator instruction in the ninth figure, which instructs 512 neural processing units 126 to write back their accumulator 202 output 217 as a result 133. The three columns of data random access memory 122, in this example, columns 16 through 18. That is, the write accumulator instruction instructs the sequencer 128 to output a data random access memory address 123 of a value of 16 and a write command during the first time-frequency period, and output a second time-frequency period. The data random access memory address 123 and a write command of value 17 output a data random access memory address 123 and a write command of a value of 18 in the third time-frequency cycle. In a preferred embodiment, the execution time of the write accumulator instruction can overlap with other instructions, so that the write accumulator instruction can actually be executed during the three time-frequency periods, each of which is time-frequency. The cycle is written to one of the columns of data random access memory 122. In one embodiment, the user specifies the value of the start function 2934 and the output command 2956 of the control register 127 (FIG. 29A) to write the desired portion of the accumulator 202 to the data random access. Memory 122 or weight random access memory 124. Additionally, the write accumulator instruction can selectively write back a subset of the accumulators 202 instead of writing back all of the accumulator 202. In an embodiment, the standard type accumulator 202 can be written back. This section will be described in more detail in the subsequent sections corresponding to the twenty-ninth to thirty-first.

第十圖係顯示神經網路單元121執行第九圖之程式之時序圖。第十圖之時序圖類似於第五圖之時序圖，其中時頻週期0至512均為相同。不過，在時頻週期513-515，這512個神經處理單元126中每一個神經處理單元126之啟動函數單元212會執行第九圖中位址3至5之寫入累加器指令之其中之一。尤其是，在時頻週期513，512個神經處理單元126中每一個神經處理單元126會將累加器202輸出217之位元[15：0]作為其結果133寫回資料隨機存取記憶體122之列16中之相對應文字；在時頻週期514，512個神經處理單元126中每一個神經處理單元126會將累加器202輸出217之位元[31：16]作為其結果133寫回資料隨機存取記憶體122之列17中之相對應文字；而在時頻週期515，512個神經處理單元126中每一個神經處理單元126會將累加器202輸出217之位元[40：32]作為其結果133寫回資料隨機存取記憶體122之列18中之相對應文字。就一較佳實施例而言，位元[47：41]會被強制設定為零值。 The tenth diagram shows a timing diagram of the execution of the program of the ninth diagram by the neural network unit 121. The timing chart of the tenth figure is similar to the time of the fifth figure. The sequence diagram, in which the time-frequency periods 0 to 512 are the same. However, in time-frequency periods 513-515, the start function unit 212 of each of the 512 neural processing units 126 performs one of the write accumulator instructions of addresses 3 through 5 in the ninth figure. . In particular, in the time-frequency period 513, each of the 512 neural processing units 126 will write back the bit random access memory 122 to the data random access memory 122 as the result 133 of the accumulator 202 output 217 bits [15:0]. The corresponding text in column 16; in the time-frequency period 514, each of the 512 neural processing units 126 will write back the data as the result 133 of the 217 bits [31:16] of the accumulator 202 output. Corresponding characters in column 17 of random access memory 122; and in time-frequency period 515, each of 512 neural processing units 126 will output accumulator 202 to bits 217 [40:32] As a result 133, the corresponding text in the column 18 of the data random access memory 122 is written back. In a preferred embodiment, bit [47:41] is forced to a zero value.

共享啟動函數單元 Shared boot function unit

第十一圖係顯示第一圖之神經網路單元121之一實施例之方塊示意圖。在第十一圖之實施例中，一個神經元係分成兩部分，即啟動函數單元部分與算術邏輯單元部分(此部分並包含移位暫存器部分)，而各個啟動函數單元部分係由多個算術邏輯單元部分共享。在第十一圖中，算術邏輯單元部分係指神經處理單元126，而共享之啟動函數單元部分則是指啟動函數單元1112。相對於如第二圖之實施例，各個神經元則是包含自己的啟動函數單元212。依此，在第十一圖實施例之一範例中，神經處理單元126(算術邏輯單元部分)可包括第二圖之累加器202、算術邏輯單元204、多工暫存器208與暫存器205，但不包括啟動函數單元212。在第十一圖之實施例中，神經網路單元121包括512個神經處理單元126，不過，本發明並不限於此。在第十一圖之範例中，這512個神經處理單元126被分成64個群組，在第十一圖中標示為群組0至63，而每個群組具有八個神經處理單元126。 The eleventh diagram is a block diagram showing an embodiment of a neural network unit 121 of the first figure. In the embodiment of the eleventh diagram, a neuron is divided into two parts, that is, a start function unit part and an arithmetic logic unit part (this part includes a shift register part), and each start function unit part is composed of The arithmetic logic units are partially shared. In the eleventh diagram, the arithmetic logic unit portion refers to the neural processing unit 126, and the shared startup function unit portion refers to the startup function unit 1112. Relative to the embodiment of the second figure, each neuron contains its own The function unit 212 is started. Accordingly, in an example of the eleventh embodiment, the neural processing unit 126 (arithmetic logic unit portion) may include the accumulator 202 of the second graph, the arithmetic logic unit 204, the multiplex register 208, and the register. 205, but does not include the start function unit 212. In the embodiment of the eleventh diagram, the neural network unit 121 includes 512 neural processing units 126, but the present invention is not limited thereto. In the example of the eleventh figure, the 512 neural processing units 126 are divided into 64 groups, labeled as groups 0 to 63 in the eleventh figure, and each group has eight neural processing units 126.

神經網路單元121並包括一列緩衝器1104與複數個共享之啟動函數單元1112，這些啟動函數單元1112係耦接於神經處理單元126與列緩衝器1104間。列緩衝器1104之寬度(以位元計)與資料隨機存取記憶體122或權重隨機存取記憶體124之一列相同，例如512個文字。每一個神經處理單元126群組具有一個啟動函數單元1112，亦即，每個啟動函數單元1112係對應於一神經處理單元126群組；如此，在第十一圖之實施例中就存在64個啟動函數單元1112對應至64個神經處理單元126群組。同一個群組之八個神經處理單元126係共享對應於此群組之啟動函數單元1112。本發明亦可應用於具有不同數量之啟動函數單元以及每一個群組中具有不同數量之神經處理單元之實施例。舉例來說，本發明亦可應用於每個群組中具有兩個、四個或十六個神經處理單元126共享同一個啟動函數單元1112之實施例。 The neural network unit 121 includes a column of buffers 1104 and a plurality of shared start function units 1112. The start function units 1112 are coupled between the neural processing unit 126 and the column buffer 1104. The width of the column buffer 1104 (in bits) is the same as one of the data random access memory 122 or the weight random access memory 124, such as 512 words. Each of the neural processing units 126 has a start function unit 1112, that is, each start function unit 1112 corresponds to a group of neural processing units 126; thus, there are 64 in the embodiment of the eleventh embodiment. The start function unit 1112 corresponds to a group of 64 neural processing units 126. The eight neural processing units 126 of the same group share the start function unit 1112 corresponding to this group. The invention is also applicable to embodiments having a different number of activation function units and a different number of neural processing units in each group. For example, the present invention is also applicable to embodiments in which two, four, or sixteen neural processing units 126 share the same start function unit 1112 in each group.

共享啟動函數單元1112有助於縮減神經網路單元121之尺寸。尺寸縮減會犧牲效能。也就是說，依據共享率之不同，會需要使用額外的時頻週期才能產生整個神經處理單元126陣列之結果133，舉例來說，如以下第十二圖所示，在8：1之共享率的情況下就需要七個額外的時頻週期。不過，一般而言，相較於產生累加總數所需之時頻週期數(舉例來說，對於每個神經元具有512個連結之一層，就需要512個時頻週期)，前述額外增加的時頻週期數(例如7)相當少。因此，共享啟動函數單元對效能的影響非常小(例如，增加大約百分之一之計算時間)，對於所能縮減神經網路單元121之尺寸而言會是一個合算的成本。 The shared start function unit 1112 helps to reduce the size of the neural network unit 121. Size reduction will sacrifice performance. That is, Depending on the sharing rate, an additional time-frequency period may be required to generate the result 133 of the entire neural processing unit 126 array, for example, as shown in Figure 12 below, at a sharing ratio of 8:1. Seven additional time-frequency cycles are required. However, in general, compared to the number of time-frequency cycles required to generate the cumulative total (for example, for each neuron with 512 connections, 512 time-frequency periods are required), the aforementioned additional time The number of frequency cycles (for example, 7) is quite small. Therefore, the shared startup function unit has a very small impact on performance (e.g., an increase of approximately one percent of the computation time), which can be a costly reduction for the size of the neural network unit 121 that can be reduced.

在一實施例中，每一個神經處理單元126包括一啟動函數單元212用以執行相對簡單的啟動函數，這些簡單的啟動函數單元212具有較小的尺寸而能被包含在每個神經處理單元126內；反之，共享的複雜啟動函數單元1112則是執行相對複雜的啟動函數，其尺寸會明顯大於簡單的啟動函數單元212。在此實施例中，只有在指定複雜啟動函數而需要由共享複雜啟動函數單元1112執行之情況下，需要額外的時頻週期，在指定的啟動函數可以由簡單啟動函數單元212執行之情況下，就不需要此額外的時頻週期。 In one embodiment, each neural processing unit 126 includes a start function unit 212 for performing a relatively simple start function, these simple start function units 212 having a smaller size to be included in each of the neural processing units 126 Conversely, the shared complex startup function unit 1112 performs a relatively complex startup function that is significantly larger in size than the simple startup function unit 212. In this embodiment, an additional time-frequency period is required only if a complex startup function is specified and needs to be performed by the shared complex startup function unit 1112, in the event that the specified startup function can be executed by the simple startup function unit 212, This additional time-frequency period is not required.

第十二與第十三圖係顯示第十一圖之神經網路單元121執行第四圖之程式之時序圖。第十二圖之時序圖係類似於第五圖之時序圖，二者之時頻週期0至512均相同。不過，在時頻週期513之運算並不相同，因為第十一圖之神經處理單元126會共享啟動函數單元 1112；亦即，同一個群組之神經處理單元126會共享關聯於此群組之啟動函數單元1112，而第十一圖即顯示此共享架構。 The twelfth and thirteenth drawings are timing charts showing the execution of the program of the fourth diagram by the neural network unit 121 of the eleventh diagram. The timing chart of the twelfth figure is similar to the timing chart of the fifth figure, and the time-frequency periods 0 and 512 of both are the same. However, the operation in the time-frequency period 513 is not the same because the neural processing unit 126 of the eleventh figure shares the startup function unit. 1112; That is, the same group of neural processing units 126 will share the start function unit 1112 associated with the group, and the eleventh figure shows the shared architecture.

第十三圖之時序圖之每一列係對應至標示於第一行之連續時頻週期。其他行則是分別對應至這64個啟動函數單元1112中不同的啟動函數單元1112並指出其運算。圖中僅顯示神經處理單元0,1,63之運算以簡化說明。第十三圖之時頻週期係對應至第十二圖之時頻週期，但以不同方式顯示神經處理單元126共享啟動函數單元1112之運算。如第十三圖所示，在時頻週期0至512，這64個啟動函數單元1112都是處於不啟動狀態，而神經處理單元126執行初始化神經處理單元指令、乘法累加指令與乘法累加旋轉指令。 Each column of the timing diagram of the thirteenth map corresponds to a continuous time-frequency period indicated in the first row. The other lines correspond to different start function units 1112 of the 64 start function units 1112, respectively, and indicate their operations. Only the operations of the neural processing units 0, 1, 63 are shown in the figure to simplify the explanation. The time-frequency period of the thirteenth map corresponds to the time-frequency period of the twelfth figure, but the operation of the neural processing unit 126 sharing the start-up function unit 1112 is displayed in a different manner. As shown in the thirteenth diagram, in the time-frequency period 0 to 512, the 64 start function units 1112 are all in the inactive state, and the neural processing unit 126 performs the initialization of the neural processing unit instruction, the multiply accumulate instruction, and the multiply accumulate rotation instruction. .

如第十二與十三圖所示，在時頻週期513，啟動函數單元0(關聯於群組0之啟動函數單元1112)開始對神經處理單元0之累加器202數值217執行所指定之啟動函數，神經處理單元0即群組0中第一個神經處理單元216，而啟動函數單元1112之輸出將會被儲存於列暫存器1104之文字0。同樣在時頻週期513，每個啟動函數單元1112都會開始對相對應神經處理單元216群組中第一個神經處理單元126之累加器202數值217執行所指定之啟動函數。因此，如第十三圖所示，在時頻週期513，啟動函數單元0開始對神經處理單元0之累加器202執行所指定之啟動函數以產生將會儲存於列暫存器1104之文字0之結果；啟動函數單元1開始對神經處理單元8之累加器202執行所指定之啟動函數以產生將會儲存於列暫存器1104之文字8之結果；依此類推，啟動函數單元63開始對神經處理單元504之累加器202執行所指定之啟動函數以產生將會儲存於列暫存器1104之文字504之結果。 As shown in the twelfth and thirteenth diagrams, at time-frequency period 513, the start function unit 0 (the start function unit 1112 associated with group 0) begins the execution of the specified value for the accumulator 202 value 217 of the neural processing unit 0. The function, neural processing unit 0 is the first neural processing unit 216 in group 0, and the output of the start function unit 1112 will be stored in text 0 of column register 1104. Also in time-frequency period 513, each start function unit 1112 begins the execution of the specified start function for the accumulator 202 value 217 of the first neural processing unit 126 in the corresponding neural processing unit 216 group. Thus, as shown in FIG. 13, at time-frequency period 513, the start function unit 0 begins executing the specified start function on the accumulator 202 of the neural processing unit 0 to generate the text 0 to be stored in the column register 1104. The result; the start function unit 1 begins to accumulate the neural processing unit 8 The controller 202 executes the specified start function to generate the result of the text 8 to be stored in the column register 1104; and so on, the start function unit 63 begins executing the specified start function on the accumulator 202 of the neural processing unit 504. The result of the text 504 to be stored in the column register 1104 is generated.

在時頻週期514，啟動函數單元0(關聯於群組0之啟動函數單元1112)開始對神經處理單元1之累加器202數值217執行所指定之啟動函數，神經處理單元1即群組0中第二個神經處理單元216，而啟動函數單元1112之輸出將會被儲存於列暫存器1104之文字1。同樣在時頻週期514，每個啟動函數單元1112都會開始對相對應神經處理單元216群組中第二個神經處理單元126之累加器202數值217執行所指定之啟動函數。因此，如第十三圖所示，在時頻週期514，啟動函數單元0開始對神經處理單元1之累加器202執行所指定之啟動函數以產生將會儲存於列暫存器1104之文字1之結果；啟動函數單元1開始對神經處理單元9之累加器202執行所指定之啟動函數以產生將會儲存於列暫存器1104之文字9之結果；依此類推，啟動函數單元63開始對神經處理單元505之累加器202執行所指定之啟動函數以產生將會儲存於列暫存器1104之文字505之結果。這樣的處理會持續到時頻週期520，啟動函數單元0(關聯於群組0之啟動函數單元1112)開始對神經處理單元7之累加器202數值217執行所指定之啟動函數，神經處理單元7即群組0中第八個(最後一個)神經處理單元216，而啟動函數單元1112之輸出將會被儲存於列暫存器1104之文字7。同樣在時頻週期520，每個啟動函數單元1112都會開始對相對應神經處理單元216群組中第八個神經處理單元126之累加器202數值217執行所指定之啟動函數。因此，如第十三圖所示，在時頻週期520，啟動函數單元0開始對神經處理單元7之累加器202執行所指定之啟動函數以產生將會儲存於列暫存器1104之文字7之結果；啟動函數單元1開始對神經處理單元15之累加器202執行所指定之啟動函數以產生將會儲存於列暫存器1104之文字15之結果；依此類推，啟動函數單元63開始對神經處理單元511之累加器202執行所指定之啟動函數以產生將會儲存於列暫存器1104之文字511之結果。 In the time-frequency period 514, the start function unit 0 (the start function unit 1112 associated with the group 0) begins executing the specified start function for the accumulator 202 value 217 of the neural processing unit 1, the neural processing unit 1 being in group 0. The second neural processing unit 216, and the output of the start function unit 1112 will be stored in the text 1 of the column register 1104. Also at time-frequency period 514, each start function unit 1112 begins to perform the specified start function for the accumulator 202 value 217 of the second neural processing unit 126 in the corresponding neural processing unit 216 group. Thus, as shown in FIG. 13, at time-frequency period 514, the start function unit 0 begins executing the specified start function on the accumulator 202 of the neural processing unit 1 to generate the text 1 to be stored in the column register 1104. As a result, the start function unit 1 begins executing the specified start function on the accumulator 202 of the neural processing unit 9 to produce the result of the literal 9 to be stored in the column register 1104; and so on, the start function unit 63 begins to The accumulator 202 of the neural processing unit 505 executes the specified start function to produce the result of the text 505 to be stored in the column register 1104. Such processing continues until the time-frequency period 520, and the start function unit 0 (the start function unit 1112 associated with the group 0) begins executing the specified start function for the accumulator 202 value 217 of the neural processing unit 7, the neural processing unit 7 That is, the eighth (last) neural processing unit 216 in group 0, and the output of the start function unit 1112 will be stored in the text 7 of the column register 1104. Also in the time-frequency period 520, Each start function unit 1112 begins the execution of the specified start function for the accumulator 202 value 217 of the eighth neural processing unit 126 in the group of corresponding neural processing units 216. Thus, as shown in FIG. 13, at time-frequency period 520, the start function unit 0 begins executing the specified start function on the accumulator 202 of the neural processing unit 7 to generate the text 7 to be stored in the column register 1104. As a result, the start function unit 1 begins executing the specified start function on the accumulator 202 of the neural processing unit 15 to produce the result of the literal 15 to be stored in the column register 1104; and so on, the start function unit 63 begins to The accumulator 202 of the neural processing unit 511 executes the specified start function to produce the result of the text 511 to be stored in the column register 1104.

在時頻週期521，一旦這512個神經處理單元126之全部512個結果都已經產生並寫入列暫存器1104，列暫存器1104就會開始將其內容寫入資料隨機存取記憶體122或是權重隨機存取記憶體124。如此，每一個神經處理單元126群組之啟動函數單元1112都執行第四圖中位址3之啟動函數指令之一部分。 In the time-frequency period 521, once all 512 results of the 512 neural processing units 126 have been generated and written to the column register 1104, the column register 1104 begins to write its contents to the data random access memory. 122 or weight random access memory 124. Thus, the start function unit 1112 of each of the neural processing units 126 group performs a portion of the start function instruction of address 3 in the fourth figure.

如第十一圖所示在算術邏輯單元204群組中共享啟動函數單元1112之實施例，特別有助於搭配整數算術邏輯單元204之使用。這部分在後續章節如對應於第二十九A至三十三圖處會有相關說明。 The embodiment of sharing the start function unit 1112 in the arithmetic logic unit 204 group as shown in FIG. 11 is particularly helpful in conjunction with the use of the integer arithmetic logic unit 204. This section will be described in subsequent sections as shown in Figures 29 to 33.

MTNN與MFNN架構指令 MTNN and MFNN architecture instructions

第十四圖係一方塊示意圖，顯示一移動至神經網路(MTNN)架構指令1400以及其對應於第一圖之神經網路單元121之部分之運作。此MTNN指令1400包括一執行碼欄位1402、一src1欄位1404、一src2欄位、一gpr欄位1408與一立即欄位1412。此MTNN指令係一架構指令，亦即此指令係包含在處理器100之指令集架構內。就一較佳實施例而言，此指令集架構會利用執行碼欄位1402之一預設值，來區分MTNN指令1400與指令集架構內之其他指令。此MTNN指令1400之執行碼1402可包括常見於x86架構等之前置碼(prefix)，也可以不包括。 Figure 14 is a block diagram showing a Move to Neural Network (MTNN) architecture instruction 1400 and its corresponding to the first figure The operation of the portion of the neural network unit 121. The MTNN instruction 1400 includes an execution code field 1402, a src1 field 1404, a src2 field, a gpr field 1408, and an immediate field 1412. The MTNN instruction is an architectural instruction, that is, the instruction is included in the instruction set architecture of the processor 100. In a preferred embodiment, the instruction set architecture utilizes one of the execution code field bits 1402 to distinguish between the MTNN instruction 1400 and other instructions within the instruction set architecture. The execution code 1402 of the MTNN instruction 1400 may or may not be included in a prefix such as the x86 architecture.

立即欄位1412提供一數值以指定一函數1432至神經網路單元121之控制邏輯1434。就一較佳實施例而言，此函數1432係作為第一圖之微指令105之一立即運算元。這些可以由神經網路單元121執行之函數1432包括寫入資料隨機存取記憶體122、寫入權重隨機存取記憶體124、寫入程式記憶體129、寫入控制暫存器127、開始執行程式記憶體129內之程式、暫停執行程式記憶體129內之程式、完成執行程式記憶體129內之程式後之通知請求(例如中斷)、以及重設神經網路單元121，但不限於此。就一較佳實施例而言，此神經網路單元指令組會包括一個指令，此指令之結果指出神經網路單元程式已完成。另外，此神經網路單元指令集包括一個明確產生中斷指令。就一較佳實施例而言，對神經網路單元121進行重設之運作包括將神經網路單元121中，除了資料隨機存取記憶體122、權重隨機存取記憶體124、程式記憶體129之資料會維持完整不動外之其他部分，有效地強制回復至一重設狀態(例如，清空內部狀態機器並將其設定為閒置狀態)。此外，內部暫存器，如累加器202，並不會受到重設函數之影響，而必須被明示地清空，例如使用第四圖中位址0之初始化神經處理單元指令。在一實施例中，函數1432可包括一直接執行函數，其第一來源暫存器包含一微運算(舉例來說，可參照第三十四圖之微運算3418)。此直接執行函數指示神經網路單元121直接執行所指定之微運算。如此，架構程式就可以直接控制神經網路單元121執行運算，而非將指令寫入程式記憶體129並於後續指示神經網路單元121執行此位於程式記憶體129內之指令或是透過MTNN指令1400(或第十五圖之MFNN指令1500)之執行。第十四圖顯示此寫入資料隨機存取記憶體122之函數之一範例。 Immediate field 1412 provides a value to specify a function 1432 to control logic 1434 of neural network unit 121. In a preferred embodiment, this function 1432 is an immediate operand as one of the microinstructions 105 of the first figure. These functions 1432, which can be executed by the neural network unit 121, include the write data random access memory 122, the write weight random access memory 124, the write program memory 129, the write control register 127, and start execution. The program in the program memory 129, the execution of the program in the program memory 129, the completion of the notification request (for example, interruption) after executing the program in the program memory 129, and the reset of the neural network unit 121 are not limited thereto. In a preferred embodiment, the neural network unit instruction set includes an instruction, the result of which indicates that the neural network unit program has completed. In addition, the neural network unit instruction set includes an explicitly generated interrupt instruction. In a preferred embodiment, the operation of resetting the neural network unit 121 includes, in addition to the data random access memory 122, the weight random access memory 124, and the program memory 129 in the neural network unit 121. The data will remain intact and other parts, effectively forcing a return to a reset state (for example, emptying the internal state machine and setting it up) Set to idle state). In addition, internal registers, such as accumulator 202, are not affected by the reset function and must be explicitly cleared, such as using the initialization neural processing unit instruction of address 0 in the fourth figure. In an embodiment, the function 1432 can include a direct execution function, the first source register of which includes a micro-operation (for example, reference to the micro-operation 3418 of the thirty-fourth figure). This direct execution function instructs the neural network unit 121 to directly perform the specified micro-operation. In this way, the architecture program can directly control the neural network unit 121 to perform operations instead of writing the instructions to the program memory 129 and subsequently instructing the neural network unit 121 to execute the instructions in the program memory 129 or through the MTNN instructions. Execution of 1400 (or MFNN instruction 1500 of Figure 15). The fourteenth figure shows an example of the function of this write data random access memory 122.

此gpr欄位指定通用暫存器檔案116內之一通用暫存器。在一實施例中，每個通用暫存器均為64位元。此通用暫存器檔案116提供所選定之通用暫存器之數值至神經網路單元121，如圖中所示，而神經網路單元121係將此數值作為位址1422使用。此位址1422會選擇函數1432中指定之記憶體之一列。就資料隨機存取記憶體122或權重隨機存取記憶體124而言，此位址1422會額外選擇一資料塊，其大小是此選定列中媒體暫存器之位置的兩倍(如512個位元)。就一較佳實施例而言，此位置係位於一個512位元邊界。在一實施例中，多工器會選擇位址1422(或是在以下描述之MFNN指令1400之情況下的位址1422)或是來自定序器128之位址123/125/131提供至資料隨機存取記憶體124/權重隨機存取記憶體124/ 程式記憶體129。在一實施例中，資料隨機存取記憶體122具有雙埠，使神經處理單元126能夠利用媒體暫存器118對此資料隨機存取記憶體122之讀取/寫入，同時讀取/寫入此資料隨機存取記憶體122。在一實施例中，為了類似的目的，權重隨機存取記憶體124亦具有雙埠。 This gpr field specifies a general purpose register in the general register file 116. In one embodiment, each general purpose register is 64 bits. The universal scratchpad file 116 provides the value of the selected universal register to the neural network unit 121, as shown in the figure, and the neural network unit 121 uses this value as the address 1422. This address 1422 selects one of the columns of memory specified in function 1432. In the case of data random access memory 122 or weighted random access memory 124, this address 1422 additionally selects a data block that is twice the size of the media register in the selected column (eg, 512). Bit). In a preferred embodiment, this location is at a 512 bit boundary. In one embodiment, the multiplexer selects the address 1422 (either address 1422 in the case of the MFNN instruction 1400 described below) or the address from the sequencer 128 123/125/131 to the data. Random access memory 124 / weight random access memory 124 / Program memory 129. In one embodiment, the data random access memory 122 has a double port, enabling the neural processing unit 126 to read/write to the data random access memory 122 using the media register 118, while reading/writing. The data is randomly accessed into the memory 122. In an embodiment, the weighted random access memory 124 also has double turns for similar purposes.

圖中之src1欄位1404與src2欄位1406均指定媒體暫存器檔案118之一媒體暫存器。在一實施例中，每個媒體暫存器118均為256位元。媒體暫存器檔案118會將來自所選定之媒體暫存器之相連資料(例如512個位元)提供至資料隨機存取記憶體122(或是權重隨機存取記憶體124或是程式記憶體129)以寫入位址1422指定之選定列1428以及在選定列1428中由位址1422指定之位置，如圖中所示。透過一系列MTNN指令1400(以及以下所述之MFNN指令1500)之執行，執行於處理器100之架構程式即可填滿資料隨機存取記憶體122列與權重隨機存取記憶體124列並將一程式寫入程式記憶體129，例如本文所述之程式(如第四與九圖所示之程式)可使神經網路單元121對資料與權重以非常快的速度進行運算，以完成此人工神經網路。在一實施例中，此架構程式係直接控制神經網路單元121而非將程式寫入程式記憶體129。 Both the src1 field 1404 and the src2 field 1406 in the figure specify a media register for the media register file 118. In one embodiment, each media register 118 is 256 bits. The media register file 118 provides the connected data (eg, 512 bits) from the selected media register to the data random access memory 122 (or the weighted random access memory 124 or the program memory). 129) The selected column 1428 specified by the write address 1422 and the location specified by the address 1422 in the selected column 1428, as shown in the figure. Through the execution of a series of MTNN instructions 1400 (and the MFNN instructions 1500 described below), the architecture program executing on the processor 100 can fill the data random access memory 122 columns and the weight random access memory 124 columns and A program write program memory 129, such as the one described herein (such as the programs shown in Figures 4 and 9), allows neural network unit 121 to perform operations on data and weights at very fast speeds to complete the manual. Neural network. In one embodiment, the architecture program directly controls the neural network unit 121 rather than writing the program to the program memory 129.

在一實施例中，MTNN指令1400係指定一起始來源暫存器以及來源暫存器之數量，即Q，而非指定兩個來源暫存器(如欄位1404與1406所指定者)。這種形式之MTNN指令1400會指示處理器100將指定為起始來源暫存器之媒體暫存器118以及接下來Q-1個接續的媒體暫存器118寫入神經網路單元121，也就是寫入所指定之資料隨機存取記憶體122或權重隨機存取記憶體124。就一較佳實施例而言，指令轉譯器104會將MTNN指令1400轉譯為寫入所有Q個所指定之媒體暫存器118所需數量之微指令。舉例來說，在一實施例中，當MTNN指令1400將暫存器MR4指定為起始來源暫存器並且Q為8，指令轉譯器104就會將MTNN指令1400轉譯為四個微指令，其中第一個微指令係寫入暫存器MR4與MR5，第二個微指令係寫入暫存器MR6與MR7，第三個微指令係寫入暫存器MR8與MR9，而第四個微指令係寫入暫存器MR10與MR11。在另一個實施例中，由媒體暫存器118至神經網路單元121之資料路徑是1024位元而非512位元，在此情況下，指令轉譯器104會將MTNN指令1400轉譯為兩個微指令，其中第一個微指令係寫入暫存器MR4至MR7，第二個微指令則是寫入暫存器MR8至MR11。本發明亦可應用於MFNN指令1500指定一起始目的暫存器以及目的暫存器之數量之實施例，而使每一個MFNN指令1500可以從資料隨機存取記憶體122或權重隨機存取記憶體124之一列讀取大於單一媒體暫存器118之資料塊。 In one embodiment, the MTNN instruction 1400 specifies a starting source register and the number of source registers, Q, rather than specifying two source registers (as specified by fields 1404 and 1406). This form of MTNN instruction 1400 will instruct the processor 100 to designate as the start The media register 118 of the source register and the next Q-1 consecutive media register 118 are written to the neural network unit 121, that is, the specified data random access memory 122 or the weight random memory is written. Take the memory 124. In a preferred embodiment, the instruction translator 104 translates the MTNN instruction 1400 into the number of microinstructions required to write all of the Q designated media registers 118. For example, in one embodiment, when the MTNN instruction 1400 designates the scratchpad MR4 as the starting source register and Q is 8, the instruction translator 104 translates the MTNN instruction 1400 into four microinstructions, where The first microinstruction is written to the scratchpads MR4 and MR5, the second microinstruction is written to the scratchpads MR6 and MR7, the third microinstruction is written to the scratchpads MR8 and MR9, and the fourth micro The instructions are written to the scratchpads MR10 and MR11. In another embodiment, the data path from media register 118 to neural network unit 121 is 1024 bits instead of 512 bits, in which case instruction translator 104 translates MTNN instruction 1400 into two A microinstruction in which the first microinstruction is written to the scratchpads MR4 to MR7, and the second microinstruction is written to the scratchpads MR8 to MR11. The present invention is also applicable to embodiments in which the MFNN instruction 1500 specifies a number of start destination registers and destination registers, such that each MFNN instruction 1500 can be accessed from the data random access memory 122 or the weighted random access memory. One of the columns 124 reads a block of data larger than the single media register 118.

第十五圖係一方塊示意圖，顯示一移動至神經網路(MTNN)架構指令1500以及其對應於第一圖之神經網路單元121之部分之運作。此MFNN指令1500包括一執行碼欄位1502、一dst欄位1504、一gpr欄位1508以及一立即欄位1512。MFNN指令係一架構指令，亦即此指令係包含於處理器100之指令集架構內。就一較佳實施例而言，此指令集架構會利用執行碼欄位1502之一預設值，來區分MFNN指令1500與指令集架構內之其他指令。此MFNN指令1500之執行碼1502可包括常見於x86架構等之前置碼(prefix)，也可以不包括。 The fifteenth diagram is a block diagram showing the operation of a mobile-to-neural network (MTNN) architecture command 1500 and its portion corresponding to the neural network unit 121 of the first diagram. The MFNN instruction 1500 includes an execution code field 1502, a dst field 1504, and a gpr field 1508. And an immediate field 1512. The MFNN instruction is an architectural instruction, that is, the instruction is included in the instruction set architecture of the processor 100. In a preferred embodiment, the instruction set architecture utilizes one of the execution code fields 1502 to distinguish between the MFNN instruction 1500 and other instructions within the instruction set architecture. The execution code 1502 of the MFNN instruction 1500 may or may not be included in a prefix such as the x86 architecture.

立即欄位1512提供一數值以指定一函數1532至神經網路單元121之控制邏輯1434。就一較佳實施例而言，此函數1532係作為第一圖之微指令105之一立即運算元。這些神經網路單元121可以執行之函數1532包括讀取資料隨機存取記憶體122、讀取權重隨機存取記憶體124、讀取程式記憶體129、以及讀取狀態暫存器127，但不限於此。第十五圖之範例係顯示讀取資料隨機存取記憶體122之函數1532。 Immediate field 1512 provides a value to specify a function 1532 to control logic 1434 of neural network unit 121. In a preferred embodiment, this function 1532 is an immediate operand as one of the microinstructions 105 of the first figure. The functions 1532 that the neural network unit 121 can execute include a read data random access memory 122, a read weight random access memory 124, a read program memory 129, and a read status register 127, but not Limited to this. The example of the fifteenth figure shows a function 1532 of the read data random access memory 122.

此gpr欄位1508指定通用暫存器檔案116內之一通用暫存器。此通用暫存器檔案116提供所選定之通用暫存器之數值至神經網路單元121，如圖中所示，而神經網路單元121係將此數值作為位址1522並以類似於第十四圖之位址1422之方式進行運算，藉以選擇函數1532中指定之記憶體之一列。就資料隨機存取記憶體122或權重隨機存取記憶體124而言，此位址1522會額外選擇一資料塊，其大小即為此選定列中媒體暫存器(如256個位元)之位置。就一較佳實施例而言，此位置係位於一個256位元邊界。 This gpr field 1508 specifies a general purpose register within the general register file 116. The universal scratchpad file 116 provides the value of the selected universal register to the neural network unit 121, as shown in the figure, and the neural network unit 121 uses this value as the address 1522 and is similar to the tenth The operation of the address 1422 of the four graphs is performed to select one of the memories specified in the function 1532. In the case of the data random access memory 122 or the weighted random access memory 124, the address 1522 additionally selects a data block whose size is the media register (eg, 256 bits) in the selected column. position. In a preferred embodiment, this location is at a 256 bit boundary.

此dst欄位1504係於一媒體暫存器檔案 118內指定一媒體暫存器。如圖中所示，媒體暫存器檔案118係將來自資料隨機存取記憶體122(或權重隨機存取記憶體124或程式記憶體129)之資料(如256位元)接收至選定的媒體暫存器，此資料係讀取自資料接收中位址1522指定之選定列1528以及選定列1528中位址1522指定之位置。 This dst field 1504 is attached to a media register file. A media register is specified in 118. As shown in the figure, the media register file 118 receives data (eg, 256 bits) from the data random access memory 122 (or the weighted random access memory 124 or the program memory 129) to the selected media. The scratchpad, this data is read from the selected column 1528 specified by the data receiving address 1522 and the location specified by the address 1522 in the selected column 1528.

神經網路單元內部隨機存取記憶體之埠配置 Neural network unit internal random access memory configuration

第十六圖係顯示第一圖之資料隨機存取記憶體122之一實施例之方塊示意圖。此資料隨機存取記憶體122包括一記憶體陣列1606、一讀取埠1602與一寫入埠1604。記憶體陣列1606係裝載資料文字，就一較佳實施例而言，這些資料係排列成如前所述D個列之N個文字之陣列。在一實施例中，此記憶體陣列1606包括一個由64個水平排列之靜態隨機存取記憶胞構成之陣列，其中每個記憶胞具有128位元之寬度以及64位元之高度，如此即可提供一個64KB之資料隨機存取記憶體122，其寬度為8192位元並且具有64列，而此資料隨機存取記憶體122所使用之晶粒面積大致為0.2平方毫米。不過，本發明並不限於此。 Figure 16 is a block diagram showing an embodiment of the data random access memory 122 of the first figure. The data random access memory 122 includes a memory array 1606, a read buffer 1602 and a write buffer 1604. The memory array 1606 is loaded with data text. In a preferred embodiment, the data is arranged in an array of N words of D columns as previously described. In one embodiment, the memory array 1606 includes an array of 64 horizontally arranged static random access memory cells, each of which has a width of 128 bits and a height of 64 bits. A 64 KB data random access memory 122 is provided having a width of 8192 bits and having 64 columns, and the data random access memory 122 uses a die area of approximately 0.2 square millimeters. However, the invention is not limited thereto.

就一較佳實施例而言，寫入埠1602係以多工方式耦接至神經處理單元126以及媒體暫存器118。進一步來說，這些媒體暫存器118可以透過結果匯流排耦接至讀取埠，而結果匯流排也用於提供資料至重排緩衝器與/或結果傳送匯流排以提供至其他執行單元112。這些神經處理單元126與媒體暫存器118係共享此讀取埠1602，以對資料隨機存取記憶體122進行讀取。又，就一較佳實施例而言，寫入埠1604亦是以多工方式耦接至神經處理單元126以及媒體暫存器118。這些神經處理單元126與媒體暫存器118係共享此寫入埠1604，以寫入此資料隨機存取記憶體122。如此，媒體暫存器118就可以在神經處理單元126對資料隨機存取記憶體122進行讀取之同時，寫入資料隨機存取記憶體122，而神經處理單元126也就可以在媒體暫存器118正在對資料隨機存取記憶體122進行讀取之同時，寫入資料隨機存取記憶體122。這樣的進行方式可以提升效能。舉例來說，這些神經處理單元126可以讀取資料隨機存取記憶體122(例如持續執行計算)，而此同時，媒體暫存器118可以將更多資料文字寫入資料隨機存取記憶體122。在另一範例中，這些神經處理單元126可以將計算結果寫入資料隨機存取記憶體122，而此同時，媒體暫存器118則可以從資料隨機存取記憶體122讀取計算結果。在一實施例中，神經處理單元126可以將一列計算結果寫入資料隨機存取記憶體122，同時還從資料隨機存取記憶體122讀取一列資料文字。在一實施例中，記憶體陣列1606係配置成記憶體區塊(bank)。在神經處理單元126存取資料隨機存取記憶體122的時候，所有的記憶體區塊都會被啟動來存取記憶體陣列1606之一完整列；不過，在媒體暫存器118存取資料隨機存取記憶體122的時候，只有所指定的記憶體區塊會被啟動。在一實施例中，每個記憶體區塊之寬度均為128 位元，而媒體暫存器118之寬度則是256位元，如此，舉例來說，每次存取媒體暫存器118就需要啟動兩個記憶體區塊。在一實施例中，這些埠1602/1604之其中之一為讀取/寫入埠。在一實施例中，這些埠1602/1604都是讀取/寫入埠。 In a preferred embodiment, write 埠 1602 is coupled to neural processing unit 126 and media register 118 in a multiplexed manner. Further, the media registers 118 can be coupled to the read buffer through the result bus, and the resulting bus is also used to provide data to the rearrangement buffer and/or the result transfer bus to be provided to the other execution units 112. . These ones The neural processing unit 126 and the media register 118 share the read buffer 1602 to read the data random access memory 122. Moreover, in a preferred embodiment, write 埠 1604 is also multiplexed to neural processing unit 126 and media register 118. The neural processing unit 126 shares the write buffer 1604 with the media register 118 to write the data random access memory 122. In this manner, the media register 118 can write the data random access memory 122 while the neural processing unit 126 reads the data random access memory 122, and the neural processing unit 126 can temporarily store the data in the media. The device 118 is writing to the data random access memory 122 while reading the data random access memory 122. This way of doing things can improve performance. For example, the neural processing unit 126 can read the data random access memory 122 (eg, continuously perform calculations) while the media register 118 can write more data to the data random access memory 122. . In another example, the neural processing unit 126 can write the calculation result to the data random access memory 122, and at the same time, the media register 118 can read the calculation result from the data random access memory 122. In one embodiment, the neural processing unit 126 can write a list of calculation results to the data random access memory 122 while also reading a list of data characters from the data random access memory 122. In one embodiment, memory array 1606 is configured as a memory bank. When the neural processing unit 126 accesses the data random access memory 122, all memory blocks are activated to access a complete column of the memory array 1606; however, the data is randomly accessed in the media register 118. When the memory 122 is accessed, only the specified memory block will be activated. In one embodiment, each memory block has a width of 128 The width of the media register 118 is 256 bits. Thus, for example, each time the media register 118 is accessed, two memory blocks need to be activated. In one embodiment, one of the 埠 1602/1604 is a read/write 埠. In an embodiment, these 埠 1602/1604 are both read/write 埠.

讓這些神經處理單元126具備如本文所述之旋轉器之能力的優點在於，相較於為了確保神經處理單元126可被充分利用而使架構程式(通過媒體暫存器118)得以持續提供資料至資料隨機存取記憶體122並且在神經處理單元126執行計算之同時，從資料隨機存取記憶體122取回結果所需要之記憶體陣列，此能力有助於減少資料隨機存取記憶體122之記憶體陣列1606的列數，因而可以縮小尺寸。 An advantage of having these neural processing units 126 with the capabilities of a rotator as described herein is that the architectural program (via the media register 118) is continuously provided with information to ensure that the neural processing unit 126 is fully utilized. The data random access memory 122 and the memory array required to retrieve the results from the data random access memory 122 while the neural processing unit 126 is performing the calculations, this ability helps to reduce the data random access memory 122 The number of columns of the memory array 1606 can thus be reduced in size.

內部隨機存取記憶體緩衝器 Internal random access memory buffer

第十七圖係顯示第一圖之權重隨機存取記憶體124與一緩衝器1704之一實施例之方塊示意圖。此權重隨機存取記憶體124包括一記憶體陣列1706與一埠1702。此記憶體陣列1706係裝載權重文字，就一較佳實施例而言，這些權重文字係排列成如前所述W個列之N個文字之陣列。在一實施例中，此記憶體陣列1706包括一個由128個水平排列之靜態隨機存取記憶胞構成之陣列，其中每個記憶胞具有64位元之寬度以及2048位元之高度，如此即可提供一個2MB之權重隨機存取記憶體124，其寬度為8192位元並且具有2048列，而此權重隨機存取記憶體124所使用之晶粒面積大致為2.4平方毫米。不過，本發明並不限於此。 Figure 17 is a block diagram showing an embodiment of a weighted random access memory 124 and a buffer 1704 of the first figure. The weight random access memory 124 includes a memory array 1706 and a stack 1702. The memory array 1706 is loaded with weighting text. For a preferred embodiment, the weighting texts are arranged in an array of N words of W columns as previously described. In one embodiment, the memory array 1706 includes an array of 128 horizontally arranged static random access memory cells, each of which has a width of 64 bits and a height of 2048 bits. A 2MB weight random access memory 124 is provided having a width of 8192 bits and having 2048 columns, and the weight random access memory 124 uses a die area of approximately 2.4 square millimeters. However, the invention is not limited thereto.

就一較佳實施例而言，此埠1702係以多工方式耦接至神經處理單元126與緩衝器1704。這些神經處理單元126與緩衝器1704係透過此埠1702讀取並寫入權重隨機存取記憶體124。緩衝器1704並耦接至第一圖之媒體暫存器118，如此，媒體暫存器118即可透過緩衝器1704讀取並寫入權重隨機存取記憶體124。此方式之優點在於，當神經處理單元126正在讀取或寫入權重隨機存取記憶體124的時候，媒體暫存器118還可以寫入緩衝器118或是從緩衝器118讀取(不過若是神經處理單元126正在執行，在較佳之情況下係擱置這些神經處理單元126，以避免當緩衝器1704存取權重隨機存取記憶體124時，存取權重隨機存取記憶體124)。此方式可以提升效能，特別是因為媒體暫存器118對於權重隨機存取記憶體124之讀取與寫入相對上明顯小於神經處理單元126對於權重隨機存取記憶體124之讀取與寫入。舉例來說，在一實施例中，神經處理單元126一次讀取/寫入8192個位元(一列)，不過，媒體暫存器118之寬度僅為256位元，而每個MTNN指令1400僅寫入兩個媒體暫存器118，即512位元。因此，在架構程式執行十六個MTNN指令1400以填滿緩衝器1704之情況下，神經處理單元126與存取權重隨機存取記憶體124之架構程式間發生衝突的時間會少於大致全部時間之百分之六。在另一實施例中，指令轉譯器104將一個MTNN指令1400轉譯為兩個微指令105，而每個微指令會將單一個資料暫存器118寫入緩衝器 1704，如此，神經處理單元126與架構程式在存取權重隨機存取記憶體124時產生衝突之頻率還會進一步減少。 In a preferred embodiment, the 埠 1702 is multiplexed to the neural processing unit 126 and the buffer 1704. The neural processing unit 126 and the buffer 1704 are read through the buffer 1702 and written to the weight random access memory 124. The buffer 1704 is coupled to the media register 118 of the first figure. Thus, the media register 118 can read and write the weight random access memory 124 through the buffer 1704. An advantage of this approach is that the media register 118 can also write to or read from the buffer 118 when the neural processing unit 126 is reading or writing to the weighted random access memory 124 (but if The neural processing unit 126 is executing, and preferably, the neural processing unit 126 is placed to avoid accessing the weighted random access memory 124 when the buffer 1704 accesses the weighted random access memory 124. This approach can improve performance, particularly since the read and write of the media scratchpad 118 for the weighted random access memory 124 is significantly less than the read and write by the neural processing unit 126 for the weighted random access memory 124. . For example, in one embodiment, the neural processing unit 126 reads/writes 8192 bits (one column) at a time, however, the width of the media register 118 is only 256 bits, and each MTNN instruction 1400 is only Two media registers 118 are written, ie 512 bits. Therefore, in the case where the architecture program executes sixteen MTNN instructions 1400 to fill the buffer 1704, the collision time between the neural processing unit 126 and the architectural program accessing the weight random access memory 124 may be less than substantially all of the time. Six percent. In another embodiment, the instruction translator 104 translates one MTNN instruction 1400 into two microinstructions 105, and each microinstruction writes a single data register 118 to the buffer. 1704, as such, the frequency at which the neural processing unit 126 and the architectural program collide when accessing the weighted random access memory 124 is further reduced.

在包含緩衝器1704之實施例中，利用架構程式寫入權重隨機存取記憶體124需要多個MTNN指令1400。一個或多個MTNN指令1400指定一函數1432以寫入緩衝器1704中指定之資料塊，隨後一MTNN指令1400指定一函數1432指示神經網路單元121將緩衝器1704之內容寫入權重隨機存取記憶體124之一選定列。單一個資料塊之大小為媒體暫存器118之位元數的兩倍，而這些資料塊會自然地排齊於緩衝器1704中。在一實施例中，每個指定函數1432以寫入緩衝器1704指定資料塊之MTNN指令1400係包含一位元遮罩(bitmask)，其具有位元對應至緩衝器1704之各個資料塊。來自兩個指定之來源暫存器118之資料係被寫入緩衝器1704之資料塊中，在位元遮罩內之對應位元為被設定之各個資料塊。此實施例有助於權重隨機存取記憶體124之一列內存在重複資料值之情形。舉例來說，為了將緩衝器1704(以及接下去之權重隨機存取記憶體124之一列)歸零，程式設計者可以將零值載入來源暫存器並且設定位元遮罩之所有位元。此外，位元遮罩也可以讓程式設計者僅寫入緩衝器1704中之選定資料塊，而使其他資料塊維持其先前之資料狀態。 In an embodiment that includes buffer 1704, the use of an architectural program to write weighted random access memory 124 requires a plurality of MTNN instructions 1400. One or more MTNN instructions 1400 specify a function 1432 to write to the data block specified in buffer 1704, and then an MTNN instruction 1400 specifies a function 1432 to instruct neural network unit 121 to write the contents of buffer 1704 to weight random access. One of the memories 124 is selected. The size of a single data block is twice the number of bits in the media register 118, and these data blocks are naturally aligned in the buffer 1704. In one embodiment, each of the specified functions 1432 in the write buffer 1704 specifies that the MTNN instruction 1400 of the data block includes a bitmask having bits corresponding to the respective data blocks of the buffer 1704. The data from the two designated source registers 118 are written into the data block of the buffer 1704, and the corresponding bits in the bit mask are the respective data blocks that are set. This embodiment facilitates the case where there is a duplicate data value in one of the columns of the weighted random access memory 124. For example, to zero the buffer 1704 (and the next column of weighted random access memory 124), the programmer can load zero values into the source register and set all bits of the bit mask. . In addition, the bit mask can also allow the programmer to write only the selected data blocks in buffer 1704, leaving other data blocks to maintain their previous data state.

在包含緩衝器1704之實施例中，利用架構程式讀取權重隨機存取記憶體124需要多個MFNN指令1500。初始的MFNN指令1500指定一函數1532將權重隨機存取單元124之一指定列載入緩衝器1704，隨後一個或多個MFNN指令1500指定一函數1532將緩衝器1704之一指定資料塊讀取至目的暫存器。單一個資料塊之大小即為媒體暫存器118之位元數，而這些資料塊會自然地排齊於緩衝器1704中。本發明之技術特徵亦可適用於其他實施例，如權重隨機存取記憶體124具有多個緩衝器1704，透過增加神經處理單元126執行時架構程式之可存取數量，以進一步減少神經處理單元126與架構程式間因存取權重隨機存記憶體124所產生之衝突，而增加在神經處理單元126不須存取權重隨機存取記憶體124之時頻週期內，改由緩衝器1704進行存取之可能性。 In an embodiment that includes buffer 1704, reading the weight random access memory 124 with the architectural program requires a plurality of MFNN instructions 1500. The initial MFNN instruction 1500 specifies a function 1532 to load a specified column of one of the weighted random access units 124 into the buffer 1704, followed by an OR A plurality of MFNN instructions 1500 specify a function 1532 to read a designated data block of one of the buffers 1704 to the destination register. The size of a single data block is the number of bits in the media register 118, and these data blocks are naturally aligned in the buffer 1704. The technical features of the present invention are also applicable to other embodiments. For example, the weight random access memory 124 has a plurality of buffers 1704, which further reduces the number of neural processing units by increasing the number of accesses of the architecture program when the neural processing unit 126 executes. The conflict between the 126 and the architecture program due to the access weight random memory 124 is increased by the buffer 1704 during the time-frequency period in which the neural processing unit 126 does not need to access the weight random access memory 124. Take the possibility.

第十六圖係描述一雙埠資料隨機存取記憶體122，不過，本發明並不限於此。本發明之技術特徵亦可適用於權重隨機存取記憶體124亦為雙埠設計之其他實施例。此外，第十七圖中描述一緩衝器搭配權重隨機存取記憶體124使用，不過，本發明並不限於此。本發明之技術特徵亦可適用於資料隨機存取記憶體122具有一個類似於緩衝器1704之相對應緩衝器之實施例。 The sixteenth figure depicts a pair of data random access memory 122, however, the present invention is not limited thereto. The technical features of the present invention are also applicable to other embodiments in which the weighted random access memory 124 is also a dual-turn design. Further, a buffer matching weight random access memory 124 is used in the seventeenth figure, but the present invention is not limited thereto. The technical features of the present invention are also applicable to embodiments in which the data random access memory 122 has a corresponding buffer similar to the buffer 1704.

可動態配置之神經處理單元 Dynamically configurable neural processing unit

第十八圖係顯示第一圖之一可動態配置之神經處理單元126之方塊示意圖。第十八圖之神經處理單元126係類似於第二圖之神經處理單元126。不過，第十八圖之神經處理單元126係可動態配置以運作於兩個不同配置之其中之一。在第一個配置中，第十八圖之神經處理單元126之運作係類似於第二圖之神經處理單元126。也就是說，在第一個配置中，在此標示為“寬的” 配置或“單一個”配置，神經處理單元126之算術邏輯單元204對單一個寬的資料文字以及單一個寬的權重文字(例如16個位元)執行運算以產生單一個寬的結果。相較之下，在第二個配置中，即本文標示為“窄的”配置或“雙數”配置，神經處理單元126會對兩個窄的資料文字以及兩個窄的權重文字(例如8個位元)執行運算分別產生兩個窄的結果。在一實施例中，神經處理單元126之配置(寬或窄)係由初始化神經處理單元指令(例如位於前述第二十圖中位址0之指令)達成。另外，此配置也可以由一個具有函數1432指定來設定神經處理單元設定之配置(寬或窄)之MTNN指令來達成。就一較佳實施例而言，程式記憶體129指令或確定配置(寬或窄)之MTNN指令會填滿配置暫存器。舉例來說，配置暫存器之輸出係提供給算術邏輯單元204、啟動函數單元212以及產生多工暫存器控制信號213之邏輯。基本上，第十八圖之神經處理單元126之元件與第二圖中相同編號之元件會執行類似的功能，可從中取得參照以瞭解第十八圖之實施例。以下係針對第十八圖之實施例包含其與第二圖之不同處進行說明。 Figure 18 is a block diagram showing a dynamically configurable neural processing unit 126 of the first figure. The neural processing unit 126 of the eighteenth diagram is similar to the neural processing unit 126 of the second figure. However, the neural processing unit 126 of Fig. 18 is dynamically configurable to operate in one of two different configurations. In the first configuration, the operation of the neural processing unit 126 of Fig. 18 is similar to the neural processing unit 126 of the second figure. That is, in the first configuration, it is labeled "wide" here. In a configuration or "single one" configuration, the arithmetic logic unit 204 of the neural processing unit 126 performs operations on a single wide data text and a single wide weight text (e.g., 16 bits) to produce a single wide result. In contrast, in the second configuration, which is labeled herein as a "narrow" configuration or a "double" configuration, the neural processing unit 126 will have two narrow data words and two narrow weight texts (eg, eight). Bits) perform operations that produce two narrow results, respectively. In one embodiment, the configuration (wide or narrow) of the neural processing unit 126 is achieved by initializing a neural processing unit instruction (e.g., an instruction located at address 0 in the aforementioned twentieth diagram). Alternatively, this configuration can be accomplished by an MTNN command having a configuration (wide or narrow) that the function 1432 specifies to set the neural processing unit settings. In a preferred embodiment, the program memory 129 instructs or determines that the configuration (wide or narrow) MTNN instruction fills up the configuration register. For example, the output of the configuration register is provided to the logic logic unit 204, the startup function unit 212, and the logic that generates the multiplex register control signal 213. Basically, the elements of the neural processing unit 126 of Fig. 18 and the elements of the same number in the second figure perform similar functions, from which reference can be made to understand the embodiment of Fig. 18. The following description of the embodiment of the eighteenth embodiment includes the differences from the second figure.

第十八圖之神經處理單元126包括兩個暫存器205A與205B、兩個三輸入多工暫存器208A與208B、一個算術邏輯單元204、兩個累加器202A與202B、以及兩個啟動函數單元212A與212B。暫存器205A/205B分別具有第二圖之暫存器205之寬度之一半(如8個位元)。暫存器205A/205B分別從權重隨機存取記憶體124 接收一相對應之窄權重文字206A/B206(例如8個位元)並將其輸出203A/203B在一後續時頻週期提供至算術邏輯單元204之運算元選擇邏輯1898。神經處理單元126處於寬配置的時候，暫存器205A/205B就會一起運作以接收來自權重隨機存取記憶體124之一寬權重文字206A/206B(例如16個位元)，類似於第二圖之實施例中的暫存器205；神經處理單元126處於窄配置的時候，暫存器205A/205B實際上就會是獨立運作，各自接收來自權重隨機存取記憶體124之一窄權重文字206A/206B(例如8個位元)，如此，神經處理單元126實際上就相當於兩個窄的神經處理單元各自獨立運作。不過，不論神經處理單元126之配置態樣為何，權重隨機存取記憶體124之相同輸出位元都會耦接並提供至暫存器205A/205B。舉例來說，神經處理單元0之暫存器205A接收到位元組0、神經處理單元0之暫存器205B接收到位元組1、神經處理單元1之暫存器205A接收到位元組2、神經處理單元1之暫存器205B接收到位元組3、依此類推，神經處理單元511之暫存器205B就會接收到位元組1023。 The neural processing unit 126 of Fig. 18 includes two registers 205A and 205B, two three-input multiplex registers 208A and 208B, one arithmetic logic unit 204, two accumulators 202A and 202B, and two start-ups. Function units 212A and 212B. The registers 205A/205B have one-half (e.g., 8 bits) of the width of the register 205 of the second figure, respectively. The registers 205A/205B are respectively from the weight random access memory 124 A corresponding narrow weight text 206A/B 206 (e.g., 8 bits) is received and its output 203A/203B is provided to the operand selection logic 1898 of the arithmetic logic unit 204 at a subsequent time-frequency period. When the neural processing unit 126 is in a wide configuration, the registers 205A/205B operate together to receive a wide weight text 206A/206B (e.g., 16 bits) from the weighted random access memory 124, similar to the second In the embodiment of the embodiment, when the neural processing unit 126 is in a narrow configuration, the registers 205A/205B are actually operated independently, each receiving a narrow weight text from the weighted random access memory 124. 206A/206B (e.g., 8 bits), as such, the neural processing unit 126 is essentially equivalent to two narrow neural processing units operating independently of each other. However, regardless of the configuration of the neural processing unit 126, the same output bits of the weighted random access memory 124 are coupled and provided to the registers 205A/205B. For example, the register 205A of the neural processing unit 0 receives the byte 0, the register 205B of the neural processing unit 0 receives the byte 1, the register 205A of the neural processing unit 1 receives the byte 2, and the nerve The register 205B of the processing unit 1 receives the byte 3, and so on, and the register 205B of the neural processing unit 511 receives the byte 1023.

多工暫存器208A/208B分別具有第二圖之暫存器208之寬度之一半(如8個位元)。多工暫存器208A會在輸入207A、211A與1811A中選擇一個儲存至其暫存器並在後續時頻週期由輸出209A提供，多工暫存器208B會在輸入207B、211B與1811B中選擇一個儲存至其暫存器並在後續時頻週期由輸出209B提供至運算元選擇邏輯1898。輸入207A從資料隨機存取記憶體122接收一窄資料文字(例如8個位元)，輸入207B從資料隨機存取記憶體122接收一窄資料文字。當神經處理單元126處於寬配置的時候，多工暫存器208A/208B實際上就會是一起運作以接收來自資料隨機存取記憶體122之一寬資料文字207A/207B(例如16個位元)，類似於第二圖之實施例中的多工暫存器208；神經處理單元126處於窄配置的時候，多工暫存器208A/208B實際上就會是獨立運作，各自接收來自資料隨機存取記憶體122之一窄資料文字207A/207B(例如8個位元)，如此，神經處理單元126實際上就相當於兩個窄的神經處理單元各自獨立運作。不過，不論神經處理單元126之配置態樣為何，資料隨機存取記憶體122之相同輸出位元都會耦接並提供至多工暫存器208A/208B。舉例來說，神經處理單元0之多工暫存器208A接收到位元組0、神經處理單元0之多工暫存器208B接收到位元組1、神經處理單元1之多工暫存器208A接收到位元組2、神經處理單元1之多工暫存器208B接收到位元組3、依此類推，神經處理單元511之多工暫存器208B就會接收到位元組1023。 The multiplex registers 208A/208B have one-half (e.g., 8 bits) of the width of the register 208 of the second figure, respectively. Multiplex register 208A selects one of inputs 207A, 211A, and 1811A to store to its scratchpad and is provided by output 209A for subsequent time-frequency cycles. Multiplexed register 208B selects among inputs 207B, 211B, and 1811B. One is stored to its register and is provided by the output 209B to the operand selection logic 1898 at subsequent time-frequency cycles. Input 207A receives a narrow from data random access memory 122 Data text (e.g., 8 bits), input 207B receives a narrow data text from data random access memory 122. When the neural processing unit 126 is in a wide configuration, the multiplex registers 208A/208B will actually operate together to receive a wide data text 207A/207B from the data random access memory 122 (e.g., 16 bits). ), similar to the multiplex register 208 in the embodiment of the second figure; when the neural processing unit 126 is in a narrow configuration, the multiplex registers 208A/208B will actually operate independently, each receiving random data from One of the narrow data words 207A/207B (e.g., 8 bits) is accessed by the memory 122. Thus, the neural processing unit 126 is essentially equivalent to the operation of the two narrow neural processing units. However, regardless of the configuration of the neural processing unit 126, the same output bits of the data random access memory 122 are coupled and provided to the multiplex registers 208A/208B. For example, the multiplexer 208A of the neural processing unit 0 receives the byte 0, the multiplexer 208B of the neural processing unit 0 receives the byte 1, and the multiplexer 208A of the neural processing unit 1 receives The octet 2, the multiplexer 208B of the neural processing unit 1 receives the byte 3, and so on, and the multiplexer 208B of the neural processing unit 511 receives the byte 1023.

輸入211A接收鄰近之神經處理單元126之多工暫存器208A之輸出209A，輸入211B接收鄰近之神經處理單元126之多工暫存器208B之輸出209B。輸入1811A接收鄰近神經處理單元126之多工暫存器208B之輸出209B，而輸入1811B接收鄰近神經處理單元126之多工暫存器208A之輸出209A。第十八圖所示之神經處理單元126係屬於第一圖所示之N個神經處理單元126之其中之一並標示為神經處理單元J。也就是說，神經處理單元J是這N個神經處理單元之一代表範例。就一較佳實施例而言，神經處理單元J之多工暫存器208A輸入211A會接收範例J-1之神經處理單元126之多工暫存器208A輸出209A，而神經處理單元J之多工暫存器208A輸入1811A會接收範例J-1之神經處理單元126之多工暫存器208B輸出209B，並且神經處理單元J之多工暫存器208A輸出209A會同時提供至範例J+1之神經處理單元126之多工暫存器208A輸入211A以及範例J之神經處理單元126之多工暫存器208B輸入211B；神經處理單元J之多工暫存器208B之輸入211B會接收範例J-1之神經處理單元126之多工暫存器208B輸出209B，而神經處理單元J之多工暫存器208B之輸入1811B會接收範例J之神經處理單元126之多工暫存器208A輸出209A，並且，神經處理單元J之多工暫存器208B之輸出209B會同時提供至範例J+1之神經處理單元126之多工暫存器208A輸入1811A以及範例J+1之神經處理單元126之多工暫存器208B輸入211B。 Input 211A receives output 209A of multiplex register 208A of neighboring neural processing unit 126, and input 211B receives output 209B of multiplex register 208B of neighboring neural processing unit 126. Input 1811A receives output 209B of multiplex register 208B of proximity neural processing unit 126, and input 1811B receives output 209A of multiplex register 208A of adjacent neural processing unit 126. The nerve processing unit 126 shown in FIG. 18 belongs to one of the N nerve processing units 126 shown in the first figure. Also labeled as nerve processing unit J. That is to say, the neural processing unit J is a representative example of one of the N neural processing units. In a preferred embodiment, the multiplexer 208A input 211A of the neural processing unit J receives the output 209A of the multiplexer 208A of the neural processing unit 126 of the example J-1, and the number of the neural processing unit J is The physical register 208A input 1811A will receive the multiplex buffer 208B output 209B of the neural processing unit 126 of the example J-1, and the multiplex register 208A output 209A of the neural processing unit J will be provided to the sample J+1 at the same time. The multiplexer register 208A of the neural processing unit 126 inputs 211A and the multiplexer 208B of the neural processing unit 126 of the example J inputs 211B; the input 211B of the multiplexer 208B of the neural processing unit J receives the sample J The multiplex register 208B of the neural processing unit 126 of the -1 outputs 209B, and the input 1811B of the multiplex register 208B of the neural processing unit J receives the output 209A of the multiplexer 208A of the neural processing unit 126 of the example J. And, the output 209B of the multiplexer 208B of the neural processing unit J is simultaneously provided to the multiplexer 208A input 1811A of the neural processing unit 126 of the example J+1 and the neural processing unit 126 of the example J+1. The multiplex register 208B inputs 211B.

控制輸入213控制多工暫存器208A/208B中之每一個，從這三個輸入中選擇其一儲存至其相對應之暫存器，並在後續步驟提供至相對應之輸出209A/209B。當神經處理單元126被指示要從資料隨機存取記憶體122載入一列時(例如第二十圖中位址1之乘法累加指令，詳如後述)，無論此神經處理單元126是處於寬配置或是窄配置，控制輸入213會控制多工暫存器208A/208B中之每一個多工暫存器，從資料隨機存取記憶體122之選定列之相對應窄文字中選擇一相對應之窄資料文字207A/207B(如8位元)。 Control input 213 controls each of multiplex registers 208A/208B, selects one of the three inputs to store to its corresponding register, and provides it to the corresponding output 209A/209B in a subsequent step. When the neural processing unit 126 is instructed to load a column from the data random access memory 122 (for example, the multiply-accumulate instruction of address 1 in the twentieth map, as will be described later), regardless of whether the neural processing unit 126 is in a wide configuration Or a narrow configuration, the control input 213 controls each of the multiplex registers 208A/208B, from the data random access record A corresponding narrow text file 207A/207B (eg, 8-bit) is selected from the corresponding narrow text of the selected column of the memory 122.

當神經處理單元126接收指示需要對先前接收之資料列數值進行旋轉時(例如第二十圖中位址2之乘法累加旋轉指令，詳如後述)，若是神經處理單元126是處於窄配置，控制輸入213就會控制多工暫存器208A/208B中每一個多工暫存器選擇相對應之輸入1811A/1811B。在此情況下，多工暫存器208A/208B實際上會是獨立運作而使神經處理單元126實際上就如同兩個獨立的窄神經處理單元。如此，N個神經處理單元126之多工暫存器208A與208B共同運作就會如同一2N個窄文字之旋轉器，這部分在後續對應於第十九圖處有更詳細的說明。 When the neural processing unit 126 receives the indication that the previously received data column value needs to be rotated (for example, the multiplication accumulation rotation instruction of the address 2 in the twentieth figure, as will be described later), if the neural processing unit 126 is in a narrow configuration, the control Input 213 controls each of the multiplex registers 208A/208B to select the corresponding input 1811A/1811B. In this case, the multiplex registers 208A/208B will actually operate independently and the neural processing unit 126 will behave as two separate narrow neural processing units. Thus, the multiplexers 208A and 208B of the N neural processing units 126 operate together as the same 2N narrow text rotators, which is described in more detail later in the corresponding FIG.

當神經處理單元126接收指示需要對先前接收之資料列數值進行旋轉時，若是神經處理單元126是處於寬配置，控制輸入213就會控制多工暫存器208A/208B中每一個多工暫存器選擇相對應輸入211A/211B。在此情況下，多工暫存器208A/208B會共同運作而實際上就好像這個神經處理單元126是單一個寬神經處理單元126。如此，N個神經處理單元126之多工暫存器208A與208B共同運作就會如同一N個寬文字之旋轉器，類似對應於第三圖所描述之方式。 When the neural processing unit 126 receives an indication that the previously received data column value needs to be rotated, if the neural processing unit 126 is in a wide configuration, the control input 213 controls each of the multiplex registers 208A/208B. The device selection corresponds to input 211A/211B. In this case, the multiplex registers 208A/208B will operate together as if the neural processing unit 126 were a single wide neural processing unit 126. Thus, the multiplex registers 208A and 208B of the N neural processing units 126 operate together as the same N wide-text rotators, similarly to the manner described in the third figure.

算術邏輯單元204包括運算元選擇邏輯1898、一個寬乘法器242A、一個窄乘法器242B、一個寬雙輸入多工器1896A，一個窄雙輸入多工器1896B，一個寬加法器244A與一個窄加法器244B。實際上，此算術邏輯單元204可理解為包括運算元選擇邏輯、一個寬算術邏輯單元204A(包括前述寬乘法器242A、前述寬多工器1896A與前述寬加法器244A)與一個窄算術邏輯單元204B(包括前述窄乘法器242B、前述窄多工器1896B與前述窄加法器244B)。就一較佳實施例而言，寬乘法器242A可將兩個寬文字相乘，類似於第二圖之乘法器242，例如一個16位元乘16位元之乘法器。窄乘法器242B可將兩個窄文字相乘，例如一個8位元乘8位元之乘法器以產生一個16位元之結果。神經處理單元126處於窄配置時，透過運算元選擇邏輯1898之協助，即可充分利用寬乘法器242A，將其作為一個窄乘法器使兩個窄文字相乘，如此神經處理單元126就會如同兩個有效運作之窄神經處理單元。就一較佳實施例而言，寬加法器244A會將寬多工器1896A之輸出與寬累加器202A之輸出217A相加已產生一總數215A供寬累加器202A使用，其運作係類似於第二圖之加法器244。窄加法器244B會將窄多工器1896B之輸出與窄累加器202B輸出217B相加以產生一總數215B供窄累加器202B使用。在一實施例中，窄累加器202B具有28位元之寬度，以避免在進行多達1024個16位元乘積之累加運算時會喪失準確度。神經處理單元126處於寬配置時，窄乘法器244B、窄累加器202B與窄啟動函數單元212B最好是處於不啟動狀態以降低能量耗損。 Arithmetic logic unit 204 includes operand selection logic 1898, a wide multiplier 242A, a narrow multiplier 242B, a wide dual input multiplexer 1896A, a narrow dual input multiplexer 1896B, a Wide adder 244A and a narrow adder 244B. In fact, this arithmetic logic unit 204 can be understood to include operand selection logic, a wide arithmetic logic unit 204A (including the aforementioned wide multiplier 242A, the aforementioned wide multiplexer 1896A and the aforementioned wide adder 244A) and a narrow arithmetic logic unit. 204B (including the aforementioned narrow multiplier 242B, the aforementioned narrow multiplexer 1896B and the aforementioned narrow adder 244B). In a preferred embodiment, wide multiplier 242A can multiply two wide words, similar to multiplier 242 of the second figure, such as a 16-bit by 16-bit multiplier. Narrow multiplier 242B can multiply two narrow words, such as an 8-bit by 8-bit multiplier to produce a 16-bit result. When the neural processing unit 126 is in a narrow configuration, the wide multiplier 242A can be fully utilized as a narrow multiplier to multiply the two narrow words by the assistance of the operand selection logic 1898, so that the neural processing unit 126 Two narrowly operated nerve processing units. In a preferred embodiment, wide adder 244A adds the output of wide multiplexer 1896A to output 217A of wide accumulator 202A to produce a total number 215A for use by wide accumulator 202A, which operates similarly to Adder 244 of the second figure. Narrow adder 244B adds the output of narrow multiplexer 1896B to narrow accumulator 202B output 217B to produce a total number 215B for use by narrow accumulator 202B. In one embodiment, the narrow accumulator 202B has a width of 28 bits to avoid loss of accuracy when performing an accumulation operation of up to 1024 16-bit products. When the neural processing unit 126 is in a wide configuration, the narrow multiplier 244B, the narrow accumulator 202B, and the narrow start function unit 212B are preferably in an inactive state to reduce energy consumption.

運算元選擇邏輯1898會從209A、209B、203A與203B中選擇運算元提供至算術邏輯單元204之其他元件，詳如後述。就一較佳實施例而言，運算元選擇邏輯1898也具有其他功能，例如執行帶符號數值資料文字與權重文字之符號延展。舉例來說，若是神經處理單元126是處於窄配置，運算元選擇邏輯1898會將窄資料文字與權重文字之符號延展至寬文字之寬度，然後才提供給寬乘法器242A。類似地，若是算術邏輯單元204接受指示要傳遞一個窄資料/權重文字(利用寬多工器1896A跳過寬乘法器242A)，運算元選擇邏輯1898會將窄資料文字與權重文字之符號延展至寬文字之寬度，然後才提供給寬加法器244A。就一較佳實施例而言，此執行符號延展功能之邏輯亦存在於第二圖之神經處理單元126之算術邏輯運算204之內部。 The operand selection logic 1898 will select the operands from 209A, 209B, 203A, and 203B to provide to the arithmetic logic unit 204. His components are described in detail later. In a preferred embodiment, operand selection logic 1898 also has other functions, such as performing symbolic extensions of signed numeric data and weighted text. For example, if the neural processing unit 126 is in a narrow configuration, the operand selection logic 1898 extends the sign of the narrow data text and the weight text to the width of the wide text and then provides it to the wide multiplier 242A. Similarly, if the arithmetic logic unit 204 accepts an indication that a narrow data/weight text is to be passed (using the wide multiplexer 1896A to skip the wide multiplier 242A), the operand selection logic 1898 extends the sign of the narrow data text and the weight text to The width of the wide text is then provided to the wide adder 244A. For a preferred embodiment, the logic for performing the symbol stretching function is also present within the arithmetic logic operation 204 of the neural processing unit 126 of the second figure.

寬多工器1896A接收寬乘法器242A之輸出與來自運算元選擇邏輯1898之一運算元，並從這些輸入中選擇其一提供給寬加法器244A，窄多工器1896B接收窄乘法器242B之輸出與來自運算元選擇邏輯1898之一運算元，並從這些輸入中選擇其一提供給窄加法器244B。 The wide multiplexer 1896A receives the output of the wide multiplier 242A and one of the operands from the operand selection logic 1898, and selects one of these inputs for the wide adder 244A, and the narrow multiplexer 1896B receives the narrow multiplier 242B. The output is derived from one of the operand selection logic 1898 and one of these inputs is provided to the narrow adder 244B.

運算元選擇邏輯1898會依據神經處理單元126之配置以及算術邏輯單元204將要執行之算術與/或邏輯運算提供運算元，此算術/邏輯運算係依據神經處理單元126執行之指令所指定之函數來決定。舉例來說，若是指令指示算術邏輯單元204執行一乘法累加運算而神經處理單元126係處於寬配置，運算元選擇邏輯1898就將輸出209A與209B串接構成之一寬文字提供至寬乘法器242A之一輸入，而將輸出203A與203B串接構成之一寬文字提供至另一輸入，而窄乘法器242B則是不啟動，如此，神經處理單元126之運作就會如同單一個類似於第二圖之神經處理單元126之寬神經處理單元126。不過，若是指令指示算術邏輯單元執行一乘法累加運算並且神經處理單元126是處於窄配置，運算元選擇邏輯1898就將一延展後或擴張後版本之窄資料文字209A提供至寬乘法器242A之一輸入，而將延展後版本之窄權重文字203A提供至另一輸入；此外，運算元選擇邏輯1898會將窄資料文字209B提供至窄乘法器242B之一輸入，而將窄權重文字203B提供至另一輸入。為達成如前所述對窄文字進行延展或擴張之運算，若是窄文字帶有符號，運算元選擇邏輯1898就會對窄文字進行符號延展；若是窄文字不帶有符號，運算元選擇邏輯1898就會在窄文字加入上方零值位元。 The operand selection logic 1898 provides operands in accordance with the configuration of the neural processing unit 126 and the arithmetic and/or logical operations to be performed by the arithmetic logic unit 204, which is based on the function specified by the instructions executed by the neural processing unit 126. Decide. For example, if the instruction instructs the arithmetic logic unit 204 to perform a multiply-accumulate operation and the neural processing unit 126 is in a wide configuration, the operand selection logic 1898 concatenates the outputs 209A and 209B to form a wide text to provide a wide multiplication. One of the inputs 242A is input, and the outputs 203A and 203B are connected in series to form one wide text to be supplied to the other input, and the narrow multiplier 242B is not activated. Thus, the operation of the neural processing unit 126 is similar to that of a single one. The wide nerve processing unit 126 of the neural processing unit 126 of the second figure. However, if the instruction instructs the arithmetic logic unit to perform a multiply-accumulate operation and the neural processing unit 126 is in a narrow configuration, the operand selection logic 1898 provides an extended or expanded version of the narrow data word 209A to one of the wide multipliers 242A. Input, while the extended version of the narrow weight text 203A is provided to another input; in addition, the operand selection logic 1898 provides the narrow data text 209B to one of the narrow multipliers 242B and the narrow weight text 203B to the other An input. In order to achieve the operation of extending or expanding the narrow text as described above, if the narrow text has a symbol, the operation element selection logic 1898 will perform symbol extension on the narrow text; if the narrow text has no symbol, the operation element selection logic 1898 The upper zero value is added to the narrow text.

在另一範例中，若是神經處理單元126處於寬配置並且指令指示算術邏輯單元204執行一權重文字之累加運算，寬乘法器242A就會被跳過，而運算元選擇邏輯1898就會將輸出203A與203B串接提供至寬多工器1896A以提供給寬加法器244A。不過，若是神經處理單元126處於窄配置並且指令指示算術邏輯單元204執行一權重文字之累加運算，寬乘法器242A就會被跳過，而運算元選擇邏輯1898就會將一延展後版本之輸出203A提供至寬多工器1896A以提供給寬加法器244A；此外，窄乘法器242B會被跳過，運算元選擇邏輯1898會將延展後版本之輸出203B提供至窄多工器1896B以提供給窄加法器244B。 In another example, if the neural processing unit 126 is in a wide configuration and the instruction instructs the arithmetic logic unit 204 to perform an accumulation operation of a weighted text, the wide multiplier 242A is skipped and the operand selection logic 1898 will output 203A. Serial multiplexer 1896A is provided in series with 203B for supply to wide adder 244A. However, if the neural processing unit 126 is in a narrow configuration and the instruction instructs the arithmetic logic unit 204 to perform an accumulation operation of a weighted text, the wide multiplier 242A is skipped and the operand selection logic 1898 will have an extended version output. 203A is provided to wide multiplexer 1896A for supply to wide adder 244A; in addition, narrow multiplier 242B is skipped and operand selection logic 1898 will be extended The output 203B of the latter version is provided to the narrow multiplexer 1896B for supply to the narrow adder 244B.

在另一範例中，若是神經處理單元126處於寬配置並且指令指示算術邏輯單元204執行一資料文字之累加運算，寬乘法器242A就會被跳過，而運算元選擇邏輯1898就會將輸出209A與209B串接提供至寬多工器1896A以提供給寬加法器244A。不過，若是神經處理單元126處於窄配置並且指令指示算術邏輯單元204執行一資料文字之累加運算，寬乘法器242A就會被跳過，而運算元選擇邏輯1898就會將一延展後版本之輸出209A提供至寬多工器1896A以提供給寬加法器244A；此外，窄乘法器242B會被跳過，運算元選擇邏輯1898會將延展後版本之輸出209B提供至窄多工器1896B以提供給窄加法器244B。權重/資料文字之累加計算有助於平均運算，平均運算可用如影像處理在內之某些人工神經網路應用之共源(pooling)層。 In another example, if the neural processing unit 126 is in a wide configuration and the instruction instructs the arithmetic logic unit 204 to perform an accumulation of data, the wide multiplier 242A is skipped and the operand selection logic 1898 will output 209A. Serial multiplexer 1896A is provided in series with 209B for supply to wide adder 244A. However, if the neural processing unit 126 is in a narrow configuration and the instruction instructs the arithmetic logic unit 204 to perform a data word accumulation operation, the wide multiplier 242A is skipped and the operand selection logic 1898 will have an extended version output. 209A is provided to wide multiplexer 1896A for supply to wide adder 244A; in addition, narrow multiplier 242B is skipped, and operand selection logic 1898 provides extended version 209B to narrow multiplexer 1896B for provision to Narrow adder 244B. The cumulative calculation of weights/data texts contributes to the averaging operation, which can be used in the pooling layer of some artificial neural network applications, such as image processing.

就一較佳實施例而言，神經處理單元126還包括一第二寬多工器(未圖示)，用以跳過寬加法器244A，以利於將寬配置下之一寬資料/權重文字或是窄配置下之一延展後之窄資料/權重文字載入寬累加器202A，以及一第二窄多工器(未圖示)，用以跳過窄加法器244B，以利於將窄配置下之一窄資料/權重文字載入窄累加器202B。就一較佳實施例而言，此算術邏輯單元204還包括寬與窄之比較器/多工器組合(未圖示)，此比較器/多工器組合係接收相對應之累加器數值217A/217B 與相對應之多工器1896A/1896B輸出，藉以在累加器數值217A/217B與一資料/權重文字209A/209B/203A/203B間選擇最大值，某些人工神經網路應用之共源(pooling)層係使用此運算，這部分在後續章節，例如對應於第二十七與二十八圖處，會有更詳細的說明。此外，運算元選擇邏輯1898係用以提供數值零之運算元(用於加零之加法運算或是用以清除累加器)，並提供數值一之運算元(用於乘一之乘法運算)。 In a preferred embodiment, the neural processing unit 126 further includes a second wide multiplexer (not shown) for skipping the wide adder 244A to facilitate wide data/weight text in a wide configuration. Or a narrow data/weight text loaded into the wide accumulator 202A in a narrow configuration, and a second narrow multiplexer (not shown) to skip the narrow adder 244B to facilitate narrow configuration The next narrow data/weight text is loaded into the narrow accumulator 202B. In a preferred embodiment, the arithmetic logic unit 204 further includes a wide and narrow comparator/multiplexer combination (not shown) that receives the corresponding accumulator value 217A. /217B And the corresponding multiplexer 1896A/1896B output, so that the maximum value is selected between the accumulator value 217A/217B and a data/weight text 209A/209B/203A/203B, and the common source of some artificial neural network applications (pooling) The layer system uses this operation, which is described in more detail in subsequent sections, for example, corresponding to the twenty-seventh and twenty-eighth diagrams. In addition, the operand selection logic 1898 is used to provide an arithmetic element of zero value (for addition of zeros or to clear the accumulator) and to provide an arithmetic element of value one (for multiplication by one).

窄啟動函數單元212B接收窄累加器202B之輸出217B並對其執行一啟動函數以產生一窄結果133B，寬啟動函數單元212A接收寬累加器202A之輸出217A並對其執行一啟動函數以產生一寬結果133A。神經處理單元126處於窄配置時，寬啟動函數單元212A會依此配置理解累加器202A之輸出217A並對其執行一啟動函數以產生一窄結果，如8位元，這部分在後續章節如對應於第二十九A至三十圖處有更詳細的說明。 Narrow start function unit 212B receives output 217B of narrow accumulator 202B and performs a start function thereon to produce a narrow result 133B. Wide start function unit 212A receives output 217A of wide accumulator 202A and performs a start function thereon to generate a The wide result is 133A. When the neural processing unit 126 is in a narrow configuration, the wide start function unit 212A will understand the output 217A of the accumulator 202A and perform a start function on it to generate a narrow result, such as an 8-bit, which is corresponding in subsequent sections. A more detailed description is given in Figures 29 to 30.

如前所述，單一個神經處理單元126在處於窄配置時實際上可以作為兩個窄神經處理單元來運作，因此，對於較小的文字而言，相較於寬配置時，大致上可以提供多達兩倍的處理能力。舉例來說，假定神經網路層具有1024個神經元，而每個神經元從前一層接收1024個窄輸入(並具有窄權重文字)，如此就會產生一百萬個連結。對於具有512個神經處理單元126之神經網路單元121而言，在窄配置下(相當於1024個窄神經處理單元)，雖然處理的是窄文字而非寬文字，不過其所能處理之連結數可以達到寬配置之四倍(一百萬個連結對上256K個連結)，而所需的時間大致為一半(約1026個時頻週期對上514個時頻週期)。 As previously mentioned, a single neural processing unit 126 can actually operate as two narrow neural processing units when in a narrow configuration, so that for smaller text, it is generally comparable to a wide configuration. Up to twice the processing power. For example, suppose the neural network layer has 1024 neurons, and each neuron receives 1024 narrow inputs (and has narrow weighted text) from the previous layer, thus producing one million links. For a neural network unit 121 having 512 neural processing units 126, in a narrow configuration (equivalent to 1024 narrow neural processing units), although it is a narrow text rather than a wide text, it is capable of The number of links can be up to four times the wide configuration (256K connections on one million links), and the required time is roughly half (about 1026 time-frequency periods versus 514 time-frequency periods).

在一實施例中，第十八圖之動態配置神經處理單元126包括類似於多工暫存器208A與208B之三輸入多工暫存器以取代暫存器205A與205B，以構成一旋轉器，處理由權重隨機存取記憶體124接收之權重文字列，此運作部分類似於第七圖之實施例所描述之方式但應用於第十八圖所述之動態配置中。 In one embodiment, the dynamic configuration neural processing unit 126 of FIG. 18 includes a three-input multiplexer similar to the multiplex registers 208A and 208B in place of the registers 205A and 205B to form a rotator. The weighted character string received by the weighted random access memory 124 is processed in a manner similar to that described in the embodiment of the seventh embodiment but applied to the dynamic configuration described in the eighteenth figure.

第十九圖係一方塊示意圖，顯示依據第十八圖之實施例，利用第一圖之神經網路單元121之N個神經處理單元126之2N個多工暫存器208A/208B，對於由第一圖之資料隨機存取記憶體122取得之一列資料文字207執行如同一旋轉器之運作。在第十九圖之實施例中，N是512，神經處理單元121具有1024個多工暫存器208A/208B，標示為0至511，分別對應至512個神經處理單元126以及實際上1024個窄神經處理單元。神經處理單元126內之兩個窄神經處理單元分別標示為A與B，在每個多工暫存器208中，其相對應之窄神經處理單元亦加以標示。進一步來說，標示為0之神經處理單元126之多工暫存器208A係標示為0-A，標示為0之神經處理單元126之多工暫存器208B係標示為0-B，標示為1之神經處理單元126之多工暫存器208A係標示為1-A，標示為1之神經處理單元126之多工暫存器208B係標示為1-B，標示為511之神經處理單元126之多工暫存器208A係標示為 511-A，而標示為511之神經處理單元126之多工暫存器208B係標示為511-B，其數值亦對應至後續第二十一圖所述之窄神經處理單元。 Figure 19 is a block diagram showing the 2N multiplex registers 208A/208B of the N neural processing units 126 of the neural network unit 121 of the first figure according to the embodiment of the eighteenth figure. The data random access memory 122 of the first figure obtains a list of data words 207 to perform operations as the same rotator. In the embodiment of the nineteenth embodiment, N is 512, and the neural processing unit 121 has 1024 multiplex registers 208A/208B, labeled 0 to 511, corresponding to 512 neural processing units 126 and 1024 actually. Narrow nerve processing unit. The two narrow neural processing units within the neural processing unit 126 are labeled A and B, respectively, and in each multiplex register 208, the corresponding narrow neural processing unit is also labeled. Further, the multiplex register 208A of the neural processing unit 126 labeled 0 is labeled 0-A, and the multiplex register 208B of the neural processing unit 126 labeled 0 is labeled 0-B, labeled as The multiplex register 208A of the neural processing unit 126 of 1 is labeled 1-A, and the multiplex register 208B of the neural processing unit 126 labeled 1 is labeled 1-B, and the neural processing unit 126 is labeled 511. The multiplex register 208A is labeled as 511-A, and the multiplex register 208B of the neural processing unit 126, labeled 511, is labeled 511-B, the value of which also corresponds to the narrow neural processing unit described in the subsequent twenty-first figure.

每個多工暫存器208A在資料隨機存取記憶體122之D個列之其中一列中接收其相對應的窄資料文字207A，而每個多工暫存器208B在資料隨機存取記憶體122之D個列之其中一列中接收其相對應的窄資料文字207B。也就是說，多工暫存器0-A接收資料隨機存取記憶體122列之窄資料文字0，多工暫存器0-B接收資料隨機存取記憶體122列之窄資料文字1，多工暫存器1-A接收資料隨機存取記憶體122列之窄資料文字2，多工暫存器1-B接收資料隨機存取記憶體122列之窄資料文字3，依此類推，多工暫存器511-A接收資料隨機存取記憶體122列之窄資料文字1022，而多工暫存器511-B則是接收資料隨機存取記憶體122列之窄資料文字1023。此外，多工暫存器1-A接收多工暫存器0-A之輸出209A作為其輸入211A，多工暫存器1-B接收多工暫存器0-B之輸出209B作為其輸入211B，依此類推，多工暫存器511-A接收多工暫存器510-A之輸出209A作為其輸入211A，多工暫存器511-B接收多工暫存器510-B之輸出209B作為其輸入211B，並且多工暫存器0-A接收多工暫存器511-A之輸出209A作為其輸入211A，多工暫存器0-B接收多工暫存器511-B之輸出209B作為其輸入211B。每個多工暫存器208A/208B都會接收控制輸入213以控制其選擇資料文字207A/207B或是旋轉後輸入211A/211B或是旋轉後輸入1811A/1811B。最後，多工暫存器1-A接收多工暫存器0-B之輸出209B作為其輸入1811A，多工暫存器1-B接收多工暫存器1-A之輸出209A作為其輸入1811B，依此類推，多工暫存器511-A接收多工暫存器510-B之輸出209B作為其輸入1811A，多工暫存器511-B接收多工暫存器511-A之輸出209A作為其輸入1811B，並且多工暫存器0-A接收多工暫存器511-B之輸出209B作為其輸入1811A，多工暫存器0-B接收多工暫存器0-A之輸出209A作為其輸入1811B。每個多工暫存器208A/208B都會接收控制輸入213以控制其選擇資料文字207A/207B或是旋轉後輸入211A/211B或是旋轉後輸入1811A/1811B。在一運算模式中，在第一時頻週期，控制輸入213會控制每個多工暫存器208A/208B選擇資料文字207A/207B儲存至暫存器供後續提供至算術邏輯單元204；而在後續時頻週期(例如前述之M-1時頻週期)，控制輸入213會控制每個多工暫存器208A/208B選擇旋轉後輸入1811A/1811B儲存至暫存器供後續提供至算術邏輯單元204，這部分在後續章節會有更詳細的說明。 Each multiplex register 208A receives its corresponding narrow data 207A in one of the D columns of the data random access memory 122, and each multiplex buffer 208B is in the data random access memory. One of the D columns of 122 receives its corresponding narrow data text 207B. That is, the multiplex register 0-A receives the narrow data text 0 of the data random access memory 122 column, and the multiplex register 0-B receives the narrow data text 1 of the data random access memory 122 column. The multiplexed register 1-A receives the narrow data text 2 of the data random access memory 122, the multiplex register 1-B receives the narrow data text of the data random access memory 122, and so on. The multiplexed register 511-A receives the narrow data text 1022 of the data random access memory 122, and the multiplex register 511-B receives the narrow data text 1023 of the data random access memory 122. In addition, the multiplex register 1-A receives the output 209A of the multiplex register 0-A as its input 211A, and the multiplex register 1-B receives the output 209B of the multiplex register 0-B as its input. 211B, and so on, the multiplex register 511-A receives the output 209A of the multiplex register 510-A as its input 211A, and the multiplex register 511-B receives the output of the multiplex register 510-B. 209B as its input 211B, and the multiplex register 0-A receives the output 209A of the multiplex register 511-A as its input 211A, and the multiplex register 0-B receives the multiplex register 511-B. Output 209B is its input 211B. Each multiplex register 208A/208B receives control input 213 to control its selection data 207A/207B or rotates input 211A/211B or rotates and then loses Into 1811A/1811B. Finally, the multiplex register 1-A receives the output 209B of the multiplex register 0-B as its input 1811A, and the multiplex register 1-B receives the output 209A of the multiplex register 1-A as its input. 1811B, and so on, the multiplex register 511-A receives the output 209B of the multiplex register 510-B as its input 1811A, and the multiplex register 511-B receives the output of the multiplex register 511-A. 209A as its input 1811B, and the multiplex register 0-A receives the output 209B of the multiplex register 511-B as its input 1811A, and the multiplex register 0-B receives the multiplex register 0-A Output 209A is its input 1811B. Each multiplex register 208A/208B receives a control input 213 to control its selection data 207A/207B or a post-rotation input 211A/211B or a post-rotation input 1811A/1811B. In an operational mode, during the first time-frequency period, control input 213 controls each multiplex register 208A/208B to select data words 207A/207B for storage to the scratchpad for subsequent supply to arithmetic logic unit 204; Subsequent time-frequency periods (eg, the M-1 time-frequency period described above), control input 213 controls each multiplex register 208A/208B to select the rotated input 1811A/1811B to store to the scratchpad for subsequent supply to the arithmetic logic unit 204, this section will be described in more detail in subsequent chapters.

第二十圖係一表格，顯示一個儲存於第一圖之神經網路單元121之程式記憶體129並由該神經網路單元121執行之程式，而此神經網路單元121具有如第十八圖之實施例所示之神經處理單元126。第二十圖之範例程式係類似於第四圖之程式。以下係針對其差異進行說明。位於位址0之初始化神經處理單元係指令指定神經處理單元126將會進入窄配置。此外，如圖中所示，位於位址2之乘法累加旋轉指令係指定一數值為1023之計數值並需要1023個時頻週期。這是因為第二十圖之範例中假定在一層中實際上具有1024個窄(如8位元)神經元(即神經處理單元)，每個窄神經元具有1024個來自前一層之1024個神經元之連結輸入，因此總共有1024K個連結。每個神經元從每個連結輸入接收一個8位元資料值並將此8位元資料值乘上一個適當的8位元權重值。 Figure 20 is a table showing a program stored in the program memory 129 of the neural network unit 121 of the first figure and executed by the neural network unit 121, and the neural network unit 121 has the eighteenth The nerve processing unit 126 shown in the embodiment of the figure. The example program in Figure 20 is similar to the program in Figure 4. The differences are explained below. The initialization neural processing unit instruction at address 0 specifies that the neural processing unit 126 will enter a narrow configuration. In addition, as shown in the figure, in position The multiply-accumulate rotation command of address 2 specifies a count value of 1023 and requires 1023 time-frequency periods. This is because the example in the twentieth diagram assumes that there are actually 1024 narrow (eg, 8-bit) neurons (ie, neural processing units) in one layer, each with 1024 neurons from the previous layer. The link of the yuan is entered, so there are a total of 1024K links. Each neuron receives an 8-bit data value from each of the link inputs and multiplies the 8-bit data value by an appropriate 8-bit weight value.

第二十一圖係顯示一神經網路單元121執行第二十圖之程式之時序圖，此神經網路單元121具有如第十八圖所示之神經處理單元126執行於窄配置。第二十一圖之時序圖係類似於第五圖之時序圖。以下係針對其差異進行說明。 The twenty-first figure shows a timing chart in which a neural network unit 121 executes the program of the twentieth diagram, and the neural network unit 121 has the neural processing unit 126 as shown in the eighteenth diagram executed in a narrow configuration. The timing chart of the twenty-first figure is similar to the timing chart of the fifth figure. The differences are explained below.

在第二十一圖之時序圖中，這些神經處理單元126會處於窄配置，這是因為位於位址0之初始化神經處理單元指令將其初始化為窄配置。所以，這512個神經處理單元126實際上運作起來就如同1024個窄神經處理單元(或神經元)，這1024個窄神經處理單元在欄位內係以神經處理單元0-A與神經處理單元0-B(標示為0之神經處理單元126之兩個窄神經處理單元)，神經處理單元1-A與神經處理單元1-B(標示為1之神經處理單元126之兩個窄神經處理單元)，依此類推直到神經處理單元511-A與神經處理單元511-B(標示為511之神經處理單元126之兩個窄神經處理單元)，加以指明。為簡化說明，圖中僅顯示窄神經處理單元0-A、0-B與511-B之運算。因為位於位址2之乘法累加旋轉指令所指定之計數值為 1023，而需要1023個時頻週期進行運作因此，第二十一圖之時序圖之列數包括多達1026個時頻週期。 In the timing diagram of the twenty-first figure, these neural processing units 126 will be in a narrow configuration because the initialization neural processing unit instruction at address 0 initializes it to a narrow configuration. Therefore, the 512 neural processing units 126 actually function as 1024 narrow neural processing units (or neurons), which are treated with neuroprocessing units 0-A and neural processing units in the field. 0-B (two narrow neuroprocessing units of neuroprocessing unit 126 labeled 0), neuroprocessing unit 1-A and neuroprocessing unit 1-B (two narrow neural processing units of neuroprocessing unit 126 labeled 1) ), and so on until the neural processing unit 511-A and the neural processing unit 511-B (the two narrow neural processing units of the neural processing unit 126 labeled 511) are indicated. To simplify the description, only the operations of the narrow neural processing units 0-A, 0-B, and 511-B are shown. Because the count value specified by the multiply-accumulate rotation instruction at address 2 is 1023, and requires 1023 time-frequency cycles to operate. Therefore, the number of timing charts in the twenty-first figure includes up to 1026 time-frequency periods.

在時頻週期0，這1024個神經處理單元之每一個都會執行第四圖之初始化指令，即第五圖所示指派零值至累加器202之運作。 In a time-frequency period of 0, each of the 1024 neural processing units performs an initialization instruction of the fourth map, i.e., assigns a zero value to the operation of the accumulator 202 as shown in the fifth figure.

在時頻週期1，這1024個窄神經處理單元之每一個都會執行第二十圖中位於位址1之乘法累加指令。如圖中所示，窄神經處理單元0-A將累加器202A數值(即零)加上資料隨機存取單元122之列17窄文字0與權重隨機存取單元124之列0窄文字0之乘積；窄神經處理單元0-B將累加器202B數值(即零)加上資料隨機存取單元122之列17窄文字1與權重隨機存取單元124之列0窄文字1之乘積；依此類推直到窄神經處理單元511-B將累加器202B數值(即零)加上資料隨機存取單元122之列17窄文字1023與權重隨機存取單元124之列0窄文字1023之乘積。 In time-frequency period 1, each of the 1024 narrow neural processing units performs a multiply-accumulate instruction at address 1 in the twentieth diagram. As shown in the figure, the narrow neural processing unit 0-A adds the value of the accumulator 202A (i.e., zero) to the column of the data random access unit 122, the narrow text 0, and the weight random access unit 124, the narrow text 0. Product; the narrow neural processing unit 0-B adds the value of the accumulator 202B (i.e., zero) to the product of the narrow text 1 of the data random access unit 122 and the narrow text 1 of the weight random access unit 124; Analogy until the narrow neural processing unit 511-B adds the value of the accumulator 202B (i.e., zero) to the product of the narrow text 1023 of the column 17 of the data random access unit 122 and the narrow text 1023 of the weight random access unit 124.

在時頻週期2，這1024個窄神經處理單元之每一個都會執行第二十圖中位於位址2之乘法累加旋轉指令之第一次迭代。如圖中所示，窄神經處理單元0-A將累加器202A數值217A加上由窄神經處理單元511-B之多工暫存器208B輸出209B所接收之旋轉後窄資料文字1811A(也就是由資料隨機存取記憶體122所接收之窄資料文字1023)與權重隨機存取單元124之列1窄文字0之乘積；窄神經處理單元0-B將累加器202B數值217B加上由窄神經處理單元0-A之多工暫存器208A輸出209A所接收之旋轉後窄資料文字1811B(也就是由資料隨機存取記憶體122所接收之窄資料文字0)與權重隨機存取單元124之列1窄文字1之乘積；依此類推，直到窄神經處理單元511-B將累加器202B數值217B加上由窄神經處理單元511-A之多工暫存器208A輸出209A所接收之旋轉後窄資料文字1811B(也就是由資料隨機存取記憶體122所接收之窄資料文字1022)與權重隨機存取單元124之列1窄文字1023之乘積。 In time-frequency period 2, each of the 1024 narrow neural processing units performs the first iteration of the multiply-accumulate rotation instruction at address 2 in the twentieth diagram. As shown in the figure, the narrow neural processing unit 0-A adds the value 217A of the accumulator 202A to the rotated narrow data text 1811A received by the multiplexer 208B output 209B of the narrow neural processing unit 511-B (ie, The narrow data word 1023) received by the data random access memory 122 and the narrow text 0 of the column 1 of the weight random access unit 124; the narrow neural processing unit 0-B adds the value 217B of the accumulator 202B to the narrow nerve Processing unit 0-A multiplex register 208A receives 209A received After rotation, the narrow data text 1811B (that is, the narrow data text 0 received by the data random access memory 122) is multiplied by the narrow text 1 of the weight random access unit 124; and so on, until narrow nerve processing Unit 511-B adds the accumulator 202B value 217B to the rotated narrow data text 1811B received by the multiplexer 208A output 209A of the narrow neural processing unit 511-A (i.e., by the data random access memory 122). The product of the narrow data text 1022 received and the narrow text 1023 of the column 1 of the weight random access unit 124.

在時頻週期3，這1024個窄神經處理單元之每一個都會執行第二十圖中位於位址2之乘法累加旋轉指令之第二次迭代。如圖中所示，窄神經處理單元0-A將累加器202A數值217A加上由窄神經處理單元511-B之多工暫存器208B輸出209B所接收之旋轉後窄資料文字1811A(也就是由資料隨機存取記憶體122所接收之窄資料文字1022)與權重隨機存取單元124之列2窄文字0之乘積；窄神經處理單元0-B將累加器202B數值217B加上由窄神經處理單元0-A之多工暫存器208A輸出209A所接收之旋轉後窄資料文字1811B(也就是由資料隨機存取記憶體122所接收之窄資料文字1023)與權重隨機存取單元124之列2窄文字1之乘積；依此類推，直到窄神經處理單元511-B將累加器202B數值217B加上由窄神經處理單元511-A之多工暫存器208A輸出209A所接收之旋轉後窄資料文字1811B(也就是由資料隨機存取記憶體122所接收之窄資料文字1021)與權重隨機存取單元124之列2窄文字1023之乘積。如第二十一圖所示，此運算會在後續1021 個時頻週期持續進行，直到以下所述之時頻週期1024。 In time-frequency period 3, each of the 1024 narrow neural processing units performs the second iteration of the multiply-accumulate rotation instruction at address 2 in the twentieth diagram. As shown in the figure, the narrow neural processing unit 0-A adds the value 217A of the accumulator 202A to the rotated narrow data text 1811A received by the multiplexer 208B output 209B of the narrow neural processing unit 511-B (ie, The product of the narrow data text 1022 received by the data random access memory 122 and the narrow text 0 of the column 2 of the weight random access unit 124; the narrow neural processing unit 0-B adds the value 217B of the accumulator 202B by the narrow nerve The multiplexer 208A of the processing unit 0-A outputs the rotated narrow data text 1811B received by the 209A (that is, the narrow data text 1023 received by the data random access memory 122) and the weight random access unit 124. Column 2 is the product of narrow text 1; and so on, until the narrow neural processing unit 511-B adds the accumulator 202B value 217B to the rotation received by the multiplexer 208A output 209A of the narrow neural processing unit 511-A. The product of the narrow data text 1811B (i.e., the narrow data text 1021 received by the data random access memory 122) and the narrow text 1023 of the weight random access unit 124. As shown in Figure 21, this operation will follow in the next 1021. The time-frequency period continues until the time-frequency period 1024 described below.

在時頻週期1024，這1024個窄神經處理單元之每一個都會執行第二十圖中位於位址2之乘法累加旋轉指令之第1023次迭代。如圖中所示，窄神經處理單元0-A將累加器202A數值217A加上由窄神經處理單元511-B之多工暫存器208B輸出209B所接收之旋轉後窄資料文字1811A(也就是由資料隨機存取記憶體122所接收之窄資料文字1)與權重隨機存取單元124之列1023窄文字0之乘積；窄神經處理單元0-B將累加器202B數值217B加上由窄神經處理單元0-A之多工暫存器208A輸出209A所接收之旋轉後窄資料文字1811B(也就是由資料隨機存取記憶體122所接收之窄資料文字2)與權重隨機存取單元124之列1023窄文字1之乘積；依此類推，直到窄神經處理單元511-B將累加器202B數值217B加上由窄神經處理單元511-A之多工暫存器208A輸出209A所接收之旋轉後窄資料文字1811B(也就是由資料隨機存取記憶體122所接收之窄資料文字0)與權重隨機存取單元124之列1023窄文字1023之乘積。 In the time-frequency period 1024, each of the 1024 narrow neural processing units performs the 1023th iteration of the multiply-accumulate rotation instruction at address 2 in the twentieth diagram. As shown in the figure, the narrow neural processing unit 0-A adds the value 217A of the accumulator 202A to the rotated narrow data text 1811A received by the multiplexer 208B output 209B of the narrow neural processing unit 511-B (ie, The product of the narrow data text 1) received by the data random access memory 122 and the narrow text 0 of the column 1023 of the weight random access unit 124; the narrow neural processing unit 0-B adds the value 217B of the accumulator 202B to the narrow nerve The multiplexer 208A of the processing unit 0-A outputs the rotated narrow data text 1811B received by the 209A (that is, the narrow data character 2 received by the data random access memory 122) and the weight random access unit 124. Column 1023 is the product of narrow text 1; and so on, until narrow neuroprocessing unit 511-B adds accumulator 202B value 217B to the rotation received by multiplexer 208A output 209A of narrow neural processing unit 511-A. The product of the narrow data text 1811B (that is, the narrow data text 0 received by the data random access memory 122) and the narrow text 1023 of the column 1023 of the weight random access unit 124.

在時頻週期1025，這1024個窄神經處理單元中之每一個之啟動函數單元212A/212B會執行第二十圖中位於位址3之啟動函數指令。最後，在時頻週期1026，這1024個窄神經處理單元中之每一個會將其窄結果133A/133B寫回資料隨機存取記憶體122之列16中之相對應窄文字，以執行第二十圖中位於位址4之寫入啟動函數單元指令。亦即，神經處理單元0-A之窄結果133A 會被寫入資料隨機存取記憶體122之窄文字0，神經處理單元0-B之窄結果133B會被寫入資料隨機存取記憶體122之窄文字1，依此類推，直到神經處理單元511-B之窄結果133B會被寫入資料隨機存取記憶體122之窄文字1023。第二十二圖係以方塊圖顯示前述對應於第二十一圖之運算。 At time-frequency period 1025, the start function unit 212A/212B of each of the 1024 narrow neural processing units performs the start function instruction at address 3 in the twentieth diagram. Finally, in a time-frequency period 1026, each of the 1024 narrow neural processing units writes its narrow result 133A/133B back to the corresponding narrow text in column 16 of the data random access memory 122 to perform the second The write start function unit instruction at address 4 in the ten figure. That is, the narrow result of the neuroprocessing unit 0-A 133A The narrow text 0 of the data random access memory 122 will be written, the narrow result 133B of the neural processing unit 0-B will be written to the narrow text 1 of the data random access memory 122, and so on, until the neural processing unit The narrow result 133B of 511-B is written to the narrow text 1023 of the data random access memory 122. The twenty-second figure shows the aforementioned operation corresponding to the twenty-first figure in a block diagram.

第二十二圖係顯示第一圖之神經網路單元121之方塊示意圖，此神經網路單元121具有如第十八圖所示之神經處理單元126以執行第二十圖之程式。此神經網路單元121包括512個神經處理單元126，即1024個窄神經處理單元，資料隨機存取記憶體122，以及權重隨機存取記憶體124，資料隨機存取記憶體122係接收其位址輸入123，權重隨機存取記憶體124係接收其位址輸入125。雖然圖中並未顯示，不過，在時頻週期0，這1024個窄神經處理單元都會執行第二十圖之初始化指令。如圖中所示，在時頻週期1，列17之1024個8位元資料文字會從資料隨機存取記憶體122讀出並提供至這1024個窄神經處理單元。在時頻週期1至1024，列0至1023之1024個8位元權重文字會分別從權重隨機存取記憶體124讀出並提供至這1024個窄神經處理單元。雖然圖中並未顯示，不過，在時頻週期1，這1024個窄神經處理單元會對載入之資料文字與權重文字執行其相對應之乘法累加運算。在時頻週期2至1024，這1024個窄神經處理單元之多工暫存器208A/208B之運作係如同一個1024個8位元文字之旋轉器，會將先前載入資料隨機存取記憶體122之列 17之資料文字旋轉至鄰近之窄神經處理單元，而這些窄神經處理單元會對相對應之旋轉後資料文字以及由權重隨機存取記憶體124載入之相對應窄權重文字執行乘法累加運算。雖然圖中並未顯示，在時頻週期1025，這1024個窄啟動函數單元212A/212B會執行啟動指令。在時頻週期1026，這1024個窄神經處理單元會將其相對應之1024個8位元結果133A/133B寫回資料隨機存取記憶體122之列16。 The twenty-second diagram shows a block diagram of the neural network unit 121 of the first diagram, the neural network unit 121 having the neural processing unit 126 as shown in Fig. 18 to execute the program of the twentieth diagram. The neural network unit 121 includes 512 neural processing units 126, namely 1024 narrow neural processing units, data random access memory 122, and weight random access memory 124. The data random access memory 122 receives its bits. Address input 123, weighted random access memory 124 receives its address input 125. Although not shown in the figure, in the time-frequency period 0, the 1024 narrow neural processing units will execute the initialization instructions of the twentieth map. As shown in the figure, at time-frequency period 1, 1024 octet data words of column 17 are read from data random access memory 122 and provided to the 1024 narrow neural processing units. In the time-frequency period 1 to 1024, 1024 8-bit weight texts of columns 0 to 1023 are read out from the weight random access memory 124 and supplied to the 1024 narrow neural processing units, respectively. Although not shown in the figure, in the time-frequency period 1, the 1024 narrow neural processing units perform the corresponding multiply-accumulate operation on the loaded data text and the weight text. In the time-frequency period of 2 to 1024, the 1024 narrow neural processing unit multiplexer 208A/208B operates as a 1024 8-bit text rotator that will load the previously accessed data random access memory. 122 The data text of 17 is rotated to the adjacent narrow nerve processing unit, and the narrow neural processing unit performs a multiplication and accumulation operation on the corresponding rotated data text and the corresponding narrow weight text loaded by the weight random access memory 124. Although not shown in the figure, at time-frequency period 1025, the 1024 narrow start function units 212A/212B execute a start command. At time-frequency period 1026, the 1024 narrow neural processing units write their corresponding 1024 octet results 133A/133B back to column 16 of data random access memory 122.

由此可以發現，相較於第二圖之實施例，第十八圖之實施例讓程式設計者具有彈性可以選擇使用寬資料與權重文字(如16位元)以及窄資料與權重文字(如8位元)執行計算，以因應特定應用下對於準確度的需求。從一個面向來看，對於窄資料之應用而言，第十八圖之實施例相較於第二圖之實施例可提供兩倍的效能，但必須增加額外的窄元件(例如多工暫存器208B、暫存器205B、窄算術邏輯單元204B、窄累加器202B、窄啟動函數單元212B)作為代價，這些額外的窄元件會使神經處理單元126增加約50%之面積。 It can be seen that, compared to the embodiment of the second figure, the embodiment of the eighteenth embodiment gives the programmer the flexibility to choose to use wide data and weight text (such as 16 bits) and narrow data and weight text (such as 8-bit) Perform calculations to meet the accuracy requirements for a particular application. From a single perspective, the embodiment of the eighteenth embodiment provides twice the performance compared to the embodiment of the second figure for narrow data applications, but additional narrow components must be added (eg, multiplexed temporary storage) At the expense of 208B, register 205B, narrow arithmetic logic unit 204B, narrow accumulator 202B, narrow start function unit 212B), these additional narrow elements will increase the neural processing unit 126 by about 50%.

三模神經處理單元 Three-mode neural processing unit

第二十三圖係顯示第一圖之一可動態配置之神經處理單元126之另一實施例之方塊示意圖。第二十三圖之神經處理單元126不但可用於寬配置與窄配置，還可用以一第三種配置，在此稱為“漏斗(funnel)”配置。第二十三圖之神經處理單元126係類似於第十八圖之神經處理單元126。不過，第十八圖中之寬加法器244A在第二十三圖之神經處理單元126中係由一個三輸入寬加法器2344A所取代，此三輸入寬加法器2344A接收一第三加數2399，其為窄多工器1896B之輸出之一延伸版本。具有第二十三圖之神經處理單元之神經網路單元所執行之程式係類似於第二十圖之程式。不過，其中位於位址0之初始化神經處理單元指令會將這些神經處理單元126初始化為漏斗配置，而非窄配置。此外，位於位址2之乘法累加旋轉指令之計數值為511而非1023。 A twenty-third figure is a block diagram showing another embodiment of a dynamically configurable neural processing unit 126 of the first figure. The neural processing unit 126 of the twenty-third diagram can be used not only for wide and narrow configurations, but also for a third configuration, referred to herein as a "funnel" configuration. The neural processing unit 126 of the twenty-third figure is similar to the eighteenth The nerve processing unit 126. However, the wide adder 244A in Fig. 18 is replaced by a three-input wide adder 2344A in the neural processing unit 126 of the twenty-third figure, and the three-input wide adder 2344A receives a third addend 2399. It is an extended version of the output of the narrow multiplexer 1896B. The program executed by the neural network unit having the neural processing unit of the twenty-third figure is similar to the program of the twenty-first figure. However, the initialization neural processing unit instructions located at address 0 will initialize these neural processing units 126 into a funnel configuration rather than a narrow configuration. In addition, the multiply-accumulate rotation instruction at address 2 has a count value of 511 instead of 1023.

處於漏斗配置時，神經處理單元126之運作係類似於處於窄配置，當執行如第二十圖中位址1之乘法累加指令時，神經處理單元126會接收兩個窄資料文字207A/207B與兩個窄權重文字206A/206B；寬乘法器242A會將資料文字209A與權重文字203A相乘以產生寬多工器1896A選擇之乘積246A；窄乘法器242B會將資料文字209B與權重文字203B相乘以產生窄多工器1896B選擇之乘積246B。不過，寬加法器2344A會將乘積246A(由寬多工器1896A選擇)以及乘積246B/2399(由寬多工器1896B選擇)都與寬累加器202A輸出217A相加，而窄加法器244B與窄累加器202B則是不啟動。此外，處於漏斗配置而執行如第二十圖中位址2之乘法累加旋轉指令時，控制輸入213會使多工暫存器208A/208B旋轉兩個窄文字(如16位元)，也就是說，多工暫存器208A/208B會選擇其相對應輸入211A/211B，就如同處於寬配置一樣。不過，寬乘法器242A會將資料文字209A與權重文字 203A相乘以產生寬多工器1896A選擇之乘積246A；窄乘法器242B會將資料文字209B與權重文字203B相乘以產生窄多工器1896B選擇之乘積246B；並且，寬加法器2344A會將乘積246A(由寬多工器1896A選擇)以及乘積246B/2399(由寬多工器1896B選擇)都與寬累加器202A輸出217A相加，而窄加法器244B與窄累加器202B如前述則是不啟動。最後，處於漏斗配置而執行如第二十圖中位址3之啟動函數指令時，寬啟動函數單元212A會對結果總數215A執行啟動函數以產生一窄結果133A，而窄啟動函數單元212B則是不啟動。如此，只有標示為A之窄神經處理單元會產生窄結果133A，標示為B之窄神經處理單元所產生之窄結果133B則是無效。因此，寫回結果之列(如第二十圖中位址4之指令所指示之列16)會包含空洞，這是因為只有窄結果133A有效，窄結果133B則是無效。因此，在概念上，每個時頻週期內，每個神經元(第二十三圖之神經處理單元)會執行兩個連結資料輸入，即將兩個窄資料文字乘上其相對應之權重並將這兩個乘積相加，相較之下，第二圖與第十八圖之實施例在每個時頻週期內只執行一個連結資料輸入。 When in the funnel configuration, the operation of the neural processing unit 126 is similar to being in a narrow configuration. When performing the multiply accumulate instruction of address 1 in the twentieth diagram, the neural processing unit 126 receives two narrow data words 207A/207B and Two narrow weight texts 206A/206B; wide multiplier 242A multiplies data text 209A by weight text 203A to produce product 246A selected by wide multiplexer 1896A; narrow multiplier 242B compares data text 209B with weight text 203B Multiply by the product 246B that produces the narrow multiplexer 1896B selection. However, the wide adder 2344A adds the product 246A (selected by the wide multiplexer 1896A) and the product 246B/2399 (selected by the wide multiplexer 1896B) to the wide accumulator 202A output 217A, and the narrow adder 244B The narrow accumulator 202B is not activated. In addition, when in the funnel configuration to perform the multiply-accumulate rotation instruction of address 2 as in the twentieth diagram, the control input 213 causes the multiplex register 208A/208B to rotate two narrow words (eg, 16 bits), that is, It is said that the multiplex register 208A/208B will select its corresponding input 211A/211B as if it were in a wide configuration. However, the wide multiplier 242A will present the text 209A with the weight text. 203A is multiplied to produce a product 246A selected by the wide multiplexer 1896A; the narrow multiplier 242B multiplies the data word 209B by the weight text 203B to produce a product 246B selected by the narrow multiplexer 1896B; and the wide adder 2344A will Product 246A (selected by wide multiplexer 1896A) and product 246B/2399 (selected by wide multiplexer 1896B) are all added to wide accumulator 202A output 217A, while narrow adder 244B and narrow accumulator 202B are as described above. Do not start. Finally, when in the funnel configuration to execute the start function instruction of address 3 as in the twentieth diagram, the wide start function unit 212A performs a start function on the total number of results 215A to produce a narrow result 133A, while the narrow start function unit 212B is Do not start. Thus, only the narrow nerve processing unit labeled A will produce a narrow result 133A, and the narrow result 133B produced by the narrow nerve processing unit labeled B is ineffective. Therefore, writing back the result column (as indicated by the instruction of address 4 in Figure 20) will contain holes because only the narrow result 133A is valid and the narrow result 133B is invalid. Therefore, conceptually, in each time-frequency period, each neuron (the neural processing unit of the twenty-third figure) performs two linked data inputs, ie multiplies two narrow data words by their corresponding weights and The two products are added. In contrast, the second and eighteenth embodiments perform only one link data input per time-frequency period.

在第二十三圖之實施例中可以發現，產生並寫回資料隨機存取記憶體122或權重隨機存取記憶體124之結果文字(神經元輸出)之數量是所接收資料輸入(連結)數量之平方根的一半，而結果之寫回列具有空洞，即每隔一個窄文字結果就是無效，更精確來說，標示為B之窄神經處理單元結果不具意義。因此，第二十三圖之實施例對於具有連續兩層之神經網路特別有效率，舉例來說，第一層具有之神經元數量為第二層之兩倍(例如第一層具有1024個神經元充分連接至第二層之512個神經元)。此外，其他的執行單元122(例如媒體單元，如x86高級向量擴展單元)在必要時，可對一分散結果列(即具有空洞)執行合併運算(pack operation)以使其緊密(即不具空洞)。後續當神經處理單元121在執行其他關聯於資料隨機存取記憶體122與/或權重隨機存取記憶體124之其他列之計算時，即可將此處理後之資料列用於計算。 In the embodiment of the twenty-third figure, it can be found that the number of result words (neuron output) generated and written back to the data random access memory 122 or the weight random access memory 124 is the received data input (link). Half of the square root of the number, and the result of the write back column has a hole, that is, every other narrow text result is invalid, more precisely, the result of the narrow neural processing unit labeled B is meaningless. Therefore, the twentieth The three-figure embodiment is particularly efficient for a neural network with two consecutive layers. For example, the first layer has twice as many neurons as the second layer (eg, the first layer has 1024 neurons fully connected to 512 neurons in the second layer). In addition, other execution units 122 (eg, media units, such as x86 advanced vector expansion units) may perform a packing operation on a scatter result column (ie, having holes) to make it compact (ie, not void), if necessary. . Subsequently, when the neural processing unit 121 performs other calculations associated with the data random access memory 122 and/or the other columns of the weight random access memory 124, the processed data column can be used for calculation.

混合神經網路單元運算：卷積與共源運算能力 Hybrid neural network unit operation: convolution and common source computing

本發明實施例所述之神經網路單元121的優點在於，此神經網路單元121能夠同時以類似於一個協處理器執行自己內部程式之方式運作以及以類似於一個處理器之處理單元執行所發佈之架構指令(或是由架構指令轉譯出之微指令)。架構指令是包含在具有神經網路單元121之處理器所執行之架構程式內。如此，神經網路單元121即可以混合方式運作，而能維持神經處理單元121之高利用率。舉例來說，第二十四至二十六圖係顯示神經網路單元121執行卷積運算之運作，其中，神經網路單元係被充分利用，第二十七至二十八圖係顯示神經網路單元121執行共源運算之運作。卷積層、共源層以及其他數位資料計算之應用，例如影像處理(如邊緣偵測、銳利化、模糊化、辨識/分類)需要使用到這些運算。不過，神經處理單元121之混合運算並不限於執行卷積或共源運算，此混合特徵亦可用於執行其他運算，例如第四至十三圖所述之傳統神經網路乘法累加運算與啟動函數運算。也就是說，處理器100(更精確地說，保留站108)會發佈MTNN指令1400與MFNN指令1500至神經網路單元121，因應此發佈之指令，神經網路單元121會將資料寫入記憶體122/124/129並將結果從被神經網路單元121寫入之記憶體122/124中讀出，在此同時，為了執行處理器100(透過MTNN1400指令)寫入程式記憶體129之程式，神經網路單元121會讀取並寫入記憶體122/124/129。 An advantage of the neural network unit 121 described in the embodiments of the present invention is that the neural network unit 121 can simultaneously operate in a manner similar to a coprocessor executing its own internal program and executing in a processing unit similar to a processor. The published architecture directive (or the microinstruction translated by the architectural directive). The architectural instructions are contained within an architectural program executed by a processor having a neural network unit 121. As such, the neural network unit 121 can operate in a mixed manner while maintaining the high utilization of the neural processing unit 121. For example, the twenty-fourth to twenty-sixth diagrams show the operation of the neural network unit 121 to perform a convolution operation in which the neural network unit is fully utilized, and the twenty-seventh to twenty-eighth diagrams show the nerve The network unit 121 performs the operation of the common source operation. Applications such as convolutional layers, common source layers, and other digital data calculations, such as image processing (such as edge detection, sharpening, blurring, recognition/classification), require these operations. Do not The hybrid operation of the neural processing unit 121 is not limited to performing convolution or common source operations, and the hybrid feature can also be used to perform other operations, such as the conventional neural network multiply accumulate operation and the start function described in the fourth to thirteenth figures. Operation. That is, the processor 100 (more precisely, the reservation station 108) will issue the MTNN command 1400 and the MFNN command 1500 to the neural network unit 121, and in response to this issued command, the neural network unit 121 will write the data to the memory. The body 122/124/129 reads the result from the memory 122/124 written by the neural network unit 121, and at the same time, executes the program of the program memory 129 for executing the processor 100 (via the MTNN1400 instruction). The neural network unit 121 reads and writes to the memory 122/124/129.

第二十四圖係一方塊示意圖，顯示由第一圖之神經網路單元121使用以執行一卷積運算之資料結構之一範例。此方塊圖包括一卷積核2402、一資料陣列2404、以及第一圖之資料隨機存取記憶體122與權重隨機存取記憶體124。就一較佳實施例而言，資料陣列2404(例如對應於影像畫素)係裝載於連接至處理器100之系統記憶體(未圖示)並由處理器100透過執行MTNN指令1400載入神經網路單元121之權重隨機存取記憶體124。卷積運算係將一第一陣列與一第二陣列進行卷積，此第二陣列即為本文所述之卷積核。如本文所述，卷積核係一係數矩陣，這些係數亦可稱為權重、參數、元素或數值。就一較佳實施例而言，此卷積核2042係處理器100所執行之架構程式之靜態資料。 The twenty-fourth diagram is a block diagram showing an example of a data structure used by the neural network unit 121 of the first figure to perform a convolution operation. The block diagram includes a convolution kernel 2402, a data array 2404, and the data random access memory 122 and the weight random access memory 124 of the first figure. In a preferred embodiment, data array 2404 (e.g., corresponding to image pixels) is loaded into system memory (not shown) coupled to processor 100 and loaded by processor 100 by executing MTNN instructions 1400. The network unit 121 weights the random access memory 124. The convolution operation convolves a first array with a second array, which is the convolution kernel described herein. As described herein, a convolution kernel is a matrix of coefficients, which may also be referred to as weights, parameters, elements, or values. For a preferred embodiment, the convolution kernel 2042 is static data of the architecture program executed by the processor 100.

此資料陣列2404係一個資料值之二維陣列，而每個資料值(例如影像畫素值)的大小是資料隨機存取記憶體122或權重隨機存取記憶體124之文字的尺寸(例如16位元或8位元)。在此範例中，資料值為16位元文字，神經網路單元121係配置有512個寬配置之神經處理單元126。此外，在此實施例中，神經處理單元126包括多工暫存器以接收來自權重隨機存取記憶體124之權重文字206，例如第七圖之多工暫存器705，藉以對由權重隨機存取記憶體124接收之一列資料值執行集體旋轉器運算，這部分在後續章節會有更詳細的說明。在此範例中，資料陣列2404係一個2560行X1600列之畫素陣列。如圖中所示，當架構程式將資料陣列2404與卷積核2402進行卷積計算時，資料陣列2402會被分為20個資料塊，而每個資料塊分別是512x400之資料陣列2406。 This data array 2404 is a two-dimensional array of data values, and the size of each data value (such as image pixel value) is The machine accesses the size of the text of the memory 122 or the weight random access memory 124 (for example, 16 bits or 8 bits). In this example, the data value is 16-bit characters, and the neural network unit 121 is configured with 512 wide-configured neural processing units 126. Moreover, in this embodiment, the neural processing unit 126 includes a multiplex register to receive the weight text 206 from the weighted random access memory 124, such as the multiplex register 705 of the seventh figure, whereby the weights are random. The access memory 124 receives a list of data values to perform a collective rotator operation, which is described in more detail in subsequent sections. In this example, data array 2404 is a 2560 row x 1600 column pixel array. As shown in the figure, when the architecture program convolves the data array 2404 with the convolution kernel 2402, the data array 2402 is divided into 20 data blocks, and each data block is a 512x400 data array 2406.

在此範例中，卷積核2402係一個由係數、權重、參數、或元素，構成之3x3陣列。這些係數的第一列標示為C0,0；C0,1；與C0,2；這些係數的第二列標示為C1,0；C1,1；與C1,2；這些係數的第三列標示為C2,0；C2,1；與C2,2。舉例來說，具有以下係數之卷積核可用於執行邊緣偵測：0,1,0,1,-4,1,0,1,0。在另一實施例中，具有以下係數之卷積核可用於執行高斯模糊運算：1,2,1,2,4,2,1,2,1。在此範例中，通常會對最終累加後之數值再執行一個除法，其中，除數係卷積核2042之各元素之絕對值的加總，在此範例中即為16。在另一範例中，除數可以是卷積核2042之元素數量。在又一個範例中，除數可以是將卷積運算壓縮至一目標數值範圍所使用之數值，此除數係由卷積核2042之元素數值、目標範圍以及執行卷積運算之輸入值陣列的範圍所決定。 In this example, convolution kernel 2402 is a 3x3 array of coefficients, weights, parameters, or elements. The first column of these coefficients is labeled C0,0; C0,1; and C0,2; the second column of these coefficients is labeled C1,0; C1,1; and C1,2; the third column of these coefficients is labeled C2,0; C2,1; and C2,2. For example, a convolution kernel with the following coefficients can be used to perform edge detection: 0, 1, 0, 1, 4, 1, 0, 1, 0. In another embodiment, a convolution kernel having the following coefficients can be used to perform Gaussian blur operations: 1, 2, 1, 2, 4, 2, 1, 2, 1. In this example, a division is typically performed on the final accumulated value, where the sum of the absolute values of the elements of the divisor-convolution kernel 2042 is 16, in this example. In another example, the divisor may be the number of elements of the convolution kernel 2042. In yet another example, the divisor may be a value used to compress the convolution operation to a target value range, the divisor being the element value of the convolution kernel 2042, The range of the target and the range of input value arrays that perform the convolution operation are determined.

請參照第二十四圖以及詳述其中細節之第二十五圖，架構程式將卷積核2042之係數寫入資料隨機存取記憶體122。就一較佳實施例而言，資料隨機存取記憶體122之連續九個列(卷積核2402內之元素數量)之每個列上的所有文字，會利用卷積核2402之不同元素以列為其主要順序加以寫入。也就是說，如圖中所示，在同一列之每個文字係以第一係數C0,0寫入；下一列則是以第二係數C0,1寫入；下一列則是以第三係數C0,2寫入；再下一列則是以第四係數C1,0寫入；依此類推，直到第九列之每個文字都以第九係數C2,2寫入。為了對資料陣列2404分割出之資料塊之資料矩陣2406進行卷積運算，神經處理單元126會依據順序重複讀取資料隨機存取記憶體122中裝載卷積核2042係數之九個列，這部分在後續章節，特別是對應於第二十六A圖的部分，會有更詳細的說明。 Referring to the twenty-fourth diagram and the twenty-fifth diagram detailing the details therein, the architecture program writes the coefficients of the convolution kernel 2042 into the data random access memory 122. In a preferred embodiment, all of the text on each of the nine consecutive columns of data random access memory 122 (the number of elements in convolution kernel 2402) will utilize different elements of convolution kernel 2402. Columns are written in their primary order. That is to say, as shown in the figure, each character in the same column is written with the first coefficient C0,0; the next column is written with the second coefficient C0,1; the next column is the third coefficient. C0,2 is written; the next column is written with the fourth coefficient C1,0; and so on, until each character of the ninth column is written with the ninth coefficient C2,2. In order to perform a convolution operation on the data matrix 2406 of the data block segmented by the data array 2404, the neural processing unit 126 repeatedly reads the nine columns of the coefficients of the convolution kernel 2042 loaded in the data random access memory 122 in this order. In the subsequent sections, particularly the sections corresponding to the twenty-sixth A diagram, there will be a more detailed description.

請參照第二十四圖以及詳述其中細節之第二十五圖，架構程式係將資料矩陣2406之數值寫入權重隨機存取記憶體124。神經網路單元程式執行卷積運算時，會將結果陣列寫回權重隨機存取記憶體124。就一較佳實施例而言，架構程式會將一第一資料矩陣2406寫入權重隨機存取記憶體124並使神經網路單元121開始運作，當神經網路單元121在對第一資料矩陣2406與卷積核2402執行卷積運算時，架構程式會將一第二資料矩陣2406寫入權重隨機存取記憶體124，如此，神經網路單元 121完成第一資料矩陣2406之卷積運算後，即可開始執行第二資料矩陣2406之卷積運算，這部分在後續對應於第二十五圖處有更詳細的說明。以此方式，架構程式會往返於權重隨機存取記憶體124之兩個區域，以確保神經網路單元121被充分使用。因此，第二十四圖之範例顯示有一第一資料矩陣2406A與一第二資料矩陣2406B，第一資料矩陣2406A係對應於佔據權重隨機存取記憶體124中列0至399之第一資料塊，而第二資料矩陣2406B係對應於佔據權重隨機存取記憶體124中列500至899之第二資料塊。此外，如圖中所示，神經網路單元121會將卷積運算之結果寫回權重隨機存取記憶體124之列900-1299以及列1300-1699，隨後架構程式會從權重隨機存取記憶體124讀取這些結果。裝載於權重隨機存取記憶體124之資料矩陣2406之資料值係標示為“Dx,y”，其中“x”是權重隨機存取記憶體124列數，“y”是權重隨機存取記憶體之文字、或稱行數。舉例來說，位於列399之資料文字511在第二十四圖中係標示為D399,511，此資料文字係由神經處理單元511之多工暫存器705接收。 Referring to the twenty-fourth diagram and the twenty-fifth diagram detailing the details therein, the architecture program writes the value of the data matrix 2406 into the weight random access memory 124. When the neural network unit program performs the convolution operation, the result array is written back to the weighted random access memory 124. In a preferred embodiment, the architecture program writes a first data matrix 2406 to the weighted random access memory 124 and causes the neural network unit 121 to operate, when the neural network unit 121 is in the first data matrix. When the convolution operation is performed by the convolution kernel 2402, the architecture program writes a second data matrix 2406 into the weight random access memory 124. Thus, the neural network unit After the convolution operation of the first data matrix 2406 is completed, the convolution operation of the second data matrix 2406 can be started, which is described in more detail later in the corresponding figure. In this manner, the architecture program will travel to and from the two regions of the weighted random access memory 124 to ensure that the neural network unit 121 is fully utilized. Therefore, the example of the twenty-fourth figure shows a first data matrix 2406A and a second data matrix 2406B, and the first data matrix 2406A corresponds to the first data block occupying the columns 0 to 399 of the weight random access memory 124. And the second data matrix 2406B corresponds to the second data block occupying the columns 500 to 899 in the weight random access memory 124. In addition, as shown in the figure, the neural network unit 121 writes the result of the convolution operation back to the columns 900-1299 of the weighted random access memory 124 and the columns 1300-1699, and then the architecture program will use the weight random access memory. Body 124 reads these results. The data value of the data matrix 2406 loaded in the weighted random access memory 124 is labeled as "Dx, y", where "x" is the number of columns of the weighted random access memory 124, and "y" is the weighted random access memory. The text, or the number of lines. For example, the data text 511 located in column 399 is labeled D399, 511 in the twenty-fourth figure, and is received by the multiplex register 705 of the neural processing unit 511.

第二十五圖係一流程圖，顯示第一圖之處理器100執行一架構程式以利用神經網路單元121對第二十四圖之資料陣列2404執行卷積核2042之卷積運算。此流程始於步驟2502。 The twenty-fifth diagram is a flowchart showing that the processor 100 of the first figure executes an architectural program to perform a convolution operation on the convolution kernel 2042 on the data array 2404 of the twenty-fourth map using the neural network unit 121. This process begins in step 2502.

在步驟2502中，處理器100，即執行有架構程式之處理器100，會將第二十四圖之卷積核2402以第二十四圖所顯示描述之方式寫入資料隨機存取記憶體 122。此外，架構程式會將一變數N初始化為數值1。變數N係標示資料陣列2404中神經網路單元121正在處理之資料塊。此外，架構程式會將一變數NUM_CHUNKS初始化為數值20。接下來流程前進至步驟2504。 In step 2502, the processor 100, that is, the processor 100 executing the architecture program, writes the convolution kernel 2402 of the twenty-fourth graph into the data random access memory in the manner described in the twenty-fourth diagram. 122. In addition, the architecture program initializes a variable N to a value of 1. The variable N is a data block that the neural network unit 121 is processing in the data array 2404. In addition, the architecture program initializes a variable NUM_CHUNKS to a value of 20. The flow then proceeds to step 2504.

在步驟2504中，如第二十四圖所示，處理器100會將資料塊1之資料矩陣2406寫入權重隨機存取記憶體124(如資料塊1之資料矩陣2406A)。接下來流程前進至步驟2506。 In step 2504, as shown in FIG. 24, the processor 100 writes the data matrix 2406 of the data block 1 into the weight random access memory 124 (eg, the data matrix 2406A of the data block 1). The flow then proceeds to step 2506.

在步驟2506中，處理器100會使用一個指定一函數1432以寫入程式記憶體129之MTNN指令1400，將一卷積程式寫入神經網路單元121程式記憶體129。處理器100隨後會使用一個指定一函數1432以開始執行程式之MTNN指令1400，以啟動神經網路單元卷積程式。神經網路單元卷積程式之一範例在對應於第二十六A圖處會有更詳細的說明。接下來流程前進至步驟2508。 In step 2506, the processor 100 writes a convolution program to the neural network unit 121 program memory 129 using a MTNN instruction 1400 that specifies a function 1432 to write to the program memory 129. The processor 100 then uses a MTNN instruction 1400 that specifies a function 1432 to begin execution of the program to initiate the neural network unit convolution program. An example of a neural network unit convolution program is described in more detail at correspondence to Figure 26A. The flow then proceeds to step 2508.

在決策步驟2508，架構程式確認變數N之數值是否小於NUM_CHUNKS。若是，流程會前進至步驟2512；否則就前進至步驟2514。 At decision step 2508, the architecture program determines if the value of the variable N is less than NUM_CHUNKS. If so, the flow will proceed to step 2512; otherwise, proceed to step 2514.

在步驟2512，如第二十四圖所示，處理器100將資料塊N+1之資料矩陣2406寫入權重隨機存取記憶體124(如資料塊2之資料矩陣2406B)。因此，當神經網路單元121正在對當前資料塊執行卷積運算的時候，架構程式可將下一個資料塊之資料矩陣2406寫入權重隨機存取記憶體124，如此，在完成當前資料塊之卷積運算後，即寫入權重隨機存取記憶體124後，神經網路單元121可以立即開始對下一個資料塊執行卷積運算。 At step 2512, as shown in FIG. 24, processor 100 writes data matrix 2406 of data block N+1 into weighted random access memory 124 (e.g., data matrix 2406B of data block 2). Therefore, when the neural network unit 121 is performing a convolution operation on the current data block, the architecture program can write the data matrix 2406 of the next data block into the weight random access memory 124, thus completing the current data block. Convolution operation After writing to the weighted random access memory 124, the neural network unit 121 can immediately begin performing a convolution operation on the next data block.

在步驟2514，處理器100確認正在執行之神經網路單元程式(對於資料塊1而是從步驟2506開始執行，對於資料塊2-20而言則是從步驟2518開始執行)是否已經完成執行。就一較佳實施例而言，處理器100係透過執行一MFNN指令1500讀取神經網路單元121狀態暫存器127以確認是否已經完成執行。在另一實施例中，神經網路單元121會產生一中斷，表示已經完成卷積程式。接下來流程前進至決策步驟2516。 At step 2514, the processor 100 confirms whether the neural network unit program being executed (for data block 1 but from step 2506, and for block 2-20, execution from step 2518) has completed execution. In a preferred embodiment, processor 100 reads neural network unit 121 status register 127 by executing an MFNN instruction 1500 to confirm whether execution has been completed. In another embodiment, the neural network unit 121 generates an interrupt indicating that the convolution procedure has been completed. The flow then proceeds to decision step 2516.

在決策步驟2516中，架構程式確認變數N之數值是否小於NUM_CHUNKS。若是，流程前進至步驟2518；否則就前進至步驟2522。 In decision step 2516, the architecture program determines if the value of the variable N is less than NUM_CHUNKS. If so, the flow proceeds to step 2518; otherwise, the process proceeds to step 2522.

在步驟2518中，處理器100會更新卷積程式以便執行於資料塊N+1。更精確地說，處理器100會將權重隨機存取記憶體124中對應於位址0之初始化神經處理單元指令之列值更新為資料矩陣2406之第一列(例如，更新為資料矩陣2406A之列0或是資料矩陣2406B之列500)，並且會更新輸出列(例如更新為列900或1300)。隨後處理器100會開始執行此更新後之神經網路單元卷積程式。接下來流程前進至步驟2522。 In step 2518, processor 100 updates the convolution program to execute on data block N+1. More precisely, the processor 100 updates the column value of the initialized neural processing unit instruction corresponding to address 0 in the weighted random access memory 124 to the first column of the data matrix 2406 (eg, updated to the data matrix 2406A). Column 0 is either column 500 of the material matrix 2406B and the output column is updated (eg, updated to column 900 or 1300). The processor 100 then begins executing the updated neural network unit convolution program. The flow then proceeds to step 2522.

在步驟2522中，處理器100從權重隨機存取記憶體124讀取資料塊N之神經網路單元卷積程式之執行結果。接下來流程前進至決策步驟2524。 In step 2522, the processor 100 reads the execution result of the neural network unit convolution program of the data block N from the weight random access memory 124. The flow then proceeds to decision step 2524.

在決策步驟2524中，架構程式確認變數N 之數值是否小於NUM_CHUNKS。若是，流程前進至步驟2526；否則就終止。 In decision step 2524, the architecture program confirms the variable N Is the value less than NUM_CHUNKS? If so, the flow proceeds to step 2526; otherwise, it terminates.

在步驟2526中，架構程式會將N的數值增加一。接下來流程回到決策步驟2508。 In step 2526, the architecture program increments the value of N by one. The flow then returns to decision step 2508.

第二十六A圖係一神經網路單元程式之一程式列表，此神經網路單元程式係利用第二十四圖之卷積核2402執行一資料矩陣2406之卷積運算並將其寫回權重隨機存取記憶體124。此程式係將位址1至9之指令所構成之指令迴圈循環一定次數。位於位址0之初始化神經處理單元指令指定每個神經處理單元126執行此指令迴圈之次數，在第二十六A圖之範例所具有之迴圈計數值為400，對應於第二十四圖之資料矩陣2406內之列數，而位於迴圈終端之迴圈指令(位於位址10)會使當前迴圈計數值遞減，若是結果為非零值，就使其回到指令迴圈之頂端(即回到位址1之指令)。初始化神經處理單元指令也會將累加器202清除為零。就一較佳實施例而言，位於位址10之迴圈指令也會將累加器202清除為零。另外，如前述位於位址1之乘法累加指令也可將累加器202清除為零。 The twenty-sixth A diagram is a list of programs of a neural network unit program that performs a convolution operation on a data matrix 2406 and writes it back using the convolution kernel 2402 of the twenty-fourth graph. Weight random access memory 124. This program loops the instruction loop formed by the instructions of addresses 1 through 9 a certain number of times. The initialization neural processing unit instruction at address 0 specifies the number of times each neural processing unit 126 performs this instruction loop. The example of the twenty-sixth A diagram has a loop count value of 400, corresponding to the twenty-fourth. The number of columns in the data matrix 2406 of the graph, and the loop command at the loop terminal (at address 10) decrements the current loop count value, and if the result is non-zero, it returns to the command loop. The top (ie, the instruction to return to address 1). Initializing the neural processing unit instructions also clears the accumulator 202 to zero. In a preferred embodiment, the loop command at address 10 also clears accumulator 202 to zero. Alternatively, the multiply accumulate instruction at address 1 as described above can also clear accumulator 202 to zero.

對於程式內指令迴圈之每一次執行，這512個神經處理單元126會同時執行512個3x3卷積核以及資料矩陣2406之512個相對應之3x3子矩陣之卷積運算。卷積運算是由卷積核2042之元素與相對應子矩陣內之相對應元素計算出來之九個乘積的加總。在第二十六A圖之實施例中，這512個相對應3x3子矩陣之每一個的原點 (中央元素)是第二十四圖中的資料文字Dx+1,y+1，其中y(行編號)是神經處理單元126編號，而x(列編號)是當前權重隨機存取記憶體124中由第二十六A圖之程式中位址1之乘法累加指令所讀取之列編號(此列編號也會由位址0之初始化神經處理單元指令進行初始化處理，也會在執行位於位址3與5之乘法累加指令時遞增，也會被位於位址9之遞減指令更新)。如此，在此程式之每一個循環中，這512個神經處理單元126會計算512個卷積運算並將這512個卷積運算之結果寫回權重隨機存取記憶體124之指令列。在本文中係省略邊緣處理(edge handling)以簡化說明，不過需要注意的是，利用這些神經處理單元126之集體旋轉特徵會造成資料矩陣2406(對於影像處理器而言即影像之資料矩陣)之多行資料中有兩行從其一側之垂直邊緣到另一個垂直邊緣間(例如從左側邊緣到右側邊緣，反之亦然)產生環繞(wrapping)。現在針對指令迴圈進行說明。 For each execution of the in-program instruction loop, the 512 neural processing units 126 simultaneously perform a convolution operation of 512 3x3 convolution kernels and 512 corresponding 3x3 sub-matrices of the data matrix 2406. The convolution operation is the sum of the nine products calculated from the elements of the convolution kernel 2042 and the corresponding elements in the corresponding sub-matrix. In the embodiment of Figure 26A, the origin of each of the 512 corresponding 3x3 sub-matrices (Central element) is the data character Dx+1, y+1 in the twenty-fourth figure, where y (row number) is the number of the neural processing unit 126, and x (column number) is the current weight random access memory 124 The column number read by the multiply-accumulate instruction of address 1 in the program of Figure 26 (this column number will also be initialized by the initialization neural processing unit instruction of address 0, and will also be in the execution position. The multiplication of addresses 3 and 5 is incremented by the instruction and is also updated by the decrement instruction at address 9. Thus, in each cycle of the program, the 512 neural processing units 126 compute 512 convolution operations and write the results of the 512 convolution operations back to the instruction sequence of the weighted random access memory 124. Edge handling is omitted herein to simplify the description, but it should be noted that utilizing the collective rotation features of these neural processing units 126 results in a data matrix 2406 (for the image processor, the data matrix of the image) Two rows of multi-line data are wrapped from the vertical edge of one side to the other (eg, from the left edge to the right edge, and vice versa). Now explain the instruction loop.

位址1是乘法累加指令，此指令會指定資料隨機存取記憶體122之列0並暗中利用當前權重隨機存取記憶體124之列，這個列最好是裝載在定序器128內(並由位於位址0之指令將其初始化為零以執行第一次指令迴圈傳遞之運算)。也就是說，位於位址1的指令會使每個神經處理單元126從資料隨機存記憶體122之列0讀取其相對應文字，從當前權重隨機存取記憶體124列讀取其相對應文字，並對此二個文字執行一乘法累加運算。如此，舉例來說，神經處理單元5將C0,0與Dx,5相乘(其中 “x”是當前權重隨機存取記憶體124列)，將結果加上累加器202數值217，並將總數寫回累加器202。 Address 1 is a multiply-accumulate instruction that specifies column 0 of data random access memory 122 and implicitly utilizes the current weight of random access memory 124, which column is preferably loaded in sequencer 128 (and It is initialized to zero by the instruction at address 0 to perform the operation of the first instruction loop transfer). That is to say, the instruction at address 1 causes each neural processing unit 126 to read its corresponding text from the column 0 of the data random memory 122, and reads the corresponding data from the current weight random access memory 124 column. Text, and perform a multiply accumulate operation on the two words. Thus, for example, the neural processing unit 5 multiplies C0,0 by Dx,5 (where "x" is the current weight random access memory 124 column), the result is added to the accumulator 202 value 217, and the total is written back to the accumulator 202.

位址2是一個乘法累加指令，此指令會指定資料隨機存取記憶體122之列遞增(即增加至1)，隨後再從資料隨機存取記憶體122之遞增後位址讀取這個列。此指令並會指定將每個神經處理單元126之多工暫存器705內的數值旋轉至鄰近的神經處理單元126，在此範例中即為因應位址1之指令而從權重隨機存取記憶體124讀取之資料矩陣2406值之列。在第二十四至二十六圖之實施例中，這些神經處理單元126係用以將多工暫存器705之數值向左旋轉，亦即從神經處理單元J旋轉至神經處理單元J-1，而非如前述第三、七與十九圖從神經處理單元J旋轉至神經處理單元J+1。值得注意的是，神經處理單元126向右旋轉之實施例中，架構程式會將卷積核2042係數值以不同順序寫入資料隨機存取記憶體122(例如繞著其中心行旋轉)以達到相似卷積結果之目的。此外，在需要時，架構程式可執行額外的卷積核預處理(例如移動(transposition))。此外，指令指定之計數值為2。因此，位於位址2之指令會使每個神經處理單元126從資料隨機存取記憶體122之列1讀取其相對應文字，將旋轉後文字接收至多工暫存器705，並對這兩個文字執行一乘法累加運算。因為計數值為2，此指令也會使每個神經處理單元126重複前述運作。也就是說，定序器128會使資料隨機存取記憶體122列位址123遞增(即增加至2)，而每個神經處理單元126會從資料隨機存取記憶體122之列 2讀取其相對應文字以及將旋轉後文字接收至多工暫存器705，並且對這兩個文字執行一乘法累加運算。如此，舉例來說，假定當前權重隨機存取記憶體124列為27，在執行位址2之指令後，神經處理單元5會將C0,1與D27,6之乘積與C0,2與D27,7之乘積累加至其累加器202。如此，完成位址1與位址2之指令後，C0,0與D27,5之乘積、C0,1與D27,6之乘積與C0,2與D27,7就會累加至累加器202，加入其他所有來自先前傳遞之指令迴圈的累加值。 Address 2 is a multiply-accumulate instruction that specifies that the column of data random access memory 122 is incremented (i.e., increased to one) and then read from the incremented address of data random access memory 122. This instruction will also specify that the value in the multiplex register 705 of each neural processing unit 126 is rotated to the adjacent neural processing unit 126, in this example from the weighted random access memory in response to the instruction of address 1. The body 124 reads the data matrix 2406 values. In the embodiment of the twenty-fourth to twenty-sixth embodiments, the neural processing unit 126 is configured to rotate the value of the multiplex register 705 to the left, that is, from the neural processing unit J to the neural processing unit J- 1, instead of rotating from the neural processing unit J to the neural processing unit J+1 as in the aforementioned third, seventh and nineteenth views. It should be noted that in the embodiment in which the neural processing unit 126 rotates to the right, the architecture program writes the convolution kernel 2042 coefficient values into the data random access memory 122 in different orders (eg, rotating around its center row) to achieve The purpose of similar convolution results. In addition, the architecture program can perform additional convolution kernel preprocessing (such as transposition) when needed. In addition, the instruction specifies a count value of 2. Thus, the instruction at address 2 causes each neural processing unit 126 to read its corresponding text from column 1 of data random access memory 122, and receive the rotated text into multiplex register 705, and The text performs a multiply accumulate operation. Since the count value is 2, this command also causes each neural processing unit 126 to repeat the foregoing operations. That is, the sequencer 128 increments the data random access memory 122 column address 123 (i.e., increases to 2), and each neural processing unit 126 fetches from the data random access memory 122. 2 Reading its corresponding text and receiving the rotated text to the multiplex register 705, and performing a multiply-accumulate operation on the two characters. Thus, for example, assuming that the current weight random access memory 124 is listed as 27, after executing the instruction of address 2, the neural processing unit 5 will multiply the product of C0,1 and D27,6 with C0,2 and D27. The multiplication of 7 is added to its accumulator 202. Thus, after the instruction of address 1 and address 2 is completed, the product of C0, 0 and D27, 5, the product of C0, 1 and D27, 6 and C0, 2 and D27, 7 are accumulated to accumulator 202, and added. All other accumulated values from the previously passed instruction loop.

位址3與4之指令所執行之運算係類似於位址1與2之指令，利用權重隨機存取記憶體124列遞增指標之功效，這些指令會對權重隨機存取記憶體124之下一列進行運算，並且這些指令會對資料隨機存取記憶體122之後續三列，即列3至5，進行運算。也就是說，以神經處理單元5為例，完成位址1至4之指令後，C0,0與D27,5之乘積、C0,1與D27,6之乘積、C0,2與D27,7之乘積、C1,0與D28,5之乘積、C1,1與D28,6之乘積、以及C1,2與D28,7之乘積會累加至累加器202，加入其他所有來自先前傳遞之指令迴圈的累加值。 The operations performed by the instructions of addresses 3 and 4 are similar to the instructions of addresses 1 and 2, using the weighted random access memory 124 column to increment the performance of the indicator, these instructions will be below the weight random access memory 124 The operations are performed and these instructions operate on the next three columns of data random access memory 122, columns 3 through 5. That is to say, taking the neural processing unit 5 as an example, after completing the instructions of the address 1 to 4, the product of C0, 0 and D27, 5, the product of C0, 1 and D27, 6, C0, 2 and D27, 7 The product of product, C1,0 and D28,5, the product of C1,1 and D28,6, and the product of C1,2 and D28,7 are accumulated to accumulator 202, adding all other orders from the previously passed instruction loop. Accumulated value.

位址5與6之指令所執行之運算係類似於位址3與4之指令，這些指令會對權重隨機存取記憶體124之下一列，以及資料隨機存取記憶體122之後續三列，即列6至8，進行運算。也就是說，以神經處理單元5為例，完成位址1至6之指令後，C0,0與D27,5之乘積、C0,1與D27,6之乘積、C0,2與D27,7之乘積、C1,0與D28,5之乘積、C1,1與D28,6之乘積、C1,2與D28,7、C2,0與D29,5 之乘積、C2,1與D29,6之乘積、以及C2,2與D29,7之乘積會累加至累加器202，加入其他所有來自先前傳遞之指令迴圈的累加值。也就是說，完成位址1至6之指令後，假定指令迴圈開始時，權重隨機存取記憶體124列為27，以神經處理單元5為例，將會利用卷積核2042對以下3x3子矩陣進行卷積運算：D27,5 D27,6 D27,7 D28,5 D28,6 D28,7 D29,5 D29,6 D29,7一般而言，完成位址1到6的指令後，這512個神經處理單元126都已經使用卷積核2042對下列3x3子矩陣進行卷積運算：Dr,n Dr,n+1 Dr,n+2 Dr+1,n Dr+1,n+1 Dr+1,n+2 Dr+2,n Dr+2,n+1 Dr+2,n+2其中r是指令迴圈開始時，權重隨機存取記憶體124之列位址值，而n是神經處理單元126之編號。 The operations performed by the instructions of addresses 5 and 6 are similar to the instructions of addresses 3 and 4, which will be a column below the weighted random access memory 124 and the next three columns of the data random access memory 122, That is, columns 6 to 8 perform calculations. That is to say, taking the neural processing unit 5 as an example, after completing the instructions of addresses 1 to 6, the product of C0, 0 and D27, 5, the product of C0, 1 and D27, 6, C0, 2 and D27, 7 Product, product of C1,0 and D28,5, product of C1,1 and D28,6, C1,2 and D28,7,C2,0 and D29,5 The product, the product of C2,1 and D29,6, and the product of C2,2 and D29,7 are accumulated to accumulator 202, adding all other accumulated values from the previously passed instruction loop. That is to say, after the instruction of address 1 to 6 is completed, it is assumed that the weight random access memory 124 is listed as 27 when the instruction loop starts, and the neuroprocessing unit 5 is taken as an example, and the convolution kernel 2042 is used for the following 3x3. Submatrix for convolution: D27,5 D27,6 D27,7 D28,5 D28,6 D28,7 D29,5 D29,6 D29,7 In general, after completing the instructions of addresses 1 to 6, this 512 Each of the neural processing units 126 has used the convolution kernel 2042 to convolute the following 3x3 submatrices: Dr, n Dr, n+1 Dr, n+2 Dr+1, n Dr+1, n+1 Dr+1 , n+2 Dr+2, n Dr+2, n+1 Dr+2, n+2, where r is the address of the address of the random access memory 124 when the instruction loop starts, and n is the neural processing The number of unit 126.

位址7之指令會透過啟動函數單元121傳遞累加器202數值217。此傳遞功能會傳遞一個文字，其尺寸大小(以位元計)係等同於由資料隨機存取記憶體122與權重隨機存取記憶體124讀取之文字(在此範例中即16位元)。就一較佳實施例而言，使用者可指定輸出格式，例如輸出位元中有多少位元是小數(fractional)位元，這部分在後續章節會有更詳細的說明。另外，此指定可指定一個除法啟動函數，而非指定一個傳遞啟動函數，此除法啟動函數會將累加器202數值217除以一個除數，如本文對應於第二十九A與三十圖所述，例如利用第三十圖之“除法器”3014/3016之其中之一。舉例來說，就一個具有係數之卷積核2042而言，如前述具有十六分之一之係數之高斯模糊核，位址7之指令會指定一除法啟動函數(例如除以16)，而非指定一傳遞函數。另外，架構程式可以在將卷積核係數寫入資料隨機存取記憶體122前，對卷積核2042係數執行此除以16之運算，並據以調整卷積核2042數值之二進位小數點的位置，例如使用如下所述第二十九圖之資料二進位小數點2922。 The instruction of address 7 transmits the accumulator 202 value 217 through the start function unit 121. This transfer function passes a text whose size (in bits) is equivalent to the text read by the data random access memory 122 and the weighted random access memory 124 (in this example, 16 bits). . In a preferred embodiment, the user can specify an output format, such as how many bits in the output bit are fractional bits, as will be explained in more detail in subsequent sections. In addition, this specification specifies a division start function instead of specifying a pass start letter. The division start function divides the accumulator 202 value 217 by a divisor, as described herein in relation to the twenty-ninth and thirty-first diagrams, for example, using the "divider" 3014/3016 of FIG. one of them. For example, in the case of a convolution kernel 2042 with coefficients, as described above with a Gaussian blur kernel with a factor of one-sixteenth, the instruction of address 7 specifies a division start function (eg, divided by 16), and Do not specify a transfer function. In addition, the architecture program may perform the division by 16 on the coefficients of the convolution kernel 2042 before writing the convolution kernel coefficients into the data random access memory 122, and adjust the binary decimal point of the convolution kernel 2042 value accordingly. The position, for example, uses the data binary point 2922 of the twenty-ninth figure as described below.

位址8之指令會將啟動函數單元212之輸出寫入權重隨機存取記憶體124中由輸出列暫存器之當前值所指定之列。此當前值會被位址0之指令初始化，並且由指令內之遞增指標在每傳遞經過一次迴圈就遞增此數值。 The instruction of address 8 writes the output of the start function unit 212 to the column specified in the weight random access memory 124 by the current value of the output column register. This current value is initialized by the instruction of address 0, and this value is incremented by the increment indicator within the instruction after each pass.

如第二十四至二十六圖具有一3x3卷積核2402之範例所述，神經處理單元126大約每三個時頻週期會讀取權重隨機存取記憶體124以讀取資料矩陣2406之一個列，並且大約每十二個時頻週期會將卷積核結果矩陣寫入權重隨機存取記憶體124。此外，假定在一實施例中，具有如第十七圖之緩衝器1704之一寫入與讀取緩衝器，在神經處理單元126進行讀取與寫入之同時，處理器100可以對權重隨機存取記憶體124進行讀取與寫入，緩衝器1704大約每十六個時頻週期會對權重隨機存取記憶體執行一次讀取與寫入動作，以分別讀取資料矩陣以及寫入卷積核結果矩陣。因此，權重隨機存取記憶體124之大約一半的頻寬會由神經網路單元121以混合方式執行之卷積核運算所消耗。本範例係包含一個3x3卷積核2042，不過，本發明並不限於此，其他大小的卷積核，如2x2、4x4、5x5、6x6、7x7、8x8等，亦可適用於不同的神經網路單元程式。在使用較大卷積核之情況下，因為乘法累加指令之旋轉版本(如第二十六A圖之位址2、4與6之指令，較大之卷積核會需要使用這些指令)具有較大之計數值，神經處理單元126讀取權重隨機存取記憶體124之時間占比會降低，因此，權重隨機存取記憶體124之頻寬使用比也會降低。 As described in the twenty-fourth to twenty-sixth diagrams having an example of a 3x3 convolution kernel 2402, the neural processing unit 126 reads the weighted random access memory 124 approximately every three time-frequency cycles to read the data matrix 2406. One column, and the convolution kernel result matrix is written to the weighted random access memory 124 approximately every twelve time-frequency cycles. Furthermore, it is assumed that in one embodiment, having one of the buffers 1704 of the seventeenth embodiment, the write and read buffers, while the neural processing unit 126 is reading and writing, the processor 100 can have random weights. The memory 124 is read and written, and the buffer 1704 performs a read and write operation on the weight random access memory every sixteen time-frequency cycles to read the data matrix and Write the convolution kernel result matrix. Thus, approximately half of the bandwidth of the weighted random access memory 124 is consumed by the convolution kernel operation performed by the neural network unit 121 in a mixed manner. This example includes a 3x3 convolution kernel 2042. However, the present invention is not limited thereto, and other sizes of convolution kernels, such as 2x2, 4x4, 5x5, 6x6, 7x7, 8x8, etc., may also be applicable to different neural networks. Unit program. In the case of larger convolution kernels, because of the rotated version of the multiply-accumulate instruction (such as the instructions of addresses 2, 4, and 6 in Figure 26A, larger convolution kernels will need to use these instructions) With a larger count value, the time ratio of the neural processing unit 126 reading the weight random access memory 124 is reduced, and therefore, the bandwidth usage ratio of the weight random access memory 124 is also lowered.

另外，架構程式可使神經網路單元程式對輸入資料矩陣2406中不再需要使用之列進行覆寫，而非將卷積運算結果寫回權重隨機存取記憶體124之不同列(如列900-1299與1300-1699)。舉例來說，就一個3x3之卷積核而言，架構程式可以將資料矩陣2406寫入權重隨機存取記憶體124之列2-401，而非寫入列0-399，而神經處理單元程式則會從權重隨機存取記憶體124之列0開始將卷積運算結果寫入，而每傳遞經過一次指令迴圈就遞增列數。如此，神經網路單元程式只會將不再需要使用之列進行覆寫。舉例來說，在第一次傳遞經過指令迴圈之後(或更精確地說，在執行位址1之指令之後其載入權重隨機存取記憶體124之列0)，列0之資料可以被覆寫，不過，列1-3的資料需要留給第二次傳遞經過指令迴圈之運算而不能被覆寫；同樣地，在第二次傳遞經過指令迴圈之後，列1之資料可以被覆寫，不過，列2-4的資料需要留給第三次傳遞經過指令迴圈之運算而不能被覆寫；依此類推。在此實施例中，可以增大各個資料矩陣2406(資料塊)之高度(如800列)，因而可以使用較少之資料塊。 In addition, the architectural program can cause the neural network unit program to overwrite the columns that are no longer needed in the input data matrix 2406, rather than writing the convolution operation results back to different columns of the weighted random access memory 124 (eg, column 900). -1299 and 1300-1699). For example, in the case of a 3x3 convolution kernel, the architecture program can write the data matrix 2406 into columns 2-401 of the weighted random access memory 124 instead of writing columns 0-399, and the neural processing unit program The convolution operation result is written from the column 0 of the weight random access memory 124, and the number of columns is incremented every time the instruction loop is passed. In this way, the neural network unit program will only overwrite the columns that are no longer needed. For example, after the first pass through the instruction loop (or more precisely, after loading the instruction of address 1 it loads the column 0 of the weight random access memory 124), the data of column 0 can be overwritten. Write, however, the data in columns 1-3 needs to be left for the second pass through the operation of the instruction loop and cannot be overwritten; likewise, the second pass is passed After the loop is made, the data of column 1 can be overwritten. However, the data of column 2-4 needs to be left for the third pass through the operation of the instruction loop and cannot be overwritten; and so on. In this embodiment, the height of each data matrix 2406 (data block) can be increased (e.g., 800 columns) so that fewer data blocks can be used.

另外，架構程式可以使神經網路單元程式將卷積運算之結果寫回卷積核2402上方之資料隨機存取記憶體122列(例如在列8上方)，而非將卷積運算結果寫回權重隨機存取記憶體124，當神經網路單元121寫入結果時，架構程式可以從資料隨機存取記憶體122讀取結果(例如使用第二十六圖中資料隨機存取記憶體122之最近寫入列2606位址)。此配置適用於具有單埠權重隨機存取記憶體124與雙埠資料隨機存取記憶體之實施例。 In addition, the architecture program can cause the neural network unit program to write the result of the convolution operation back to the data random access memory 122 column above the convolution kernel 2402 (eg, above column 8) instead of writing the convolution operation result back. The weight random access memory 124, when the neural network unit 121 writes the result, the architecture program can read the result from the data random access memory 122 (for example, using the data random access memory 122 in the twenty-sixth figure) Recently written to column 2606 address). This configuration is applicable to embodiments having a weighted random access memory 124 and a dual data random access memory.

依據第二十四至二十六A圖之實施例中神經網路單元121之運算可以發現，第二十六A圖之程式之每次執行會需要大約5000個時頻週期，如此，第二十四圖中整個2560x1600之資料陣列2404之卷積運算需要大約100,000個時頻週期，明顯少於以傳統方式執行相同任務所需要的時頻週期數。 According to the operation of the neural network unit 121 in the embodiment of the twenty-fourth to twenty-sixth embodiment, it can be found that each execution of the program of the twenty-sixth A diagram requires about 5000 time-frequency periods, and thus, the second The convolution operation of the entire 2560x1600 data array 2404 in Figure 14 requires approximately 100,000 time-frequency cycles, significantly less than the number of time-frequency cycles required to perform the same task in the conventional manner.

第二十六B圖係顯示第一圖之神經網路單元121之控制暫存器127之某些欄位之一實施例之方塊示意圖。此狀態暫存器127包括一個欄位2602，指出權重隨機存取記憶體124中最近被神經處理單元126寫入之列的位址；一個欄位2606，指出資料隨機存取記憶體122中最近被神經處理單元126寫入之列的位址；一個欄位2604，指出權重隨機存取記憶體124中最近被神經處理單元126讀取之列的位址；以及一個欄位2608，指出資料隨機存取記憶體122中最近被神經處理單元126讀取之列的位址。如此，執行於處理器100之架構程式就可以確認神經網路單元121之處理進度，當對資料隨機存取記憶體122與/或權重隨機存取記憶體124進行資料之讀取與/或寫入時。利用此能力，加上如前述選擇對輸入資料矩陣進行覆寫(或是如前述將結果寫入資料隨機存取記憶體122)，如以下之範例所述，第二十四圖之資料陣列2404就可以視為5個512x1600之資料塊來執行，而非20個512x400之資料塊。處理器100從權重隨機存取記憶體124之列2開始寫入第一個512x1600之資料塊，並使神經網路單元程式啟動(此程式具有一數值為1600之迴圈計數，並且將權重隨機存取記憶體124輸出列初始化為0)。當神經網路單元121執行神經網路單元程式時，處理器100會監測權重隨機存取記憶體124之輸出位置/位址，藉以(1)(使用MFNN指令1500)讀取權重隨機存取記憶體124中具有由神經網路單元121(由列0開始)寫入之有效卷積運算結果之列；以及(2)將第二個512x1600資料矩陣2406(始於列2)覆寫於已經被讀取過之有效卷積運算結果，如此當神經網路單元121對於第一個512x1600資料塊完成神經網路單元程式，處理器100在必要時可以立即更新神經網路單元程式並再次啟動神經網路單元程式以執行於第二個512x1600資料塊。此程序會再重複三次執行剩下三個512x1600資料塊，以使神經網路單元121可以被充分使用。 The twenty-sixth B diagram is a block diagram showing one embodiment of certain fields of the control register 127 of the neural network unit 121 of the first figure. The status register 127 includes a field 2602 indicating the address of the weighted random access memory 124 that was recently written by the neural processing unit 126; a field 2606 indicating the most recent of the data random access memory 122. The address written by the neural processing unit 126; a field 2604, An address of the column of the weighted random access memory 124 that was recently read by the neural processing unit 126 is indicated; and a field 2608 indicating the bit of the data random access memory 122 that was most recently read by the neural processing unit 126. site. In this way, the processing program executed by the processor 100 can confirm the processing progress of the neural network unit 121, and read and/or write data to the data random access memory 122 and/or the weight random access memory 124. Time to enter. Using this capability, plus the overwrite of the input data matrix as previously described (or writing the results to the data random access memory 122 as previously described), as described in the following example, the data array 2404 of the twenty-fourth graph It can be viewed as five 512x1600 data blocks instead of 20 512x400 data blocks. The processor 100 writes the first 512x1600 data block from the column 2 of the weight random access memory 124 and causes the neural network unit program to start (the program has a loop count of 1600 values, and the weights are randomized). The output column of the access memory 124 is initialized to 0). When the neural network unit 121 executes the neural network unit program, the processor 100 monitors the output position/address of the weight random access memory 124, thereby (1) (using the MFNN instruction 1500) reading the weight random access memory. Body 124 has a list of valid convolution operations written by neural network unit 121 (starting with column 0); and (2) overwrites the second 512x1600 data matrix 2406 (starting with column 2) The result of the effective convolution operation is read, so that when the neural network unit 121 completes the neural network unit program for the first 512x1600 data block, the processor 100 can immediately update the neural network unit program and start the neural network again if necessary. The way unit program is executed on the second 512x1600 data block. This program will repeat the execution of three remaining 512x1600 data blocks three times so that the neural network unit 121 can be charged. Use separately.

在一實施例中，啟動函數單元212具有能夠對累加器202數值217有效執行一有效除法運算之能力，這部分在後續章節尤其是對應於第二十九A、二十九B與三十圖處會有更詳細的說明。舉例來說，對累加器202數值進行除以16之除法運算之啟動函數神經網路單元指令可用於以下所述之高斯模糊矩陣。 In an embodiment, the start function unit 212 has the ability to effectively perform an effective division operation on the accumulator 202 value 217, which in the subsequent sections corresponds in particular to the twenty-ninth, twenty-nine, and twenty-sixth and thirty-th. There will be more detailed instructions. For example, a start function neural network unit instruction that divides the value of accumulator 202 by a divide by 16 can be used for the Gaussian blur matrix described below.

第二十四圖之範例中所使用之卷積核2402為一個應用於整個資料矩陣2404之小型靜態卷積核，不過，本發明並不限於此，此卷積核亦可為一大型矩陣，具有特定之權重對應於資料陣列2404之不同資料值，例如常見於卷積神經網路之卷積核。當神經網路單元121以此方式被使用時，架構程式會將資料矩陣與卷積核之位置互換，亦即將資料矩陣放置於資料隨機存取記憶體122內而將卷積核放置於權重隨機存取記憶體124內，而執行神經網路單元程式所需處理之列數也會相對較少。 The convolution kernel 2402 used in the example of the twenty-fourth figure is a small static convolution kernel applied to the entire data matrix 2404. However, the present invention is not limited thereto, and the convolution kernel may also be a large matrix. The particular weight corresponds to a different data value of the data array 2404, such as a convolution kernel commonly found in convolutional neural networks. When the neural network unit 121 is used in this manner, the architecture program interchanges the data matrix with the location of the convolution kernel, that is, the data matrix is placed in the data random access memory 122 and the convolution kernel is placed in the weight random. Within memory 124, the number of columns required to execute a neural network unit program is relatively small.

第二十七圖係一方塊示意圖，顯示第一圖中填入輸入資料之權重隨機存取記憶體124之一範例，此輸入資料係由第一圖之神經網路單元121執行共源運算(pooling operation)。共源運算是由人工神經網路之一共源層執行，透過取得輸入矩陣之子區域或子矩陣並計算子矩陣之最大值或平均值以作為一結果矩陣即共源矩陣，以縮減輸入資料矩陣(如一影像或是卷積後影像)之大小(dimension)。在第二十七與二十八圖之範例中，共源運算計算各個子矩陣之最大值。共源運算對於如執行物件分類或偵測之人工神經網路特別有用。一般而言，共源運算實際上可以使輸入矩陣縮減之因數為所檢測之子矩陣的元素數，特別是可以將輸入矩陣之各個維度方向都縮減子矩陣之相對應維度方向之元素數。在第二十七圖之範例中，輸入資料是一個寬文字(如16位元)之512x1600矩陣，儲存於權重隨機存取記憶體124之列0至1599。在第二十七圖中，這些文字係以其所在列行位置標示，如，位於列0行0之文字係標示為D0,0；位於列0行1之文字係標示為D0,1；位於列0行2之文字係標示為D0,2；依此類推，位於列0行511之文字係標示為D0,511。相同地，位於列1行0之文字係標示為D1,0；位於列1行1之文字係標示為D1,1；位於列1行2文字係標示為D1,2；依此類推，位於列1行511之文字係標示為D1,511；如此依此類推，位於列1599行0之文字係標示為D1599,0；位於列1599行1之文字係標示為D1599,1位於列1599行2之文字係標示為D1599,2；依此類推，位於列1599行511之文字係標示為D1599,511。 Figure 27 is a block diagram showing an example of a weighted random access memory 124 filled with input data in the first figure. The input data is subjected to a common source operation by the neural network unit 121 of the first figure ( Pooling operation). The common source operation is performed by one common source layer of the artificial neural network, and the input data matrix is reduced by taking the sub-region or sub-matrix of the input matrix and calculating the maximum or average value of the sub-matrix as a result matrix, ie, a common source matrix ( The size of an image or a convolved image. In the examples of the twenty-seventh and twenty-eighth, The common source operation calculates the maximum value of each submatrix. Common source operations are particularly useful for artificial neural networks that perform object classification or detection. In general, the common source operation can actually reduce the input matrix by the number of elements of the detected sub-matrix, and in particular, the direction of each dimension of the input matrix can be reduced by the number of elements in the corresponding dimension direction of the sub-matrix. In the example of the twenty-seventh figure, the input data is a 512 x 1600 matrix of wide text (e.g., 16 bits) stored in columns 0 through 1599 of the weighted random access memory 124. In the twenty-seventh figure, these characters are indicated by their row position. For example, the text in column 0 row 0 is marked as D0, 0; the text in column 0 row 1 is marked as D0, 1; The text of column 0 and line 2 is labeled as D0, 2; and so on, the text in column 0 line 511 is labeled D0, 511. Similarly, the text in column 1 row 0 is labeled D1, 0; the text in column 1 row 1 is labeled D1, 1; the text in column 1 row 2 is labeled D1, 2; and so on, in the column The text of line 1 511 is marked as D1, 511; and so on, the text in column 1599 line 0 is marked as D1599, 0; the text in column 1599 line 1 is marked as D1599, 1 is in column 1599 line 2 The text is labeled D1599, 2; and so on, the text in column 1599, line 511 is labeled D1599, 511.

第二十八圖係一神經網路單元程式之一程式列表，此神經網路單元程式係執行第二十七圖之輸入資料矩陣之共源運作並將其寫回權重隨機存取記憶體124。在第二十八圖之範例中，共源運算會計算輸入資料矩陣中各個4x4子矩陣之最大值。此程式會多次執行由指令1至10構成的指令迴圈。位於位址0之初始化神經處理單元指令會指定每個神經處理單元126執行指令迴圈之次數，在第二十八圖之範例中之迴圈計數值為400，而在迴圈末端(在位址11)之迴圈指令會使當前迴圈計數值遞減，而若是所產生之結果是一非零值，就使其回到指令迴圈之頂端(即回到位址1之指令)。權重隨機存取記憶體124內之輸入資料矩陣實質上會被神經網路單元程式視為400個由四個相鄰列構成之互斥群組，即列0-3、列4-7、列8-11、依此類推，直到列1596-1599。每一個由四個相鄰列構成之群組包括128個4x4子矩陣，這些子矩陣係由此群組之四個列與四個相鄰行之交叉處元素所形成之4x4子矩陣，這些相鄰行即行0-3、行4-7、行8-11、依此類推直到行508-511。這512個神經處理單元126中，每四個為一組計算之第四個神經處理單元126(一共即128個)會對一相對應4x4子矩陣執行一共源運算，而其他三個神經處理單元126則不被使用。更精確地說，神經處理單元0、4、8、依此類推直到神經處理單元508，會對其相對應之4x4子矩陣執行一共源運算，而此4x4子矩陣之最左側行編號係對應於神經處理單元編號，而下方列係對應於當前權重隨機存取記憶體124之列值，此數值會被位址0之初始化指令初始化為零並且在重複每次指令迴圈後會增加4，這部分在後續章節會有更詳細的說明。這400次指令迴圈之重複動作係對應至第二十七圖之輸入資料矩陣中之4x4子矩陣群組數(即輸入資料矩陣具有之1600個列除以4)。初始化神經處理單元指令也會清除累加器202使其歸零。就一較佳實施例而言，位址11之迴圈指令也會清除累加器202使其歸零。另外，位址1 之maxwacc指令會指定清除累加器202使其歸零。 The twenty-eighth figure is a list of programs of a neural network unit program that performs the common source operation of the input data matrix of the twenty-seventh figure and writes it back to the weight random access memory 124. . In the example of the twenty-eighth figure, the common source operation calculates the maximum of each 4x4 submatrix in the input data matrix. This program executes the instruction loop consisting of instructions 1 through 10 multiple times. The initialization neural processing unit instruction at address 0 specifies that each neural processing unit 126 executes the instruction loop. The number of times, in the example of the twenty-eighth figure, the loop count value is 400, and the loop command at the end of the loop (at address 11) decrements the current loop count value, and if the result is A non-zero value returns it to the top of the instruction loop (ie, the instruction back to address 1). The input data matrix in the weighted random access memory 124 is essentially treated by the neural network unit program as 400 mutually exclusive groups of four adjacent columns, namely columns 0-3, columns 4-7, columns. 8-11, and so on, until column 1596-1599. Each group consisting of four adjacent columns includes 128 4x4 sub-matrices, which are 4x4 sub-matrices formed by the elements at the intersection of four columns and four adjacent rows of the group. The adjacent row is line 0-3, line 4-7, line 8-11, and so on until line 508-511. Each of the 512 neural processing units 126, a fourth set of four neural processing units 126 (a total of 128) performs a common source operation on a corresponding 4x4 submatrix, while the other three neural processing units 126 is not used. More precisely, the neural processing units 0, 4, 8, and so on until the neural processing unit 508 performs a common source operation on the corresponding 4x4 submatrix, and the leftmost row number of the 4x4 submatrix corresponds to The neural processing unit number, and the lower column corresponds to the column value of the current weight random access memory 124, this value will be initialized to zero by the initialization instruction of address 0 and will increase by 4 after repeating each instruction loop. Some will be explained in more detail in subsequent chapters. The repetition of the 400 instruction loops corresponds to the number of 4x4 submatrix groups in the input data matrix of the twenty-seventh graph (ie, the input data matrix has 1600 columns divided by 4). Initializing the neural processing unit instructions also clears the accumulator 202 to zero. In a preferred embodiment, the loop instruction of address 11 also clears accumulator 202 to zero. In addition, address 1 The maxwacc instruction will specify that the accumulator 202 is cleared to zero.

每次在執行程式之指令迴圈時，這128個被使用之神經處理單元126會對輸入資料矩陣之當前四列群組中之128個個別之4x4子矩陣，同時執行128個共源運算。進一步來說，此共源運算會確認這4x4子矩陣之16個元素中之最大值元素。在第二十八圖之實施例中，對於這128個被使用之神經處理單元126中之每個神經處理單元y而言，4x4子矩陣之下方左側元素為第二十七圖內之元素Dx,y，其中x是指令迴圈開始時當前權重隨機存取記憶體124之列數，而此列資料係由第二十八圖之程式中位址1之maxwacc指令讀取(此列數也會由位址0之初始化神經處理單元指令加以初始化，並在每次執行位址3、5與7之maxwacc指令時遞增)。因此，對於此程式之每一個迴圈而言，這128個被使用之神經處理單元126會將當前列群組之相對應128個4x4子矩陣之最大值元素，寫回權重隨機存取記憶124之指定列。以下係針對此指令迴圈進行描述。 Each time the instruction loop of the program is executed, the 128 used neural processing units 126 perform 128 common source operations simultaneously on 128 individual 4x4 sub-matrices in the current four-column group of the input data matrix. Further, this common source operation will identify the largest of the 16 elements of the 4x4 submatrix. In the embodiment of the twenty-eighth figure, for each of the 128 used neural processing units 126, the lower left element of the 4x4 submatrix is the element Dx in the twenty-seventh graph. , y, where x is the number of columns of the current weight random access memory 124 at the beginning of the instruction loop, and the column data is read by the maxwacc instruction of the address 1 in the program of the twenty-eighth figure (this column number is also It will be initialized by the initialization neural processing unit instruction of address 0 and incremented each time the maxwacc instruction of addresses 3, 5 and 7 is executed). Thus, for each loop of the program, the 128 used neural processing units 126 will write back the maximum elements of the corresponding 128 4x4 sub-matrices of the current column group back to the weighted random access memory 124. The specified column. The following is a description of this instruction loop.

位址1之maxwacc指令會暗中使用當前權重隨機存取記憶體124列，這個列最好是裝載在定序器128內(並由位於位址0之指令將其初始化為零以執行第一次傳遞經過指令迴圈之運算)。位址1之指令會使每個神經處理單元126從權重隨機存取記憶體124之當前列讀取其相對應文字，將此文字與累加器202數值217相比，並將這兩個數值之最大者儲存於累加器202。因此，舉例來說，神經處理單元8會確認累加器202數值217與資料文字Dx,8(其中“x”是當前權重隨機存取記憶體124列)中之最大值並將其寫回累加器202。 The maxwacc instruction of address 1 implicitly uses the current weight random access memory 124 column, which is preferably loaded in sequencer 128 (and initialized to zero by the instruction at address 0 to execute the first time) Pass the operation of the instruction loop.) The instruction of address 1 causes each neural processing unit 126 to read its corresponding text from the current column of the weighted random access memory 124, compare this text with the value 217 of the accumulator 202, and compare the two values. The largest is stored in accumulator 202. Thus, for example, the neural processing unit 8 will confirm the accumulator 202 value 217 and the message text. The word Dx,8 (where "x" is the current weight random access memory 124 column) has the maximum value and is written back to the accumulator 202.

位址2是一個maxwacc指令，此指令會指定將每個神經處理單元126之多工暫存器705內之數值旋轉至鄰近至神經處理單元126，在此即為因應位址1之指令剛從權重隨機存取記憶體124讀取之一列輸入資料陣列值。在第二十七至二十八圖之實施例中，神經處理單元126係用以將多工器705數值向左旋轉，亦即從神經處理單元J旋轉至神經處理單元J-1，如前文對應於第二十四至二十六圖之章節所述。此外，此指令會指定一計數值為3。如此，位址2之指令會使每個神經處理單元126將旋轉後文字接收至多工暫存器705並確認此旋轉後文字與累加器202數值中之最大值，然後將這個運算再重複兩次。也就是說，每個神經處理單元126會執行三次將旋轉後文字接收至多工暫存器705並確認旋轉後文字與累加器202數值中最大值之運算。如此，舉例來說，假定開始此指令迴圈時，當前權重隨機存取記憶體124列為36，以神經處理單元8為例，在執行位址1與2之指令後，神經處理單元8將會在其累加器202中儲存迴圈開始時累加器202以及四個權重隨機存取記憶體124文字D36,8、D36,9、D36,10與D36,11中之最大值。 Address 2 is a maxwacc instruction that specifies that the value in the multiplex register 705 of each neural processing unit 126 is rotated adjacent to the neural processing unit 126, where the instruction for address 1 is just The weighted random access memory 124 reads one of the column input data array values. In the embodiment of the twenty-seventh to twenty-eighth embodiment, the neural processing unit 126 is configured to rotate the multiplexer 705 value to the left, that is, from the neural processing unit J to the neural processing unit J-1, as in the foregoing Corresponds to the chapters of the twenty-fourth to twenty-sixth figures. In addition, this instruction will specify a count value of 3. Thus, the instruction of address 2 causes each neural processing unit 126 to receive the rotated text into the multiplex register 705 and confirm the maximum value of the rotated text and the accumulator 202, and then repeat the operation twice. . That is, each neural processing unit 126 performs three operations of receiving the rotated text to the multiplex register 705 and confirming the maximum value of the rotated text and accumulator 202 values. Thus, for example, assuming that the instruction loop is started, the current weight random access memory 124 is listed as 36. Taking the neural processing unit 8 as an example, after executing the instructions of addresses 1 and 2, the neural processing unit 8 will The maximum value of the accumulator 202 and the four weight random access memory 124 characters D36, 8, D36, 9, D36, 10 and D36, 11 will be stored in its accumulator 202.

位址3與4之maxwacc指令所執行之運算類似於位址1之指令，利用權重隨機存取記憶體124列遞增指標具有之功效，位址3與4之指令會對權重隨機存取記憶體124之下一列執行。也就是說，假定指令迴圈開始時當前權重隨機存取記憶體124列是36，以神經處理單元8為例，在完成位址1到4之指令後，神經處理單元8將會在其累加器202中儲存迴圈開始時累加器202以及八個權重隨機存取記憶體124文字D36,8、D36,9、D36,10、D36,11、D37,8、D37,9、D37,10與D37,11中之最大值。 The operations performed by the maxwacc instructions of addresses 3 and 4 are similar to the instructions of address 1, using the weighted random access memory 124 column increment indicator has the effect, the address 3 and 4 instructions will be weighted random access memory A column below 124 is executed. In other words, assume that the instruction loop begins The current weight random access memory 124 column is 36. Taking the neural processing unit 8 as an example, after completing the instructions of addresses 1 to 4, the neural processing unit 8 will accumulate the storage loops in its accumulator 202 at the beginning of the loop. 202 and eight weight random access memories 124 have the maximum values of characters D36, 8, D36, 9, D36, 10, D36, 11, D37, 8, D37, 9, D37, 10 and D37, 11.

位址5至8之maxwacc指令所執行之運算類似於位址1至4之指令，位址5至8之指令會對權重隨機存取記憶體124之下兩列執行。也就是說，假定指令迴圈開始時當前權重隨機存取記憶體124列是36，以神經處理單元8為例，在完成位址1到8之指令後，神經處理單元8將會在其累加器202中儲存迴圈開始時累加器202以及十六個權重隨機存取記憶體124文字D36,8、D36,9、D36,10、D36,11、D37,8、D37,9、D37,10、D37,11、D38,8、D38,9、D38,10、D38,11、D39,8、D39,9、D39,10與D39,11中之最大值。也就是說，假定指令迴圈開始時當前權重隨機存取記憶體124列是36，以神經處理單元8為例，在完成位址1到8之指令後，神經處理單元8將會完成確認下列4x4子矩陣之最大值：D36,8 D36,9 D36,10 D36,11 D37,8 D37,9 D37,10 D37,11 D38,8 D38,9 D38,10 D38,11 D39,8 D39,9 D39,10 D39,11基本上，在完成位址1至8之指令後，這128個被使用之神經處理單元126中的每一個神經處理單元126就會完成確認下列4x4子矩陣之最大值： Dr,n Dr,n+1 Dr,n+2 Dr,n+3 Dr+1,n Dr+1,n+1 Dr+1,n+2 Dr+1,n+3 Dr+2,n Dr+2,n+1 Dr+2,n+2 Dr+2,n+3 Dr+3,n Dr+3,n+1 Dr+3,n+2 Dr+3,n+3其中r是指令迴圈開始時當前權重隨機存取記憶體124之列位址值，n是神經處理單元126編號。 The operations performed by the maxwacc instructions of addresses 5 through 8 are similar to the instructions of addresses 1 through 4, and the instructions of addresses 5 through 8 are executed for the two columns below the weighted random access memory 124. That is to say, it is assumed that the current weight random access memory 124 column is 36 at the beginning of the instruction loop, and the neural processing unit 8 is taken as an example. After the instructions of addresses 1 to 8 are completed, the neural processing unit 8 will accumulate in it. The accumulator 202 and the sixteen weight random access memory 124 characters D36, 8, D36, 9, D36, 10, D36, 11, D37, 8, D37, 9, D37, 10 are stored in the buffer 202. The maximum of D37, 11, D38, 8, D38, 9, D38, 10, D38, 11, D39, 8, D39, 9, D39, 10 and D39, 11. That is to say, assuming that the current weight random access memory 124 column is 36 at the beginning of the instruction loop, the neural processing unit 8 is taken as an example. After completing the instructions of addresses 1 to 8, the neural processing unit 8 will complete the confirmation of the following. Maximum value of 4x4 submatrix: D36, 8 D36, 9 D36, 10 D36, 11 D37, 8 D37, 9 D37, 10 D37, 11 D38, 8 D38, 9 D38, 10 D38, 11 D39, 8 D39, 9 D39 10 D39, 11 Basically, after completing the instructions of addresses 1 through 8, each of the 128 used neural processing units 126 will complete the confirmation of the following maximum values of the 4x4 submatrix: Dr,n Dr,n+1 Dr,n+2 Dr,n+3 Dr+1,n Dr+1,n+1 Dr+1,n+2 Dr+1,n+3 Dr+2,n Dr +2,n+1 Dr+2,n+2 Dr+2,n+3 Dr+3,n Dr+3,n+1 Dr+3,n+2 Dr+3,n+3 where r is the command The column address value of the random access memory 124 is currently weighted at the beginning of the loop, and n is the number of the neural processing unit 126.

位址9之指令會透過啟動函數單元212傳遞累加器202數值217。此傳遞功能會傳遞一個文字，其尺寸大小(以位元計)係等同於由權重隨機存取記憶體124讀取之文字(在此範例中即16位元)。就一較佳實施例而言，使用者可指定輸出格式，例如輸出位元中有多少位元是小數(fractional)位元，這部分在後續章節會有更詳細的說明。 The instruction of address 9 transmits the accumulator 202 value 217 through the start function unit 212. This transfer function passes a text whose size (in bits) is equivalent to the text read by the weighted random access memory 124 (in this example, 16 bits). In a preferred embodiment, the user can specify an output format, such as how many bits in the output bit are fractional bits, as will be explained in more detail in subsequent sections.

位址10之指令會將累加器202數值217寫入權重隨機存取記憶體124中由輸出列暫存器之當前值所指定之列，此當前值會被位址0之指令予以初始化，並利用指令內之遞增指標在每次傳遞經過迴圈後將此數值遞增。進一步來說，位址10之指令會將累加器202之一寬文字(如16位元)寫入權重隨機存取記憶體124。就一較佳實施例而言，此指令會將這16個位元依照輸出二進位小數點2916來進行寫入，這部分在下列對應於第二十九A與二十九B圖處會有更詳細的說明。 The instruction of address 10 writes the accumulator 202 value 217 into the column of the weighted random access memory 124 specified by the current value of the output column register, which is initialized by the instruction of address 0, and Use the increment indicator in the instruction to increment this value after each pass through the loop. Further, the instruction of address 10 writes a wide text (e.g., 16 bits) of accumulator 202 into weighted random access memory 124. In a preferred embodiment, the instruction will write the 16 bits in accordance with the output binary point 2916, which will be in the following corresponding to the twenty-ninth and twenty-ninth B-pictures. More detailed instructions.

如前述，迭代一次指令迴圈寫入權重隨機存取記憶體124之列會包含具有無效值之空洞。也就是說，結果133之寬文字1至3、5至7、9至11、依此類推，直到寬文字509至511都是無效或未使用的。在一實施例中，啟動函數單元212包括一多工器使能將結果合併至列緩衝器之相鄰文字，例如第十一圖之列緩衝器1104，以寫回輸出權重隨機存取記憶體124列。就一較佳實施例而言，啟動函數指令會指定每個空洞中的文字數，而此空洞內之文字數控制多工器合併結果。在一實施例中，空洞數可指定為數值2至6，以合併共源之3x3、4x4、5x5、6x6或7x7子矩陣之輸出。另外，執行於處理器100之架構程式會從權重隨機存取記憶體124讀取所產生之稀疏(即具有空洞)結果列，並利用其他執行單元112，例如使用架構合併指令之媒體單元，如x86單指令多資料流程擴展(SSE)指令，執行合併功能。以類似於前述同時進行之方式並利用神經網路單元121之混合本質，執行於處理器100之架構程式可以讀取狀態暫存器127以監測權重隨機存取記憶體124之最近寫入列(例如第二十六B圖之欄位2602)以讀取所產生之一稀疏結果列，將其合併並寫回權重隨機存取記憶體124之同一列，如此就完成準備而能作為一輸入資料矩陣，提供給神經網路之下一層使用，例如一卷積層或是一傳統神經網路層(亦即乘法累加層)。此外，本文所述之實施例係以4x4子矩陣執行共源運算，不過本發明並不限於此，第二十八圖之神經網路單元程式可經調整，而以其他尺寸之子矩陣，如3x3、5x5、6x6或7x7，執行共源運算。 As described above, iterating once the instruction loop write weights random access memory 124 column will contain holes with invalid values. That is, the result 133 is wide text 1 to 3, 5 to 7, 9 to 11, and so on. Until the wide text 509 to 511 are invalid or unused. In one embodiment, the start function unit 212 includes a multiplexer that enables the result of merging the adjacent words into the column buffer, such as the column buffer 1104 of the eleventh figure, to write back the output weight random access memory. 124 columns. In a preferred embodiment, the start function instruction specifies the number of words in each hole, and the number of words in the hole controls the multiplexer result. In an embodiment, the number of holes may be specified as a value of 2 to 6 to combine the outputs of the 3x3, 4x4, 5x5, 6x6, or 7x7 sub-matrices of the common source. In addition, the architecture program executing on the processor 100 reads the sparse (ie, having holes) result columns generated from the weighted random access memory 124 and utilizes other execution units 112, such as media units that use the schema merge instructions, such as The x86 single instruction multiple data flow extension (SSE) instruction performs the merge function. In a manner similar to that described above and utilizing the hybrid nature of neural network unit 121, the architectural program executing on processor 100 can read state register 127 to monitor the most recently written column of weighted random access memory 124 ( For example, the field 2602 of FIG. 26B is used to read one of the generated sparse result columns, merge them and write them back to the same column of the weighted random access memory 124, so that preparation is completed and can be used as an input data. The matrix is provided for use under the neural network, such as a roll of layers or a traditional neural network layer (ie, a multiply-accumulate layer). In addition, the embodiments described herein perform a common source operation in a 4x4 submatrix, but the present invention is not limited thereto, and the neural network unit program of the twenty-eighth figure can be adjusted to be a sub-matrix of other sizes, such as 3x3. , 5x5, 6x6 or 7x7, perform a common source operation.

如前述可以發現，寫入權重隨機存取記憶體124之結果列的數量是輸入資料矩陣之列數的四分之一。最後，在此範例中並未使用資料隨機存取記憶體122。不過，也可利用資料隨機存取記憶體122，而非權重隨機存取記憶體124，來執行共源運算。 As can be seen from the foregoing, the number of result columns of the write weight random access memory 124 is four quarters of the number of columns of the input data matrix. One. Finally, data random access memory 122 is not used in this example. However, the data random access memory 122, rather than the weighted random access memory 124, can also be used to perform the common source operation.

在第二十七與二十八圖之實施例中，共源運算會計算子區域之最大值。不過，第二十八圖之程式可經調整以計算子區域之平均值，利入透過將maxwacc指令以sumwacc指令取代(將權重文字與累加器202數值217加總)並將位址9之啟動函數指令修改為將累加結果除以各個子區域之元素數(較佳者係透過如下所述之倒數乘法運算)，在此範例中為十六。 In the twenty-seventh and twenty-eighth embodiments, the common source operation calculates the maximum value of the sub-region. However, the program of Figure 28 can be adjusted to calculate the average of the sub-regions by replacing the maxwacc instruction with the sumwacc command (adding the weight text to the accumulator 202 value 217) and starting the address 9. The function instruction is modified to divide the accumulated result by the number of elements in each sub-region (preferably by reciprocal multiplication as described below), which is sixteen in this example.

由神經網路單元121依據第二十七與二十八圖之運算中可以發現，每一次執行第二十八圖之程式需要使用大約6000個時頻週期來對第二十七圖所示之整個512x1600資料矩陣執行一次共源運算，此運算所使用之時頻週期數明顯少於傳統方式執行相類似任務所需之時頻週期數。 According to the operations of the neural network unit 121 according to the twenty-seventh and twenty-eighth diagrams, each time the program of the twenty-eighth figure is executed, it takes about 6000 time-frequency periods to be used for the twenty-seventh figure. The entire 512x1600 data matrix performs a common source operation, and the number of time-frequency cycles used in this operation is significantly less than the number of time-frequency cycles required to perform similar tasks in the conventional manner.

另外，架構程式可使神經網路單元程式將共源運算之結果寫回資料隨機存取記憶體122列，而非將結果寫回權重隨機存取記憶體124，當神經網路單元121將結果寫入資料隨機存取記憶體122時(例如使用第二十六B圖之資料隨機存取記憶體122最近寫入列2606之位址)，架構程式會從資料隨機存取記憶體122讀取結果。此配置適用具有單埠權重隨機存取記憶體124與雙埠資料隨機存取記憶體122之實施例。 In addition, the architecture program can cause the neural network unit program to write the result of the common source operation back to the data random access memory 122 column instead of writing the result back to the weight random access memory 124, when the neural network unit 121 will result When the data random access memory 122 is written (for example, using the data of the twenty-sixth B-picture data random access memory 122 recently written to the address of the column 2606), the architecture program reads from the data random access memory 122. result. This configuration is applicable to embodiments having a weighted random access memory 124 and a dual data random access memory 122.

定點算術運算，具有使用者提供二進位小數點，全精度定點累加，使用者指定倒數值，累加器數值之隨機捨入，以及可選擇啟動/輸出函數 Fixed-point arithmetic operation with user-supplied decimal point, full-precision fixed-point accumulation, user-specified reciprocal value, random rounding of accumulator values, and selectable start/output functions

一般而言，在數位計算裝置內執行算術運算之硬體單元依據其執行算術運算之對象為整數或浮點數，通常可分為“整數”單元與“浮點”單元。浮點數具有一數值(magnitude)(或尾數)與一指數，通常還有一符號。指數是基數(radix)點(通常為二進位小數點)相對於數值之位置之指標。相較之下，整數不具有指數，而只具有一數值，通常還有一符號。浮點單元可以讓程式設計者可以從一個非常大範圍之不同數值中取得其工作所要使用之數字，而硬體則是在需要時負責調整此數字之指數值，而不需程式設計者處理。舉例來說，假定兩個浮點數0.111 x 10²⁹與0.81 x 10³¹相乘。(雖然浮點單元通常工作於2為基礎之浮點數，此範例中所使用的是十進位小數，或以10為基礎之浮點數。)浮點單元會自動負責尾數相乘，指數相加，隨後再將結果標準化至數值.8911 x 10⁵⁹。在另一個範例中，假定同樣的兩個浮點數相加。浮點單元會在相加前自動負責將尾數之二進位小數點對齊以產生數值為.81111 x 10³¹之總數。 Generally, a hardware unit that performs an arithmetic operation in a digital computing device is an integer or a floating point number according to an object on which an arithmetic operation is performed, and is generally classified into an "integer" unit and a "floating point" unit. A floating point number has a magnitude (or mantissa) and an exponent, usually with a sign. An index is an indicator of the position of a radix point (usually a binary point) relative to a value. In contrast, an integer does not have an exponent, but only has a value, usually with a sign. Floating point units allow programmers to get the numbers they need to work from a very wide range of different values, while hardware is responsible for adjusting the index value of this number when needed, without the programmer having to deal with it. For example, suppose two floating point numbers 0.111 x 10 ^{29 are} multiplied by 0.81 x 10 ³¹ . (Although floating-point units typically work on 2-based floating-point numbers, decimal digits, or 10-based floating-point numbers, are used in this example.) Floating-point units are automatically responsible for multiplying the mantissa, exponential phase Add, then normalize the result to the value .8911 x 10 ⁵⁹ . In another example, assume that the same two floating point numbers are added. The floating point unit is automatically responsible for aligning the decimal point of the mantissa before the addition to produce a total number of .81111 x 10 ³¹ .

不過，眾所周知，這樣複雜的運算而會導致浮點單元之尺寸增加，耗能增加、每指令所需時頻週期數增加、以及/或週期時間拉長。因為這個原因，許多裝置(如嵌入式處理器、微控制器與相對低成本與/或低功率之微處理器)並不具有浮點單元。由前述範例可以發現，浮點單元之複雜結構包含執行關聯於浮點加法與乘法/除法之指數計算之邏輯(即對運算元之指數執行加/減運算以產生浮點乘法/除法之指數數值之加法器，將運算元指數相減以確認浮點加法之二進位小數點對準偏移量之減法器)，包含為了達成浮點加法中尾數之二進位小數點對準之偏移器，包含對浮點結果進行標準化處理之偏移器。此外，流程之進行通常還需要執行浮點結果之捨入運算之邏輯、執行整數格式與浮點格式間以及不同浮點格式(例如擴增精度、雙精度、單精度、半精度)間之轉換的邏輯、前導零與前導一之偵測器、以及處理特殊浮點數之邏輯，例如反常值、非數值與無窮值。 However, it is well known that such complicated operations result in an increase in the size of the floating point unit, an increase in energy consumption, an increase in the number of time-frequency cycles required per instruction, and/or an increase in cycle time. For this reason, many devices, such as embedded processors, microcontrollers, and relatively low cost and/or low power microprocessors, do not have floating point units. From the aforementioned examples It has been found that the complex structure of a floating-point unit includes logic that performs exponential calculations associated with floating-point addition and multiplication/division (ie, an adder that performs an addition/subtraction operation on an index of an operand to produce an index value of floating-point multiplication/division, a subtractor that subtracts the operand index to confirm the binary point of the floating point addition offset offset), including the offset of the binary point alignment of the mantissa in the floating point addition, including the pair of floating points The result is an offset that is standardized. In addition, the process usually needs to perform the logic of rounding the floating point result, perform conversion between integer format and floating point format, and between different floating point formats (such as amplification precision, double precision, single precision, half precision). Logic, leading zero and preamble detectors, and logic for handling special floating point numbers, such as anomalous, non-numeric, and infinite.

此外，關於浮點單元之正確度驗證會因為設計上需要被驗證之數值空間增加而大幅增加其複雜度，而會延長產品開發週期與上市時間。此外，如前述，浮點算術運算需要對用於計算之每個浮點數的尾數欄位與指數欄位分別儲存與使用，而會增加所需之儲存空間與/或在給定儲存空間以儲存整數之情況下降低精確度。其中許多缺點都可以透過整數單元執行算術運算來避免。 In addition, the verification of the correctness of the floating-point unit will greatly increase the complexity of the design due to the increased numerical space required to be verified, which will extend the product development cycle and time to market. In addition, as mentioned above, floating-point arithmetic operations need to store and use the mantissa field and the exponent field for each floating-point number used for calculation, respectively, to increase the required storage space and/or in a given storage space. Reduce accuracy when storing integers. Many of these shortcomings can be avoided by performing arithmetic operations on integer units.

程式設計者通常需要撰寫處理小數之程式，小數即為非完整數之數值。這種程式可能需要在不具有浮點單元之處理器上執行，或是處理器雖然具有浮點單元，不過由處理器之整數單元執行整數指令會比較快。為了利用整數處理器在效能上的優勢，程式設計者會對定點數值(fixed-point numbers)使用習知之定點算術運算。這樣的程式會包括執行於整數單元以處理整數或整數資料之指令。軟體知道資料是小數，這個軟體並包含指令對整數資料執行運算而處理這個資料實際上是小數的問題，例如對準偏移器。基本上，定點軟體可手動執行某些或全部浮點單元所能執行之功能。 Programmers usually need to write a program that processes decimals, which is a non-complete number. Such a program may need to be executed on a processor that does not have a floating point unit, or if the processor has a floating point unit, it is faster to execute an integer instruction by an integer unit of the processor. In order to take advantage of the performance of integer processors, programmers use well-known fixed-point calculations for fixed-point numbers. Operational calculations. Such programs will include instructions that execute on integer units to process integer or integer data. The software knows that the data is a decimal, and the software contains instructions to perform operations on the integer data to handle the fact that the data is actually a decimal, such as an alignment offset. Basically, the pointing software can manually perform functions that some or all of the floating point units can perform.

在本文中，一個“定點”數(或值或運算元或輸入或輸出)是一個數字，其儲存位元被理解為包含位元以表示此定點數之一小數部分，此位元在此稱為“小數位元”。定點數之儲存位元係包含於記憶體或暫存器內，例如記憶體或暫存器內之一個8位元或16位元文字。此外，定點數之儲存位元全部都用來表達一個數值，而在某些情況下，其中一個位元會用來表達符號，不過，沒有一個定點數的儲存位元會用來表達這個數的指數。此外，此定點數之小數位元數量或稱二進位小數點位置係指定於一個不同於定點數儲存位元之儲存空間內，並且是以共享或通用之方式指出小數位元的數量或稱二進位小數點位置，分享給一個包含此定點數之定點數集合，例如輸入運算元、累加數值或是處理單元陣列之輸出結果之集合。 In this paper, a "fixed point" number (or value or operand or input or output) is a number whose stored bit is understood to contain a bit to represent a fractional part of the fixed point number, which is referred to herein. It is a "decimal bit". A fixed-point storage location is included in a memory or scratchpad, such as an 8-bit or 16-bit literal in a memory or scratchpad. In addition, the storage bits of the fixed-point number are all used to express a value, and in some cases, one of the bits will be used to express the symbol, but a fixed-bit storage element will be used to express the number. index. In addition, the number of decimal places or the number of decimal places of the fixed point number is specified in a storage space different from the fixed point storage bit, and the number of decimal places or the number of the decimal places is indicated in a shared or common manner. The decimal point position is shared with a set of fixed-point numbers containing the fixed-point number, such as input operands, accumulated values, or a set of output results of the processing unit array.

在此描述之實施例中，算術邏輯單元是整數單元，不過，啟動函數單元則是包含浮點算術硬體輔助或加速。如此可以使算術邏輯單元部分變得更小且更為快速，以利於在給定的晶片空間上使用更多的算術邏輯單元。這也表示在單位晶片空間上可以設置更多的神經元，而特別有利於神經網路單元。 In the embodiment described herein, the arithmetic logic unit is an integer unit, however, the start function unit is either a floating point arithmetic hardware assist or acceleration. This can make the arithmetic logic unit portion smaller and faster to facilitate the use of more arithmetic logic units on a given wafer space. This also means that more neurons can be placed on the unit wafer space, which is particularly advantageous for neural network units.

此外，相較於每個浮點數都需要指數儲存位元，本文所述之實施例中的定點數係以一個指標表達全部的數字集合中屬於小數位元之儲存位元的數量，不過，此指標係位於一個單一、共享之儲存空間而廣泛地指出整個集合之所有數字，例如一系列運算之輸入集合、一系列運算之累加數之集合、輸出之集合，其中小數位元之數量。就一較佳實施例而言，神經網路單元之使用者可對此數字集合指定小數儲存位元之數量。因此，可以理解的是，雖然在許多情況下(如一般數學)，“整數”之用語是指一個帶符號完整數，也就是一個不具有小數部分之數字，不過，在本文的脈絡中，“整數”之用語可表示具有小數部分之數字。此外，在本文的脈絡中，“整數”之用語是為了與浮點數進行區分，對於浮點數而言，其各自儲存空間內之部分位元會用來表達浮點數之指數。類似地，整數算術運算，如整數單元執行之整數乘法或加法或比較運算，係假設運算元中不具有指數，因此，整數單元之整數元件，如整數乘法器、整數加法器、整數比較器，就不需要包含邏輯來處理指數，例如不需要為了加法或比較運算而移動尾數來對準二進位小數點，不需要為了乘法運算而將指數相加。 In addition, exponential storage bits are required compared to each floating point number. The fixed point number in the embodiment described herein expresses the number of storage bits belonging to the decimal place in all the digital sets by one index, however, This indicator is located in a single, shared storage space and broadly identifies all numbers of the entire collection, such as the input set of a series of operations, the set of accumulated numbers of a series of operations, the set of outputs, and the number of decimal places. In a preferred embodiment, a user of the neural network unit can specify the number of decimal storage bits for the set of numbers. Therefore, it can be understood that although in many cases (such as general mathematics), the term "integer" refers to a signed complete number, that is, a number that does not have a fractional part, but in the context of this article, " The term "integer" can mean a number with a fractional part. In addition, in the context of this article, the term "integer" is used to distinguish from floating-point numbers. For floating-point numbers, some of the bits in their respective storage spaces are used to express the exponent of floating-point numbers. Similarly, integer arithmetic operations, such as integer multiplication or addition or comparison operations performed by integer units, assume that there are no indices in the operands, therefore, integer components of integer units, such as integer multipliers, integer adders, integer comparators, There is no need to include logic to process the index. For example, there is no need to move the mantissa to align the binary point for addition or comparison operations, and there is no need to add the indices for multiplication.

此外，本文所述之實施例包括一個大型的硬體整數累加器以對一個大型系列之整數運算進行累加(如1000個乘法累加運算)而不會喪失精確度。如此可避免神經網路單元處理浮點數，同時又能使累加數維持全精度，而不會使其飽和或因為溢位而產生不準確的結果。一旦這系列整數運算加總出一結果輸入此全精度累加器，此定點硬體輔助會執行必要的縮放與飽和運算，藉以利用使用者指定之累加值小數位元數量指標以及輸出值所需要之小數位元數量，將此全精度累加值轉換為一輸出值，這部分在後續章節會有更詳細的說明。 In addition, the embodiments described herein include a large hardware integer accumulator to accumulate a large series of integer operations (eg, 1000 multiply-accumulate operations) without loss of precision. This prevents the neural network unit from processing the floating point number while maintaining the full precision of the accumulated number without saturating or generating an inaccurate junction due to the overflow. fruit. Once the series of integer operations adds a result to the full-precision accumulator, the fixed-point hardware assist performs the necessary scaling and saturation operations to take advantage of the user-specified cumulative value of the fractional number of bits and the output value required. The number of decimal places, this full precision accumulated value is converted into an output value, which will be explained in more detail in subsequent chapters.

當需要將累加值從全精度形式進行壓縮以便用於啟動函數之一輸入或是用於傳遞，就一較佳實施例而言，啟動函數單元可以選擇性地對累加值執行隨機捨入運算，這部分在後續章節會有更詳細的說明。最後，依據神經網路之一給定層之不同需求，神經處理單元可以選擇性地接受指示以使用不同的啟動函數以及/或輸出許多不同形式之累加值。 When it is desired to compress the accumulated value from full precision form for use in one of the input functions of the start function or for transfer, in a preferred embodiment, the start function unit can selectively perform a random round operation on the accumulated value. This section will be explained in more detail in subsequent chapters. Finally, depending on the different needs of a given layer of one of the neural networks, the neural processing unit can selectively accept indications to use different start functions and/or output many different forms of accumulated values.

第二十九A圖係顯示第一圖之控制暫存器127之一實施例之方塊示意圖。此控制暫存器127可包括複數個控制暫存器127。如圖中所示，此控制暫存器127包括下列欄位：配置2902、帶符號資料2912、帶符號權重2914、資料二進位小數點2922、權重二進位小數點2924、算術邏輯單元函數2926、捨入控制2932、啟動函數2934、倒數2942、偏移量2944、輸出隨機存取記憶體2952、輸出二進位小數點2954、以及輸出命令2956。控制暫存器127值可以利用MTNN指令1400與NNU程式之指令，如啟動指令，進行寫入動作。 The twenty-ninth A diagram is a block diagram showing an embodiment of the control register 127 of the first figure. The control register 127 can include a plurality of control registers 127. As shown in the figure, the control register 127 includes the following fields: configuration 2902, signed data 2912, signed weight 2914, data binary decimal point 2922, weight binary decimal point 2924, arithmetic logic unit function 2926, Rounding control 2932, start function 2934, reciprocal 2942, offset 2944, output random access memory 2952, output binary point 2954, and output command 2956. Controlling the register 127 value can utilize the MTNN instruction 1400 and the NNU program instructions, such as the start command, to perform the write operation.

配置2902值係指定神經網路單元121是屬於窄配置、寬配置或是漏斗配置，如前所述。配置2902也設定了由資料隨機存取記憶體122與權重隨機存取記憶體124接收之輸入文字的大小。在窄配置與漏斗配置中，輸入文字的大小是窄的(例如8位元或9位元)，不過，在寬配置中，輸入文字的大小則是寬的(例如12位元或16位元)。此外，配置2902也設定了與輸入文字大小相同之輸出結果133的大小。 The configuration 2902 value specifies that the neural network unit 121 is of a narrow configuration, a wide configuration, or a funnel configuration, as previously described. Configuration 2902 is also set by data random access memory 122 and weight random access memory The size of the input text received by the body 124. In a narrow configuration and funnel configuration, the size of the input text is narrow (for example, 8-bit or 9-bit), but in a wide configuration, the size of the input text is wide (for example, 12-bit or 16-bit) ). In addition, the configuration 2902 also sets the size of the output result 133 that is the same size as the input text.

帶符號資料值2912為真的時候，即表示由資料隨機存取記憶體122接收之資料文字為帶符號值，若為假，則表示這些資料文字為不帶符號值。帶符號權重值2914為真的時候，即表示由權重隨機存取記憶體122接收之權重文字為帶符號值，若為假，則表示這些權重文字為不帶符號值。 When the signed data value 2912 is true, it means that the data characters received by the data random access memory 122 are signed values, and if they are false, the data characters are unsigned values. When the signed weight value 2914 is true, it means that the weighted text received by the weight random access memory 122 is a signed value, and if it is false, it means that the weighted characters are unsigned values.

資料二進位小數點2922值表示由資料隨機存取記憶體122接收之資料文字之二進位小數點位置。就一較佳實施例而言，對於二進位小數點之位置而言，資料二進位小數點2922值即表示二進位小數點從右側計算之位元位置數量。換言之，資料二進位小數點2922表示資料文字之最低有效位元中屬於小數位元之數量，即位於二進位小數點右側之位元數。類似地，權重二進位小數點2924值表示由權重隨機存取記憶體124接收之權重文字之二進位小數點位置。就一較佳實施例而言，當算術邏輯單元函數2926是一個乘法與累加或輸出累加，神經處理單元126將裝載於累加器202之數值之二進位小數點右側之位元數確定為資料二進位小數點2922與權重二進位小數點2924之加總。因此，舉例來說，若是資料二進位小數點2922之值為5而權重二進位小數點 2924之值為3，累加器202內之值就會在二進位小數點右側有8個位元。當算術邏輯單元函數2926是一個總數/最大值累加器與資料/權重文字或是傳遞資料/權重文字，神經處理單元126會將裝載於累加器202之數值之二進位小數點右側之位元數分別確定為資料/權重二進位小數點2922/2924。在另一實施例中，則是指定單一個累加器二進位小數點2923，而不去指定個別的資料二進位小數點2922與權重二進位小數點2924。這部分在後續對應於第二十九B圖處會有更詳細的說明。 The data binary decimal point value of 2922 indicates the binary decimal point position of the data text received by the data random access memory 122. In a preferred embodiment, for the position of the binary point, the data binary point 2922 value represents the number of bit positions calculated from the right side of the binary point. In other words, the data binary decimal point 2922 indicates the number of decimal places in the least significant bit of the data text, that is, the number of bits located to the right of the decimal point. Similarly, the weighted binary point 2924 value represents the binary point position of the weighted text received by the weighted random access memory 124. In a preferred embodiment, when the arithmetic logic unit function 2926 is a multiply and accumulate or output accumulation, the neural processing unit 126 determines the number of bits to the right of the binary digit of the value of the accumulator 202 as the data two. The sum of the carry decimal point 2922 and the weight binary decimal point 2924. So, for example, if the data binary decimal point 2922 has a value of 5 and the weight binary decimal point The value of 2924 is 3, and the value in accumulator 202 will have 8 bits to the right of the decimal point. When the arithmetic logic unit function 2926 is a total/maximum accumulator and data/weight text or a transfer data/weight text, the neural processing unit 126 will load the number of bits to the right of the decimal point of the value of the accumulator 202. It is determined as the data/weight binary decimal point 2922/2924. In another embodiment, a single accumulator binary decimal point 2923 is specified without specifying an individual data binary decimal point 2922 and a weight binary decimal point 2924. This section will be described in more detail later in the corresponding map of Figure 29.

算術邏輯單元函數2926指定由神經處理單元126之算術邏輯單元204執行之函數。如前述，算術邏輯單元函數2926可包括以下運算但不限於：將資料文字209與權重文字203相乘並將此乘積與累加器202相加；將累加器202與權重文字203相加；將累加器202與資料文字209相加；累加器202與資料文字209中之最大值；累加器202與權重文字209中之最大值；輸出累加器202；傳遞資料文字209；傳遞權重文字209；輸出零值。在一實施例中，此算術邏輯單元函數2926係由神經網路單元初始化指令予以指定，並且由算術邏輯單元204使用以因應一執行指令(未圖示)。在一實施例中，此算術邏輯單元函數2926係由個別的神經網路單元指令予以指定，如前述乘法累加以及maxwacc指令。 Arithmetic logic unit function 2926 specifies a function that is executed by arithmetic logic unit 204 of neural processing unit 126. As previously described, the arithmetic logic unit function 2926 can include the following operations, but is not limited to: multiplying the data text 209 by the weight text 203 and adding the product to the accumulator 202; adding the accumulator 202 to the weight text 203; The loader 202 is added to the data text 209; the maximum value of the accumulator 202 and the data text 209; the maximum value of the accumulator 202 and the weight text 209; the output accumulator 202; the transfer data 209; the transfer weight text 209; value. In one embodiment, the arithmetic logic unit function 2926 is specified by a neural network unit initialization instruction and is used by the arithmetic logic unit 204 to respond to an execution instruction (not shown). In one embodiment, the arithmetic logic unit function 2926 is specified by individual neural network unit instructions, such as the aforementioned multiply accumulate and maxwacc instructions.

捨入控制2932指定(第三十圖中)捨入器3004所使用之捨入運算的形式。在一實施例中，可指定之捨入模式包括但不限於：不捨入、捨入至最近值、以及隨機捨入。就一較佳實施例而言，處理器100包括一隨機位元來源3003(請參照第三十圖)以產生隨機位元3005，這些隨機位元3005係經取樣用以執行隨機捨入以降低產生捨入偏差的可能性。在一實施例中，當捨入位元3005為一而黏(sticky)位元為零，若是取樣之隨機位元3005為真，神經處理單元126就會向上捨入，若是取樣之隨機位元3005為假，神經處理單元126就不會向上捨入。在一實施例中，隨機位元來源3003係基於處理器100具有之隨機電子特性進行取樣以產生隨機位元3005，這些隨機電子特性如半導體二極體或電阻之熱雜訊，不過本發明並不限於此。 The rounding control 2932 specifies (in the thirty-fifth figure) the form of the rounding operation used by the rounder 3004. In an embodiment, the rounding modes that can be specified include, but are not limited to, not rounding, rounding to the nearest value, And random rounding. For a preferred embodiment, processor 100 includes a random bit source 3003 (see FIG. 30) to generate random bits 3005 that are sampled to perform random rounding to reduce The possibility of rounding deviations. In one embodiment, when the rounding bit 3005 is one and the sticky bit is zero, if the sampled random bit 3005 is true, the neural processing unit 126 rounds up if the random bit is sampled. If the 3005 is false, the neural processing unit 126 will not round up. In one embodiment, the random bit source 3003 is sampled based on the random electronic characteristics of the processor 100 to generate random bits 3005, such random semiconductor characteristics or thermal noise of the semiconductor diode or resistor, although the present invention Not limited to this.

啟動函數2934指定用於累加器202數值217之函數以產生神經處理單元126之輸出133。如本文所述，啟動函數2934包括但不限於：S型函數；雙曲正切函數；軟加函數；校正函數；除以二的指定冪次方；乘上一個使用者指定之倒數值以達成等效除法；傳遞整個累加器；以及將累加器以標準尺寸傳遞，這部分在以下章節會有更詳細的說明。在一實施例中，啟動函數係由神經網路單元啟動函數指令所指定。另外，啟動函數也可由初始化指令所指定，並因應一輸出指令而使用，例如第四圖中位址4之啟動函數單元輸出指令，在此實施例中，位於第四圖中位址3之啟動函數指令會包含於輸出指令內。 The start function 2934 specifies a function for the accumulator 202 value 217 to generate the output 133 of the neural processing unit 126. As described herein, the start function 2934 includes, but is not limited to, a sigmoid function; a hyperbolic tangent function; a soft addition function; a correction function; divide by a specified power of two; multiply a user-specified reciprocal value to achieve, etc. Dividing the effect; passing the entire accumulator; and passing the accumulator in standard dimensions, as described in more detail in the following sections. In an embodiment, the startup function is specified by a neural network unit startup function instruction. In addition, the startup function may also be specified by the initialization instruction and used in response to an output instruction, such as the start function unit output instruction of address 4 in the fourth figure. In this embodiment, the address 3 is activated in the fourth figure. Function instructions are included in the output instructions.

倒數2942值指定一個與累加器202數值217相乘以達成對累加器202數值217進行除法運算之數值。也就是說，使用者所指定之倒數2942值會是實際上想要執行之除數的倒數。這有利於搭配如本文所述之卷積或共源運算。就一較佳實施例而言，使用者會將倒數2942值指定為兩個部分，這在後續對應於第二十九C圖處會有更詳細的說明。在一實施例中，控制暫存器127包括一欄位(未圖示)讓使用者可以在多個內建除數值中指定一個進行除法，這些內建除數值的大小相當於常用之卷積核的大小，如9、25、36或49。在此實施例中，啟動函數單元212會儲存這些內建除數的倒數，用以與累加器202數值217相乘。 The reciprocal 2942 value specifies a number multiplied by the accumulator 202 value 217 to achieve the division of the accumulator 202 value 217 value. That is, the reciprocal 2942 value specified by the user will be the reciprocal of the divisor that you actually want to execute. This facilitates collocation with convolution or common source operations as described herein. In a preferred embodiment, the user will specify the reciprocal 2942 value as two portions, which will be described in more detail later in the corresponding twenty-ninth C-picture. In one embodiment, the control register 127 includes a field (not shown) that allows the user to specify a division among a plurality of built-in divisor values that are equivalent to commonly used convolutions. The size of the core, such as 9, 25, 36 or 49. In this embodiment, the start function unit 212 stores the reciprocal of these built-in divisors for multiplying the accumulator 202 value 217.

偏移量2944係指定啟動函數單元212之一移位器會將累加器202數值217右移之位元數，以達成將其除以二的冪次方之運算。這有利於搭配尺寸為二的冪次方之卷積核進行運算。 Offset 2944 is the number of bits that specify one of the start function unit 212 shifters to shift the value of the accumulator 202 217 to the right to divide it by the power of two. This facilitates the operation of a convolution kernel with a power of two.

輸出隨機存取記憶體2952值會在資料隨機存取記憶體122與權重隨機存取記憶體124中指定一個來接收輸出結果133。 The output random access memory 2952 value specifies one of the data random access memory 122 and the weight random access memory 124 to receive the output result 133.

輸出二進位小數點2954值表示輸出結果133之二進位小數點的位置。就一較佳實施例而言，對於輸出結果133之二進位小數點的位置而言，輸出二進位小數點2954值即表示從右側計算之位元位置數量。換言之，輸出二進位小數點2954表示輸出結果133之最低有效位元中屬於小數位元之數量，即位於二進位小數點右側之位元數。啟動函數單元212會基於輸出二進位小數點2954之數值(在大部分之情況下，也會基於資料二進位小數點2922、權重二進位小數點2924、啟動函數2934與/或配置2902之數值)執行捨入、壓縮、飽和與尺寸轉換之運算。 The output binary point 2954 value represents the position of the decimal point of the output result 133. For a preferred embodiment, for the position of the binary point of the output 133, the output binary point 2954 value represents the number of bit positions calculated from the right side. In other words, the output binary decimal point 2954 represents the number of decimal places in the least significant bit of the output result 133, that is, the number of bits located to the right of the binary decimal point. The start function unit 212 will be based on the value of the output binary point 2954 (in most cases, it will also be based on the data binary The operation of rounding, compression, saturation, and size conversion is performed on the decimal point 2922, the weighted binary decimal point 2924, the start function 2934, and/or the value of the configuration 2902.

輸出命令2956會從許多面向控制輸出結果133。在一實施例中，啟動函數單元121會利用標準尺寸的概念，標準尺寸為配置2902指定之寬度大小(以位元計)的兩倍。如此，舉例來說，若是配置2902設定由資料隨機存取記憶體122與權重隨機存取記憶體124接收之輸入文字的大小為8位元，標準尺寸就會是16位元；在另一個範例中，若是配置2902設定由資料隨機存取記憶體122與權重隨機存取記憶體124接收之輸入文字的大小為16位元，標準尺寸就會是32位元。如本文所述，累加器202之尺寸較大(舉例來說，窄的累加器202B為28位元，而寬的累加器202A則是41位元)以維持中間計算，如1024與512個神經網路單元乘法累加指令，之全精度。如此，累加器202數值217就會大於(以位元計)標準尺寸，而對於啟動函數2934之大部分數值(除了傳遞整個累加器)，啟動函數單元212(例如以下對應於第三十圖之段落所述之標準尺寸壓縮器3008)就會將累加器202數值217壓縮至標準尺寸之大小。輸出命令2956之第一預設值會指示啟動函數單元212執行指定的啟動函數2934以產生一內部結果並將此內部結果作為輸出結果133輸出，此內部結果之大小等於原始輸入文字之大小，即標準尺寸的一半。輸出命令2956之第二預設值會指示啟動函數單元212執行指定的啟動函數2934以產生一內部結果並將此內部結果之下半部作為輸出結果133輸出，此內部結果之大小等於原始輸入文字之大小的兩倍，即標準尺寸；而輸出命令2956之第三預設值會指示啟動函數單元212將標準尺寸之內部結果的上半部作為輸出結果133輸出。輸出命令2956之第四預設值會指示啟動函數單元212將累加器202之未經處理的最低有效文字作為輸出結果133輸出；而輸出命令2956之第五預設值會指示啟動函數單元212將累加器202之未經處理的中間有效文字作為輸出結果133輸出；輸出命令2956之第六預設值會指示啟動函數單元212將累加器202之未經處理的最高有效文字(其寬度係由配置2902所指定)作為輸出結果133輸出，這在前文對應於第八至十圖之章節有更詳細的說明。如前述，輸出整個累加器202尺寸或是標準尺寸之內部結果有助於讓處理器100之其他執行單元112可以執行啟動函數，如軟極大啟動函數。 Output command 2956 will output 133 from many control outputs. In an embodiment, the startup function unit 121 utilizes the concept of a standard size that is twice the width (in bits) specified by the configuration 2902. Thus, for example, if the configuration 2902 sets the size of the input text received by the data random access memory 122 and the weight random access memory 124 to be 8-bit, the standard size will be 16 bits; in another example If the configuration 2902 sets the size of the input text received by the data random access memory 122 and the weight random access memory 124 to be 16 bits, the standard size will be 32 bits. As described herein, the accumulator 202 is relatively large in size (for example, the narrow accumulator 202B is 28 bits and the wide accumulator 202A is 41 bits) to maintain intermediate calculations, such as 1024 and 512 nerves. Network unit multiply accumulates instructions with full precision. Thus, the accumulator 202 value 217 will be greater than (in terms of the bit) standard size, and for most of the value of the start function 2934 (in addition to passing the entire accumulator), the function unit 212 is activated (eg, the following corresponds to the thirty-th figure) The standard size compressor 3008) described in the paragraph compresses the accumulator 202 value 217 to the size of the standard size. The first predetermined value of the output command 2956 will instruct the startup function unit 212 to execute the specified startup function 2934 to generate an internal result and output the internal result as the output result 133, the size of the internal result being equal to the size of the original input text, ie Half the standard size. The second preset value of the output command 2956 will instruct the boot function unit 212 to execute the specified start function 2934 to generate an internal knot. And the lower half of the internal result is output as an output result 133, the size of the internal result is equal to twice the size of the original input text, that is, the standard size; and the third preset value of the output command 2956 indicates the start function unit 212 outputs the upper half of the internal result of the standard size as the output result 133. The fourth preset value of the output command 2956 will instruct the start function unit 212 to output the unprocessed least significant text of the accumulator 202 as the output result 133; and the fifth preset value of the output command 2956 will indicate that the start function unit 212 will The unprocessed intermediate valid text of the accumulator 202 is output as an output 133; the sixth preset value of the output command 2956 instructs the startup function unit 212 to treat the unprocessed most significant text of the accumulator 202 (its width is configured by The output specified by 2902) is output as output 133, which is described in more detail in the section corresponding to the eighth to tenth figures. As before, the internal result of outputting the entire accumulator 202 size or standard size helps the other execution units 112 of the processor 100 to execute a boot function, such as a soft maximal start function.

第二十九A圖(以及第二十九B與二十九C圖)所描述之欄位係位於控制暫存器127內部，不過，本發明並不限於此，其中一個或多個欄位亦可位於神經網路單元121之其他部分。就一較佳實施例而言，其中許多欄位可以包含在神經網路單元指令內部，並由定序器128予以解碼以產生一微指令3416(請參照第三十四圖)控制算術邏輯單元204以及/或啟動函數單元212。此外，這些欄位也可以包含在儲存於媒體暫存器118之微運算3414內(請參照第三十四圖)，以控制算術邏輯單元204以及/或啟動函數單元212。此實施例可以降低初始化神經網路單元指令之使用，而在其他實施例中則可去除此初始化神經網路單元指令。 The fields described in the twenty-ninth A diagram (and the twenty-ninth and twenty-ninth C diagrams) are located inside the control register 127, however, the invention is not limited thereto, and one or more of the fields It can also be located in other parts of the neural network unit 121. In a preferred embodiment, a plurality of fields may be included within the neural network unit instructions and decoded by the sequencer 128 to generate a microinstruction 3416 (see Figure 34) to control the arithmetic logic unit. 204 and/or start function unit 212. In addition, these fields may also be included in the micro-operations 3414 stored in the media register 118 (see FIG. 34) to control the arithmetic logic unit 204 and/or the start function unit 212. This embodiment can reduce the initialization god Used by network unit instructions, in other embodiments the initialization neural network unit instructions can be removed.

如前述，神經網路單元指令可以指定對記憶體運算元(如來自資料隨機存取記憶體122與/或權重隨機存取記憶體123之文字)或一個旋轉後運算元(如來自多工暫存器208/705)執行算術邏輯指令運算。在一實施例中，神經網路單元指令還可以將一個運算元指定為一啟動函數之暫存器輸出(如第三十圖之暫存器3038之輸出)。此外，如前述，神經網路單元指令可以指定來使資料隨機存取記憶體122或權重隨機存取記憶體124之一當前列位址遞增。在一實施例中，神經網路單元指令可指定一立即帶符號整數差值加入當前列以達成遞增或遞減一以外數值之目的。 As described above, the neural network unit instructions can specify a memory operand (such as text from the data random access memory 122 and/or the weight random access memory 123) or a post-rotation operand (eg, from a multiplex The memory 208/705) performs an arithmetic logic instruction operation. In one embodiment, the neural network unit instructions may also designate an operand as a scratchpad output of a boot function (such as the output of register 3038 in FIG. 30). Moreover, as previously described, the neural network unit instructions can be specified to increment the current column address of one of the data random access memory 122 or the weight random access memory 124. In an embodiment, the neural network unit instruction may specify an immediate signed integer difference to join the current column for the purpose of incrementing or decrementing a value other than one.

第二十九B圖係顯示第一圖之控制暫存器127之另一實施例之方塊示意圖。第二十九B圖之控制暫存器127類似於第二十九A圖之控制暫存器127，不過，第二十九B圖之控制暫存器127包括一個累加器二進位小數點2923。累加器二進位小數點2923係表示累加器202之二進位小數點位置。就一較佳實施例而言，累加器二進位小數點2923值表示此二進位小數點位置從右側的位元位置數量。換言之，累加器二進位小數點2923表示累加器202之最低有效位元中屬於小數位元之數量，即位於二進位小數點右側之位元。在此實施例中，累加器二進位小數點2923係明確指示，而非如第二十九A圖之實施例是暗中確認。 The twenty-ninth B diagram is a block diagram showing another embodiment of the control register 127 of the first figure. The control register 127 of FIG. 29B is similar to the control register 127 of FIG. 29A. However, the control register 127 of FIG. 29B includes an accumulator binary decimal point 2923. . The accumulator binary decimal point 2923 represents the binary position of the accumulator 202. In a preferred embodiment, the accumulator binary point 2923 value indicates the number of bit positions from the right side of the binary point position. In other words, the accumulator binary decimal point 2923 represents the number of decimal places in the least significant bit of accumulator 202, ie, the bit located to the right of the binary decimal point. In this embodiment, the accumulator binary point 2923 is explicitly indicated, rather than being implicitly confirmed as in the twenty-ninth embodiment.

第二十九C圖係顯示以兩個部分儲存第二十九A圖之倒數2942之一實施例之方塊示意圖。第一個部分2962是一個偏移值，表示使用者想要乘上累加器202數值217之真實倒數值中被抑制之前導零的數量2962。前導零的數量是緊接在二進位小數點右側連續排列之零的數量。第二部分2694是前導零抑制倒數值，也就是將所有前導零移除後之真實倒數值。在一實施例中，被抑制前導零數量2962係以4位元儲存，而前導零抑制倒數值2964則是以8位元不帶符號值儲存。 The twenty-ninth C-picture shows a block diagram of an embodiment in which the inverse of the twenty-ninth A picture is stored in two parts. The first portion 2962 is an offset value indicating the number 2962 of zeros that were previously suppressed by the user in the real reciprocal value of the value 217 of the accumulator 202. The number of leading zeros is the number of zeros consecutively arranged to the right of the binary decimal point. The second portion 2694 is the leading zero suppression reciprocal value, which is the true reciprocal value after all leading zeros have been removed. In one embodiment, the suppressed leading zeros number 2962 is stored in 4 bits, and the leading zero suppression inverted value 2964 is stored in 8-bit unsigned values.

舉例來說，假設使用者想要將累加器202數值217乘上數值49的倒數值。數值49的倒數值以二維呈現並設定13個小數位元就會是0.0000010100111，其中有五個前導零。如此，使用者會將被抑制前導零數量2962填入數值5，將前導零抑制倒數值2964填入數值10100111。在倒數乘法器“除法器A”3014(請參照第三十圖)將累加器202數值217與前導零抑制倒數值2964相乘後，所產生之乘積會依據被抑制前導零數量2962右移。這樣的實施例有助於利用相對較少之位元來表達倒數2942值達成高精確度的要求。 For example, assume that the user wants to multiply the accumulator 202 value 217 by the reciprocal value of the value 49. The reciprocal value of the value 49 is presented in two dimensions and the 13 decimal places are set to be 0.0000010100111 with five leading zeros. Thus, the user will fill the suppressed leading zero number 2962 into the value 5 and the leading zero suppression reversal value 2964 into the value 10100111. After the reciprocal multiplier "divider A" 3014 (see Fig. 30) multiplies the accumulator 202 value 217 by the leading zero suppression reciprocal value 2964, the resulting product is shifted to the right according to the suppressed leading zero number 2962. Such an embodiment facilitates the use of relatively few bits to express the reciprocal 2942 value to achieve high accuracy requirements.

第三十圖係顯示第二圖之啟動函數單元212之一實施例之方塊示意圖。此啟動函數單元212包含第一圖之控制邏輯127、一個正類型轉換器(PFC)與輸出二進位小數點對準器(OBPA)3002以接收累加器202數值217、一個捨入器3004以接收累加器202數值217與輸出二進位小數點對準器3002移出之位元數量的指標、一個如前述之隨機位元來源3003以產生隨機位元3005、一個第一多工器3006以接收正類型轉換器與輸出二進位小數點對準器3002之輸出以及捨入器3004之輸出、一個標準尺寸壓縮器(CCS)與飽和器3008以接收第一多工器3006之輸出、一個位元選擇器與飽和器3012以接收標準尺寸壓縮器與飽和器3008之輸出、一個校正器3018以接收標準尺寸壓縮器與飽和器3008之輸出、一個倒數乘法器3014以接收標準尺寸壓縮器與飽和器3008之輸出、一個向右移位器3016以接收標準尺寸壓縮器與飽和器3008之輸出、一個雙取正切(tanh)模組3022以接收位元選擇器與飽和器3012之輸出、一個S型模組3024以接收位元選擇器與飽和器3012之輸出、一個軟加模組3026以接收位元選擇器與飽和器3012之輸出、一個第二多工器3032以接收雙取正切模組3022、S型模組3024、軟加模組3026、校正器3018、倒數乘法器3014與向右移位器3016之輸出以及標準尺寸壓縮器與飽和器3008所傳遞之標準尺寸輸出3028、一個符號恢復器3034以接收第二多工器3032之輸出、一個尺寸轉換器與飽和器3036以接收符號恢復器3034之輸出、一第三多工器3037以接收尺寸轉換器與飽和器3036之輸出與累加器輸出217、以及一個輸出暫存器3038以接收多工器3037之輸出，而其輸出即為第一圖中的結果133。 The thirtieth diagram is a block diagram showing one embodiment of the start function unit 212 of the second figure. The start function unit 212 includes control logic 127 of the first diagram, a positive type converter (PFC) and an output binary point aligner (OBPA) 3002 to receive the accumulator 202 value 217, a rounder 3004 to receive The accumulator 202 has a value 217 and an output index of the number of bits removed by the decimal point aligner 3002. The aforementioned random bit source 3003 is used to generate a random bit 3005, a first multiplexer 3006 to receive the output of the positive type converter and the output binary point aligner 3002, and the output of the rounder 3004, a standard size. A compressor (CCS) and saturator 3008 to receive the output of the first multiplexer 3006, a bit selector and saturator 3012 to receive the output of the standard size compressor and saturator 3008, and a corrector 3018 to receive the standard size The output of the compressor and saturator 3008, a reciprocal multiplier 3014 to receive the output of the standard size compressor and saturator 3008, a rightward shifter 3016 to receive the output of the standard size compressor and saturator 3008, a double take A tanh module 3022 receives the output of the bit selector and saturator 3012, an S-type module 3024 to receive the output of the bit selector and saturator 3012, and a soft add block 3026 to receive the bit selection. And the output of the saturator 3012, a second multiplexer 3032 to receive the dual take-out tangent module 3022, the S-type module 3024, the soft add-on module 3026, the corrector 3018, the reciprocal multiplier 3014, and the right shifter 301 The output of 6 and the standard size output 3028 passed by the standard size compressor and saturator 3008, a symbol restorer 3034 to receive the output of the second multiplexer 3032, a size converter and saturator 3036 to receive the symbol restorer 3034 The output, a third multiplexer 3037 to receive the output of the size converter and saturator 3036 and the accumulator output 217, and an output register 3038 to receive the output of the multiplexer 3037, the output of which is the first Results 133 in the figure.

正類型轉換器與輸出二進位小數點對準器3002接收累加器202值217。就一較佳實施例而言，如前述，累加器202值217是一個全精度值。也就是說，累加器202具有足夠的儲存位元數以裝載一累加數，此累加數是由整數加法器244將一系列由整數乘法器242產生之乘積相加所產生之總數，而此運算不捨棄乘法器242之個別乘積或加法器之各個總數中之任何一個位元以維持精確度。就一較佳實施例而言，累加器202至少具有足夠的位元數來裝載一神經網路單元121可被程式化執行產生之乘積累加的最大數量。舉例來說，請參照第四圖之程式，在寬配置下，神經網路單元121可被程式化執行產生之乘積累加的最大數量為512，而累加數202位元寬度為41。在另一範例中，請參照第二十圖之程式，在窄配置下，神經網路單元121可被程式化執行產生之乘積累加的最大數量為1024，而累加數202位元寬度為28。基本上，全精度累加器202具有至少Q個位元，其中Q是M與log₂P之加總，其中M是乘法器242之整數乘積之位元寬度(舉例來說，對於窄乘法器242而言是16位元，對於寬乘法器242而言是32位元)，而P是累加器202所能累加之乘積的最大容許數量。就一較佳實施例而言，乘積累加之最大數量是依據神經網路單元121之程式設計者之程式規格所指定。在一實施例中，假定一個先前乘法累加指令用以從資料/權重隨機存取記憶體122/124載入資料/權重文字206/207列(如第四圖中位址1之指令)之基礎上，定序器128會執行乘法累加神經網路單元指令(如第四圖中位址2之指令)之計數的最大值是例如511。 The positive type converter and the output binary point aligner 3002 receive the accumulator 202 value 217. In a preferred embodiment, accumulator 202 value 217 is a full precision value as previously described. That is, accumulator 202 has a sufficient number of storage bits to load an accumulated number, which is the total number of products produced by integer adder 244 that are summed by a series of integer multipliers 242, and this operation is performed. Any one of the individual products of the multiplier 242 or the respective totals of the adders is not discarded to maintain accuracy. In a preferred embodiment, accumulator 202 has at least a sufficient number of bits to load the maximum number of multiply-accumulates that a neural network unit 121 can be programmed to perform. For example, please refer to the program of the fourth figure. In the wide configuration, the maximum number of multiplications and accumulations that can be generated by the neural network unit 121 can be 512, and the cumulative number of 202 bits is 41. In another example, referring to the program of the twentieth diagram, in a narrow configuration, the maximum number of multiply-accumulates that can be generated by the neural network unit 121 can be 1024, and the cumulative number of 202 bits is 28. Basically, full precision accumulator 202 has at least Q bits, where Q is the sum of M and log ₂ P, where M is the bit width of the integer product of multiplier 242 (for example, for narrow multiplier 242) It is a 16-bit element, which is 32 bits for the wide multiplier 242), and P is the maximum allowable number of products that the accumulator 202 can accumulate. In a preferred embodiment, the maximum number of multipliers is specified in accordance with the program specification of the programmer of neural network unit 121. In one embodiment, assume that a previous multiply-accumulate instruction is used to load the data/weight text 206/207 column from the data/weight random access memory 122/124 (as in the instruction of address 1 in the fourth figure). The maximum value of the count of the sequencer 128 executing the multiply-accumulate neural network unit instruction (such as the instruction of address 2 in the fourth figure) is, for example, 511.

利用一個具有足夠位元寬度而能對所容許累加之最大數量之一全精度值執行累加運算之一累加器202，即可簡化神經處理單元126之算術邏輯單元204之設計。特別是，這樣處理可以緩和需要使用邏輯來對整數加法器244產生之總數執行飽和運算之需求，因為整數加法器244會使一個小型累加器產生溢位，而需要持續追蹤累加器之二進位小數點位置以確認是否產生溢位以確認是否需要執行飽和運算。舉例來說，對於具有一非全精度累加器但具有飽和邏輯以處理非全精度累加器之溢位之設計而言，假定存在以下情況。 Accumulating one of the accumulating operations with a full-precision value that has a sufficient bit width and a maximum number of allowed accumulations The design of the arithmetic logic unit 204 of the neural processing unit 126 can be simplified. In particular, such processing can alleviate the need to use logic to perform saturation operations on the total number of integer adder 244 generations, since integer adder 244 causes an overflow accumulator to generate an overflow, while continuing to track the accumulator's binary fraction The point position is used to confirm whether an overflow is generated to confirm whether a saturation operation needs to be performed. For example, for a design with a non-full precision accumulator but with saturation logic to handle the overflow of a non-full accumulator, the following is assumed.

(1)資料文字值的範圍是介於0與1之間而所有儲存位元都用以儲存小數位元。權重文字值的範圍是介於-8與+8之間而除了三個以外之所有儲存位元都用以儲存小數位元。做為一個雙曲正切啟動函數之輸入之累加值的範圍是介於-8與8之間，而除了三個以外之所有儲存位元都用以儲存小數位元。 (1) The range of data values is between 0 and 1 and all storage bits are used to store decimal places. The weight text value ranges from -8 to +8 and all but three of the storage bits are used to store the decimal places. The cumulative value of the input as a hyperbolic tangent start function is between -8 and 8, and all but three of the storage bits are used to store the decimal places.

(2)累加器之位元寬度為非全精度(如只有乘積之位元寬度)。 (2) The bit width of the accumulator is not full precision (such as the bit width of only the product).

(3)假定累加器為全精度，最終累加值也大約會介於-8與8之間(如+4.2)；不過，在此序列中“點A”前的乘積會較頻繁地產生正值，而在點A後的乘積則會較頻繁地產生負值。 (3) Assuming that the accumulator is full precision, the final accumulated value will also be between -8 and 8 (eg +4.2); however, the product before "point A" in this sequence will produce positive values more frequently. , and the product after point A will produce negative values more frequently.

在此情況下，就可能取得不正確的結果(如+4.2以外之結果)。這是因為在點A前方之某些點，當需要使累加器達到一個超過其飽和最大值+8之數值，如+8.2，就會損失多出的0.2。累加器甚至會使剩下的乘積累加結果維持在飽和值，而會損失更多正值。因此，累加器之最終值可能會小於使用具有全精度位元寬度之累加器所計算之數值(即小於+4.2)。 In this case, it is possible to obtain incorrect results (such as results other than +4.2). This is because at some point in front of point A, when the accumulator needs to reach a value greater than its saturation maximum of +8, such as +8.2, it will lose an extra 0.2. The accumulator will even keep the remaining multiply-accumulated result at a saturated value, and will lose more positive values. Therefore, the final value of the accumulator It may be less than the value calculated using an accumulator with a full-precision bit width (ie less than +4.2).

正類型轉換器3004會在累加器202數值217為負時，將其轉換為正類型，並產生一額外位元指出原本數值之正負，這個位元會隨同此數值向下傳遞至啟動函數單元212管線。將負數轉換為正類型可以簡化後續啟動函數單元121之運算。舉例來說，經此處理後，只有正值會輸入雙曲正切模組3022與S型模組3024，因而可以簡化這些模組的設計。此外，也可以簡化捨入器3004與飽和器3008。 The positive type converter 3004 converts the positive value to 217 when the value 217 of the accumulator 202 is negative, and generates an extra bit to indicate the positive or negative value of the original value. This bit is passed down to the start function unit 212 along with this value. Pipeline. Converting a negative number to a positive type simplifies the operation of the subsequent start function unit 121. For example, after this processing, only the positive value will input the hyperbolic tangent module 3022 and the S-type module 3024, thereby simplifying the design of these modules. In addition, the rounder 3004 and the saturator 3008 can also be simplified.

輸出二進位小數點對準器3002會向右移動或縮放此正類型值，使其對準於控制暫存器127內指定之輸出二進位小數點2954。就一較佳實施例而言，輸出二進位小數點對準器3002會計算累加器202數值217之小數位元數(例如由累加器二進位小數點2923所指定或是資料二進位小數點2922與權重二進位小數點2924之加總)減去輸出之小數位元數(例如由輸出二進位小數點2954所指定)之差值作為偏移量。如此，舉例來說，若是累加器202二進位小數點2923為8(即上述實施例)而輸出二進位小數點2954為3，輸出二進位小數點對準器3002就會將此正類型數值右移5個位元以產生提供至多工器3006與捨入器3004之結果。 The output binary point aligner 3002 will shift or scale the positive type value to the right to align with the output binary point 2954 specified in the control register 127. In a preferred embodiment, the output binary point aligner 3002 calculates the number of decimal places of the value 217 of the accumulator 202 (eg, as specified by the accumulator binary decimal point 2923 or the data binary decimal point 2922). The sum of the number of decimal places (as specified by the output binary point 2954) minus the number of decimal places of the output (as specified by the output binary point 2954) is used as the offset. Thus, for example, if the accumulator 202 binary decimal point 2923 is 8 (ie, the above embodiment) and the output binary decimal point 2954 is 3, the output binary decimal point aligner 3002 will have this positive type value right. Five bits are shifted to produce the result provided to multiplexer 3006 and rounder 3004.

捨入器3004會對累加器202數值217執行捨入運算。就一較佳實施例而言，捨入器3004會對正類型轉換器與輸出二進位小數點對準器3002產生之一正類型數值產生一個捨入後版本，並將此捨入後版本提供至多工器3006。捨入器3004會依據前述捨入控制2932執行捨入運算，如本文所述，前述捨入控制會包括使用隨機位元3005之隨機捨入。多工器3006會依據捨入控制2932(如本文所述，可包含隨機捨入)，在其多個輸入中選擇其一，也就是來自正類型轉換器與輸出二進位小數點對準器3002之正類型數值或是來自捨入器3004之捨入後版本，並且將選擇後的數值提供給標準尺寸壓縮器與飽和器3008。就一較佳實施例而言，若是捨入控制指定不進行捨入，多工器3006就會選擇正類型轉換器與輸出二進位小數點對準器3002之輸出，否則就會選擇捨入器3004之輸出。在其他實施例中，亦可由啟動函數單元212執行額外的捨入運算。舉例來說，在一實施例中，當位元選擇器3012對標準尺寸壓縮器與飽和器3008之輸出(如後述)位元進行壓縮時，位元選擇器3012會基於遺失的低順位位元進行捨入運算。在另一個範例中，倒數乘法器3014(如後述)之乘積會被施以捨入運算。在又一個範例中，尺寸轉換器3036需要轉換出適當之輸出尺寸(如後述)，此轉換可能涉及丟去某些用於決定捨入之低順位位元，就會執行捨入運算。 The rounder 3004 performs a rounding operation on the accumulator 202 value 217. In a preferred embodiment, the rounder 3004 generates a positive class for the positive type converter and the output binary point aligner 3002. The type value produces a rounded version and provides this rounded version to multiplexer 3006. The rounder 3004 performs a rounding operation in accordance with the rounding control 2932 described above. As described herein, the rounding control described above may include random rounding using random bits 3005. The multiplexer 3006 will select one of its plurality of inputs, that is, from the positive type converter and the output binary point aligner 3002, depending on the rounding control 2932 (which may include random rounding as described herein). The positive type value is either from the rounded version of the rounder 3004 and the selected value is provided to the standard size compressor and saturator 3008. In a preferred embodiment, if the rounding control specifies that rounding is not performed, the multiplexer 3006 selects the output of the positive type converter and the output binary point aligner 3002, otherwise the rounder is selected. The output of 3004. In other embodiments, additional rounding operations may also be performed by the start function unit 212. For example, in one embodiment, when the bit selector 3012 compresses the output of the standard size compressor and saturator 3008 (described below), the bit selector 3012 is based on the missing low order bit. Perform rounding operations. In another example, the product of the reciprocal multiplier 3014 (as described below) is subjected to a rounding operation. In yet another example, the size converter 3036 needs to convert the appropriate output size (as described below), which may involve dropping some of the low order bits used to determine the rounding, and performing a rounding operation.

標準尺寸壓縮器3008會將多工器3006輸出值壓縮至標準尺寸。因此，舉例來說，若是神經處理單元126是處於窄配置或漏斗配置2902，標準尺寸壓縮器3008可將28位元之多工器3006輸出值壓縮至16位元；而若是神經處理單元126是處於寬配置2902，標準尺寸壓縮器3008可將41位元之多工器3006輸出值壓縮至32位元。不過，在壓縮至標準尺寸前，若是壓縮前值大於標準型式所能表達之最大值，飽和器3008就會使此壓縮前值填滿至標準型式所能表達之最大值。舉例來說，若是壓縮前值中位於最高有效壓縮前值位元左側之任何位元都是數值1，飽和器3008就會填滿至最大值(如填滿為全部1)。 The standard size compressor 3008 compresses the multiplexer 3006 output value to a standard size. Thus, for example, if the neural processing unit 126 is in a narrow configuration or funnel configuration 2902, the standard size compressor 3008 can compress the 28-bit multiplexer 3006 output value to 16 bits; if the neural processing unit 126 is In wide configuration 2902, standard size compression The device 3008 can compress the 41-bit multiplexer 3006 output value to 32 bits. However, before compressing to the standard size, if the pre-compression value is greater than the maximum value that can be expressed by the standard version, the saturator 3008 fills the pre-compression value to the maximum value that can be expressed by the standard pattern. For example, if any bit in the pre-compression value that is to the left of the most significant pre-compression value bit is a value of 1, the saturator 3008 fills up to the maximum value (if filled to all 1s).

就一較佳實施例而言，雙曲正切模組3022、S型模組3024、以及軟加模組3026都包含查找表，如可程式化邏輯陣列(PLA)、唯讀記憶體(ROM)、組合邏輯閘等等。在一實施例中，為了簡化並縮小這些模組3022/3024/3026的尺寸，提供至這些模組之輸入值係具有3.4之型式，即三個整數位元與四個小數位元，亦即輸入值具有四個位元位於二進位小數點右側並且具有三個位元位於二進位小數點左側。因為在3.4型式之輸入值範圍(-8,+8)之極端處，輸出值會漸近地靠近其最小/最大值，因此選擇這些數值。不過，本發明並不限於此，本發明亦可應用於其它將二進位小數點放置在不同位置之實施例，如以4.3型式或2.5型式。位元選擇器3012會在標準尺寸壓縮器與飽和器3008輸出之位元中選擇選擇滿足3.4型式規範之位元，此涉及壓縮處理，也就是會喪失某些位元，因為標準型式則具有較多之位元數。不過，在選擇/壓縮標準尺寸壓縮器與飽和器3008輸出值之前，若是壓縮前值大於3.4型式所能表達之最大值，飽和器3012就會使壓縮前值填滿至3.4型式所能表達之最大值。舉例來說，若是壓縮前值中位於最高有效3.4型式位元左側之任何位元都是數值1，飽和器3012就會填滿至最大值(如填滿至全部1)。 In a preferred embodiment, the hyperbolic tangent module 3022, the S-type module 3024, and the soft-add module 3026 all include a lookup table, such as a programmable logic array (PLA), a read-only memory (ROM). , combination logic gates, and so on. In an embodiment, in order to simplify and reduce the size of the modules 3022/3024/3026, the input values provided to the modules are 3.4, that is, three integer bits and four decimal places, that is, The input value has four bits to the right of the binary decimal point and three bits to the left of the binary decimal point. Since the output value is asymptotically close to its min/max value at the extremes of the input value range (-8, +8) of the 3.4 type, these values are selected. However, the present invention is not limited thereto, and the present invention is also applicable to other embodiments in which a binary decimal point is placed at a different position, such as a 4.3 type or a 2.5 type. The bit selector 3012 selects and selects a bit that satisfies the 3.4 type specification among the bits output by the standard size compressor and the saturator 3008. This involves compression processing, that is, some bits are lost because the standard type has a comparison. More bits. However, before selecting/compressing the output value of the standard size compressor and the saturator 3008, if the pre-compression value is greater than the maximum value that can be expressed by the 3.4 type, the saturator 3012 will fill the pre-compression value to the 3.4 type. maximum value. For example, if any bit in the pre-compression value that is to the left of the most significant 3.4 type bit is a value of 1, the saturator 3012 fills up to the maximum value (eg, fills up to all 1s).

雙曲正切模組3022、S型模組3024與軟加模組3026會對標準尺寸壓縮器與飽和器3008輸出之3.4型式數值執行相對應之啟動函數(如前述)以產生一結果。就一較佳實施例而言，雙曲正切模組3022與S型模組3024所產生的是一個0.7型式之7位元結果，即零個整數位元與七個小數位元，亦即輸入值具有七個位元位於二進位小數點右側。就一較佳實施例而言，軟加模組3026產生的是一個3.4型式之7位元結果，即其型式與此模組3026之輸入型式相同。就一較佳實施例而言，雙曲正切模組3022、S型模組3024與軟加模組3026之輸出會被延展至標準型式(例如在必要時加上前導零)並對準而使二進位小數點由輸出二進位小數點2954數值所指定。 The hyperbolic tangent module 3022, the S-type module 3024, and the soft-add module 3026 perform a corresponding start function (as described above) on the 3.4-type value output by the standard size compressor and the saturator 3008 to produce a result. In a preferred embodiment, the hyperbolic tangent module 3022 and the S-type module 3024 produce a 0.7-bit 7-bit result, ie, zero integer bits and seven decimal places, that is, input. The value has seven bits to the right of the binary decimal point. In a preferred embodiment, the soft add module 3026 produces a 3.4-bit 7-bit result, i.e., the pattern is the same as the input pattern of the module 3026. In a preferred embodiment, the outputs of the hyperbolic tangent module 3022, the S-type module 3024, and the soft-add module 3026 are extended to a standard version (eg, with leading zeros if necessary) and aligned The binary decimal point is specified by the output binary digit 2954.

校正器3018會產生標準尺寸壓縮器與飽和器3008之輸出值之一校正後版本。也就是說，若是標準尺寸壓縮器與飽和器3008之輸出值(如前述其符號係以管線下移)為負，校正器3018會輸出零值；否則，校正器3018就會將其輸入值輸出。就一較佳實施例而言，校正器3018之輸出為標準型式並具有由輸出二進位小數點2954數值所指定之二進位小數點。 The corrector 3018 produces a corrected version of the output values of the standard size compressor and saturator 3008. That is, if the output value of the standard size compressor and the saturator 3008 (as described above, the symbol is shifted down the pipeline), the corrector 3018 will output a zero value; otherwise, the corrector 3018 will output its input value. . In a preferred embodiment, the output of the corrector 3018 is of a standard version and has a binary point specified by the output binary point 2954 value.

倒數乘法器3014會將標準尺寸壓縮器與飽和器3008之輸出與指定於倒數值2942之使用者指定倒數值相乘，以產生標準尺寸之乘積，此乘積實際上即為標準尺寸壓縮器與飽和器3008之輸出值，以倒數值2942之倒數作為除數計算出來的商數。就一較佳實施例而言，倒數乘法器3014之輸出為標準型式並具有由輸出二進位小數點2954數值指定之二進位小數點。 The reciprocal multiplier 3014 multiplies the output of the standard size compressor and saturator 3008 by a user specified reciprocal value assigned to the reciprocal value 2942 to produce a product of the standard size, which is actually The output value of the standard size compressor and saturator 3008 is the quotient calculated by the inverse of the inverse value 2942 as the divisor. In a preferred embodiment, the output of the reciprocal multiplier 3014 is a standard version and has a binary point specified by the output binary point 2954.

向右移位器3016會將標準尺寸壓縮器與飽和器3008之輸出，以指定於偏移量值2944之使用者指定位元數進行移動，以產生標準尺寸之商數。就一較佳實施例而言，向右移位器3016之輸出為標準型式並具有由輸出二進位小數點2954數值指定之二進位小數點。 The right shifter 3016 moves the output of the standard size compressor and saturator 3008 by the number of user-specified bits specified by the offset value 2944 to produce a quotient of the standard size. In a preferred embodiment, the output to the right shifter 3016 is a standard version and has a binary point specified by the output binary point 2954.

多工器3032選擇啟動函數2934值所指定之適當輸入，並將其選擇提供至符號恢復器3034，若是原本的累加器202數值217為負值，符號恢復器3034就會將多工器3032輸出之正類型數值轉換為負類型，例如轉換為二補數類型。 The multiplexer 3032 selects the appropriate input specified by the value of the start function 2934 and provides its selection to the symbol restorer 3034. If the original accumulator 202 has a negative value of 217, the symbol restorer 3034 outputs the multiplexer 3032. The positive type value is converted to a negative type, for example to a two-complement type.

尺寸轉換器3036會依據如第二十九A圖所述之輸出命令2956之數值，將符號恢復器3034之輸出轉換至適當的尺寸。就一較佳實施例而言，符號恢復器3034之輸出具有一個由輸出二進位小數點2954數值指定之二進位小數點。就一較佳實施例而言，對於輸出命令之第一預設值而言，尺寸轉換器3036會捨棄符號恢復器3034輸出之上半部位元。此外，若是符號恢復器3034之輸出為正並且超過配置2902指定之文字尺寸所能表達之最大值，或是輸出為負並且小於文字尺寸所能表達之最小值，飽和器3036就會將其輸出分別填滿至此文字尺寸之可表達最大/最小值。對於第二與第三預設值，尺寸轉換器3036會傳遞符號恢復器3034之輸出。 The size converter 3036 converts the output of the symbol restorer 3034 to the appropriate size in accordance with the value of the output command 2956 as described in FIG. 29A. In a preferred embodiment, the output of symbol restorer 3034 has a binary point specified by the output binary point 2954 value. In a preferred embodiment, the size converter 3036 discards the upper half of the output of the symbol restorer 3034 for the first predetermined value of the output command. In addition, if the output of the symbol restorer 3034 is positive and exceeds the maximum value that can be expressed by the text size specified by the configuration 2902, or the output is negative and smaller than the minimum value that can be expressed by the text size, the saturator 3036 outputs it. Fill up to the text size to express the maximum/minimum value. For the second and third preset values, the size is turned The converter 3036 passes the output of the symbol restorer 3034.

多工器3037會依據輸出命令2956，在資料轉換器與飽和器3036輸出與累加器202輸出217中選擇其一以提供給輸出暫存器3038。進一步來說，對於輸出命令2956之第一與第二預設值，多工器3037會選擇尺寸轉換器與飽和器3036之輸出的下方文字(尺寸由配置2902指定)。對於第三預設值，多工器3037會選擇尺寸轉換器與飽和器3036之輸出的上方文字。對於第四預設值，多工器3037會選擇未經處理之累加器202數值217的下方文字；對於第五預設值，多工器3037會選擇未經處理之累加器202數值217的中間文字；而對於第六預設值，多工器3037會選擇未經處理之累加器202數值217的上方文字。如前述，就一較佳實施例而言，啟動函數單元212會在未經處理之累加器202數值217的上方文字加上零值上方位元。 The multiplexer 3037 selects one of the data converter and saturator 3036 output and accumulator 202 output 217 for supply to the output register 3038 in accordance with the output command 2956. Further, for the first and second preset values of the output command 2956, the multiplexer 3037 selects the lower text of the output of the size converter and saturator 3036 (the size is specified by configuration 2902). For the third preset value, the multiplexer 3037 selects the upper text of the output of the size converter and saturator 3036. For the fourth preset value, the multiplexer 3037 selects the lower text of the unprocessed accumulator 202 value 217; for the fifth preset value, the multiplexer 3037 selects the middle of the unprocessed accumulator 202 value 217 The text; and for the sixth preset value, the multiplexer 3037 selects the upper text of the unprocessed accumulator 202 value 217. As previously mentioned, in a preferred embodiment, the start function unit 212 adds a zero value upper orientation element to the upper text of the unprocessed accumulator 202 value 217.

第三十一圖係顯示第三十圖之啟動函數單元212之運作之一範例。如圖中所示，神經處理單元126之配置2902係設定為窄配置。此外，帶符號資料2912與帶符號權重2914值為真。此外，資料二進位小數點2922值表示對於資料隨機存取記憶體122文字而言，其二進位小數點位置右側有7個位元，神經處理單元126所接收之第一資料文字之一範例值係呈現為0.1001110。此外，權重二進位小數點2924值表示對於權重隨機存取記憶體124文字而言，其二進位小數點位置右側有3個位元，神經處理單元126所接收之第一權重文字之一範例值係呈現為00001.010。 The thirty-first figure shows an example of the operation of the start function unit 212 of the thirty-th figure. As shown in the figure, the configuration 2902 of the neural processing unit 126 is set to a narrow configuration. In addition, the signed material 2912 and the signed weight 2914 are true. In addition, the data binary decimal point value of 2922 indicates that for the data random access memory 122 text, there are 7 bits to the right of the binary decimal point position, and one of the first data characters received by the neural processing unit 126 is an example value. The line is presented as 0.1001110. In addition, the weight binary decimal point 2924 value indicates that for the weight random access memory 124 text, there are 3 bits to the right of the binary decimal point position, and one of the first weight texts received by the neural processing unit 126 is an example value. Department It is now 00001.010.

第一資料與權重文字之16位元乘積(此乘積會與累加器202之初始零值相加)係呈現為000000.1100001100。因為資料二進位小數點2912是7而權重二進位小數點2914是3，對於所隱含之累加器202二進位小數點而言，其右側會有10個位元。在窄配置的情況下，如本實施例所示，累加器202具有28個位元寬。舉例來說，完成所有算術邏輯運算後(例如第二十圖全部1024個乘法累加運算)，累加器202之數值217會是000000000000000001.1101010100。 The 16-bit product of the first data and the weight text (this product is added to the initial zero value of the accumulator 202) is presented as 000000.1100001100. Since the data binary decimal point 2912 is 7 and the weighted binary decimal point 2914 is 3, for the implicit accumulator 202 binary decimal point, there will be 10 bits to the right. In the case of a narrow configuration, as shown in this embodiment, the accumulator 202 has 28 bit widths. For example, after all arithmetic logic operations have been completed (eg, all 1024 multiply-accumulate operations in the twentieth diagram), the value 217 of the accumulator 202 would be 000000000000000001.1101010100.

輸出二進位小數點2954值表示輸出之二進位小數點右側有7個位元。因此，在傳遞輸出二進位小數點對準器3002與標準尺寸壓縮器3008之後，累加器202數值217會被縮放、捨入與壓縮至標準型式之數值，即000000001.1101011。在此範例中，輸出二進位小數點位址表示7個小數位元，而累加器202二進位小數點位置表示10個小數位元。因此，輸出二進位小數點對準器3002會計算出差值3，並透過將累加器202數值217右移3個位元以對其進行縮放。在第三十一圖中即顯示累加器202數值217會喪失3個最低有效位元(二進位數100)。此外，在此範例中，捨入控制2932值係表示使用隨機捨入，並且在此範例中係假定取樣隨機位元3005為真。如此，如前述，最低有效位元就會被向上捨入，這是因為累加器202數值217的捨入位元(這3個因為累加器202數值217之縮放運算而被移出的位元中之最高有效位元)為一，而黏位元(這3個因為累加器202數值217之縮放運算而被移出的位元中，2個最低有效位元之布林或運算結果)為零。 The output binary digit 2954 value indicates that the output binary digit has 7 digits to the right of the decimal point. Thus, after passing the output decimal point aligner 3002 and the standard size compressor 3008, the accumulator 202 value 217 is scaled, rounded, and compressed to the value of the standard version, 000000001.1101011. In this example, the output binary decimal point address represents 7 decimal places, and the accumulator 202 binary decimal point position represents 10 decimal places. Thus, the output binary point aligner 3002 calculates the difference 3 and scales it by shifting the accumulator 202 value 217 to the right by 3 bits. In the thirty-first figure, it is shown that the accumulator 202 value 217 will lose 3 least significant bits (two digits of 100). Moreover, in this example, the rounding control 2932 value indicates that random rounding is used, and in this example it is assumed that the sampling random bit 3005 is true. Thus, as before, the least significant bit is rounded up because of the rounding bit of the value 217 of the accumulator 202 (the three are removed from the bit due to the scaling operation of the value 217 of the accumulator 202). The most significant bit) is one, The sticky bits (the three Boolean or arithmetic results of the two least significant bits due to the scaling operation of the value 217 of the accumulator 202) are zero.

在本範例中，啟動函數2934表示所使用的是S型函數。如此，位元選擇器3012就會選擇標準型式值之位元而使S型模組3024之輸入具有三個整數位元與四個小數位元，如前述，即所示之數值001.1101。S型模組3024之輸出數值會放入標準型式中，即所示之數值000000000.1101110。 In this example, the start function 2934 indicates that an sigmoid function is used. Thus, bit selector 3012 selects the bit of the standard pattern value such that the input of S-type module 3024 has three integer bits and four decimal places, as previously described, the value 001.1101 shown. The output value of the S-type module 3024 will be placed in the standard version, which is the value shown as 000000000.1101110.

此範例之輸出命令2956指定第一預設值，即輸出配置2902表示之文字尺寸，在此情況下即窄文字(8位元)。如此，尺寸轉換器3036會將標準S型輸出值轉換為一個8位元量，其具有一個隱含之二進位小數點，即在此二進位小數點右側有7個位元，而產生一個輸出值01101110，如圖中所示。 The output command 2956 of this example specifies the first preset value, i.e., the text size represented by output configuration 2902, in this case narrow text (8 bits). Thus, the size converter 3036 converts the standard S-type output value into an 8-bit quantity having an implicit binary point, ie, 7 bits to the right of the binary point, producing an output The value is 01101110, as shown in the figure.

第三十二圖係顯示第三十圖之啟動函數單元212之運作之第二個範例。第三十二圖之範例係描述當啟動函數2934表示以標準尺寸傳遞累加器202數值217時，啟動函數單元212之運算。如圖中所示，此配置2902係設定為神經處理單元216之窄配置。 The thirty-second figure shows a second example of the operation of the start function unit 212 of the thirty-th figure. The example of the thirty-second diagram describes the operation of the start function unit 212 when the start function 2934 indicates that the accumulator 202 value 217 is passed in a standard size. As shown in the figure, this configuration 2902 is set to a narrow configuration of the neural processing unit 216.

在此範例中，累加器202之寬度為28個位元，累加器202二進位小數點之位置右側有10個位元(這是因為在一實施例中資料二進位小數點2912與權重二進位小數點2914之加總為10，或者在另一實施例中累加器二進位小數點2923明確被指定為具有數值10)。舉例來說，在執行所有算術邏輯運算後，第三十二圖所示之累加器202數值217為000001100000011011.1101111010。 In this example, the width of the accumulator 202 is 28 bits, and the accumulator 202 has 10 bits to the right of the decimal point position (this is because in the embodiment the data binary decimal point 2912 and the weight binary carry The sum of the decimal points 2914 is always 10, or in another embodiment the accumulator binary decimal point 2923 is explicitly designated to have a value of 10). For example It is said that after performing all the arithmetic logic operations, the value 217 of the accumulator 202 shown in the thirty-second figure is 000001100000011011.1101111010.

在此範例中，輸出二進位小數點2954值表示對於輸出而言，二進位小數點右側有4個位元。因此，在傳遞輸出二進位小數點對準器3002與標準尺寸壓縮器3008之後，累加器202數值217會飽和並壓縮至所示之標準型式值111111111111.1111，此數值係由多工器3032所接收以作為標準尺寸傳遞值3028。 In this example, the output binary point 2954 value indicates that for the output, there are 4 bits to the right of the binary point. Thus, after passing the output binary point aligner 3002 and the standard size compressor 3008, the accumulator 202 value 217 is saturated and compressed to the standard pattern value 111111111111.1111 shown, which is received by the multiplexer 3032. The value is passed as a standard size 3028.

在此範例中顯示兩個輸出命令2956。第一個輸出命令2956指定第二預設值，即輸出標準型式尺寸之下方文字。因為配置2902所指示之尺寸為窄文字(8位元)，標準尺寸就會是16位元，而尺寸轉換器3036會選擇標準尺寸傳遞值3028之下方8個位元以產生如圖中所示之8位元數值11111111。第二個輸出命令2956指定第三預設值，即輸出標準型式尺寸之上方文字。如此，尺寸轉換器3036會選擇標準尺寸傳遞值3028之上方8個位元以產生如圖中所示之8位元數值11111111。 Two output commands 2956 are shown in this example. The first output command 2956 specifies a second preset value, which is the lower text of the output standard type size. Since the size indicated by configuration 2902 is a narrow text (8-bit), the standard size will be 16-bit, and the size converter 3036 will select the lower 8 bits below the standard size transfer value 3028 to produce as shown in the figure. The 8-bit value is 11111111. The second output command 2956 specifies a third preset value, that is, the upper text of the output standard type size. As such, the size converter 3036 selects the upper 8 bits of the standard size transfer value 3028 to produce the 8-bit value 11111111 as shown.

第三十三圖係顯示第三十圖之啟動函數單元212之運作之第三個範例。第三十三圖之範例係揭示當啟動函數2934表示要傳遞整個未經處理之累加器202數值217時啟動函數單元212之運作。如圖中所示，此配置2902係設定為神經處理單元126之寬配置(例如16位元之輸入文字)。 The thirty-third figure shows a third example of the operation of the start function unit 212 of the thirty-th figure. The example of the thirty-third diagram reveals the operation of the start function unit 212 when the start function 2934 indicates that the entire unprocessed accumulator 202 value 217 is to be passed. As shown in the figure, this configuration 2902 is set to a wide configuration of the neural processing unit 126 (e.g., 16-bit input text).

在此範例中，累加器202之寬度為41個位元，累加器202二進位小數點位置的右側有8個位元(這是因為在一實施例中資料二進位小數點2912與權重二進位小數點2914之加總為8，或者在另一實施例中累加器二進位小數點2923明確被指定為具有數值8)。舉例來說，在執行所有算術邏輯運算後，第三十三圖所示之累加器202數值217為001000000000000000001100000011011.11011110。 In this example, the width of the accumulator 202 is 41 bits, and the accumulator 202 has 8 bits to the right of the decimal point position (this This is because in one embodiment the data binary decimal point 2912 and the weighted binary decimal point 2914 add up to 8, or in another embodiment the accumulator binary decimal point 2923 is explicitly designated to have a value of 8). For example, after performing all of the arithmetic logic operations, the value 217 of the accumulator 202 shown in the thirty-third figure is 001000000000000000001100000011011.11011110.

此範例中顯示三個輸出命令2956。第一個輸出命令指定第四預設值，即輸出未經處理之累加器202數值之下方文字；第二個輸出命令指定第五預設值，即輸出未經處理之累加器202數值之中間文字；而第三個輸出命令指定第六預設值，即輸出未經處理之累加器202數值之上方文字。因為配置2902所指示之尺寸為寬文字(16位元)，如第三十三圖所示，因應第一輸出命令2956，多工器3037會選擇16位元值0001101111011110；因應第二輸出命令2956，多工器3037會選擇16位元值0000000000011000；而因應第三輸出命令2956，多工器3037會選擇16位元值0000000001000000。 Three output commands 2956 are shown in this example. The first output command specifies a fourth preset value, that is, outputs the text below the value of the unprocessed accumulator 202; the second output command specifies a fifth preset value, that is, the middle of the output of the unprocessed accumulator 202. The text is output; the third output command specifies a sixth preset value, that is, the text above the value of the unprocessed accumulator 202 is output. Because the size indicated by configuration 2902 is a wide text (16-bit), as shown in the thirty-third figure, in response to the first output command 2956, the multiplexer 3037 selects the 16-bit value 0001101111011110; in response to the second output command 2956 The multiplexer 3037 selects the 16-bit value 0000000000011000; and in response to the third output command 2956, the multiplexer 3037 selects the 16-bit value 0000000001000000.

如前述，神經網路單元121即可執行於整數資料而非浮點資料。如此，即有助於簡化個個神經處理單元126，或至少其中之算術邏輯單元204部分。舉例來說，這個算術邏輯單元204就不需要為了乘法器242而納入在浮點運算中需用來將乘數之指數相加之加法器。類似地，這個算術邏輯單元204就不需要為了加法器234而納入在浮點運算中需用來對準加數之二進位小數點之移位器。所屬技術領域具有通常知識者當能理解，浮點單元往往非常複雜；因此，本文所述之範例僅針對算術邏輯單元204進行簡化，利用所述具有硬體定點輔助而讓使用者可指定相關二進位小數點之整數實施例亦可用於對其他部分進行簡化。相較於浮點之實施例，使用整數單元作為算術邏輯單元204可以產生一個較小(且較快)之神經處理單元126，而有利於將一個大型的神經處理單元126陣列整合進神經網路單元121內。啟動函數單元212之部分可以基於使用者指定、累加數需要之小數位元數量以及輸出值需要之小數位元數量，來處理累加器202數值217之縮放與飽和運算，而較佳者係基於使用者指定。任何額外複雜度與伴隨之尺寸增加，以及啟動函數單元212之定點硬體輔助內之能源與/或時間耗損，都可以透過在算術邏輯單元204間共享啟動函數單元212之方式來進行分攤，這是因為如第十一圖之實施例所示，採用共享方式之實施例可以減少啟動函數單元1112之數量。 As mentioned above, the neural network unit 121 can execute on integer data instead of floating point data. As such, it is helpful to simplify the individual neural processing units 126, or at least the arithmetic logic unit 204 portions thereof. For example, the arithmetic logic unit 204 does not need to include, for the multiplier 242, an adder that is used in floating point operations to add the exponents of the multipliers. Similarly, this arithmetic logic unit 204 does not need to include a shifter that is used in the floating point operation to align the binary digits of the addend for the adder 234. A person skilled in the art can understand, floating point The elements are often very complex; therefore, the examples described herein are only simplified for the arithmetic logic unit 204, and the integer embodiment with the hardware fixed point assistance allowing the user to specify the associated binary point can also be used for other parts. Simplify. In contrast to floating point embodiments, the use of integer units as arithmetic logic unit 204 can produce a smaller (and faster) neural processing unit 126 that facilitates the integration of a large array of neural processing units 126 into the neural network. Within unit 121. The portion of the start function unit 212 may process the scaling and saturation operations of the value 217 of the accumulator 202 based on the user specified, the number of decimal places required for the accumulated number, and the number of decimal bits required for the output value, preferably based on use. Designated. Any additional complexity and accompanying increase in size, as well as energy and/or time loss within the fixed-point hardware assistance of the start function unit 212, can be shared by sharing the start function unit 212 between the arithmetic logic units 204. This is because, as shown in the embodiment of the eleventh figure, the embodiment using the sharing mode can reduce the number of startup function units 1112.

本文所述之實施例可以享有許多利用整數算數單元以降低硬體複雜度之優點(相較於使用浮點算術單元)，而同時還能用於小數之算術運算，即具有二進位小數點之數字。浮點算術之優點在於它可以提供資料算術運算給資料之個別數值落在一個非常廣的數值範圍內(實際上只受限於指數範圍的大小，因此會是一個非常大的範圍)。也就是說，每個浮點數具有其潛在獨一無二的指數值。不過，本文所述之實施例理解到並利用某些應用中具有輸入資料高度平行且落於一相對較窄之範圍內而使所有平行資料具有相同“指數”之特性。如此，這些實施例讓使用者將二進位小數點位置一次指定給所有的輸入值與/或累加值。類似地，透過理解並利用平行輸出具有類似範圍之特性，這些實施例讓使用者將二進位小數點位置一次指定給所有的輸出值。人工神經網路是此種應用之一範例，不過本發明之實施例亦可應用於執行其他應用之計算。透過將二進位小數點位置一次指定給多個輸入而非給對個別的輸入數，相較於使用浮點運算，本發明之實施例可以更有效率地利用記憶空間(如需要較少之記憶體)以及/或在使用類似數量之記憶體的情況下提升精度，這是因為用於浮點運算之指數的位元可用來提升數值精度。 The embodiments described herein can enjoy many advantages of using integer arithmetic units to reduce hardware complexity (compared to using floating point arithmetic units), while also being used for arithmetic operations of decimals, that is, having a binary decimal point. digital. The advantage of floating-point arithmetic is that it can provide data arithmetic operations to the individual values of the data falling within a very wide range of values (actually limited only by the size of the exponent range, so it will be a very large range). That is, each floating point number has its own unique index value. However, the embodiments described herein understand and utilize applications in which the input data is highly parallel and falls within a relatively narrow Within the scope, all parallel data have the same "index" characteristics. As such, these embodiments allow the user to assign a binary decimal point position to all input values and/or accumulated values at a time. Similarly, by understanding and utilizing the parallel output having similar range characteristics, these embodiments allow the user to assign the binary point position to all of the output values at a time. Artificial neural networks are an example of such an application, although embodiments of the invention may also be applied to perform calculations for other applications. By assigning a binary decimal point position to multiple inputs instead of giving individual input numbers, embodiments of the present invention can utilize memory space more efficiently than using floating point operations (eg, requiring less memory) Accuracy is improved in the case of using a similar amount of memory, because the bits of the exponent used for floating-point operations can be used to improve numerical precision.

此外，本發明之實施例理解到在對一個大型系列之整數運算(如溢位或喪失較不重要之小數位元)執行累加時可能喪失精度，因此提供一個解決方法，主要是利用一個足夠大的累加器來避免精度喪失。 Moreover, embodiments of the present invention understand that accuracy may be lost when performing an accumulation of a large series of integer operations (such as overflow or loss of less important fractional bits), thus providing a solution, primarily using a sufficiently large Accumulator to avoid loss of precision.

神經網路單元微運算之直接執行 Direct execution of neural network unit micro-operation

第三十四圖係顯示第一圖之處理器100以及神經網路單元121之部分細節之方塊示意圖。神經網路單元121包括神經處理單元126之管線級3401。各個管線級3401係以級暫存器區分，並包括組合邏輯以達成本文之神經處理單元126之運算，如布林邏輯閘、多工器、加法器、乘法器、比較器等等。管線級3401從多工器3402接收一微運算3418。微運算3418會向下流動至管線級 3401並控制其組合邏輯。微運算3418是一個位元集合。就一較佳實施例而言，微運算3418包括資料隨機存取記憶體122記憶體位址123之位元、權重隨機存取記憶體124記憶體位址125之位元、程式記憶體129記憶體位址131之位元、多工暫存器208/705控制信號213/713、還有許多控制暫存器217之欄位(例如第二十九A至二十九C圖之控制暫存器)。在一實施例中，微運算3418包括大約120個位元。多工器3402從三個不同的來源接收微運算，並選擇其中一個作為提供給管線級3401之微運算3418。 The thirty-fourth diagram is a block diagram showing a portion of the details of the processor 100 of the first diagram and the neural network unit 121. The neural network unit 121 includes a pipeline stage 3401 of the neural processing unit 126. Each pipeline stage 3401 is distinguished by a stage register and includes combinatorial logic to accomplish the operations of the neural processing unit 126 herein, such as a Boolean logic gate, a multiplexer, an adder, a multiplier, a comparator, and the like. Pipeline stage 3401 receives a micro-operation 3418 from multiplexer 3402. Micro-operation 3418 will flow down to the pipeline level 3401 and control its combinatorial logic. Micro-operation 3418 is a set of bits. For a preferred embodiment, the micro-operation 3418 includes a bit of the data random access memory 122 memory address 123, a bit of the weight random access memory 124 memory address 125, and a program memory 129 memory address. The bit of 131, the multiplex register 208/705 control signal 213/713, and a number of fields that control the register 217 (e.g., the control registers of the twenty-ninth to twenty-ninth C-pictures). In an embodiment, the micro-operation 3418 includes approximately 120 bits. Multiplexer 3402 receives micro-operations from three different sources and selects one of them as a micro-operation 3418 that is provided to pipeline stage 3401.

多工器3402之一個微運算來源為第一圖之定序器128。定序器128會將由程式記憶體129接收之神經網路單元指令解碼並據以產生一個微運算3416提供至多工器3402之第一輸入。 One micro-operation source of multiplexer 3402 is sequencer 128 of the first figure. The sequencer 128 decodes the neural network unit instructions received by the program memory 129 and accordingly provides a micro-operation 3416 to the first input of the multiplexer 3402.

多工器3402之第二個微運算來源為從第一圖之保留站108接收微指令105以及從通用暫存器116與媒體暫存器118接收運算元之解碼器3404。就一較佳實施例而言，如前述，微指令105係由指令轉譯器104因應MTNN指令1400與MFNN指令1500之轉譯所產生。微指令105可包括一個立即欄以指定一特定函數(由一個MTNN指令1400或一個MFNN指令1500所指定)，例如程式記憶體129內程式的開始與停止執行、直接從媒體暫存器118執行一微運算、或是如前述讀取/寫入神經網路單元之一記憶體。解碼器3404會將微指令105解碼並據以產生一個微運算3412提供至多工器之第二輸入。就一較佳實施例而言，對於MTNN指令1400/MFNN指令1500之某些函數 1432/1532而言，解碼器3404不需要產生一個微運算3412向下傳送至管線3401，例如寫入控制暫存器127、開始執行程式記憶體129內之程式、暫停執行程式記憶體129內之程式、等待程式記憶體129內之程式完成執行、從狀態暫存器127讀取以及重設神經網路單元121。 The second source of micro-computations of multiplexer 3402 is a decoder 10404 that receives microinstructions 105 from reservation station 108 of the first diagram and operands from general register 116 and media register 118. In a preferred embodiment, as previously described, the microinstruction 105 is generated by the instruction translator 104 in response to the translation of the MTNN instruction 1400 and the MFNN instruction 1500. Microinstruction 105 may include an immediate column to specify a particular function (specified by an MTNN instruction 1400 or an MFNN instruction 1500), such as the start and stop execution of a program in program memory 129, executing directly from media register 118. Micro-operation, or memory of one of the neural network units read/written as described above. The decoder 3404 decodes the microinstructions 105 and accordingly generates a micro-operation 3412 to provide a second input to the multiplexer. For a preferred embodiment, some functions of the 1400/MFNN instruction 1500 for the MTNN instruction For 1432/1532, the decoder 3404 does not need to generate a micro-operation 3412 to be transferred down to the pipeline 3401, for example, to the write control register 127, to start executing the program in the program memory 129, and to suspend execution of the program memory 129. The program, the program in the wait program memory 129 completes execution, reads from the status register 127, and resets the neural network unit 121.

多工器3402之第三個微運算來源為媒體暫存器118本身。就一較佳實施例而言，如前文對應於第十四圖所述，MTNN指令1400可指定一函數以指示神經網路單元121直接執行一個由媒體暫存器118提供至多工器3402之第三輸入之微運算3414。直接執行由架構媒體暫存器118提供之微運算3414有利於對神經網路單元121進行測試，如內建自我測試(BIST)，或除錯之動作。 The third source of micro-computation for multiplexer 3402 is media register 118 itself. In a preferred embodiment, as previously described in relation to FIG. 14, the MTNN instruction 1400 can specify a function to instruct the neural network unit 121 to directly perform a provision from the media register 118 to the multiplexer 3402. The three-input micro-operation 3414. Direct execution of the micro-operations 3414 provided by the architectural media register 118 facilitates testing of the neural network unit 121, such as built-in self-test (BIST), or debugging.

就一較佳實施例而言，解碼器3404會產生一個模式指標3422控制多工器3402之選擇。當MTNN指令1400指定一個函數開始執行一個來自程式記憶體129之程式，解碼器3404會產生一模式指標3422值使多工器3402選擇來自定序器128之微運算3416，直到發生錯誤或直到解碼器3404碰到一個MTNN指令1400指定一個函數停止執行來自程式記憶體129之程式。當MTNN指令1400指定一個函數指示神經網路單元121直接執行由媒體暫存器118提供之一微運算3414，解碼器3404會產生一個模式指標3422值使多工器3402選擇來自所指定之媒體暫存器118之微運算3414。否則，解碼器3404就會產生一個模式指標3422值使多工器3402選擇來自解碼器3404之微運算3412。 In a preferred embodiment, decoder 3404 generates a mode indicator 3422 that controls the selection of multiplexer 3402. When the MTNN instruction 1400 specifies a function to begin execution of a program from the program memory 129, the decoder 3404 generates a mode indicator 3422 value causing the multiplexer 3402 to select the micro-operation 3416 from the sequencer 128 until an error occurs or until decoding occurs. The device 3404 encounters an MTNN instruction 1400 that specifies a function to stop execution of the program from the program memory 129. When the MTNN instruction 1400 specifies a function to instruct the neural network unit 121 to directly perform a micro-operation 3414 provided by the media register 118, the decoder 3404 generates a mode indicator 3422 value to cause the multiplexer 3402 to select from the specified media. The micro-operation 3414 of the register 118. Otherwise, decoder 3404 will generate a mode indicator 3422 value to cause multiplexer 3402 to select micro-operation 3412 from decoder 3404.

可變率神經網路單元 Variable rate neural network unit

在許多情況下，神經網路單元121執行程式後就會進入待機狀態(idle)等待處理器100處理一些需要在執行下一個程式前處理的事情。舉例來說，假設處在一個類似於第三至六A圖所述之情況，神經網路單元121會對一乘法累加啟動函數程式(也可稱為一前授神經網路層程式(feed forward neural network layer program))連續執行兩次或更多次。相較於神經網路單元121執行程式所花費的時間，處理器100明顯需要花費較長的時間來將512KB之權重值寫入權重隨機存取記憶體124以供下一次神經網路單元程式使用。換言之，神經網路單元121會在短時間內執行程式，隨後就進入待機狀態，直到處理器100將接下來的權重值寫入權重隨機存取記憶體124供下一次程式執行使用。此情況可參照第三十六A圖，詳如後述。在此情況下，神經網路單元121可採用較低時頻率運行以延長執行程式之時間，藉以使執行程式所需之能源消耗分散至較長的時間範圍，而使神經網路單元121，乃至於整個處理器100，維持在較低溫度。此情況稱為緩和模式，可參照第三十六B圖，詳如後述。 In many cases, the neural network unit 121 will enter the standby state after executing the program, waiting for the processor 100 to process some things that need to be processed before executing the next program. For example, assuming that it is in a situation similar to that described in Figures 3-6A, the neural network unit 121 will accumulate a start function program (also known as a pre-nego network layer program). Neural network layer program)) Performs two or more times in succession. Compared to the time it takes for the neural network unit 121 to execute the program, the processor 100 obviously takes a long time to write a weight value of 512 KB into the weight random access memory 124 for use by the next neural network unit program. . In other words, the neural network unit 121 executes the program in a short time and then enters the standby state until the processor 100 writes the next weight value to the weight random access memory 124 for use in the next program execution. For the case, refer to the thirty-sixth A diagram, as will be described later. In this case, the neural network unit 121 can operate at a lower time frequency to extend the execution time of the program, so that the energy consumption required to execute the program is dispersed over a longer time range, and the neural network unit 121, and even the neural network unit 121 Throughout the processor 100, it is maintained at a lower temperature. This case is called a mitigation mode, and can be referred to the thirty-sixth B-picture, as will be described later.

第三十五圖係一方塊圖，顯示具有一可變率神經網路單元121之處理器100。此處理器100係類似於第一圖之處理器100，並且圖中具有相同標號之元件亦相類似。第三十五圖之處理器100並具有時頻產生邏輯3502耦接至處理器100之功能單元，這些功能單元即指令攫取單元101，指令快取102，指令轉譯器104，重命名單元 106，保留站108，神經網路單元121，其他執行單元112，記憶體子系統114，通用暫存器116與媒體暫存器118。時頻產生邏輯3502包括一時頻產生器，例如一鎖相迴路(PLL)，以產生一個具有一主要時頻率或稱時頻頻率之時頻信號。舉例來說，此主要時頻率可以是1GHz，1.5GHz，2GHz等等。時頻率即表示每秒之週期數，如時頻信號在高低狀態間之震盪次數。較佳地，此時頻信號具有一平衡週期(duty cycle)，即此週期之一半為高狀態而另一半為低狀態；另外，此時頻信號也可具有一非平衡週期，也就是時頻信號處在高狀態之時間長於其處在低狀態之時間，反之亦然。較佳地，鎖相迴路係用以產生多個時頻率之主要時頻信號。較佳地，處理器100包括一電源管理模組，依據多種因素自動調整主要時頻率，這些因素包括處理器100之動態偵測操作溫度，利用率(utilization)，以及來自系統軟體(如作業系統，基本輸入輸出系統(BIOS))指示所需效能與/或節能指標之命令。在一實施例中，電源管理模組包括處理器100之微碼。 The thirty-fifth diagram is a block diagram showing a processor 100 having a variable rate neural network unit 121. This processor 100 is similar to processor 100 of the first figure, and elements having the same reference numerals are similar in the figures. The processor 100 of the thirty-fifth diagram has a time-frequency generating logic 3502 coupled to the functional units of the processor 100, namely, the instruction fetching unit 101, the instruction cache 102, the instruction translator 104, and the renaming unit. 106, reservation station 108, neural network unit 121, other execution units 112, memory subsystem 114, general purpose register 116 and media register 118. The time-frequency generating logic 3502 includes a time-frequency generator, such as a phase-locked loop (PLL), to generate a time-frequency signal having a dominant time frequency or time-frequency. For example, the primary time frequency can be 1 GHz, 1.5 GHz, 2 GHz, and the like. The time frequency means the number of cycles per second, such as the number of oscillations of the time-frequency signal between high and low states. Preferably, the frequency signal has a duty cycle, that is, one half of the cycle is a high state and the other half is a low state; in addition, the frequency signal can also have an unbalanced period, that is, a time frequency. The signal is in a high state for longer than it is at a low state, and vice versa. Preferably, the phase locked loop is used to generate a primary time-frequency signal of a plurality of time frequencies. Preferably, the processor 100 includes a power management module that automatically adjusts the primary time frequency according to various factors, including the dynamic detection operating temperature of the processor 100, the utilization, and the system software (such as the operating system). , Basic Input Output System (BIOS)) A command that indicates the required performance and/or energy efficiency indicators. In an embodiment, the power management module includes microcode of the processor 100.

時頻產生邏輯3502並包括一時頻散佈網路，或時頻樹(clock tree)。時頻樹會將主要時頻信號散佈至處理器100之功能單元，如第三十五圖所示，此散佈動作就是將時頻信號3506-1傳送至指令攫取單元101，將時頻信號3506-2傳送至指令快取102，將時頻信號3506-10傳送至指令轉譯器104，將時頻信號3506-9傳送至重命名單元106，將時頻信號3506-8傳送至保留站108，將時頻信號3506-7傳送至神經網路單元121，將時頻信號3506-4傳送至其他執行單元112，將時頻信號3506-3傳送至記憶體子系統114，將時頻信號3506-5傳送至通用暫存器116，以及將時頻信號3506-6傳送至媒體暫存器118，這些信號集體稱為時頻信號3506。此時頻樹具有節點或線，以傳送主要時頻信號3506至其相對應之功能單元。此外，較佳地，時頻產生邏輯3502可包括時頻緩衝器，在需要提供較乾淨之時頻信號與/或需要提升主要時頻信號之電壓準位時，特別是對於較遠之節點，時頻緩衝器可重新產生主要時頻信號。此外，各個功能單元並具有其自身之子時頻樹，在需要時重新產生與/或提升所接收之相對應主要時頻信號3506的電壓準位。 The time-frequency generation logic 3502 includes a time-frequency spreading network, or a clock tree. The time-frequency tree will spread the main time-frequency signal to the functional unit of the processor 100. As shown in the thirty-fifth figure, the spreading action is to transmit the time-frequency signal 3506-1 to the instruction capturing unit 101, and the time-frequency signal 3506. -2 is transmitted to the instruction cache 102, the time frequency signal 3506-10 is transmitted to the instruction translator 104, the time frequency signal 3506-9 is transmitted to the rename unit 106, and the time frequency signal 3506-8 is transmitted to the reservation station 108. Transmitting the time-frequency signal 3506-7 to the neural network unit 121, The frequency signal 3506-4 is transmitted to the other execution unit 112, the time-frequency signal 3506-3 is transmitted to the memory subsystem 114, the time-frequency signal 3506-5 is transmitted to the general-purpose register 116, and the time-frequency signal 3506-6 is transmitted. The signals are transmitted to the media register 118, which are collectively referred to as time-frequency signals 3506. The frequency tree now has a node or line to transmit the primary time-frequency signal 3506 to its corresponding functional unit. In addition, preferably, the time-frequency generating logic 3502 can include a time-frequency buffer, when it is required to provide a clean time-frequency signal and/or need to increase the voltage level of the main time-frequency signal, especially for a distant node. The time-frequency buffer can regenerate the main time-frequency signal. In addition, each functional unit has its own sub-time-frequency tree that regenerates and/or boosts the voltage level of the corresponding primary time-frequency signal 3506 received as needed.

神經網路單元121包括時頻降低邏輯3504，時頻降低邏輯3504接收一緩和指標3512與主要時頻信號3506-7，以產生一第二時頻信號。第二時頻信號具有一時頻率。此時頻率若非相同於主要時頻率，就是處於一緩和模式從主要時頻率降低一數值以減少熱能產生，此數值係程式化至緩和指標3512。時頻降低邏輯3504類似於時頻產生邏輯3502，其具有一時頻散佈網路，或時頻樹，以散佈第二時頻信號至神經網路單元121之多種功能方塊，此散佈動作就是將時頻信號3508-1傳送至神經處理單元陣列126，將時頻信號3508-2傳送至定序器128以即將時頻信號3508-3傳送至介面邏輯3514，這些信號集體稱為第二時頻信號3508。較佳地，這些神經處理單元126包括複數個管線級3401，如第三十四圖所示，管線級3401包括管線分級暫存器，用以從時頻降低邏輯 3504接收第二時頻信號3508-1。 The neural network unit 121 includes time-frequency reduction logic 3504, and the time-frequency reduction logic 3504 receives a mitigation indicator 3512 and a primary time-frequency signal 3506-7 to generate a second time-frequency signal. The second time-frequency signal has a one-time frequency. If the frequency is not the same as the main time frequency, it is in a mode of mitigation. The frequency is reduced by a value from the main time to reduce the heat generation. This value is programmed to the mitigation index 3512. The time-frequency reduction logic 3504 is similar to the time-frequency generation logic 3502, which has a time-frequency spreading network, or a time-frequency tree, to spread the second time-frequency signal to the various functional blocks of the neural network unit 121, and the spreading action is time-consuming. The frequency signal 3508-1 is transmitted to the neural processing unit array 126, and the time-frequency signal 3508-2 is transmitted to the sequencer 128 to transmit the time-frequency signal 3508-3 to the interface logic 3514, collectively referred to as the second time-frequency signal. 3508. Preferably, the neural processing unit 126 includes a plurality of pipeline stages 3401. As shown in the thirty-fourth diagram, the pipeline stage 3401 includes a pipeline grading register for time-frequency reduction logic. The 3504 receives the second time-frequency signal 3508-1.

神經網路單元121並具有介面邏輯3514以接收主要時頻信號3506-7與第二時頻信號3508-3。介面邏輯3514係耦接於處理器100前端之下部分(例如保留站108，媒體暫存器118與通用暫存器116)與神經網路單元121之多種功能方塊間，這些功能方塊即時頻降低邏輯3504，資料隨機存取記憶體122，權重隨機存取記憶體124，程式記憶體129與定序器128。介面邏輯3514包括一資料隨機存取記憶體緩衝3522，一權重隨機存取記憶體緩衝3524，第三十四圖之解碼器3404，以及緩和指標3512。緩和指標3512裝載一數值，此數值係指定神經處理單元陣列126會以多慢的速度執行神經網路單元程式指令。較佳地，緩和指標3512係指定一除數值N，時頻降低邏輯3504將主要時頻信號3506-7除以此除數值以產生第二時頻信號3508，如此，第二時頻信號之時頻率就會是1/N。較佳地，N的數值可程式化為複數個不同預設值中之任何一個，這些預設值可使時頻降低邏輯3504對應產生複數個具有不同時頻率之第二時頻信號3508，這些時頻率係小於主要時頻率。 The neural network unit 121 also has interface logic 3514 to receive the primary time-frequency signal 3506-7 and the second time-frequency signal 3508-3. The interface logic 3514 is coupled between the lower portion of the front end of the processor 100 (eg, the reservation station 108, the media register 118 and the universal register 116) and the various functional blocks of the neural network unit 121. These functional blocks are reduced in real time. Logic 3504, data random access memory 122, weighted random access memory 124, program memory 129 and sequencer 128. Interface logic 3514 includes a data random access memory buffer 3522, a weighted random access memory buffer 3524, a thirty-fourth graph decoder 3404, and a mitigation indicator 3512. The mitigation indicator 3512 is loaded with a value that specifies how slow the neural processing unit array 126 will execute the neural network unit program instructions. Preferably, the mitigation indicator 3512 specifies a division value N, and the time-frequency reduction logic 3504 divides the main time-frequency signal 3506-7 by the division value to generate the second time-frequency signal 3508. Thus, the second time-frequency signal is The frequency will be 1/N. Preferably, the value of N can be programmed into any one of a plurality of different preset values. The preset values can cause the time-frequency reduction logic 3504 to generate a plurality of second time-frequency signals 3508 having different time frequencies. The time frequency is less than the primary frequency.

在一實施例中，時頻降低邏輯3504包括一時頻除法器電路，用以將主要時頻信號3506-7除以緩和指標3512數值。在一實施例中，時頻降低邏輯3504包括時頻閘(如AND閘)，時頻閘可透過一啟動信號來門控主要時頻信號3506-7，啟動信號在主要時頻信號之每N個週期中只會產生一次真值。以一個包含一計數器以產生啟動信號之電路為例，此計數器可向上計數至N。當伴隨的邏輯電路偵測到計數器之輸出與N匹配，邏輯電路就會在第二時頻信號3508產生一真值脈衝並重設計數器。較佳地，緩和指標3512數值可由一架構指令予以程式化，例如第十四圖之MTNN指令1400。較佳地，在架構程式指示神經網路單元121開始執行神經網路單元程式前，運作於處理器100之架構程式會將緩和值程式化至緩和指標3512，這部分在後續對應於第三十七圖處會有更詳細的說明。 In one embodiment, the time-frequency reduction logic 3504 includes a time-frequency divider circuit for dividing the primary time-frequency signal 3506-7 by the mitigation indicator 3512 value. In an embodiment, the time-frequency reduction logic 3504 includes a time-frequency gate (such as an AND gate), and the time-frequency gate can gate the main time-frequency signal 3506-7 through a start signal, and the enable signal is at each N of the main time-frequency signal. Only one true value is generated in one cycle. One containing a counter to generate For example, the circuit that starts the signal can count up to N. When the accompanying logic detects that the output of the counter matches N, the logic generates a true pulse at the second time-frequency signal 3508 and resets the counter. Preferably, the mitigation indicator 3512 value can be programmed by an architectural instruction, such as the MTNN instruction 1400 of FIG. Preferably, before the architecture program instructs the neural network unit 121 to start executing the neural network unit program, the architecture program operating on the processor 100 will program the mitigation value to the mitigation indicator 3512, which in the subsequent corresponds to the thirtieth There will be more detailed explanations at the seven maps.

權重隨機存取記憶體緩衝3524係耦接於權重隨機存取記憶體124與媒體暫存器118之間作為其間資料傳輸之緩衝。較佳地，權重隨機存取記憶體緩衝3524係類似於第十七圖之緩衝器1704之一個或多個實施例。較佳地，權重隨機存取記憶體緩衝3524從媒體暫存器118接收資料之部分係以具有主要時頻率之主要時頻信號3506-7作為時頻，而權重隨機存取記憶體緩衝3524從權重隨機存取記憶體124接收資料之部分係以具有第二時頻率之第二時頻信號3508-3作為時頻，第二時頻率可依據程式化於緩和指標3512之數值從主要時頻率調降或否，亦即依據神經網路單元121執行於緩和或正常模式來進行調降或否。在一實施例中，權重隨機存取記憶體124為單埠，如前文第十七圖所述，權重隨機存取記憶體124並可由媒體暫存器118透過權重隨機存取記憶體緩衝3524，以及由神經處理單元126或第十一圖之列緩衝1104，以仲裁方式(arbitrated fashion)存取。在另一實施例中，權重隨機存取記憶體124為雙埠，如前文第十六圖所述，各個埠可由媒體暫存器118透過權重隨機存取記憶體緩衝3524以及由神經處理單元126或列緩衝器1104以併行方式存取。 The weighted random access memory buffer 3524 is coupled between the weighted random access memory 124 and the media register 118 as a buffer for data transmission therebetween. Preferably, weighted random access memory buffer 3524 is similar to one or more embodiments of buffer 1704 of FIG. Preferably, the weighted random access memory buffer 3524 receives the data from the media register 118 by using the primary time-frequency signal 3506-7 having the primary time frequency as the time frequency, and the weight random access memory buffer 3524 from The portion of the weighted random access memory 124 that receives the data is the second time-frequency signal 3508-3 having the second time frequency as the time frequency, and the second time frequency can be adjusted from the main time frequency according to the value programmed in the mitigation index 3512. Down or no, that is, whether the neural network unit 121 performs the mitigation or normal mode to perform the down or no. In one embodiment, the weight random access memory 124 is 單埠. As described in the foregoing FIG. 17, the weight random access memory 124 can be transmitted by the media register 118 through the weight random access memory buffer 3524. And accessed by the neural processing unit 126 or the eleventh column buffer 1104 in an arbitrated fashion. In another implementation In the example, the weight random access memory 124 is a double port. As described in the sixteenth figure above, each port can be transmitted by the media register 118 through the weight random access memory buffer 3524 and by the neural processing unit 126 or the column buffer. 1104 is accessed in parallel.

類似於權重隨機存取記憶體緩衝3524，資料隨機存取記憶體緩衝3522係耦接於資料隨機存取記憶體122與媒體暫存器118之間作為其間資料傳送之緩衝。較佳地，資料隨機存取記憶體緩衝3522係類似於第十七圖之緩衝器1704之一個或多個實施例。較佳地，資料隨機存取記憶體緩衝3522從媒體暫存器118接收資料之部分係以具有主要時頻率之主要時頻信號3506-7作為時頻，而資料隨機存取記憶體緩衝3522從資料隨機存取記憶體122接收資料之部分係以具有第二時頻率之第二時頻信號3508-3作為時頻，第二時頻率可依據程式化於緩和指標3512之數值從主要時頻率調降或否，亦即依據神經網路單元121執行於緩和或正常模式來進行調降或否。在一實施例中，資料隨機存取記憶體122為單埠，如前文第十七圖所述，資料隨機存取記憶體122並可由媒體暫存器118透過資料隨機存取記憶體緩衝3522，以及由神經處理單元126或第十一圖之列緩衝1104，以仲裁方式存取。在另一實施例中，資料隨機存取記憶體122為雙埠，如前文第十六圖所述，各個埠可由媒體暫存器118透過資料隨機存取記憶體緩衝3522以及由神經處理單元126或列緩衝器1104以併行方式存取。 Similar to the weighted random access memory buffer 3524, the data random access memory buffer 3522 is coupled between the data random access memory 122 and the media register 118 as a buffer for data transfer therebetween. Preferably, data random access memory buffer 3522 is similar to one or more embodiments of buffer 1704 of FIG. Preferably, the data random access memory buffer 3522 receives the data from the media register 118 by using the main time-frequency signal 3506-7 having the main time frequency as the time frequency, and the data random access memory buffer 3522 The data random access memory 122 receives the data by using a second time-frequency signal 3508-3 having a second time frequency as a time frequency, and the second time frequency can be adjusted from the main time frequency according to the value programmed in the mitigation index 3512. Down or no, that is, whether the neural network unit 121 performs the mitigation or normal mode to perform the down or no. In one embodiment, the data random access memory 122 is 單埠. As described in FIG. 17 above, the data random access memory 122 can be buffered by the media buffer 118 through the data random access memory buffer 3522. And accessed by the neural processing unit 126 or the column buffer 1104 of the eleventh figure in an arbitration manner. In another embodiment, the data random access memory 122 is a double port. As described in the sixteenth figure above, each port can be accessed by the media register 118 through the data random access memory buffer 3522 and by the neural processing unit 126. Or column buffer 1104 is accessed in parallel.

較佳地，不論資料隨機存取記憶體122與/ 或權重隨機存取記憶體124為單埠或雙埠，介面邏輯3514會包括資料隨機存取記憶體緩衝3522與權重隨機存取記憶體緩衝3524以同步主要時頻域與第二時頻域。較佳地，資料隨機存取記憶體122，權重隨機存取記憶體124與程式記憶體129都具有一靜態隨機存取記憶體(SRAM)，其中包含個別之讀取致能信號，寫入致能信號與記憶體選擇致能信號。 Preferably, regardless of the data random access memory 122 and / Or the weight random access memory 124 is 單埠 or 埠, and the interface logic 3514 includes a data random access memory buffer 3522 and a weight random access memory buffer 3524 to synchronize the primary time frequency domain with the second time frequency domain. Preferably, the data random access memory 122, the weight random access memory 124 and the program memory 129 both have a static random access memory (SRAM), which includes individual read enable signals, and writes The signal and memory can be selected to enable the signal.

如前述，神經網路單元121是處理器100之一執行單元。執行單元是處理器中執行架構指令轉譯出之微指令或是執行架構指令本身之功能單元，例如執行第一圖中架構指令103轉譯出之微指令105或是架構指令103本身。執行單元從處理器之通用暫存器接收運算元，例如從通用暫存器116與媒體暫存器118。執行單元執行微指令或架構指令後會產生結果，此結果會被寫入通用暫存器。第十四與十五圖所述之MTNN指令1400與MFNN指令1500為架構指令103之範例。微指令係用以實現架構指令。更精確來說，執行單元對於架構指令轉譯出之一個或多個微指令之集體執行，就會是對於架構指令所指定之輸入執行架構指令所指定之運算，以產生架構指令定義之結果。 As before, the neural network unit 121 is one of the execution units of the processor 100. The execution unit is a micro-instruction in the processor that executes the schema instruction or a functional unit that executes the architecture instruction itself, such as executing the micro-instruction 105 or the architecture instruction 103 itself translated by the architecture instruction 103 in the first figure. The execution unit receives operands from the general purpose register of the processor, such as from the general purpose register 116 and the media register 118. The execution unit executes the micro or architectural instructions and produces a result that is written to the general purpose register. The MTNN instructions 1400 and MFNN instructions 1500 described in Figures 14 and 15 are examples of architectural instructions 103. Microinstructions are used to implement architectural instructions. More precisely, the collective execution of one or more microinstructions that are executed by the execution unit for the architectural instructions is the result of the input of the architectural instruction specified by the architectural instruction specified by the architectural instruction to produce the result of the architectural instruction definition.

第三十六A圖係一時序圖，顯示處理器100具有神經網路單元121運作於一般模式之一運作範例，此一般模式即以主要時頻率運作。在時序圖中，時間之進程是由左而右。處理器100係以主要時頻率執行架構程式。更精確來說，處理器100之前端(例如指令攫取單元101，指令快取102，指令轉譯器104，重命名單元106與保留站108)係以主要時頻率攫取，解碼且發佈架構指令至神經網路單元121與其他執行單元112。 The thirty-sixth A diagram is a timing diagram showing that the processor 100 has an operational example in which the neural network unit 121 operates in a general mode, which operates at a primary time frequency. In the timing diagram, the progress of time is from left to right. The processor 100 executes the architectural program at the primary time frequency. More precisely, the front end of the processor 100 (such as instruction capture Unit 101, instruction cache 102, instruction translator 104, rename unit 106 and reservation station 108) retrieve, decode and issue architectural instructions to neural network unit 121 and other execution units 112.

起初，架構程式執行一架構指令(如MTNN指令1400)，處理器前端100係將此架構指令發佈至神經網路單元121以指示神經網路單元121開始執行其程式記憶體129內之神經網路單元程式。在之前，架構程式會執行一架構指令將一指定主要時頻率之數值寫入緩和指標3512，亦即使神經網路單元處於一般模式。更精確地說，程式化至緩和指標3512之數值會使時頻降低邏輯3504以主要時頻信號3506之主要時頻率產生第二時頻信號3508。較佳地，在此範例中，時頻降低邏輯3504之時頻緩衝器單純提升主要時頻信號3506之電壓準位。另外在之前，架構程式會執行架構指令以寫入資料隨機存取記憶體122，權重隨機存取記憶體124並將神經網路單元程式寫入程式記憶體129。因應神經網路單元程式MTNN指令1400，神經網路單元121會開始以主要時頻率執行神經網路單元程式，這是因為緩和指標3512是以主要時頻率值予以程式化。神經網路單元121開始執行後，架構程式會持續以主要時頻率執行架構指令，包括主要是以MTNN指令1400寫入與/或讀取資料隨機存取記憶體122與權重隨機存取記憶體124，以完成對於神經網路單元程式之下一次範例(instance)，或稱調用(invocation)或執行(run)之準備。 Initially, the architecture program executes an architectural instruction (e.g., MTNN instruction 1400), and the processor front end 100 issues the architectural instruction to the neural network unit 121 to instruct the neural network unit 121 to begin executing the neural network in the program memory 129. Unit program. Previously, the architecture program executed an architectural instruction to write the value of a specified primary time frequency to the mitigation indicator 3512, even if the neural network unit was in the normal mode. More precisely, the value programmed to the mitigation indicator 3512 causes the time-frequency reduction logic 3504 to generate the second time-frequency signal 3508 at the dominant time frequency of the primary time-frequency signal 3506. Preferably, in this example, the time-frequency buffer of the time-frequency reduction logic 3504 simply boosts the voltage level of the primary time-frequency signal 3506. In addition, the architecture program executes the architectural instructions to write the data random access memory 122, the weighted random access memory 124, and the neural network unit program to the program memory 129. In response to the neural network unit program MTNN command 1400, the neural network unit 121 will begin executing the neural network unit program at the primary time frequency because the mitigation indicator 3512 is programmed with the primary time frequency value. After the neural network unit 121 begins execution, the architecture program continues to execute the architectural instructions at the primary time frequency, including writing and/or reading the data random access memory 122 and the weight random access memory 124 primarily by the MTNN instruction 1400. To complete the next instance of the neural network unit program, or the preparation of the invocation or run.

在第三十六A圖之範例中，相較於架構程式完成對於資料隨機存取記憶體122與權重隨機存取記憶體124寫入/讀取所花費的時間，神經網路單元121能夠以明顯較少的時間(例如四分之一的時間)完成神經網路單元程式之執行。舉例來說，以主要時頻率運作之情況下，神經網路單元121花費大約1000個時頻週期來執行神經網路單元程式，不過，架構程式會花費大約4000個時頻週期。如此，神經網路單元121在剩下的時間內就會處於待機狀態，在此範例中，這是一個相當長的時間，如大約3000個主要時頻率週期。如第三十六A圖之範例所示，依據神經網路之大小與配置的不同，會再次執行前述模式，並可能持續執行許多次。因為神經網路單元121是處理器100中一個相當大且電晶體密集之功能單元，神經網路單元121之運作將會產生大量的熱能，尤其是以主要時頻率運作的時候。 In the example of Figure 36A, compared to the architecture The time it takes for the data random access memory 122 and the weight random access memory 124 to write/read is completed, and the neural network unit 121 can be completed in significantly less time (for example, one quarter of the time). The execution of the neural network unit program. For example, in the case of operating at the primary time frequency, the neural network unit 121 spends approximately 1000 time-frequency cycles to execute the neural network unit program, however, the architecture program would take approximately 4000 time-frequency cycles. As such, the neural network unit 121 will be in a standby state for the remainder of the time, which in this example is a relatively long period of time, such as approximately 3000 major time frequency periods. As shown in the example of Figure 36A, the foregoing mode is executed again depending on the size and configuration of the neural network, and may continue to be executed many times. Since the neural network unit 121 is a relatively large and transistor-intensive functional unit in the processor 100, the operation of the neural network unit 121 will generate a large amount of thermal energy, especially when operating at a primary time frequency.

第三十六B圖係一時序圖，顯示處理器100具有神經網路單元121運作於緩和模式之一運作範例，緩和模式之運作時頻率低於主要時頻率。第三十六B圖之時序圖係類似於第三十六A圖，在第三十六A圖中，處理器100係以主要時頻率執行一架構程式。此範例係假定第三十六B圖中之架構程式與神經網路單元程式相同於第三十六A圖之架構程式與神經網路單元程式。不過，在啟動神經網路單元程式之前，架構程式會執行一MTNN指令1400以一數值程式化緩和指標3512，此數值會使時頻降低邏輯3504以小於主要時頻率之第二時頻率產生第二時頻信號3508。也就是說，架構程式會使神經網路單元121處於第三十六B圖之緩和模式，而非第三十六A圖之一般模式。如此，神經處理單元126就會以第二時頻率執行神經網路單元程式，在緩和模式下，第二時頻率小於主要時頻率。此範例中係假定緩和指標3512是以一個將第二時頻率指定為四分之一主要時頻率之數值予以程式化。如此，神經網路單元121在緩和模式下執行神經網路單元程式所花費之時間會是其於一般模式下花費時間的四倍，如第三十六A與三十六B圖所示，透過比較此二圖可發現神經網路單元121處於待機狀態之時間長度會明顯地縮短。如此，第三十六B圖中神經網路單元121執行神經網路單元程式消耗能量之持續時間大約會是第三十六A圖中神經網路單元121在一般模式下執行程式的四倍。因此，第三十六B圖中神經網路單元121執行神經網路單元程式在單位時間內產生的熱能大約會是第三十六A圖的四分之一，而具有本文所述之優點。 The thirty-sixth B diagram is a timing diagram, and the display processor 100 has an operation example in which the neural network unit 121 operates in the mitigation mode, and the mitigation mode operates at a frequency lower than the main time frequency. The timing chart of the thirty-sixth B diagram is similar to the thirty-sixth A diagram. In the thirty-sixth A diagram, the processor 100 executes an architecture program at the main time frequency. This example assumes that the architecture program and the neural network unit program in Figure 36B are identical to the architecture program and neural network unit program in Figure 36A. However, prior to initiating the neural network unit program, the architecture program executes an MTNN instruction 1400 to program a mitigation indicator 3512, which causes the time-frequency reduction logic 3504 to generate a second frequency less than the primary frequency. Time-frequency signal 3508. In other words, the architecture program will make the neural network single The element 121 is in the mitigation mode of the thirty-sixth B-picture, instead of the general mode of the thirty-sixth A picture. Thus, the neural processing unit 126 executes the neural network unit program at the second time frequency, and in the mitigation mode, the second time frequency is less than the primary time frequency. In this example, it is assumed that the mitigation indicator 3512 is programmed with a value that specifies the second time frequency as a quarter of the primary frequency. Thus, the time taken by the neural network unit 121 to execute the neural network unit program in the mitigation mode is four times that of the normal mode, as shown in the thirty-sixth and thirty-sixth B-pictures. Comparing these two figures, it can be seen that the length of time during which the neural network unit 121 is in the standby state is significantly shortened. Thus, the duration of the neural network unit 121 performing the neural network unit program consuming energy in the thirty-sixth B diagram is approximately four times that of the neural network unit 121 executing the program in the normal mode in the thirty-sixth A picture. Therefore, the thermal energy generated by the neural network unit 121 in the unit of the neural network unit 121 in the unit time is approximately one quarter of the thirty-sixth A picture, and has the advantages described herein.

第三十七圖係一流程圖，顯示第三十五圖之處理器100之運作。此流程圖描述之運作係類似於前文對應於第三十五，三十六A與三十六B圖之運作。此流程始於步驟3702。 Figure 37 is a flow chart showing the operation of the processor 100 of the thirty-fifth figure. The operation described in this flowchart is similar to the previous operation corresponding to the thirty-fifth, thirty-sixth and thirty-sixth B diagrams. This process begins in step 3702.

在步驟3702中，處理器100執行MTNN指令1400而將權重寫入權重隨機存取記憶體124並且將資料寫入資料隨機存取記憶體122。接下來流程前進至步驟3704。 In step 3702, processor 100 executes MTNN instruction 1400 to write weights to weight random access memory 124 and writes the data to data random access memory 122. The flow then proceeds to step 3704.

在步驟3704中，處理器100執行MTNN指令1400而以一個數值程式化緩和指標3512，此數值係指定一個低於主要時頻率之時頻率，亦即使神經網路單元121處於緩和模式。接下來流程前進至步驟3706。 In step 3704, the processor 100 executes the MTNN instruction 1400 to programmatically mitigate the indicator 3512 with a numerical value. A frequency lower than the dominant frequency is also determined, even if the neural network unit 121 is in the mitigation mode. The flow then proceeds to step 3706.

在步驟3706中，處理器100執行MTNN指令1400指示神經網路單元121開始執行神經網路單元程式，即類似第三十六B圖所呈現之方式。接下來流程前進至步驟3708。 In step 3706, the processor 100 executes the MTNN instruction 1400 to instruct the neural network unit 121 to begin executing the neural network unit program, i.e., in a manner similar to that presented in Figure 36B. The flow then proceeds to step 3708.

在步驟3708中，神經網路單元121開始執行此神經網路單元程式。同時，處理器100會執行MTNN指令1400而將新的權重寫入權重隨機存取記憶體124(可能也會將新的資料寫入資料隨機存取記憶體122)，以及/或執行MFNN指令1500而從資料隨機存取記憶體122讀取結果(可能也會從權重隨機存取記憶體124讀取結果)。接下來流程前進至步驟3712。 In step 3708, neural network unit 121 begins executing the neural network unit program. At the same time, the processor 100 executes the MTNN instruction 1400 to write new weights to the weight random access memory 124 (and possibly also write new data to the data random access memory 122), and/or execute the MFNN instruction 1500. The result is read from the data random access memory 122 (the result may also be read from the weighted random access memory 124). The flow then proceeds to step 3712.

在步驟3712中，處理器100執行MFNN指令1500(例如讀取狀態暫存器127)，以偵測神經網路單元121已結束程式執行。假設架構程式選擇一個好的緩和指標3512數值，神經網路單元121執行神經網路單元程式所花費的時間就會相同於處理器100執行部分架構程式以存取權重隨機存取記憶體124與/或資料隨機存取記憶體122所花費的時間，如第三十六B圖所示。接下來流程前進至步驟3714。 In step 3712, the processor 100 executes the MFNN instruction 1500 (e.g., the read status register 127) to detect that the neural network unit 121 has finished executing the program. Assuming that the architecture program selects a good mitigation indicator 3512 value, the time taken by the neural network unit 121 to execute the neural network unit program is the same as the processor 100 executing the partial architecture program to access the weighted random access memory 124 and / Or the time it takes for the data to randomly access the memory 122, as shown in Figure 36B. The flow then proceeds to step 3714.

在步驟3714，處理器100執行MTNN指令1400而利用一數值程式化緩和指標3512，此數值指定主要時頻率，亦即使神經網路單元121處於一般模式。接下來前進至步驟3716。 At step 3714, processor 100 executes MTNN instruction 1400 and utilizes a numerically programmed mitigation indicator 3512 that specifies the primary time frequency, even if neural network unit 121 is in the normal mode. Next, proceed to step 3716.

在步驟3716中，處理器100執行MTNN指令1400指示神經網路單元121開始執行神經網路單元程式，即類似第三十六A圖所呈現之方式。接下來流程前進至步驟3718。 In step 3716, the processor 100 executes the MTNN instruction 1400 to instruct the neural network unit 121 to begin executing the neural network unit program, i.e., in a manner similar to that presented in the thirty-sixth A diagram. The flow then proceeds to step 3718.

在步驟3718中，神經網路單元121開始以一般模式執行神經網路單元程式。此流程終止於步驟3718。 In step 3718, neural network unit 121 begins executing the neural network unit program in a normal mode. This process ends at step 3718.

如前述，相較於在一般模式下執行神經網路單元程式(即以處理器之主要時頻率執行)，在緩和模式下執行可以分散執行時間而能避免產生高溫。進一步來說，當神經網路單元在緩和模式執行程式時，神經網路單元是以較低的時頻率產生熱能，這些熱能可以順利地經由神經網路單元(例如半導體裝置，金屬層與下方的基材)與周圍的封裝體以及冷卻機構(如散熱片，風扇)排出，也因此，神經網路單元內的裝置(如電晶體，電容，導線)就比較可能在較低的溫度下運作。整體來看，在緩和模式下運作也有助於降低處理器晶粒之其他部分內的裝置溫度。較低的運作溫度，特別是對於這些裝置之接面溫度而言，可以減輕漏電流的產生。此外，因為單位時間內流入之電流量降低，電感雜訊與IR壓降雜訊也會降低。此外，溫度降低對於處理器內之金氧半場效電晶體(MOSFET)之負偏壓溫度不穩定性(NBTI)與正偏壓不穩定性(PBSI)也有正面影響，而能提升可靠度與/或裝置以及處理器部分之壽命。溫度降低並可減輕處理器之金屬層內之焦耳熱與電遷移效應。 As described above, execution in the mitigation mode can disperse the execution time while avoiding high temperatures compared to executing the neural network unit program in the normal mode (i.e., at the main time frequency of the processor). Further, when the neural network unit executes the program in the mitigation mode, the neural network unit generates thermal energy at a lower time frequency, and the thermal energy can smoothly pass through the neural network unit (for example, a semiconductor device, a metal layer and a lower layer) The substrate) is discharged from the surrounding package and the cooling mechanism (such as the heat sink, fan). Therefore, devices in the neural network unit (such as transistors, capacitors, wires) are more likely to operate at lower temperatures. Overall, operating in a mitigation mode also helps to reduce device temperature in other parts of the processor die. Lower operating temperatures, especially for junction temperatures of these devices, can reduce leakage current generation. In addition, because the amount of current flowing in per unit time decreases, the inductance noise and IR drop noise are also reduced. In addition, the temperature reduction has a positive effect on the negative bias temperature instability (NBTI) and positive bias instability (PBSI) of the metal oxide half field effect transistor (MOSFET) in the processor, which improves reliability and / Or the life of the device and the processor portion. The temperature is reduced and the Joule heat and electromigration effects within the metal layer of the processor are mitigated.

關於神經網路單元共享資源之架構程式與非架構程式間之溝通機制 Communication mechanism between architecture programs and non-architecture programs for sharing resources between neural network units

如前述，在第二十四至二十八與三十五至三十七圖之範例中，資料隨機存取記憶體122與權重隨機存取記憶體124之資源是共享的。神經處理單元126與處理器100之前端係共享資料隨機存取記憶體122與權重隨機存取記憶體124。更精確地說，神經處理單元126與處理器100之前端，如媒體暫存器118，都會對資料隨機存取記憶體122與權重隨機存取記憶體124進行讀取與寫入。換句話說，執行於處理器100之架構程式與執行於神經網路單元121之神經網路單元程式會共享資料隨機存取記憶體122與權重隨機存取記憶體124，而在某些情況下，如前所述，需要對於架構程式與神經網路單元程式間之流程進行控制。程式記憶體129之資源在一定程度下也是共享的，這是因為架構程式會對其進行寫入，而定序器128會對其進行讀取。本文所述之實施例係提供一高效能的解決方案，以控制架構程式與神經網路單元程式間存取共享資源之流程。 As described above, in the examples of the twenty-fourth to twenty-eighth and thirty-fifth to thirty-seventh figures, the data random access memory 122 and the weight random access memory 124 are shared. The neural processing unit 126 shares the data random access memory 122 and the weight random access memory 124 with the front end of the processor 100. More precisely, the neural processing unit 126 and the front end of the processor 100, such as the media register 118, both read and write the data random access memory 122 and the weight random access memory 124. In other words, the architecture program executed by the processor 100 and the neural network unit program executing in the neural network unit 121 share the data random access memory 122 and the weight random access memory 124, and in some cases As mentioned earlier, it is necessary to control the flow between the architecture program and the neural network unit program. The resources of program memory 129 are also shared to some extent because the architecture program will write to it and sequencer 128 will read it. The embodiments described herein provide a high performance solution to control the flow of shared resources between the architectural program and the neural network unit.

在本文所述之實施例中，神經網路單元程式也稱為非架構程式，神經網路單元指令也稱為非架構指令，而神經網路單元指令集(如前所述也稱為神經處理單元指令集)也稱為非架構指令集。非架構指令集不同於架構指令集。在處理器100內包含指令轉譯器104將架構指令轉譯出微指令之實施例中，非架構指令集也不同於微指令集。 In the embodiments described herein, neural network unit programs are also referred to as non-architectural programs, neural network unit commands are also referred to as non-architectural instructions, and neural network unit instruction sets (also referred to as neural processing as previously described). The unit instruction set) is also known as the non-architectural instruction set. A non-architectural instruction set is different from a architectural instruction set. In embodiments where processor 100 includes instruction translator 104 to translate architectural instructions out of microinstructions, the non-architectural instruction set is also distinct from the microinstruction set.

第三十八圖係一方塊圖，詳細顯示神經網路單元121之序列器128。序列器128提供記憶體位址至程式記憶體129，以選擇提供給序列器128之非架構指令，如前所述。如第三十八圖所示，記憶體位址係裝載於定序器128之一程式計數器3802內。定序器128通常會以程式記憶體129之位址順序循序遞增，除非定序器128遭遇到一非架構指令，例如一迴圈或分支指令，而在此情況下，定序器128會將程式計數器3802更新為控制指令之目標位址，即更新為位於控制指令之目標之非架構指令之位址。因此，裝載於程式計數器3802之位址131會指定當前被攫取以供神經處理單元126執行之非架構程式之非架構指令在程式記憶體129中之位址。程式計數器3802之數值可由架構程式透過狀態暫存器127之神經網路單元程式計數器欄位3912而取得，如後續第三十九圖所述。如此可使架構程式依據非架構程式之進度，決定對於資料隨機存記憶體122與/或權重隨機存取記憶體124讀取/寫入資料之位置。 The thirty-eighth diagram is a block diagram showing the sequencer 128 of the neural network unit 121 in detail. Sequencer 128 provides a memory address to program memory 129 to select the non-architected instructions provided to sequencer 128, as previously described. As shown in the thirty-eighth figure, the memory address is loaded in a program counter 3802 of the sequencer 128. The sequencer 128 will typically increment sequentially in the address sequence of the program memory 129 unless the sequencer 128 encounters a non-architectural instruction, such as a loop or branch instruction, and in this case, the sequencer 128 will The program counter 3802 is updated to the target address of the control instruction, i.e., updated to the address of the non-architected instruction located at the target of the control instruction. Thus, the address 131 loaded in the program counter 3802 specifies the address of the non-architected instruction in the program memory 129 that is currently being fetched for execution by the neural processing unit 126. The value of the program counter 3802 can be obtained by the architecture program through the neural network unit program counter field 3912 of the status register 127, as described in the subsequent thirty-first figure. In this way, the architecture program can determine the location of the data random access memory 122 and/or the weight random access memory 124 to read/write data according to the progress of the non-architecture program.

定序器128並包括一迴圈計數器3804，此迴圈計數器3804會搭配一非架構迴圈指令進行運作，例如第二十六A圖中位址10之迴圈至1指令與第二十八圖中位址11之迴圈至1指令。在第二十六A與二十八圖之範例中，迴圈計數器3804內係載入位址0之非架構初始化指令所指定之數值，例如載入數值400。每一次定序器128遭遇到迴圈指令而跳躍至目標指令(如第二十六A圖中位於位址1之乘法累加指令或是第二十八圖中位於位址1 之maxwacc指令)，定序器128就會使迴圈計數器3804遞減。一旦迴圈計數器3804減少到零，定序器128就轉向排序在下一個的非架構指令。在另一實施例中，首次遭遇到迴圈指令時會在迴圈計數器內載入一個迴圈指令中指定之迴圈計數值，以省去利用非架構初始化指令初始化迴圈計數器3804的需求。因此，迴圈計數器3804的數值會指出非架構程式之迴圈組尚待執行的次數。迴圈計數器3804之數值可由架構程式透過狀態暫存器127之迴圈計數欄位3914取得，如後續第三十九圖所示。如此可使架構程式依據非架構程式之進度，決定對於資料隨機存記憶體122與/或權重隨機存取記憶體124讀取/寫入資料之位置。在一實施例中，定序器包括三個額外的迴圈計數器以搭配非架構程式內之巢套迴圈，這三個迴圈計數器的數值也可透過狀態暫存器127讀取。迴圈指令中具有一位元以指示這四個迴圈計數器中哪一個是提供給當前之迴圈指令使用。 The sequencer 128 includes a loop counter 3804 which operates in conjunction with a non-architected loop command, such as a loop of address 10 to a command and a twenty-eighth in the twenty-sixth A diagram. In the figure, the loop of address 11 is up to 1 command. In the examples of the twenty-sixth and twenty-eighth diagrams, the loop counter 3804 is loaded with the value specified by the non-architectural initialization instruction of address 0, such as the value 400. Each sequencer 128 encounters a loop instruction and jumps to the target instruction (such as the multiply accumulate instruction at address 1 in Figure 26A or the address 1 in the twenty-eighth picture). The maxwacc command), the sequencer 128 will decrement the loop counter 3804. Once the loop counter 3804 is reduced to zero, the sequencer 128 turns to the next non-architected instruction. In another embodiment, the loop count value specified in a loop command is loaded in the loop counter for the first time the loop command is encountered, thereby eliminating the need to initialize the loop counter 3804 with the non-architectural initialization command. Therefore, the value of the loop counter 3804 will indicate the number of times the loop group of the non-architected program has yet to be executed. The value of the loop counter 3804 can be obtained by the architecture program through the loop count field 3914 of the status register 127, as shown in the subsequent thirty-ninth figure. In this way, the architecture program can determine the location of the data random access memory 122 and/or the weight random access memory 124 to read/write data according to the progress of the non-architecture program. In one embodiment, the sequencer includes three additional loop counters to match the nest loops in the non-architected program, and the values of the three loop counters are also readable by the status register 127. The loop instruction has a bit to indicate which of the four loop counters is available for use by the current loop instruction.

定序器128並包括一迭代次數計數器3806。迭代次數計數器3806係搭配非架構指令，例如第四，九，二十與二十六A圖中位址2之乘法累加指令，以及第二十八圖中位址2之maxwacc指令，這些指令在此後將會被稱為“執行”指令。在前述範例中，各個執行指令分別指定一執行計數511，511，1023，2與3。當定序器128遭遇到一個指定一非零迭代計數之執行指令時，定序器128會以此指定值載入迭代次數計數器3806。此外，定序器128會產生一適當的微運算3418以控制第三十四圖中神經處理單元126管線級3401內之邏輯執行，並且使迭代次數計數器3806遞減。若是迭代次數計數器3806大於零，定序器128會再次產生一適當的微運算3418控制神經處理單元126內之邏輯並使迭代次數計數器3806遞減。定序器128會持續以此方式運作，直到迭代次數計數器3806之數值歸零。因此，迭代次數計數器3806之數值即為非架構執行指令內指定尚待執行之運算次數(這些運算如對於累加值與一資料/權重文字進行乘法累加，取最大值，加總運算等)。迭代次數計數器3806之數值可利用架構程式透過狀態暫存器127之迭代次數計數欄位3916取得，如後續第三十九圖所述。如此可使架構程式依據非架構程式之進度，決定對於資料隨機存記憶體122與/或權重隨機存取記憶體124讀取/寫入資料之位置。 Sequencer 128 also includes an iteration count counter 3806. The iteration number counter 3806 is paired with non-architectural instructions, such as the multiply-accumulate instruction of address 2 in the fourth, nine, twenty- and twenty-sixth A pictures, and the maxwacc instruction of address 2 in the twenty-eighth figure. This will be referred to as the "execute" instruction. In the foregoing example, each execution instruction specifies an execution count 511, 511, 1023, 2, and 3, respectively. When sequencer 128 encounters an execution instruction that specifies a non-zero iteration count, sequencer 128 loads the iteration count counter 3806 with this specified value. In addition, sequencer 128 generates an appropriate micro-operation 3418 to control the thirty-fourth map. The logic within the pipeline stage 3401 is executed by the middle neural processing unit 126 and the iteration count counter 3806 is decremented. If the iteration count counter 3806 is greater than zero, the sequencer 128 will again generate an appropriate micro-operation 3418 to control the logic within the neural processing unit 126 and decrement the iteration count counter 3806. Sequencer 128 will continue to operate in this manner until the value of iteration count counter 3806 is zeroed. Therefore, the value of the iteration count counter 3806 is the number of operations to be executed in the non-architectural execution instruction (these operations are multiply and accumulate for the accumulated value and a data/weight text, take the maximum value, add the total operation, etc.). The value of the iteration count counter 3806 can be obtained by the architecture program through the iteration count field 3916 of the state register 127, as described in the subsequent thirty-ninth figure. In this way, the architecture program can determine the location of the data random access memory 122 and/or the weight random access memory 124 to read/write data according to the progress of the non-architecture program.

第三十九圖係一方塊圖，顯示神經網路單元121之控制與狀態暫存器127之若干欄位。這些欄位包括包括神經處理單元126執行非架構程式最近寫入之權重隨機存取記憶體列之位址2602，神經處理單元126執行非架構程式最近讀取之權重隨機存取記憶體列之位址2604，神經處理單元126執行非架構程式最近寫入之資料隨機存取記憶體列的位址2606，以及神經處理單元126執行非架構程式最近讀取之資料隨機存取記憶體列的位址2608，如前述第二十六B圖所示。此外，這些欄位還包括一神經網路單元程式計數器3912欄位，一迴圈計數器3914欄位，與一迭代次數計數器3916欄位。如前述，架構程式可將狀態暫存器127內之資料讀取至媒體暫存器118與/或通用暫存器116，例如透過MFNN指令1500讀取包括神經網路單元程式計數器3912，迴圈計數器3914與迭代次數計數器3916欄位之數值。程式計數器欄位3912之數值反映第三十八圖中程式計數器3802之數值。迴圈計數器欄位3914之數值反映迴圈計數器3804之數值。迭代次數計數器欄位3916之數值反映迭代次數計數器3806之數值。在一實施例中，定序器128在每次需要調整程式計數器3802，迴圈計數器3804，或迭代次數計數器3806時，都會更新程式計數器欄位3912，迴圈計數器欄位3914與迭代次數計數器欄位3916之數值，如此，當架構程式讀取時這些欄位的數值就會是當下的數值。在另一實施例中，當神經網路單元121執行架構指令以讀取狀態暫存器127時，神經網路單元121僅僅取得程式計數器3802，迴圈計數器3804與迭代次數計數器3806之數值並將其提供回架構指令(例如提供至媒體暫存器118或通用暫存器116)。 The thirty-ninth diagram is a block diagram showing the fields of the control and status register 127 of the neural network unit 121. These fields include an address 2602 including a weighted random access memory column that the neural processing unit 126 performs a non-architected program to write recently, and the neural processing unit 126 performs a weighted random access memory column that is recently read by the non-architected program. At address 2604, the neural processing unit 126 executes the address 2606 of the data random access memory column recently written by the non-architected program, and the neural processing unit 126 performs the address of the data random access memory column recently read by the non-architected program. 2608, as shown in the aforementioned twenty-sixth B. In addition, these fields include a neural network unit program counter 3912 field, a loop counter 3914 field, and an iteration count counter 3916 field. As described above, the architecture program can read the data in the state register 127 to the media temporary storage. The timer 118 and/or the general purpose register 116, for example, through the MFNN instruction 1500, reads the values including the neural network unit program counter 3912, the loop counter 3914 and the iteration count counter 3916. The value of the program counter field 3912 reflects the value of the program counter 3802 in the thirty-eighth figure. The value of the loop counter field 3914 reflects the value of the loop counter 3804. The value of the iteration count counter field 3916 reflects the value of the iteration count counter 3806. In one embodiment, the sequencer 128 updates the program counter field 3912, the loop counter field 3914 and the iteration count counter column each time the program counter 3802, loop counter 3804, or iteration number counter 3806 needs to be adjusted. The value of bit 3916, so that when the architecture program reads it, the value of these fields will be the current value. In another embodiment, when the neural network unit 121 executes the architectural instructions to read the status register 127, the neural network unit 121 only takes the value of the program counter 3802, the loop counter 3804 and the iteration count counter 3806 and It provides back architecture instructions (eg, to media register 118 or general purpose register 116).

由此可以發現，第三十九圖之狀態暫存器127之欄位的數值可以理解為非架構指令由神經網路單元執行之過程中，其執行進度的資訊。關於非架構程式執行進度之某些特定面向，如程式計數器3802數值，迴圈計數器3804數值，迭代次數計數器3806數值，最近讀取/寫入之權重隨機存取記憶體124位址125之欄位2602/2604，以及最近讀取/寫入之資料隨機存取記憶體122位址123之欄位2606/2608，已於先前之章節進行描述。執行於處理器100之架構程式可以從狀態暫存器127 讀取第三十九圖之非架構程式進度值並利用這些資訊來做決策，例如透過如比較與分支指令等架構指令來進行。舉例來說，架構程式會決定對於資料隨機存取記憶體122與/或權重隨機存取記憶體124進行資料/權重之讀取/寫入之列，以控制資料隨機存取記憶體122或權重隨機存取記憶體124之資料的流入與流出，尤其是針對大型資料組與/或不同非架構指令之重疊執行。這些利用架構程式進行決策之範例可參照本文前後章節之描述。 It can be seen that the value of the field of the state register 127 of the thirty-ninth figure can be understood as the information of the progress of the execution of the non-architectural instruction by the neural network unit. Some specific aspects of the execution schedule of the non-architecture program, such as the program counter 3802 value, the loop counter 3804 value, the iteration count counter 3806 value, and the most recent read/write weight random access memory 124 address 125 field. 2602/2604, and the most recent read/write data random access memory 122 address 123 field 2606/2608, has been described in the previous section. The architecture program executing on the processor 100 can be slaved from the state register 127. Read the non-architected program progress values in Figure 39 and use this information to make decisions, such as through architectural instructions such as comparison and branch instructions. For example, the architecture program may determine the data/weight reading/writing queue for the data random access memory 122 and/or the weight random access memory 124 to control the data random access memory 122 or weight. The inflow and outflow of data from random access memory 124, especially for overlapping of large data sets and/or different non-architectural instructions. Examples of these decisions using the architecture program can be found in the previous and subsequent sections of this article.

舉例來說，如前文第二十六A圖所述，架構程式設定非架構程式將卷積運算之結果寫回資料隨機存取記憶體122中卷積核2402上方之列(如列8上方)，而當神經網路單元121利用最近寫入資料隨機存取記憶體122列2606之位址寫入結果時，架構程式會從資料隨機存取記憶體122讀取此結果。 For example, as described in the foregoing figure 26A, the architecture program sets the non-architecture program to write the result of the convolution operation back to the column above the convolution kernel 2402 in the data random access memory 122 (eg, above column 8). When the neural network unit 121 writes the result using the address of the recently written data random access memory 122 column 2606, the architecture program reads the result from the data random access memory 122.

在另一範例中，如前文第二十六B圖所述，架構程式利用來自第三十八圖之狀態暫存器127欄位的資訊確認非架構程式將第二十四圖之資料陣列2404分成5個512 x 1600之資料塊以執行卷積運算之進度。架構程式將此2560 x 1600資料陣列之第一個512 x 1600資料塊寫入權重隨機存取記憶體124並啟動非架構程式，其迴圈計數為1600而權重隨機存取記憶體124初始化之輸出列為0。神經網路單元121執行非架構程式時，架構程式會讀取狀態暫存器127以確認權重隨機存取記憶體124之最近寫入列2602，如此架構程式就可讀取由非架構程式寫入之有效卷積運算結果，並且在讀取後利用下一個512 x 1600資料塊覆寫此有效卷積運算結果，如此，在神經網路單元121完成非架構程式對於第一個512 x 1600資料塊之執行後，處理器100在必要時就可立即更新非架構程式並再次啟動非架構程式以執行下一個512 x 1600資料塊。 In another example, as described in the foregoing section 26B, the architecture program uses the information from the field register 127 field of the thirty-eighth figure to confirm that the non-architecture program is the data array 2404 of the twenty-fourth figure. Divided into five 512 x 1600 data blocks to perform the progress of the convolution operation. The architecture program writes the first 512 x 1600 data block of the 2560 x 1600 data array into the weighted random access memory 124 and initiates the non-architected program with a loop count of 1600 and an output of the weighted random access memory 124 initialization. Listed as 0. When the neural network unit 121 executes the non-architecture program, the architecture program reads the status register 127 to confirm the most recent write column 2602 of the weighted random access memory 124, so that the architecture program can be read by the non-architect program. The result of the effective convolution operation, and use the next 512 after reading The x 1600 data block overwrites the result of this efficient convolution operation, so that after the neural network unit 121 completes execution of the non-architected program for the first 512 x 1600 data block, the processor 100 can update the non-architected immediately when necessary. The program starts the non-architected program again to execute the next 512 x 1600 data block.

在另一範例中，假定架構程式使神經網路單元121執行一系列典型的神經網路乘法累加啟動函數，其中，權重係被儲存於權重隨機存取記憶體124而結果會被寫回資料隨機存取記憶體122。在此情況下，架構程式讀取權重隨機存取記憶體124之一列後就不會再對其進行讀取。如此，在當前的權重已經被非架構程式讀取/使用後，就可以利用架構程式開始將新的權重覆寫權重隨機存取記憶體124上之權重，以提供非架構程式之下一次範例(例如下一個神經網路層)使用。在此情況下，架構程式會讀取狀態暫存器127以取得權重隨機存取記憶體之最近讀取列2604之位址以決定其於權重隨機存取記憶體124中寫入新權重組的位置。 In another example, assume that the architectural program causes the neural network unit 121 to perform a series of typical neural network multiply-accumulate start functions, wherein the weights are stored in the weighted random access memory 124 and the results are written back to the data randomly. The memory 122 is accessed. In this case, the architecture program reads the weight of one of the random access memories 124 and then does not read it. Thus, after the current weight has been read/used by the non-architect program, the architecture program can be used to begin to overwrite the weights on the weighted random access memory 124 to provide an example of the non-architectural program ( For example, the next neural network layer). In this case, the architecture program reads the state register 127 to obtain the address of the most recent read column 2604 of the weighted random access memory to determine its write of the new weight in the weighted random access memory 124. position.

在另一個範例中，假定架構程式知道非架構程式內包括一個具有大迭代次數計數之執行指令，如第二十圖中位址2之非架構乘法累加指令。在此情況下，架構程式需要知道迭代次數計數3916，方能知道大致上還需要多少個時頻週期才能完成此非架構指令以決定架構程式接下來所要採取兩個或多個動作之一的何者。舉例來說，若是需要很長的時間才能完成執行，架構程式就會放棄控制給另一個架構程式，例如作業系統。類似地，假定架構程式知道非架構程式包括一個具有相當大之迴圈計數的迴圈組，例如第二十八圖之非架構程式。在此情況下，架構程式會需要知道迴圈計數3914，方能知道大致上還需要多少個時頻週期才能完成此非架構指令以決定接下來所要採取兩個或多個動作之一的何者。 In another example, assume that the architectural program knows that the non-architected program includes an execution instruction with a large number of iteration counts, such as the non-architectural multiply-accumulate instruction of address 2 in the twentieth diagram. In this case, the architecture program needs to know the iteration count of 3916 to know how many time-frequency cycles are needed to complete the non-architectural instruction to determine which of the two or more actions the architecture program will take next. . For example, if it takes a long time to complete execution, the architecture program will give up control to another architecture program, such as the operating system. similar It is assumed that the architecture program knows that the non-architecture program includes a loop group with a fairly large loop count, such as the non-architectural program of Figure 28. In this case, the architecture program will need to know the loop count 3914 to know how many time-frequency cycles are needed to complete the non-architectural instruction to determine which of the two or more actions to take next.

在另一範例中，假定架構程式使神經網路單元121執行類似於第二十七與二十八圖所述之共源運算，其中所要共源的資料是儲存在權重隨機存取記憶體124而結果會被寫回權重隨機存取記憶體124。不過，不同於第二十七與二十八圖之範例，假定此範例之結果會被寫回權重隨機存取記憶體124之最上方400列，例如列1600至1999。在此情況下，非架構程式完成讀取四列其所要共源之權重隨機存取記憶體124資料後，非架構程式就不會再次進行讀取。因此，一旦當前四列資料都已被非架構程式讀取/使用後，即可利用架構程式開始將新資料(如非架構程式之下一次範例之權重，舉例來說，例如對取得資料執行典型乘法累加啟動函數運算之非架構程式)覆寫權重隨機存取記憶體124之資料。在此情況下，架構程式會讀取狀態暫存器127以取得權重隨機存取記憶體之最近讀取列2604之位址，以決定新的權重組寫入權重隨機存取記憶體124之位置。 In another example, it is assumed that the architectural program causes the neural network unit 121 to perform a common source operation similar to that described in the twenty-seventh and twenty-eighthth embodiments, wherein the data to be co-sourced is stored in the weighted random access memory 124. The result is written back to the weighted random access memory 124. However, unlike the examples of the twenty-seventh and twenty-eighth diagrams, it is assumed that the results of this example are written back to the top 400 columns of the weighted random access memory 124, such as columns 1600 through 1999. In this case, after the non-architecture program finishes reading the data of the weighted random access memory 124 of the four common sources, the non-architected program will not read it again. Therefore, once the current four columns of data have been read/used by the non-architected program, the architecture program can be used to start the new data (such as the weight of the next paradigm of the non-architected program, for example, for example, the typical execution of the data. The multiplication accumulates the non-architectural program of the function operation to overwrite the data of the weighted random access memory 124. In this case, the architecture program reads the state register 127 to obtain the address of the most recent read column 2604 of the weighted random access memory to determine the location of the new weight rewrite write weight random access memory 124. .

時間遞歸(recurrent)神經網路加速 Time recurrent neural network acceleration

傳統前饋神經網路不具有儲存網路先前輸入之記憶體。前饋神經網路通常被用於執行在任務中隨時間輸入網路之多個輸入是各自獨立，且多個輸出亦是如此的任務。相較之下，時間遞歸神經網路通常有助於執行在任務中隨時間輸入至神經網路之輸入順序具有重要性之任務。(此處的順序通常被稱為時間步驟。)因此，時間遞歸神經網路包括一個概念上的記憶體或稱內部狀態，以裝載網路因應序列中之先前輸入所執行之計算而產生之資訊，時間遞歸神經網路之輸出係關聯於此內部狀態與下一個時間步驟之輸入。下列任務，如語音辨識，語言模型，文字產生，語言翻譯，影像描述產生以及某些形式之手寫辨識，是時間遞歸神經網路可以執行良好的例子。 Traditional feedforward neural networks do not have memory that was previously entered by the storage network. The feedforward neural network is usually used to perform multiple inputs that are input to the network over time in the task, and each of the inputs is independent. It is such a task. In contrast, time-recurrent neural networks often help to perform tasks that are important in the order in which tasks are input to the neural network over time. (The order here is often referred to as the time step.) Thus, the time recurrent neural network includes a conceptual memory or internal state that loads the information generated by the network in response to calculations performed by previous inputs in the sequence. The output of the time recurrent neural network is associated with the input of this internal state and the next time step. The following tasks, such as speech recognition, language modeling, text generation, language translation, image description generation, and some forms of handwriting recognition, are good examples of time recursive neural networks that can perform well.

三種習知之時間遞歸神經網路的範例為Elman時間遞歸神經網路，Jordan時間遞歸神經網路與長短期記憶(LSTM)神經網路。Elman時間遞歸神經網路包含內容節點以記憶當前時間步驟中時間遞歸神經網路之隱藏層狀態，此狀態在下一個時間步驟中會作為對於隱藏層之輸入。Jordan時間遞歸神經網路類似於Elman時間遞歸神經網路，除了其中之內容節點會記憶時間遞歸神經網路之輸出層狀態而非隱藏層狀態。長短期記憶神經網路包括由長短期記憶胞構成之一長短期記憶層。每個長短期記憶胞具有當前時間步驟之一當前狀態與一當前輸出，以及一個新的或後續時間步驟之一新的狀態與一新的輸出。長短期記憶胞包括一輸入閘與一輸出閘，以及一遺忘閘，遺忘閘可以使神經元失去其所記憶之狀態。這三種時間遞歸神經網路在後續章節會有更詳細的描述。 Examples of three conventional time recurrent neural networks are the Elman time recurrent neural network, the Jordan time recurrent neural network and the long and short term memory (LSTM) neural network. The Elman time recurrent neural network contains content nodes to remember the hidden layer state of the time recurrent neural network in the current time step, which will be used as input to the hidden layer in the next time step. The Jordan time recurrent neural network is similar to the Elman time recurrent neural network except that the content nodes remember the output layer state of the time recurrent neural network rather than the hidden layer state. Long- and short-term memory neural networks include long-term and short-term memory layers composed of long- and short-term memory cells. Each long-term and short-term memory cell has one of the current time steps of the current state with a current output, and one of the new or subsequent time steps of the new state with a new output. The long-term and short-term memory cells include an input gate and an output gate, and a forgetting gate, which can cause the neuron to lose its state of memory. These three time recurrent neural networks are described in more detail in subsequent chapters.

如本文所述，對於時間遞歸神經網路而言，如Elman或Jordan時間遞歸神經網路，神經網路單元每次執行都會使用一時間步驟，取得一組輸入層節點值，並執行必要計算使其透過時間遞歸神經網路進行傳播，以產生輸出層節點值以及隱藏層與內容層節點值。因此，輸入層節點值會關聯於計算隱藏，輸出與內容層節點值之時間步驟；而隱藏，輸出與內容層節點值會關聯於產生這些節點值之時間步驟。輸入層節點值是時間遞歸神經網路所模擬之系統之取樣值，如影像，語音取樣，商業市場資料之快照。對於長短期記憶神經網路而言，神經網路單元之每次執行都會使用一時間步驟，取得一組記憶胞輸入值並執行必要計算以產生記憶胞輸出值(以及記憶胞狀態與輸入閘，遺忘閘以及輸出閘數值)，這也可以理解為是透過長短期記憶層記憶胞傳播記憶胞輸入值。因此，記憶胞輸入值會關聯於計算記憶胞狀態以及輸入閘，遺忘閘與輸出閘數值之時間步驟；而記憶胞狀態以及輸入閘，遺忘閘與輸出閘數值會關聯於產生這些節點值之時間步驟。 As described herein, for a time recurrent neural network, such as the Elman or Jordan time recurrent neural network, the neural network unit uses a time step each time to perform a set of input layer node values and perform the necessary calculations. It propagates through a time recurrent neural network to produce output layer node values as well as hidden layer and content layer node values. Thus, the input layer node value is associated with the time step of computing the hidden, outputting the value with the content layer node; while the hidden, output and content layer node values are associated with the time step of generating these node values. The input layer node values are samples of systems simulated by the time recurrent neural network, such as images, speech samples, and snapshots of commercial market data. For long- and short-term memory neural networks, each execution of the neural network unit uses a time step to obtain a set of memory cell input values and perform the necessary calculations to generate memory cell output values (as well as memory cell states and input gates, The forgetting gate and the output gate value) can also be understood as the memory cell input value through the long-term and short-term memory layer memory. Therefore, the memory cell input value is associated with the time step of calculating the memory cell state and the input gate, the forgetting gate and the output gate value; and the memory cell state and the input gate, the forgetting gate and the output gate value are associated with the time at which these node values are generated. step.

內容層節點值，也稱為狀態節點，是神經網路之狀態值，此狀態值係基於關聯於先前時間步驟之輸入層節點值，而不僅只關聯於當前時間步驟之輸入層節點值。神經網路單元對於時間步驟所執行之計算(例如對於Elman或Jordan時間遞歸神經網路之隱藏層節點值計算)是先前時間步驟產生之內容層節點值之一函數。因此，時間步驟開始時的網路狀態值(內容節點值) 會影響此時間步驟之過程中產生之輸出層節點值。此外，時間步驟結束時之網路狀態值會受到此時間步驟之輸入節點值與時間步驟開始時之網路狀態值影響。類似地，對於長短期記憶胞而言，記憶胞狀態值係關聯於先前時間步驟之記憶胞輸入值，而非僅只關聯於當前時間步驟之記憶胞輸入值。因為神經網路單元對於時間步驟執行之計算(例如下一個記憶胞狀態之計算)是先前時間步驟產生之記憶胞狀態值之函數，時間步驟開始時之網路狀態值(記憶胞狀態值)會影響此時間步驟中產生之記憶胞輸出值，而此時間步驟結束時之網路狀態值會受到此時間步驟之記憶胞輸入值與先前網路狀態值影響。 The content layer node value, also known as the state node, is the state value of the neural network based on the input layer node values associated with the previous time step, and not only the input layer node values associated with the current time step. The calculations performed by the neural network unit for the time step (eg, for the hidden layer node value calculations of the Elman or Jordan time recurrent neural network) are a function of the content layer node values generated by the previous time step. Therefore, the network status value (content node value) at the beginning of the time step Will affect the output layer node value generated during this time step. In addition, the network status value at the end of the time step is affected by the input node value for this time step and the network status value at the beginning of the time step. Similarly, for long- and short-term memory cells, the memory cell state value is associated with the memory cell input value of the previous time step, rather than just the memory cell input value of the current time step. Since the calculation of the neural network unit for the time step execution (eg, the calculation of the next memory cell state) is a function of the memory cell state value generated by the previous time step, the network state value (memory cell state value) at the beginning of the time step will The memory cell output value generated in this time step is affected, and the network state value at the end of this time step is affected by the memory cell input value and the previous network state value of this time step.

第四十圖係一方塊圖，顯示Elman時間遞歸神經網路之一範例。第四十圖之Elman時間遞歸神經網路包括輸入層節點，或神經元，標示為D0,D1至Dn，集體稱為多個輸入層節點D而個別通稱為輸入層節點D；隱藏層節點/神經元，標示為Z0,Z1至Zn，集體稱為多個隱藏層節點Z而個別通稱為隱藏層節點Z；輸出層節點/神經元，標示為Y0,Y1至Yn，集體稱為多個輸出層節點Y而個別通稱為輸出層節點Y；以及內容層節點/神經元，標示為C0,C1至Cn，集體稱為多個內容層節點C而個別通稱為內容層節點C。在第四十圖之Elman時間遞歸神經網路之範例中，各個隱藏層節點Z具有一輸入連結至各個輸入層節點D之輸出，並具有一輸入連結至各個內容層節點C之輸出；各個輸出層節點Y具有一輸入連結至各個隱藏層節點Z之輸出；而各個內容層節點C具有一輸入連結至一相對應隱藏層節點Z之輸出。 The fortieth figure is a block diagram showing an example of an Elman time recurrent neural network. The Elman time recurrent neural network of the fortieth figure includes input layer nodes, or neurons, denoted as D0, D1 to Dn, collectively referred to as multiple input layer nodes D and individually referred to as input layer nodes D; hidden layer nodes/ Neurons, labeled Z0, Z1 to Zn, collectively referred to as multiple hidden layer nodes Z and individually referred to as hidden layer nodes Z; output layer nodes/neurons, labeled Y0, Y1 to Yn, collectively referred to as multiple outputs Layer nodes Y are collectively referred to as output layer nodes Y; and content layer nodes/neurons, designated C0, C1 through Cn, collectively referred to as multiple content layer nodes C and individually referred to as content layer nodes C. In the example of the Elman time recurrent neural network of the fortieth figure, each hidden layer node Z has an input connected to the output of each input layer node D, and has an input connected to the output of each content layer node C; each output Layer node Y has an input link The output to each hidden layer node Z; and each content layer node C has an input coupled to an output of a corresponding hidden layer node Z.

在許多方面，Elman時間遞歸神經網路之運作係類似於傳統之前饋人工神經網路。也就是說，對於一給定節點而言，此節點之各個輸入連結都會有一個相關聯的權重；節點在一輸入連結收到的數值會和關聯的權重相乘以產生一乘積；此節點會將關聯於所有輸入連結之乘積相加以產生一總數(此總數內可能還會包含一偏移項)；一般而言，對此總數還會執行一啟動函數以產生節點之輸出值，此輸出值有時稱為此節點之啟動值。對於傳統之前饋網路而言，資料總是沿著輸入層至輸出層之方向流動。也就是說，輸入層提供一數值至隱藏層(通常會有多個隱藏層)，而隱藏層會產生其輸出值提供至輸出層，而輸出層會產生可被取用之輸出。 In many ways, the operation of the Elman time recurrent neural network is similar to the traditional feed-forward artificial neural network. That is to say, for a given node, each input link of this node will have an associated weight; the value received by the node in an input link will be multiplied by the associated weight to generate a product; this node will Adding the product associated with all input joins produces a total (this offset may also contain an offset); in general, a total is also executed for this total to generate the output value of the node. Sometimes referred to as the startup value of this node. For traditional feedforward networks, data always flows in the direction from the input layer to the output layer. That is, the input layer provides a value to the hidden layer (usually there are multiple hidden layers), while the hidden layer produces its output value to the output layer, and the output layer produces an output that can be fetched.

不過，不同於傳統之前饋網路，Elman時間遞歸神經網路還包括一些反饋連結，也就是第四十圖中從隱藏層節點Z至內容層節點C之連結。Elman時間遞歸神經網路之運作如下，當輸入層節點D在一個新的時間步驟提供一輸入值至隱藏層節點Z，內容節點C會提供一數值至隱藏層Z，此數值為隱藏層節點Z因應先前輸入，也就是當前時間步驟，之輸出值。從這個意義上來說，Elman時間遞歸神經網路之內容節點C是一個基於先前時間步驟之輸入值之記憶體。第四十一與四十二圖將會對執行關聯於第四十圖之Elman時間遞歸神經網路之計算之神經網路單元121的運作實施例進行說明。 However, unlike the traditional feedforward network, the Elman time recurrent neural network also includes some feedback links, that is, the connection from the hidden layer node Z to the content layer node C in the fortieth figure. The operation of the Elman time recurrent neural network is as follows. When the input layer node D provides an input value to the hidden layer node Z in a new time step, the content node C provides a value to the hidden layer Z, which is the hidden layer node Z. The output value in response to the previous input, which is the current time step. In this sense, the content node C of the Elman time recursive neural network is a memory based on the input values of the previous time steps. The operational examples of the neural network unit 121 that performs the calculation associated with the Elman time recurrent neural network associated with the fortieth map will be described in the forty-first and forty-second figures.

為了說明本發明，Elman時間遞歸神經網路是一個包含至少一個輸入節點層，一個隱藏節點層，一個輸出節點層與一個內容節點層之時間遞歸神經網路。對於一給定時間步驟，內容節點層會儲存隱藏節點層於前一個時間步驟產生且反饋至內容節點層之結果。此反饋至內容層的結果可以是啟動函數之執行結果或是隱藏節點層執行累加運算而未執行啟動函數之結果。 To illustrate the present invention, an Elman time recurrent neural network is a time recurrent neural network comprising at least one input node layer, one hidden node layer, one output node layer and one content node layer. For a given time step, the content node layer stores the results of the hidden node layer generated in the previous time step and fed back to the content node layer. The result of this feedback to the content layer can be the result of the execution of the startup function or the result of the hidden node layer performing the accumulation operation without executing the startup function.

第四十一圖係一方塊圖，顯示當神經網路單元121執行關聯於第四十圖之Elman時間遞歸神經網路之計算時，神經網路單元121之資料隨機存取記憶體122與權重隨機存取記憶體124內之資料配置之一範例。在第四十一圖之範例中假定第四十圖之Elman時間遞歸神經網路具有512個輸入節點D，512個隱藏節點Z，512個內容節點C，與512個輸出節點Y。此外，亦假定此Elman時間遞歸神經網路為完全連結，即全部512個輸入節點D均連結各個隱藏節點Z作為輸入，全部512個內容節點C均連結各個隱藏節點Z作為輸入，而全部512個隱藏節點Z均連結各個輸出節點Y作為輸入。此外，此神經網路單元121係配置為512個神經處理單元126或神經元，例如採寬配置。最後，此範例係假定關聯於內容節點C至隱藏節點Z之連結的權重均為數值1，因而不需儲存這些為一的權重值。 The forty-first diagram is a block diagram showing the data random access memory 122 and the weight of the neural network unit 121 when the neural network unit 121 performs the calculation associated with the Elman time recurrent neural network of the fortieth map. An example of data configuration within random access memory 124. In the example of the 41st graph, it is assumed that the Elman time recurrent neural network of the fortieth graph has 512 input nodes D, 512 hidden nodes Z, 512 content nodes C, and 512 output nodes Y. In addition, it is also assumed that the Elman time recurrent neural network is fully connected, that is, all 512 input nodes D are connected to each hidden node Z as an input, and all 512 content nodes C are connected to each hidden node Z as an input, and all 512 The hidden nodes Z are each connected to each output node Y as an input. Moreover, the neural network unit 121 is configured as 512 neural processing units 126 or neurons, such as a widened configuration. Finally, this example assumes that the weights associated with the links from content node C to hidden node Z are all a value of 1, so that these weight values are not required to be stored.

如圖中所示，權重隨機存取記憶體124之下方512個列(列0至511)係裝載關聯於輸入節點D與隱藏節點Z間之連結之權重值。更精確地說，如圖中所示，列0係裝載關聯於由輸入節點D0至隱藏節點Z之輸入連結的權重，亦即，文字0會裝載關聯於輸入節點D0與隱藏節點Z0間之連結的權重，文字1會裝載關聯於輸入節點D0與隱藏節點Z1間之連結的權重，文字2會裝載關聯於輸入節點D0與隱藏節點Z2間之連結的權重，依此類推，文字511會裝載關聯於輸入節點D0與隱藏節點Z511間之連結的權重；列1係裝載關聯於由輸入節點D1至隱藏節點Z之輸入連結的權重，亦即，文字0會裝載關聯於輸入節點D1與隱藏節點Z0間之連結的權重，文字1會裝載關聯於輸入節點D1與隱藏節點Z1間之連結的權重，文字2會裝載關聯於輸入節點D1與隱藏節點Z2間之連結的權重，依此類推，文字511會裝載關聯於輸入節點D1與隱藏節點Z511間之連結的權重；直到列511，列511係裝載關聯於由輸入節點D511至隱藏節點Z之輸入連結的權重，亦即，文字0會裝載關聯於輸入節點D511與隱藏節點Z0間之連結的權重，文字1會裝載關聯於輸入節點D511與隱藏節點Z1間之連結的權重，文字2會裝載關聯於輸入節點D511與隱藏節點Z2間之連結的權重，依此類推，文字511會裝載關聯於輸入節點D511與隱藏節點Z511間之連結的權重。此配置與用途係類似於前文對應於第四至六A圖所述之實施例。 As shown in the figure, the lower 512 columns (columns 0 to 511) of the weighted random access memory 124 are loaded with weight values associated with the connections between the input node D and the hidden node Z. More precisely, as shown in the figure, Column 0 loads the weight associated with the input link from input node D0 to hidden node Z, ie, text 0 loads the weight associated with the link between input node D0 and hidden node Z0, and text 1 is loaded associated with the input node. The weight of the connection between D0 and the hidden node Z1, the text 2 will load the weight associated with the connection between the input node D0 and the hidden node Z2, and so on, the text 511 will be loaded with the link between the input node D0 and the hidden node Z511. The weight of the column 1 is loaded with the weight associated with the input link from the input node D1 to the hidden node Z, that is, the text 0 loads the weight associated with the link between the input node D1 and the hidden node Z0, and the text 1 loads the association. For the weight of the connection between the input node D1 and the hidden node Z1, the text 2 will load the weight associated with the connection between the input node D1 and the hidden node Z2, and so on, the text 511 will be loaded with the input node D1 and the hidden node Z511. The weight of the link; until column 511, the column 511 is loaded with the weight associated with the input link from the input node D511 to the hidden node Z, that is, the text 0 is loaded associated with the input The weight of the connection between the node D511 and the hidden node Z0, the character 1 will load the weight associated with the connection between the input node D511 and the hidden node Z1, and the character 2 will load the weight associated with the connection between the input node D511 and the hidden node Z2. And so on, the text 511 will load the weight associated with the connection between the input node D511 and the hidden node Z511. This configuration and use is similar to the embodiment described above corresponding to Figures 4-6A.

如圖中所示，權重隨機存記憶體124之後續512個列(列512至1023)是以類似的方式裝載關聯於隱藏節點Z與輸出節點Y間之連結的權重。 As shown in the figure, the subsequent 512 columns of weight random memory 124 (columns 512 through 1023) load the weight associated with the association between hidden node Z and output node Y in a similar manner.

資料隨機存取記憶體122係裝載Elman時間遞歸神經網路節點值供一系列時間步驟使用。進一步來說，資料隨機存取記憶體122係以三列為組裝載提供一給定時間步驟之節點值。如圖中所示，以一個具有64列之資料隨機存取記憶體122為例，此資料隨機存取記憶體122可裝載供20個不同時間步驟使用之節點值。在第四十一圖之範例中，列0至2裝載供時間步驟0使用之節點值，列3至5裝載供時間步驟1使用之節點值，依此類推，列57至59裝載供時間步驟19使用之節點值。各組中的第一列係裝載此時間步驟之輸入節點D之數值。各組中的第二列係裝載此時間步驟之隱藏節點Z之數值。各組中的第三列係裝載此時間步驟之輸出節點Y之數值。如圖中所示，資料隨機存取記憶體122之各個行係裝載其相對應之神經元或神經處理單元126之節點值。也就是說，行0係裝載關聯於節點D0，Z0與Y0之節點值，其計算是由神經處理單元0所執行；行1係裝載關聯於節點D1，Z1與Y1之節點值，其計算是由神經處理單元1所執行；依此類推，行511係裝載關聯於節點D511，Z511與Y511之節點值，其計算是由神經處理單元511所執行，這部分在後續對應於第四十二圖處會有更詳細的說明。 Data random access memory 122 is loaded with Elman The inter-recurrent neural network node values are used for a series of time steps. Further, the data random access memory 122 provides a node value for a given time step in three columns as a group load. As shown in the figure, taking a data random access memory 122 having 64 columns as an example, the data random access memory 122 can load node values for use in 20 different time steps. In the example of the 41st graph, columns 0 through 2 are loaded with node values for time step 0, columns 3 through 5 are loaded with node values for time step 1, and so on, columns 57 through 59 are loaded for time steps. 19 node value used. The first column in each group is loaded with the value of input node D for this time step. The second column in each group loads the value of the hidden node Z for this time step. The third column in each group loads the value of the output node Y for this time step. As shown in the figure, each row of data random access memory 122 loads the node value of its corresponding neuron or neural processing unit 126. That is to say, row 0 is loaded with node values associated with nodes D0, Z0 and Y0, and its calculation is performed by neural processing unit 0; row 1 is loaded with node values associated with nodes D1, Z1 and Y1, and the calculation is Executed by the neural processing unit 1; and so on, the row 511 is loaded with node values associated with nodes D511, Z511 and Y511, the calculation of which is performed by the neural processing unit 511, which in the subsequent corresponds to the forty-second map There will be more detailed instructions.

如第四十一圖所指出，對於一給定時間步驟而言，位於各組三列記憶體之第二列之隱藏節點Z的數值會是下一個時間步驟之內容節點C的數值。也就是說，神經處理單元126在一時間步驟內計算並寫入之節點Z的數值，會成為此神經處理單元126在下一個時間步驟內用於計算節點Z的數值所使用之節點C的數值(連同此下一個時間步驟之輸入節點D的數值)。內容節點C之初始值(在時間步驟0用以計算列1中之節點Z的數值所使用之節點C的數值)係假定為零。這在後續對應於第四十二圖之非架構程式之相關章節會有更詳細的說明。 As indicated in the 41st figure, for a given time step, the value of the hidden node Z in the second column of each group of three columns of memory will be the value of the content node C of the next time step. That is, the value of the node Z calculated and written by the neural processing unit 126 in a time step becomes the value of the node C used by the neural processing unit 126 to calculate the value of the node Z in the next time step ( Together with this Enter the value of node D for the next time step). The initial value of the content node C (the value of the node C used to calculate the value of the node Z in the column 1 at time step 0) is assumed to be zero. This will be explained in more detail in the subsequent sections of the non-architected program corresponding to the 42nd figure.

較佳地，輸入節點D的數值(第四十一圖之範例中之列0，3，依此類推至列57之數值)係由執行於處理器100之架構程式透過MTNN指令1400寫入/填入資料隨機存取記憶體122，並且是由執行於神經網路單元121之非架構程式讀取/使用，例如第四十二圖之非架構程式。相反地，隱藏/輸出節點Z/Y之數值(第四十一圖之範例中之列1與2，4與5，依此類推至列58與59之數值)則是由執行於神經網路單元121之非架構程式寫入/填入資料隨機存取記憶體122，並且是由執行於處理器100之架構程式透過MFNN指令1500讀取/使用。第四十一圖之範例係假定此架構程式會執行以下步驟：(1)對於20個不同的時間步驟，將輸入節點D之數值填入資料隨機存取記憶體122(列0，3，依此類推至列57)；(2)啟動第四十二圖之非架構程式；(3)偵測非架構程式是否執行完畢；(4)從資料隨機存取記憶體122讀出輸出節點Y之數值(列2，5，依此類推至列59)；以及(5)重複步驟(1)至(4)若干次直到完成任務，例如對手機使用者之話語進行辨識所需之計算。 Preferably, the value of the input node D (column 0, 3 in the example of the 41st graph, and so on to the value of column 57) is written by the architecture program executed by the processor 100 through the MTNN instruction 1400 / The data random access memory 122 is populated and is read/used by a non-architected program executed by the neural network unit 121, such as the non-architected program of the forty-second graph. Conversely, the value of the hidden/output node Z/Y (columns 1 and 2, 4 and 5 in the example of the 41st graph, and so on to the values of columns 58 and 59) are performed by the neural network. The non-architectural program of unit 121 writes/fills data random access memory 122 and is read/used by MFNN instruction 1500 by an architectural program executing on processor 100. The example in the 41st figure assumes that the architecture program performs the following steps: (1) For 20 different time steps, the value of the input node D is filled in the data random access memory 122 (column 0, 3, This type is pushed to column 57); (2) to start the non-architecture program of the 42nd chart; (3) to detect whether the non-architecture program is executed; (4) to read the output node Y from the data random access memory 122. The values (columns 2, 5, and so on to column 59); and (5) repeat steps (1) through (4) several times until the task is completed, such as the calculation required to identify the phone user's words.

在另一種執行方式中，架構程式會執行以下步驟：(1)對單一個時間步驟，以輸入節點D之數值填入資料隨機存取記憶體122(如列0)；(2)啟動非架構程式(第四十二圖非架構程式之一修正後版本，不需迴圈，並且只存取資料隨機存記憶體122之單一組三個列)；(3)偵測非架構程式是否執行完畢；(4)從資料隨機存取記憶體122讀出輸出節點Y之數值(如列2)；以及(5)重複步驟(1)至(4)若干次直到完成任務。此二種方式何者為優可依據時間遞歸神經網路之輸入值的取樣方式而定。舉例來說，若是此任務容許在多個時間步驟對輸入進行取樣(例如大約20個時間步驟)並執行計算，第一種方式就較為理想，因為此方式可能帶來更多計算資源效率與/或較佳的效能，不過，若是此任務只容許在單一個時間步驟執行取樣，就需要使用第二種方式。 In another implementation, the architecture program performs the following steps: (1) for a single time step, filling the data random access memory 122 (such as column 0) with the value of the input node D; (2) starting the non-architectural Cheng (A modified version of one of the non-architectural programs in Figure 42 does not require loops and only accesses a single set of three columns of data stored in memory 122); (3) detects whether the non-architected program has been executed. (4) reading the value of the output node Y from the data random access memory 122 (e.g., column 2); and (5) repeating steps (1) through (4) several times until the task is completed. Which of the two methods is preferred depends on the sampling method of the input value of the time recurrent neural network. For example, if this task allows sampling of inputs (eg, about 20 time steps) and performing calculations in multiple time steps, the first approach is ideal because it may result in more computational resource efficiency and / Or better performance, however, if this task only allows sampling in a single time step, the second method is required.

第三實施例類似於前述第二種方式，不過，不同於第二種方式使用單一組三列資料隨機存取記憶體122，此方式之非架構程式使用多組三列記憶體，也就是在各個時間步驟使用不同組三列記憶體，此部分類似於第一種方式。在此第三實施例中，較佳地，架構程式在步驟(2)前包含一步驟，此步驟中，架構程式會在非架構程式啟動前對其進行更新，例如將位址1之指令內的資料隨機存取記憶體122列更新為指向下一組三列記憶體。 The third embodiment is similar to the second method described above. However, unlike the second method, a single group of three columns of data random access memory 122 is used. In this manner, the non-architectural program uses multiple sets of three columns of memory, that is, Each time step uses a different set of three columns of memory, which is similar to the first. In this third embodiment, preferably, the architecture program includes a step before step (2), in which the architecture program updates the non-architecture program before starting, for example, the instruction of the address 1 is The data random access memory 122 column is updated to point to the next set of three columns of memory.

第四十二圖係一表格，顯示儲存於神經網路單元121之程式記憶體129之一程式，此程式係由神經網路單元121執行，並依據第四十一圖之配置使用資料與權重以達成Elman時間遞歸神經網路。第四十二圖(以及第四十五，四十八，五十一，五十四與五十七圖)之非架構程式中之若干指令詳如前述(例如乘法累加(MULT-ACCUM)，迴圈(LOOP)，初始化(INITIALIZE)指令)，以下段落係假定這些指令與前述說明內容一致，除非有不同的說明。 The forty-second graph is a table showing a program stored in the program memory 129 of the neural network unit 121. The program is executed by the neural network unit 121, and the data and weights are used according to the configuration of the 41st graph. To achieve the Elman time recurrent neural network. Picture 42 (and forty-fifth, forty-eight, fifty-one, fifty-four and fifty-seven) Some of the instructions in the non-architecture program are as described above (for example, multiply accumulate (MULT-ACCUM), loopback (LOOP), initialize (INITIALIZE) instructions). The following paragraphs assume that these instructions are consistent with the above description unless otherwise stated. .

第四十二圖之範例程式包含13個非架構指令，分別位於位址0至12。位址0之指令(INITIALIZE NPU,LOOPCNT=20)清除累加器202並且將迴圈計數器3804初始化至數值20，以執行20次迴圈組(位址4至11之指令)。較佳地，此初始化指令也會使神經網路單元121處於寬配置，如此，神經網路單元121就會配置為512個神經處理單元126。如同後續章節所述，在位址1至3以及位址7至11之指令執行過程中，這512個神經處理單元126係作為512個相對應之隱藏層節點Z進行運作，而在位址4至6之指令執行過程中，這512個神經處理單元126係作為512個相對應之輸出層節點Y進行運作。 The sample program in Figure 42 contains 13 non-architected instructions located at addresses 0 through 12. The instruction of address 0 (INITIALIZE NPU, LOOPCNT = 20) clears the accumulator 202 and initializes the loop counter 3804 to a value of 20 to perform 20 loop groups (instructions of addresses 4 through 11). Preferably, the initialization command also causes the neural network unit 121 to be in a wide configuration. Thus, the neural network unit 121 is configured as 512 neural processing units 126. As described in the subsequent sections, during the execution of the instructions of addresses 1 through 3 and addresses 7 through 11, the 512 neural processing units 126 operate as 512 corresponding hidden layer nodes Z, and at address 4 During the execution of the instructions up to 6, the 512 neural processing units 126 operate as 512 corresponding output layer nodes Y.

位址1至3之指令不屬於程式之迴圈組而只會執行一次。這些指令計算隱藏層節點Z之初始值並將其寫入資料隨機存取記憶體122之列1供位址4至6之指令之第一次執行使用，以計算出第一時間步驟(時間步驟0)之輸出層節點Y。此外，這些由位址1至3之指令計算並寫入資料隨機存取記憶體122之列1之隱藏層節點Z之數值會變成內容層節點C之數值供位址7與8之指令之第一次執行使用，以計算出隱藏層節點Z之數值供第二時間步驟(時間步驟1)使用。 The instructions of addresses 1 through 3 are not part of the program's loop group and will only be executed once. These instructions calculate the initial value of the hidden layer node Z and write it to column 1 of the data random access memory 122 for the first execution of the instructions of addresses 4 through 6 to calculate the first time step (time step) 0) Output layer node Y. In addition, the values of the hidden layer nodes Z calculated by the instructions of the addresses 1 to 3 and written into the column 1 of the data random access memory 122 become the value of the content layer node C for the addresses of the addresses 7 and 8. The use is performed once to calculate the value of the hidden layer node Z for use in the second time step (time step 1).

在位址1與2之指令的執行過程中，這512 個神經處理單元126中之各個神經處理單元126會執行512個乘法運算，將位於資料隨機存取記憶體122列0之512個輸入節點D數值乘上權重隨機存取記憶體124之列0至511中相對應此神經處理單元126之行之權重，以產生512個乘積累加於相對應神經處理單元126之累加器202。在位址3之指令的執行過程中，這512個神經處理單元之512個累加器202之數值會被傳遞並寫入資料隨機存取記憶體122之列1。也就是說，位址3之輸出指令會將512個神經處理單元中之各個神經處理單元512之累加器202數值寫入資料隨機存取記憶體122之列1，此數值即為初始之隱藏層Z數值，隨後，此指令會清除累加器202。 In the execution of the instructions of addresses 1 and 2, these 512 Each of the neural processing units 126 performs 512 multiplication operations, multiplying the 512 input node D values of the data random access memory 122 column 0 by the weight 0 of the weighted random access memory 124 to The weight of the row of the neural processing unit 126 is corresponding to 511 to generate 512 multiply accumulators for the accumulator 202 of the corresponding neural processing unit 126. During the execution of the instruction of address 3, the values of the 512 accumulators 202 of the 512 neural processing units are passed and written to column 1 of the data random access memory 122. That is to say, the output instruction of address 3 writes the value of the accumulator 202 of each of the 512 neural processing units into the data random access memory 122, which is the initial hidden layer. The Z value, then, this instruction clears the accumulator 202.

第四十二圖之非架構程式之位址1至2之指令所執行之運算類似於第四圖之非架構指令之位址1至2之指令所執行之運算。進一步來說，位址1之指令(MULT_ACCUM DR ROW 0)會指示這512個神經處理單元126中之各個神經處理單元126將資料隨機存取記憶體122之列0之相對應文字讀入其多工暫存器208，將權重隨機存取記憶體124之列0之相對應文字讀入其多工暫存器705，將資料文字與權重文字相乘產生乘積並將此乘積加入累加器202。位址2之指令(MULT-ACCUM ROTATE,WR ROW+1,COUNT=511)指示這512個神經處理單元中之各個神經處理單元126將來自相鄰神經處理單元126之文字轉入其多工暫存器208(利用由神經網路單元121之512個多工暫存器208集體運作構成之512個文字之旋轉器，這些暫存器即為位址1之指令指示將資料隨機存取記憶體122之列讀入之暫存器)，將權重隨機存取記憶體124之下一列之相對應文字讀入其多工暫存器705，將資料文字與權重文字相乘產生乘積並將此乘積加入累加器202，並且執行前述運算511次。 The operations performed by the instructions of addresses 1 through 2 of the non-architected program of the 42nd graph are similar to those performed by the instructions of addresses 1 through 2 of the non-architectural instructions of the fourth figure. Further, the instruction of address 1 (MULT_ACCUM DR ROW 0) indicates that each of the 512 neural processing units 126 reads the corresponding text of column 0 of the data random access memory 122 into it. The scratchpad 208 reads the corresponding text of the column 0 of the weight random access memory 124 into its multiplex register 705, multiplies the data text by the weight text to generate a product, and adds the product to the accumulator 202. The instruction of address 2 (MULT-ACCUM ROTATE, WR ROW+1, COUNT=511) indicates that each of the 512 neural processing units 126 transfers the text from the adjacent neural processing unit 126 to its multiplex The buffer 208 (using a 512-word rotator formed by the collective operation of the 512 multiplex registers 208 of the neural network unit 121, these registers are the instructions of the address 1 to indicate the random access memory of the data. The register of the volume 122 is read into the register, and the corresponding text in the lower column of the weight random access memory 124 is read into the multiplex register 705, and the data text is multiplied by the weight text to generate a product. The product is added to the accumulator 202, and the aforementioned operation is performed 511 times.

此外，第四十二圖中位址3之單一非架構輸出指令(OUTPUT PASSTHRU,DR OUT ROW 1,CLR ACC)會將啟動函數指令之運算與第四圖中位址3與4之寫入輸出指令合併(雖然第四十二圖之程式係傳遞累加器202數值，而第四圖之程式則是對累加器202數值執行一啟動函數)。也就是說，在第四十二圖之程式中，執行於累加器202數值之啟動函數，如果有的話，係輸出指令中指定(也在位址6與11之輸出指令中指定)，而非如第四圖之程式所示係於一個不同之非架構啟動函數指令中指定。第四圖(以及第二十，二十六A與二十八圖)之非架構程式之另一實施例，亦即將啟動函數指令之運算與寫入輸出指令(如第四圖之位址3與4)合併為如第四十二圖所示之單一非架構輸出指令亦屬於本發明之範疇。第四十二圖之範例假定隱藏層(Z)之節點不會對累加器數值執行啟動函數。不過，隱藏層(Z)對累加器數值執行啟動函數之實施例亦屬本案發明之範疇，這些實施例可利用位址3與11之指令進行運算，如S型，雙曲正切，校正函數等。 In addition, the single non-architectural output instruction (OUTPUT PASSTHRU, DR OUT ROW 1, CLR ACC) of address 3 in Figure 42 will write the operation of the start function instruction and the write output of addresses 3 and 4 in the fourth figure. The instructions are merged (although the program of the 42nd graph transfers the value of the accumulator 202, and the program of the fourth graph executes a start function for the accumulator 202 value). That is, in the program of the 42nd graph, the start function of the value of the accumulator 202, if any, is specified in the output instruction (also specified in the output instructions of addresses 6 and 11), and Not shown in the program in Figure 4 is specified in a different non-architectural start function instruction. Another embodiment of the non-architectural program of the fourth figure (and the twentieth, twenty-sixth and twenty-eighth) is also about to start the operation of the function instruction and the write output instruction (such as the address of the fourth figure 3 It is also within the scope of the present invention to incorporate 4) into a single non-architectural output instruction as shown in FIG. The example in Figure 42 assumes that the node of the hidden layer (Z) does not perform a start function on the accumulator value. However, the embodiment in which the hidden layer (Z) performs the start function on the accumulator value is also within the scope of the present invention. These embodiments can perform operations using the instructions of addresses 3 and 11, such as S-type, hyperbolic tangent, correction function, etc. .

相較於位址1至3之指令只會執行一次，位址4至11之指令則是位於程式迴圈內而會被執行若干次數，此次數係由迴圈計數所指定(例如20)。位址7至11 之指令的前十九次執行係計算隱藏層節點Z之數值並將其寫入資料隨機存取記憶體122供位址4至6之指令之第二至二十次執行使用以計算剩餘時間步驟之輸出層節點Y(時間步驟1至19)。(位址7至11之指令之最後/第二十次執行係計算隱藏層節點Z之數值並將其寫入資料隨機存取記憶體122之列61，不過，這些數值並未被使用。) Instructions that are compared to addresses 1 through 3 are only executed once, and instructions with addresses 4 through 11 are placed in the program loop and are executed a number of times, as specified by the loop count (for example, 20). Address 7 to 11 The first nine executions of the instruction are used to calculate the value of the hidden layer node Z and write it to the data random access memory 122 for the second to twenty executions of the instructions of addresses 4 through 6 to calculate the remaining time steps. Output layer node Y (time steps 1 to 19). (The last/twentieth execution of the instructions of addresses 7 through 11 calculates the value of the hidden layer node Z and writes it to column 61 of the data random access memory 122, however, these values are not used.)

在位址4與5之指令(MULT-ACCUM DR ROW+1,WR ROW 512 and MULT-ACCUM ROTATE,WR ROW+1,COUNT=511)之第一次執行中(對應於時間步驟0)，這512個神經處理單元126中之各個神經處理單元126會執行512個乘法運算，將資料隨機存取記憶體122之列1之512個隱藏節點Z之數值(這些數值係由位址1至3之指令之單一次執行而產生與寫入)乘上權重隨機存取記憶體124之列512至1023中對應此神經處理單元126之行之權重，以產生512個乘積累加於相對應神經處理單元126之累加器202。在位址6之指令(OUTPUT ACTIVATION FUNCTION,DR OUT ROW+1,CLR ACC)之第一次執行中，會對於這512個累加數值執行一啟動函數(例如S型，雙曲正切，校正函數)以計算輸出層節點Y之數值，執行結果會寫入資料隨機存取記憶體122之列2。 In the first execution of the instructions of addresses 4 and 5 (MULT-ACCUM DR ROW+1, WR ROW 512 and MULT-ACCUM ROTATE, WR ROW+1, COUNT=511) (corresponding to time step 0), this Each of the 512 neural processing units 126 performs 512 multiplication operations to randomly store the values of the 512 hidden nodes Z of the column 1 of the memory 122 (these values are from addresses 1 to 3). The single execution of the instruction is generated and written. The weights of the rows corresponding to the neural processing unit 126 in the columns 512 to 1023 of the weight random access memory 124 are generated to generate 512 multiplication accumulations for the corresponding neural processing unit 126. Accumulator 202. In the first execution of the instruction of address 6 (OUTPUT ACTIVATION FUNCTION, DR OUT ROW+1, CLR ACC), a start function (eg S-type, hyperbolic tangent, correction function) is executed for the 512 accumulated values. To calculate the value of the output layer node Y, the execution result is written to column 2 of the data random access memory 122.

在位址4與5之指令之第二次執行中(對應於時間步驟1)，這512個神經處理單元126中之各個神經處理單元126會執行512個乘法運算，將資料隨機存取記憶體122之列4之512個隱藏節點Z之數值(這些數值係由位址7至11之指令之第一次執行而產生與寫入)乘上權重隨機存取記憶體124之列512至1023中對應此神經處理單元126之行之權重，以產生512個乘積累加於相對應神經處理單元126之累加器202，而在位址6之指令之第二次執行中，會對於這512個累加數值執行一啟動函數以計算輸出層節點Y之數值，此結果係寫入資料隨機存取記憶體122之列5；在位址4與5之指令之第三次執行中(對應於時間步驟2)，這512個神經處理單元126中之各個神經處理單元126會執行512個乘法運算，將資料隨機存取記憶體122之列7之512個隱藏節點Z之數值(這些數值係由位址7至11之指令之第二次執行而產生與寫入)乘上權重隨機存取記憶體124之列512至1023中對應此神經處理單元126之行之權重，以產生512個乘積累加於相對應神經處理單元126之累加器202，而在位址6之指令之第三次執行中，會對於這512個累加數值執行一啟動函數以計算輸出層節點Y之數值，此結果係寫入資料隨機存取記憶體122之列8；依此類推，在位址4與5之指令之第二十次執行中(對應於時間步驟19)，這512個神經處理單元126中之各個神經處理單元126會執行512個乘法運算，將資料隨機存取記憶體122之列58之512個隱藏節點Z之數值(這些數值係由位址7至11之指令之第十九次執行而產生與寫入)乘上權重隨機存取記憶體124之列512至1023中對應此神經處理單元126之行之權重，以產生512個乘積累加於相對應神經處理單元126之累加器202，而在位址6之指令之第二十次執行中，會對於這512個累加數值執行一啟動函數以計算輸出層節點Y之數值，執行結果係寫入資料隨機存取記憶體122之列59。 In the second execution of the instructions of addresses 4 and 5 (corresponding to time step 1), each of the 512 neural processing units 126 performs 512 multiplication operations to access the data random access memory. The value of 512 hidden nodes Z of column 4 of 122 (these values are generated and written by the first execution of the instructions of addresses 7 to 11) multiplied by the weight The ranks of the rows 512 through 1023 of the random access memory 124 correspond to the rows of the neural processing unit 126 to generate 512 multiply accumulates for the accumulator 202 of the corresponding neural processing unit 126, and the instruction at address 6 In the second execution, a start function is executed for the 512 accumulated values to calculate the value of the output layer node Y, and the result is written in the data random access memory 122 column 5; in the address 4 and 5 instructions In the third execution (corresponding to time step 2), each of the 512 neural processing units 126 performs 512 multiplication operations, and 512 hidden nodes of the data random access memory 122. The value of Z (these values are generated and written by the second execution of the instructions of addresses 7 through 11) is multiplied by the row 512 to 1023 of the weighted random access memory 124 corresponding to the neural processing unit 126. The weights are generated to generate 512 multiply accumulators added to the accumulator 202 of the corresponding neural processing unit 126, and in the third execution of the instruction of the address 6, a start function is executed for the 512 accumulated values to calculate the output layer node Y value, this result is Write data column 8 of data random access memory 122; and so on, in the twentieth execution of instructions of addresses 4 and 5 (corresponding to time step 19), each of the 512 neural processing units 126 The neural processing unit 126 performs 512 multiplication operations to store the values of the 512 hidden nodes Z of the column 58 of the data random access memory 122 (these values are generated by the nineteenth execution of the instructions of addresses 7 through 11). Multiplied by the weight of the row corresponding to the neural processing unit 126 in the columns 512 to 1023 of the weighted random access memory 124 to generate 512 multiply accumulates to the accumulator 202 of the corresponding neural processing unit 126, In the twentieth execution of the instruction of address 6, a start function is executed for the 512 accumulated values to calculate the value of the output layer node Y, and the execution result is written. A column 59 of random access memory 122.

在位址7與8之指令之第一次執行中，這512個神經處理單元126中之各個神經處理單元126將資料隨機取記憶體122之列1之512個內容節點C的數值累加至其累加器202，這些數值係由位址1至3之指令之單一次執行所產生。進一步來說，位址7之指令(ADD_D_ACC DR ROW+0)會指示這512個神經處理單元126中之各個神經處理單元126將資料隨機存取記憶體122當前列(在第一次執行之過程中即為列0)之相對應文字讀入其多工暫存器208，並將此文字加入累加器202。位址8之指令(ADD_D_ACC ROTATE,COUNT=511)指示這512個神經處理單元126中之各個神經處理單元126將來自相鄰神經處理單元126之文字轉入其多工暫存器208(利用由神經網路單元121之512個多工暫存器208集體運作構成之512個文字之旋轉器，這些多工暫存器即為位址7之指令指示讀入資料隨機存取記憶體122之列之暫存器)，將此文字加入累加器202，並且執行前述運算511次。 In the first execution of the instructions of addresses 7 and 8, each of the 512 neural processing units 126 accumulates the values of the 512 content nodes C of the data array 1 of the memory 122. Accumulator 202, these values are generated by a single execution of the instructions of addresses 1 through 3. Further, the instruction of address 7 (ADD_D_ACC DR ROW+0) will instruct each of the 512 neural processing units 126 to store the data random access memory 122 in the current column (in the first execution process) The corresponding text in column 0) is read into its multiplex register 208 and the text is added to accumulator 202. The instruction of address 8 (ADD_D_ACC ROTATE, COUNT=511) indicates that each of the 512 neural processing units 126 transfers the text from the adjacent neural processing unit 126 to its multiplex register 208 (using The 512 multiplex registers 208 of the neural network unit 121 collectively operate to form a 512-word rotator, and the multiplex registers are the instructions of the address 7 to read the data random access memory 122. The register is added to the accumulator 202 and the aforementioned operation is performed 511 times.

在位址7與8之指令之第二次執行中，這512個神經處理單元126中之各個神經處理單元126會將將資料隨機取記憶體122之列4之512個內容節點C之數值累加至其累加器202，這些數值係由位址9至11之指令之第一次執行所產生並寫入；在位址7與8之指令之第三次執行中，這512個神經處理單元126中之各個神經處理單元126會將將資料隨機取記憶體122之列7之512個內容節點C之數值累加至其累加器202，這些數值係由位址9 至11之指令之第二次執行所產生並寫入；依此類推，在位址7與8之指令之第二十次執行中，這512個神經處理單元126中之各個神經處理單元126會將將資料隨機取記憶體122之列58之512個內容節點C之數值累加至其累加器202，這些數值係由位址9至11之指令之第十九次執行所產生並寫入。 In the second execution of the instructions of addresses 7 and 8, each of the 512 neural processing units 126 will accumulate the values of the 512 content nodes C of the data 4 of the memory 122. To its accumulator 202, these values are generated and written by the first execution of the instructions of addresses 9 through 11; in the third execution of the instructions of addresses 7 and 8, the 512 neural processing units 126 Each of the neural processing units 126 will accumulate the values of the 512 content nodes C of the column 7 of the data random access memory 122 to its accumulator 202, which are represented by the address 9 The second execution of the instruction to 11 is generated and written; and so on, in the twentieth execution of the instructions of addresses 7 and 8, each of the 512 neural processing units 126 will The data is randomly summed to the value of 512 content nodes C of column 58 of memory 122 to its accumulator 202, which are generated and written by the nineteenth execution of the instructions of addresses 9 through 11.

如前述，第四十二圖之範例係假定關聯於內容節點C至隱藏層節點Z之連結之權重具有為一的值。不過，在另一實施例中，這些位於Elman時間遞歸神經網路內之連結則是具有非零權重值，這些權重在第四十二圖之程式執行前係放置於權重隨機存取記憶體124(例如列1024至1535)，位址7之程式指令為MULT-ACCUM DR ROW+0,WR ROW 1024，而位址8之程式指令為MULT-ACCUM ROTATE,WR ROW+1,COUNT=511。較佳地，位址8之指令並不存取權重隨機存取記憶體124，而是旋轉位址7之指令從權重隨機存取記憶體124讀入多工暫存器705之數值。在511個執行位址8指令之時頻周期內不對權重隨機存取記憶體124進行存取即可保留更多頻寬供架構程式存取權重隨機存取記憶體124使用。 As described above, the example of the forty-second diagram assumes that the weight associated with the link of the content node C to the hidden layer node Z has a value of one. However, in another embodiment, the links within the Elman time recurrent neural network have non-zero weight values that are placed in the weighted random access memory 124 prior to execution of the program in the forty-second graph. (For example, columns 1024 to 1535), the program instructions for address 7 are MULT-ACCUM DR ROW+0, WR ROW 1024, and the program instructions for address 8 are MULT-ACCUM ROTATE, WR ROW+1, COUNT=511. Preferably, the instruction of address 8 does not access the weighted random access memory 124, but the instruction of the rotated address 7 reads the value of the multiplexed register 705 from the weighted random access memory 124. The bandwidth is not accessed during the time-frequency period of the 511 execution address 8 instructions to reserve more bandwidth for use by the architectural program access weight random access memory 124.

在位址9與10之指令(MULT-ACCUM DR ROW+2,WR ROW 0 and MULT-ACCUM ROTATE,WR ROW+1,COUNT=511)之第一次執行中(對應於時間步驟1)，這512個神經處理單元126中之各個神經處理單元126會執行512個乘法運算，將資料隨機存取記憶體122 之列3之512個輸入節點D之數值乘上權重隨機存取記憶體124之列0至511中對應此神經處理單元126之行之權重以產生512個乘積，連同位址7與8之指令對於512個內容節點C數值所執行之累加運算，累加於相對應神經處理單元126之累加器202以計算隱藏層節點Z之數值，在位址11之指令(OUTPUT PASSTHRU,DR OUT ROW+2,CLR ACC)之第一次執行中，這512個神經處理單元126之512個累加器202數值被傳遞並寫入資料隨機存取記憶體122之列4，而累加器202會被清除；在位址9與10之指令之第二次執行中(對應於時間步驟2)，這512個神經處理單元126中之各個神經處理單元126會執行512個乘法運算，將資料隨機存取記憶體122之列6之512個輸入節點D之數值乘上權重隨機存取記憶體124之列0至511中對應此神經處理單元126之行之權重，以產生512個乘積，連同位址7與8之指令對於512個內容節點C數值所執行之累加運算，累加於相對應神經處理單元126之累加器202以計算隱藏層節點Z之數值，在位址11之指令之第二次執行中，這512個神經處理單元126之512個累加器202數值被傳遞並寫入資料隨機存取記憶體122之列7，而累加器202則會被清除；依此類推，在位址9與10之指令之第十九次執行中(對應於時間步驟19)，這512個神經處理單元126中之各個神經處理單元126會執行512個乘法運算，將資料隨機存取記憶體122之列57之512個輸入節點D之數值乘上權重隨機存取記憶體124之列0至511中對應此神經處理單元126之行之權重，以產生512個乘積，連同位址7與8之指令對於512個內容節點C數值所執行之累加運算，累加於相對應神經處理單元126之累加器202以計算隱藏層節點Z之數值，而在位址11之指令之第十九次執行中，這512個神經處理單元126之512個累加器202數值被傳遞並寫入資料隨機存取記憶體122之列58，而累加器202則會被清除。如前所述，在位址9與10之指令之第二十次執行中所產生並寫入之隱藏層節點Z之數值並不會被使用。 In the first execution of the instructions of addresses 9 and 10 (MULT-ACCUM DR ROW+2, WR ROW 0 and MULT-ACCUM ROTATE, WR ROW+1, COUNT=511) (corresponding to time step 1), this Each of the 512 neural processing units 126 performs 512 multiplication operations to randomize the memory 122. The value of 512 input nodes D of column 3 is multiplied by the weight of the row corresponding to the neural processing unit 126 in columns 0 to 511 of the weighted random access memory 124 to generate 512 products, together with the instructions of addresses 7 and 8. The accumulation operation performed on the 512 content node C values is accumulated in the accumulator 202 of the corresponding neural processing unit 126 to calculate the value of the hidden layer node Z, and the instruction at address 11 (OUTPUT PASSTHRU, DR OUT ROW+2, In the first execution of CLR ACC), the 512 accumulators 202 of the 512 neural processing units 126 are passed and written to column 4 of the data random access memory 122, and the accumulator 202 is cleared; In the second execution of the instructions of addresses 9 and 10 (corresponding to time step 2), each of the 512 neural processing units 126 performs 512 multiplication operations to access the data random access memory 122. The value of 512 input nodes D of column 6 is multiplied by the weight of the row corresponding to the neural processing unit 126 in columns 0 to 511 of the weighted random access memory 124 to generate 512 products, together with the instructions of addresses 7 and 8. Cumulative load performed on 512 content node C values The accumulator 202 is added to the corresponding neural processing unit 126 to calculate the value of the hidden layer node Z. In the second execution of the instruction of the address 11, the 512 accumulators 202 of the 512 neural processing units 126 are numerically Passing and writing column 7 of data random access memory 122, and accumulator 202 is cleared; and so on, in the nineteenth execution of instructions of addresses 9 and 10 (corresponding to time step 19) Each of the 512 neural processing units 126 performs 512 multiplication operations, multiplying the value of the 512 input nodes D of the column 57 of the data random access memory 122 by the weight random access memory 124. The weights of the rows from 0 to 511 corresponding to the neural processing unit 126 are generated to generate 512 multiplications. The accumulation, together with the instructions of addresses 7 and 8, for the 512 content node C values, is accumulated in the accumulator 202 of the corresponding neural processing unit 126 to calculate the value of the hidden layer node Z, and at address 11 In the nineteenth execution of the instruction, the 512 accumulators 202 of the 512 neural processing units 126 are passed and written to the column 58 of the data random access memory 122, and the accumulator 202 is cleared. As previously mentioned, the value of the hidden layer node Z generated and written in the twentieth execution of the instructions of addresses 9 and 10 is not used.

位址12之指令(LOOP 4)會使迴圈計數器3804遞減並且在新的迴圈計數器3804數值大於零之情況下回到位址4之指令。 The instruction of address 12 (LOOP 4) decrements the loop counter 3804 and returns to the instruction of address 4 if the new loop counter 3804 has a value greater than zero.

第四十三圖係一方塊圖顯示Jordan時間遞歸神經網路之一範例。第四十三圖之Jordan時間遞歸神經網路類似於第四十圖之Elman時間遞歸神經網路，具有輸入層節點/神經元D，隱藏層節點/神經元Z，輸出層節點/神經元Y，與內容層節點/神經元C。不過，在第四十三圖之Jordan時間遞歸神經網路中，內容層節點C係以來自其相對應輸出層節點Y之輸出回饋作為其輸入連結，而非如第四十圖之Elman時間遞歸神經網路中係來自隱藏層節點Z之輸出作為其輸入連結。 A forty-third figure is a block diagram showing an example of a Jordan time recurrent neural network. The 43rd chart of the Jordan time recurrent neural network is similar to the 40th Elman time recurrent neural network with input layer nodes/neuron D, hidden layer nodes/neurons Z, output layer nodes/neurons Y , with content layer nodes / neurons C. However, in the Jordan time recurrent neural network of the 43rd graph, the content layer node C uses the output feedback from its corresponding output layer node Y as its input link instead of the Elman time recursion as in the fortieth chart. The neural network is the output from the hidden layer node Z as its input link.

為了說明本發明，Jordan時間遞歸神經網路是一個包含至少一個輸入節點層，一個隱藏節點層，一個輸出節點層與一個內容節點層之時間遞歸神經網路。在一給定時間步驟之開始，內容節點層會儲存輸出節點層於前一個時間步驟產生且回饋至內容節點層之結果。此回饋至內容層的結果可以是啟動函數之結果或是輸出節點層執行累加運算而未執行啟動函數之結果。 To illustrate the present invention, the Jordan Time Recurrent Neural Network is a time recurrent neural network comprising at least one input node layer, one hidden node layer, one output node layer and one content node layer. At the beginning of a given time step, the content node layer stores the node generated by the output node layer in the previous time step and fed back to the content node layer. fruit. The result of this feedback to the content layer can be the result of the startup function or the result of the output node performing the accumulation operation without executing the startup function.

第四十四圖係一方塊圖，顯示當神經網路單元121執行關聯於第四十三圖之Jordan時間遞歸神經網路之計算時，神經網路單元121之資料隨機存取記憶體122與權重隨機存取記憶體124內之資料配置之一範例。在第四十四圖之範例中係假定第四十三圖之Jordan時間遞歸神經網路具有512個輸入節點D，512個隱藏節點Z，512個內容節點C，與512個輸出節點Y。此外，亦假定此Jordan時間遞歸神經網路為完全連結，即全部512個輸入節點D均連結各個隱藏節點Z作為輸入，全部512個內容節點C均連結各個隱藏節點Z作為輸入，而全部512個隱藏節點Z均連結各個輸出節點Y作為輸入。第四十四圖之Jordan時間遞歸神經網路之範例雖然會對累加器202數值施以一啟動函數以產生輸出層節點Y之數值，不過，此範例係假定會將施以啟動函數前之累加器202數值傳遞至內容層節點C，而非真正的輸出層節點Y數值。此外，神經網路單元121設置有512個神經處理單元126，或神經元，例如採取寬配置。最後，此範例假定關聯於由內容節點C至隱藏節點Z之連結之權重均具有數值1；因而不需儲存這些為一的權重值。 The forty-fourth figure is a block diagram showing the data random access memory 122 of the neural network unit 121 when the neural network unit 121 performs the calculation of the Jordan time recurrent neural network associated with the forty-third figure. An example of data configuration within weight random access memory 124. In the example of the forty-fourth figure, it is assumed that the Jordan time recurrent neural network of the forty-third figure has 512 input nodes D, 512 hidden nodes Z, 512 content nodes C, and 512 output nodes Y. In addition, it is also assumed that the Jordan time recurrent neural network is fully connected, that is, all 512 input nodes D are connected to each hidden node Z as an input, and all 512 content nodes C are connected to each hidden node Z as an input, and all 512 The hidden nodes Z are each connected to each output node Y as an input. The example of the Jordan time recurrent neural network of the forty-fourth figure applies a start function to the value of the accumulator 202 to generate the value of the output layer node Y. However, this example assumes that the accumulation function will be applied before the start function. The value of the 202 is passed to the content layer node C instead of the true output layer node Y value. In addition, the neural network unit 121 is provided with 512 neural processing units 126, or neurons, for example, in a wide configuration. Finally, this example assumes that the weights associated with the links from content node C to hidden node Z all have a value of one; thus, it is not necessary to store these as a weight value.

如同第四十一圖之範例，如圖中所示，權重隨機存取記憶體124之下方512個列(列0至511)會裝載關聯於輸入節點D與隱藏節點Z間之連結之權重值，而權重隨機存取記憶體124之後續512個列(列512至1023) 會裝載關聯於隱藏節點Z與輸出節點Y間之連結之權重值。 As in the example of the 41st graph, as shown in the figure, the lower 512 columns (columns 0 to 511) of the weight random access memory 124 load the weight value associated with the link between the input node D and the hidden node Z. And weighting the subsequent 512 columns of random access memory 124 (columns 512 through 1023) The weight value associated with the link between the hidden node Z and the output node Y is loaded.

資料隨機存取記憶體122係裝載Jordan時間遞歸神經網路節點值供一系列類似於第四十一圖之範例中之時間步驟使用；不過，第四十四圖之範例中係以一組四列之記憶體裝載提供給定時間步驟之節點值。如圖中所示，在具有64列之資料隨機存取記憶體122之實施例中，資料隨機存取記憶體122可以裝載15個不同時間步驟所需之節點值。在第四十四圖之範例中，列0至3係裝載供時間步驟0使用之節點值，列4至7係裝載供時間步驟1使用之節點值，依此類推，列60至63係裝載供時間步驟15使用之節點值。此四列一組記憶體之第一列係裝載此時間步驟之輸入節點D之數值。此四列一組記憶體之第二列係裝載此時間步驟之隱藏節點Z之數值。此四列一組記憶體之第三列係裝載此時間步驟之內容節點C之數值。此四列一組記憶體之第四列則是裝載此時間步驟之輸出節點Y之數值。如圖中所示，資料隨機存取記憶體122之各個行係裝載其相對應之神經元或神經處理單元126之節點值。也就是說，行0係裝載關聯於節點D0，Z0，C0與Y0之節點值，其計算是由神經處理單元0執行；行1係裝載關聯於節點D1，Z1，C1與Y1之節點值，其計算是由神經處理單元1執行；依此類推，行511係裝載關聯於節點D511，Z511，C511與Y511之節點值，其計算是由神經處理單元511執行。這部分在後續對應於第四十四圖處會有更詳細的說明。 The data random access memory 122 is loaded with Jordan time recursive neural network node values for use in a series of time steps similar to the example of the 41st graph; however, the example of the forty-fourth graph is a set of four The column's memory load provides the node value for a given time step. As shown in the figure, in an embodiment having 64 columns of data random access memory 122, data random access memory 122 can carry node values required for 15 different time steps. In the example of the forty-fourth figure, columns 0 to 3 are loaded with node values for time step 0, columns 4 to 7 are loaded with node values for time step 1, and so on, columns 60 to 63 are loaded. The node value used for time step 15. The first column of the four columns of memory is loaded with the value of input node D for this time step. The second column of the four columns of memory loads the value of the hidden node Z for this time step. The third column of the four columns of memory is loaded with the value of the content node C of this time step. The fourth column of the four columns of memory is the value of the output node Y that is loaded for this time step. As shown in the figure, each row of data random access memory 122 loads the node value of its corresponding neuron or neural processing unit 126. That is to say, row 0 is loaded with node values associated with nodes D0, Z0, C0 and Y0, and its calculation is performed by neural processing unit 0; row 1 is loaded with node values associated with nodes D1, Z1, C1 and Y1, The calculation is performed by the neural processing unit 1; and so on, the row 511 is loaded with node values associated with the nodes D511, Z511, C511 and Y511, the calculation of which is performed by the neural processing unit 511. This section will be described in more detail later in the corresponding figure 44.

第四十四圖中給定時間步驟之內容節點C之數值係於此時間步驟內產生並作為下一個時間步驟之輸入。也就是說，神經處理單元126在此時間步驟內計算並寫入之節點C的數值，會成為此神經處理單元126在下一個時間步驟內用於計算節點Z的數值所使用之節點C的數值(連同此下一個時間步驟之輸入節點D的數值)。內容節點C之初始值(即時間步驟0計算列1節點Z之數值所使用之節點C之數值)係假定為零。這部分在後續對應於第四十五圖之非架構程式之章節會有更詳細的說明。 The value of the content node C for a given time step in Figure 44 is generated during this time step and is entered as the next time step. That is, the value of the node C calculated and written by the neural processing unit 126 during this time step becomes the value of the node C used by the neural processing unit 126 to calculate the value of the node Z in the next time step ( Together with the value of input node D for this next time step). The initial value of the content node C (i.e., the value of the node C used to calculate the value of the node 1 node Z at time step 0) is assumed to be zero. This section will be described in more detail in the subsequent sections of the non-architected program corresponding to the 45th.

如前文第四十一圖所述，較佳地，輸入節點D的數值(第四十四圖之範例中之列0，4，依此類推至列60之數值)係由執行於處理器100之架構程式透過MTNN指令1400寫入/填入資料隨機存取記憶體122，並且是由執行於神經網路單元121之非架構程式讀取/使用，例如第四十五圖之非架構程式。相反地，隱藏節點Z/內容節點C/輸出節點Y之數值(第四十四圖之範例中分別為列1/2/3，5/6/7，依此類推至列61/62/63之數值)係由執行於神經網路單元121之非架構程式寫入/填入資料隨機存取記憶體122，並且是由執行於處理器100之架構程式透過MFNN指令1500讀取/使用。第四十四圖之範例係假定此架構程式會執行以下步驟：(1)對於15個不同的時間步驟，將輸入節點D之數值填入資料隨機存取記憶體122(列0，4，依此類推至列60)；(2)啟動第四十五圖之非架構程式；(3)偵測非架構程式是否執行完畢；(4)從資料隨機存取記憶體122讀出輸出節點Y之數值(列3，7，依此類推至列63)；以及(5)重複步驟(1)至(4)若干次直到完成任務，例如對手機使用者之話語進行辨識所需之計算。 As described in the foregoing 41st, preferably, the value of the input node D (columns 0, 4 in the example of the forty-fourth graph, and so on, to the value of the column 60) is executed by the processor 100. The architecture program writes/fills the data random access memory 122 through the MTNN command 1400 and is read/used by a non-architected program executed by the neural network unit 121, such as the non-architected program of the forty-fifth figure. Conversely, the value of the hidden node Z/content node C/output node Y (in the example of the forty-fourth figure is column 1/2/3, 5/6/7, and so on, to column 61/62/63) The value is written/filled into the data random access memory 122 by the non-architectural program executed by the neural network unit 121, and is read/used by the architecture program executed by the processor 100 through the MFNN instruction 1500. The example in the forty-fourth figure assumes that the architecture program performs the following steps: (1) For 15 different time steps, the value of the input node D is filled in the data random access memory 122 (column 0, 4, This type is pushed to column 60); (2) to start the non-architectural program of the 45th figure; (3) to detect whether the non-architected program is executed (4) reading the value of the output node Y from the data random access memory 122 (column 3, 7, and so on to column 63); and (5) repeating steps (1) through (4) several times until Complete tasks, such as the calculations needed to identify the phone user's words.

在另一種執行方式中，架構程式會執行以下步驟：(1)對單一個時間步驟，以輸入節點D之數值填入資料隨機存取記憶體122(如列0)；(2)啟動非架構程式(第四十五圖非架構程式之一修正後版本，不需迴圈，並且只存取資料隨機存記憶體122之單一組四個列)；(3)偵測非架構程式是否執行完畢；(4)從資料隨機存取記憶體122讀出輸出節點Y之數值(如列3)；以及(5)重複步驟(1)至(4)若干次直到完成任務。此二種方式何者為優可依據時間遞歸神經網路之輸入值的取樣方式而定。舉例來說，若是此任務容許在多個時間步驟內對輸入進行取樣(例如大約15個時間步驟)並執行計算，第一種方式就較為理想，因為此方式可帶來更多計算資源效率與/或較佳的效能，不過，若是此任務只容許在單一個時間步驟內執行取樣，就需要使用第二種方式。 In another implementation, the architecture program performs the following steps: (1) for a single time step, filling the data random access memory 122 (such as column 0) with the value of the input node D; (2) starting the non-architectural The program (the revised version of the non-architected program in the forty-fifth figure does not require loops, and only accesses a single group of four columns of data storage memory 122); (3) detects whether the non-architected program is executed. (4) reading the value of the output node Y from the data random access memory 122 (e.g., column 3); and (5) repeating steps (1) through (4) several times until the task is completed. Which of the two methods is preferred depends on the sampling method of the input value of the time recurrent neural network. For example, if this task allows sampling of inputs (for example, about 15 time steps) and performing calculations in multiple time steps, the first method is ideal because it can bring more computing resource efficiency and / or better performance, however, if this task only allows sampling in a single time step, you need to use the second method.

第三實施例類似於前述第二種方式，不過，不同於第二種方式使用單一組四個資料隨機存取記憶體122列，此方式之非架構程式使用多組四列記憶體，也就是在各個時間步驟使用不同組四列記憶體，此部分類似於第一種方式。在此第三實施例中，較佳地，架構程式在步驟(2)前包含一步驟，在此步驟中，架構程式會在非架構程式啟動前對其進行更新，例如將位址1之指令內的資料隨機存取記憶體122列更新為指向下一組四列記憶體。 The third embodiment is similar to the second method described above. However, unlike the second method, a single group of four data random access memory 122 columns is used. The non-architectural program in this manner uses multiple sets of four columns of memory, that is, Different sets of four columns of memory are used at various time steps, which is similar to the first way. In this third embodiment, preferably, the architecture program includes a step before step (2), in which the architecture program updates the non-architecture program before starting, for example, the instruction of address 1 The data random access memory 122 column is updated to point to the next set of four columns of memory.

第四十五圖係一表格，顯示儲存於神經網路單元121之程式記憶體129之一程式，此程式係由神經網路單元121執行，並依據第四十四圖之配置使用資料與權重，以達成Jordan時間遞歸神經網路。第四十五圖之非架構程式類似於第四十二圖之非架構程式，二者之差異可參照本文相關章節之說明。 The forty-fifth figure is a table showing a program stored in the program memory 129 of the neural network unit 121, which is executed by the neural network unit 121, and uses data and weights according to the configuration of the forty-fourth figure. To achieve the Jordan time recurrent neural network. The non-architectural program of the 45th figure is similar to the non-architectural program of the 42nd figure. The difference between the two can be referred to the description of the relevant section of this article.

第四十五圖之範例程式包括14個非架構指令，分別位於位址0至13。位址0之指令是一個初始化指令，用以清除累加器202並將迴圈計數器3804初始化至數值15，以執行15次迴圈組(位址4至12之指令)。較佳地，此初始化指令並會使神經網路單元121處於寬配置而配置為512個神經處理單元126。如本文所述，在位址1至3以及位址8至12之指令執行過程中，這512個神經處理單元126係對應並作為512個隱藏層節點Z進行運作，而在位址4，5與7之指令執行過程中，這512個神經處理單元126係對應並作為512個輸出層節點Y進行運作。 The sample program of Figure 45 includes 14 non-architected instructions located at addresses 0 through 13. The instruction of address 0 is an initialization instruction to clear accumulator 202 and initialize loop counter 3804 to a value of 15 to perform 15 loop groups (instructions of addresses 4 through 12). Preferably, this initialization command causes the neural network unit 121 to be in a wide configuration and configured as 512 neural processing units 126. As described herein, during the execution of instructions for addresses 1 through 3 and addresses 8 through 12, the 512 neural processing units 126 correspond to and operate as 512 hidden layer nodes Z, at addresses 4, 5 During execution of the instruction with 7, these 512 neural processing units 126 correspond and operate as 512 output layer nodes Y.

位址1至5與位址7之指令與第四十二圖中位址1至6之指令相同並具有相同功能。位址1至3之指令計算隱藏層節點Z之初始值並將其寫入資料隨機存取記憶體122之列1供位址4，5與7之指令之第一次執行使用，以計算出第一時間步驟(時間步驟0)之輸出層節點Y。 The instructions of addresses 1 through 5 and address 7 are identical to the instructions of addresses 1 through 6 in the forty-second figure and have the same function. The instructions of address 1 to 3 calculate the initial value of the hidden layer node Z and write it to the column 1 of the data random access memory 122 for the first execution of the instructions of the addresses 4, 5 and 7 to calculate The output layer node Y of the first time step (time step 0).

在位址6之輸出指令之第一次執行之過程中，這512個由位址4與5之指令累加產生之累加器202數值(接下來這些數值會被位址7之輸出指令使用以計算並寫入輸出層節點Y之數值)會被傳遞並寫入資料隨機存取記憶體122之列2，這些數值即為第一時間步驟(時間步驟0)中產生之內容層節點C數值並於第二時間步驟(時間步驟1)中使用；在位址6之輸出指令之第二次執行之過程中，這512個由位址4與5之指令累加產生之累加器202數值(接下來，這些數值會被位址7之輸出指令使用以計算並寫入輸出層節點Y之數值)會被傳遞並寫入資料隨機存取記憶體122之列6，這些數值即為第二時間步驟(時間步驟1)中產生之內容層節點C數值並於第三時間步驟(時間步驟2)中使用；依此類推，在位址6之輸出指令之第十五次執行之過程中，這512個由位址4與5之指令累加產生之累加器202數值(接下來這些數值會被位址7之輸出指令使用以計算並寫入輸出層節點Y之數值)會被傳遞並寫入資料隨機存取記憶體122之列58，這些數值即為第十五時間步驟(時間步驟14)中產生之內容層節點C數值(並由位址8之指令讀取，但不會被使用)。 During the first execution of the output instruction of address 6, the number of accumulators 202 generated by the accumulation of the instructions of addresses 4 and 5 The values (these values will be used by the output instruction of address 7 to calculate and write to the output layer node Y) will be passed and written to column 2 of the data random access memory 122. These values are the first. The content layer node C value generated in the time step (time step 0) is used in the second time step (time step 1); during the second execution of the output instruction of address 6, the 512 bits are used The instructions of addresses 4 and 5 accumulate the value of the accumulator 202 generated (these values will be used by the output instruction of address 7 to calculate and write the value of the output layer node Y) and will be passed to the data random access. Columns 6 of memory 122, these values are the content layer node C values generated in the second time step (time step 1) and used in the third time step (time step 2); and so on, at address 6 During the fifteenth execution of the output instruction, the 512 accumulators 202 are incremented by the instructions of addresses 4 and 5 (these values are then used by the output instruction of address 7 to calculate and write The value of the output layer node Y) will be passed and the data will be stored randomly. Taking the column 58 of the memory 122, these values are the value of the content layer node C generated in the fifteenth time step (time step 14) (and read by the instruction of address 8, but will not be used).

位址8至12之指令與第四十二圖中位址7至11之指令大致相同並具有相同功能，二者僅具有一差異點。此差異點即，第四十五圖中位址8之指令(ADD_D_ACC DR ROW+1)會使資料隨機存取記憶體122之列數增加一，而第四十二圖中位址7之指令(ADD_D_ACC DR ROW+0)會使資料隨機存取記憶體122之列數增加零。此差異係導因於資料隨機存取記憶體 122內之資料配置之不同，特別是，第四十四圖中四列一組之配置包括一獨立列供內容層節點C數值使用(如列2，6，10等)，而第四十一圖中三列一組之配置則不具有此獨立列，而是讓內容層節點C之數值與隱藏層節點Z之數值共用同一個列(如列1，4，7等)。位址8至12之指令之十五次執行會計算出隱藏層節點Z之數值並將其寫入資料隨機存取記憶體122(寫入列5，9，13，依此類推直到列57)供位址4，5與7之指令之第二至十六次執行使用以計算第二至十五時間步驟之輸出層節點Y(時間步驟1至14)。(位址8至12之指令之最後/第十五次執行係計算隱藏層節點Z之數值並將其寫入資料隨機存取記憶體122之列61，不過這些數值並未被使用。) The instructions of addresses 8 through 12 are substantially identical to the instructions of addresses 7 through 11 in the forty-second figure and have the same function, with only one difference point. The difference is that the instruction of address 8 (ADD_D_ACC DR ROW+1) in the forty-fifth figure increases the number of columns of data random access memory 122 by one, and the instruction of address 7 in the forty-second figure. (ADD_D_ACC DR ROW+0) increments the number of columns of data random access memory 122 by zero. Data random access memory The configuration of the data in 122 is different. In particular, the configuration of the four columns in the forty-fourth figure includes a separate column for the value of the content layer node C (such as columns 2, 6, 10, etc.), and the forty-first The configuration of the three columns in the figure does not have this independent column, but the value of the content layer node C and the value of the hidden layer node Z share the same column (such as columns 1, 4, 7, etc.). Fifteen executions of the instructions of addresses 8 through 12 calculate the value of the hidden layer node Z and write it to the data random access memory 122 (write columns 5, 9, 13 and so on until column 57). The second to sixteenth execution of the instructions of addresses 4, 5 and 7 is used to calculate the output layer node Y of the second to fifteenth time steps (time steps 1 to 14). (The last/fifth execution of the instructions of addresses 8 through 12 calculates the value of the hidden layer node Z and writes it to column 61 of the data random access memory 122, but these values are not used.)

位址13之迴圈指令會使迴圈計數器3804遞減並且在新的迴圈計數器3804數值大於零之情況下回到位址4之指令。 The loop command of address 13 causes the loop counter 3804 to decrement and return to the instruction of address 4 if the new loop counter 3804 has a value greater than zero.

在另一實施例中，Jordan時間遞歸神經網路之設計係利用內容節點C裝載輸出節點Y之啟動函數值，此啟動函數值即啟動函數執行後之累加值。在此實施例中，因為輸出節點Y之數值與內容節點C之數值相同，位址6之非架構指令並不包含於非架構程式內。因而可以減少資料隨機存取記憶體122內使用之列數。更精確的說，第四十四圖中之各個裝載內容節點C數值之列(例如列2，6，59)都不存在於本實施例。此外，此實施例之各個時間步驟僅需要資料隨機存取記憶體122之三個列，而會搭配20個時間步驟，而非15個，第四十五圖中非架構程式之指令的位址也會進行適當的調整。 In another embodiment, the design of the Jordan time recurrent neural network utilizes the content node C to load the start function value of the output node Y, which is the accumulated value after the execution of the function. In this embodiment, since the value of the output node Y is the same as the value of the content node C, the non-architectural instruction of the address 6 is not included in the non-architectural program. Therefore, the number of columns used in the data random access memory 122 can be reduced. More precisely, the columns of the values of the loaded content nodes C (e.g., columns 2, 6, 59) in the forty-fourth graph are not present in this embodiment. In addition, the time steps of this embodiment only require three columns of data random access memory 122, and will be combined with 20 time steps instead of 15, in the forty-fifth figure. The address of the instruction of the non-architected program will also be adjusted appropriately.

長短期記憶胞 Long-term and short-term memory

長短期記憶胞用於時間遞歸神經網路是本技術領域所習知之概念。舉例來說，Long Short-Term Memory,Sepp Hochreiter and Jürgen Schmidhuber,Neural Computation,November 15,1997,Vol.9,No.8,Pages 1735-1780；Learning to Forget：Continual Prediction with LSTM,Felix A.Gers,Jürgen Schmidhuber,and Fred Cummins,Neural Computation,October 2000,Vol.12,No.10,Pages 2451-2471；這些文獻都可以從麻省理工出版社期刊(MIT Press Journals)取得。長短期記憶胞可以建構為多種不同型式。以下所述第四十六圖之長短期記憶胞4600係以網址http：//deeplearning.net/tutorial/lstm.html標題為用於情緒分析之長短期記憶網路(LSTM Networks for Sentiment Analysis)之教程所描述之長短期記憶胞為模型，此教程之副本係於2015年10月19日下載(以下稱為“長短期記憶教程”)並提供於本案之美國申請案資料揭露陳報書內。此長短期記憶胞4600可用於一般性地描述本文所述之神經網路單元121實施例能夠有效執行關聯於長短期記憶之計算之能力。值得注意的是，這些神經網路單元121之實施例，包括第四十九圖所述之實施例，都可以有效執行關聯於第四十六圖所述之長短期記憶胞以外之其他長短期記憶胞之計算。 The use of long and short term memory cells for time recurrent neural networks is a concept well known in the art. For example, Long Short-Term Memory, Sepp Hochreiter and Jürgen Schmidhuber, Neural Computation, November 15, 1997, Vol. 9, No. 8, Pages 1735-1780; Learning to Forget: Continual Prediction with LSTM, Felix A. Gers Jürgen Schmidhuber, and Fred Cummins, Neural Computation, October 2000, Vol. 12, No. 10, Pages 2451-2471; these documents are available from the MIT Press Journals. Long- and short-term memory cells can be constructed in many different types. The long-short-term memory cell 4600 of the 46th diagram described below is based on the website http://deeplearning.net/tutorial/lstm.html titled LSTM Networks for Sentiment Analysis. The long and short-term memory cells described in the tutorial are models. A copy of this tutorial was downloaded on October 19, 2015 (hereafter referred to as the “long and short-term memory tutorial”) and is provided in the US application data disclosure case in this case. This long and short term memory cell 4600 can be used to generally describe the ability of the neural network unit 121 embodiments described herein to be effective in performing calculations associated with long and short term memory. It should be noted that the embodiments of the neural network unit 121, including the embodiment described in the forty-ninth figure, can effectively perform other long-term and short-term relationships other than the long-term and short-term memory cells described in the forty-sixth figure. Calculation of memory cells.

較佳地，神經網路單元121可用以針對一個具有長短期記憶胞層連結其他層級之時間遞歸神經網路執行計算。舉例來說，在此長短期記憶教程中，網路包含一均值共源層以接收長短期記憶層之長短期記憶胞之輸出(H)，以及一邏輯回歸層以接收均值共源層之輸出。 Preferably, the neural network unit 121 can be used to target one A time recurrent neural network with long and short-term memory cell layers connected to other levels performs computations. For example, in this long-term and short-term memory tutorial, the network includes a mean common source layer to receive the long-short-term memory cell output (H) of the long- and short-term memory layer, and a logistic regression layer to receive the output of the mean common source layer. .

第四十六圖係一方塊圖，顯示長短期記憶胞4600之一實施例。 The forty-sixth diagram is a block diagram showing one embodiment of long and short term memory cells 4600.

如圖中所示，此長短期記憶胞4600包括一記憶胞輸入(X)，一記憶胞輸出(H)，一輸入閘(I)，一輸出閘(O)，一遺忘閘(F)，一記憶胞狀態(C)與一候選記憶胞狀態(C’)。輸入閘(I)可門控記憶胞輸入(X)至記憶胞狀態(C)之信號傳遞，而輸出閘(O)可門控記憶胞狀態(C)至記憶胞輸出(H)之信號傳遞。此記憶胞狀態(C)會反饋為一時間步驟之候選記憶胞狀態(C’)。遺忘閘(F)可門控此候選記憶胞狀態(C’)，此候選記憶胞狀態會反饋並變成下一個時間步驟之記憶胞狀態(C)。 As shown in the figure, the long-and short-term memory cell 4600 includes a memory cell input (X), a memory cell output (H), an input gate (I), an output gate (O), and a forget gate (F). A memory state (C) and a candidate memory state (C'). The input gate (I) can gate the signal input from the memory cell input (X) to the memory cell state (C), while the output gate (O) can gate the signal state of the memory cell state (C) to the memory cell output (H). . This memory cell state (C) is fed back as a candidate memory cell state (C') for a time step. The forgetting gate (F) can gate this candidate memory cell state (C'), and the candidate memory cell state will feed back and become the memory cell state (C) of the next time step.

第四十六圖之實施例使用下列等式來計算前述各種不同數值： The embodiment of the forty-sixth embodiment uses the following equations to calculate the various different values described above:

(1) I=SIGMOID(Wi * X+Ui * H+Bi) (1) I=SIGMOID(Wi * X+Ui * H+Bi)

(2) F=SIGMOID(Wf * X+Uf * H+Bf) (2) F=SIGMOID(Wf * X+Uf * H+Bf)

(3) C’=TANH(Wc * X+Uc * H+Bc) (3) C'=TANH(Wc * X+Uc * H+Bc)

(4) C=I * C’+F * C (4) C=I * C'+F * C

(5) O=SIGMOID(Wo * X+Uo * H+ Bo) (5) O=SIGMOID(Wo * X+Uo * H+ Bo)

(6) H=O * TANH(C) (6) H=O * TANH(C)

Wi與Ui是關聯於輸入閘(I)之權重值，而Bi是關聯於輸入閘(I)之偏移值。Wf與Uf是關聯於遺忘閘(F)之權重值，而Bf是關聯於遺忘閘(F)之偏移值。Wo與Uo是關聯於輸出閘(O)之權重值，而Bo是關聯於輸出閘(O)之偏移值。如前述，等式(1)，(2)與(5)分別計算輸入閘(I)，遺忘閘(F)與輸出閘(O)。等式(3)計算候選記憶胞狀態(C’)，而等式(4)計算以當前記憶胞狀態(C)為輸入之候選記憶胞狀態(C’)，當前記憶胞狀態(C)即當前時間步驟之記憶胞狀態(C)。等式(6)計算記憶胞輸出(H)。不過本發明並不限於此。使用他種方式計算輸入閘，遺忘閘，輸出閘，候選記憶胞狀態，記憶胞狀態與記憶胞輸出之長短期記憶胞之實施例亦為本發明所涵蓋。 Wi and Ui are weight values associated with the input gate (I), and Bi is the offset value associated with the input gate (I). Wf and Uf are weight values associated with the forgotten gate (F), and Bf is the offset value associated with the forgotten gate (F). Wo and Uo are weight values associated with the output gate (O), and Bo is the offset value associated with the output gate (O). As described above, equations (1), (2), and (5) calculate the input gate (I), the forget gate (F), and the output gate (O), respectively. Equation (3) calculates the candidate memory cell state (C'), and Equation (4) calculates the candidate memory cell state (C') with the current memory cell state (C) as input, and the current memory cell state (C) Memory cell state (C) of the current time step. Equation (6) calculates the memory cell output (H). However, the invention is not limited thereto. Embodiments for calculating input gates, forgetting gates, output gates, candidate memory cell states, memory cell states, and memory cell output long-term memory cells using other methods are also encompassed by the present invention.

為了說明本發明，長短期記憶胞包括一記憶胞輸入，一記憶胞輸出，一記憶胞狀態，一候選記憶胞狀態，一輸入閘，一輸出閘與一遺忘閘。對各個時間步驟而言，輸入閘，輸出閘，遺忘閘與候選記憶胞狀態為當前時間步驟之記憶體記憶胞輸入與先前時間步驟之記憶胞輸出與相關權重之函數。此時間步驟之記憶胞狀態為先前時間步驟之記憶胞狀態，候選記憶胞狀態，輸入閘與輸出閘之函數。從這個意義上說，記憶胞狀態會反饋用於計算下一個時間步驟之記憶胞狀態。此時間步驟之記憶胞輸出是此時間步驟計算出之記憶胞狀態與輸出閘之函數。長短期記憶神經網路是一個具有一個長短期記憶胞層之神經網路。 To illustrate the invention, the long and short term memory cells include a memory cell input, a memory cell output, a memory cell state, a candidate memory cell state, an input gate, an output gate, and a forget gate. For each time step, the input gate, output gate, forget gate, and candidate memory cell state are functions of the memory cell input of the current time step and the memory cell output and associated weight of the previous time step. The memory cell state of this time step is a function of the memory cell state of the previous time step, the candidate memory cell state, the input gate and the output gate. In this sense, the state of the memory cell feeds back the state of the memory cell used to calculate the next time step. The memory cell output of this time step is a function of the state of the memory cell and the output gate calculated at this time step. Long-term and short-term memory neural networks are one with a length The neural network of the memory cell layer.

第四十七圖係一方塊圖，顯示當神經網路單元121執行關聯於第四十六圖之長短期記憶神經網路之長短期記憶胞4600層之計算時，神經網路單元121之資料隨機存取記憶體122與權重隨機存取記憶體124內之資料配置之一範例。在第四十七圖之範例中，神經網路單元121係配置為512個神經處理單元126或神經元，例如採寬配置，不過，只有128個神經處理單元126(如神經處理單元0至127)所產生之數值會被使用，這是因為在此範例之長短期記憶層只有128個長短期記憶胞4600。 The forty-seventh figure is a block diagram showing the data of the neural network unit 121 when the neural network unit 121 performs the calculation of the long-short-term memory cell 4600 layer associated with the long-short-term memory neural network of the forty-sixth figure. An example of data configuration in random access memory 122 and weighted random access memory 124. In the example of the forty-seventh diagram, the neural network unit 121 is configured as 512 neural processing units 126 or neurons, such as a widened configuration, however, there are only 128 neural processing units 126 (eg, neural processing units 0 to 127) The resulting value will be used because the long and short term memory layer in this example has only 128 long and short term memory cells 4600.

如圖中所示，權重隨機存取記憶體124會裝載神經網路單元121之相對應神經處理單元0至127之權重值，偏移值與居間值。權重隨機存取記憶體124之行0至127裝載神經網路單元121之相對應神經處理單元0至127之權重值，偏移值與居間值。列0至14中之各列則是裝載128個下列對應於前述等式(1)至(6)之數值以提供給神經處理單元0至127，這些數值為：Wi,Ui,Bi,Wf,Uf,Bf,Wc,Uc,Bc,C’,TANH(C),C,Wo,Uo,Bo。較佳地，權重值與偏移值-Wi,Ui,Bi,Wf,Uf,Bf,Wc,Uc,Bc,Wo,Uo,Bo(位於列0至8與列12至14)-係由執行於處理器100之架構程式透過MTNN指令1400寫入/填入權重隨機存取記憶體124，並由執行於神經網路單元121之非架構程式讀取/使用，如第四十八圖之非架構程式。較佳地，居間值-C’,TANH(C),C(位於列9至11)-係由執行於神經網路單元121之非架構程式寫入/填入權重隨機存取記憶體124並進行讀取/使用，詳如後述。 As shown in the figure, the weight random access memory 124 loads the weight values, offset values and intermediate values of the corresponding neural processing units 0 to 127 of the neural network unit 121. The rows 0 to 127 of the weight random access memory 124 load the weight values, offset values and intermediate values of the corresponding neural processing units 0 to 127 of the neural network unit 121. Each of the columns 0 to 14 is loaded with 128 values corresponding to the above equations (1) to (6) to be supplied to the neural processing units 0 to 127, which are: Wi, Ui, Bi, Wf, Uf, Bf, Wc, Uc, Bc, C', TANH(C), C, Wo, Uo, Bo. Preferably, the weight value and the offset value -Wi, Ui, Bi, Wf, Uf, Bf, Wc, Uc, Bc, Wo, Uo, Bo (located in columns 0 to 8 and columns 12 to 14) are executed The architecture program of the processor 100 writes/fills the weight random access memory 124 through the MTNN instruction 1400, and is read/used by the non-architecture program executed by the neural network unit 121, such as the 48th figure. Architecture program. Preferably, the intervening values -C', TANH(C), C (in columns 9 to 11) are written/filled by the non-architected program executed by the neural network unit 121. The body 124 is read and used, as will be described later.

如圖中所示，資料隨機存取記憶體122裝載輸入(X)，輸出(H)，輸入閘(I)，遺忘閘(F)與輸出閘(O)數值供一系列時間步驟使用。進一步來說，此記憶體五列一組裝載X，H，I，F與O之數值供一給定時間步驟使用。以一個具有64列之資料隨機存取記憶體122為例，如圖中所示，此資料隨機存取記憶體122可裝載供12個不同時間步驟使用之記憶胞數值。在第四十七圖之範例中，列0至4係裝載供時間步驟0使用之記憶胞數值，列5至9係裝載供時間步驟1使用之記憶胞數值，依此類推，列55至59係裝載供時間步驟11使用之記憶胞數值。此五列一組記憶體中之第一列係裝載此時間步驟之X數值。此五列一組記憶體中之第二列係裝載此時間步驟之H數值。此五列一組記憶體中之第三列係裝載此時間步驟之I數值。此五列一組記憶體中之第四列係裝載此時間步驟之F數值。此五列一組記憶體中之第五列係裝載此時間步驟之O數值。如圖中所示，資料隨機存取記憶體122內之各行係裝載供相對應神經元或神經處理單元126使用之數值。也就是說，行0係裝載關聯於長短期記憶胞0之數值，而其計算是由神經處理單元0所執行；行1係裝載關聯於長短期記憶胞1之數值，而其計算是由神經處理單元1所執行；依此類推，行127係裝載關聯於長短期記憶胞127之數值，而其計算是由神經處理單元127所執行，詳如後續第四十八圖所述。 As shown in the figure, data random access memory 122 loads input (X), output (H), input gate (I), forget gate (F) and output gate (O) values for a series of time steps. Further, the five columns of this memory are loaded with the values of X, H, I, F and O for a given time step. Taking a random access memory 122 having 64 columns as an example, as shown in the figure, the data random access memory 122 can carry the memory cell values for use in 12 different time steps. In the example of the 47th figure, columns 0 to 4 are loaded with the memory cell values used for time step 0, columns 5 to 9 are loaded with the memory cell values used for time step 1, and so on, columns 55 to 59 The memory cell value used for time step 11 is loaded. The first of the five columns of memory is loaded with the X value of this time step. The second of the five columns of memory is loaded with the H value for this time step. The third of the five columns of memory is loaded with the I value for this time step. The fourth of the five columns of memory is loaded with the F value for this time step. The fifth of the five columns of memory is loaded with the O value for this time step. As shown in the figure, each line within the data random access memory 122 is loaded with values for use by the corresponding neuron or neural processing unit 126. That is to say, the row 0 is loaded with the value associated with the long and short term memory cell 0, and its calculation is performed by the neural processing unit 0; the row 1 is loaded with the value associated with the long and short term memory cell 1, and its calculation is performed by the nerve The processing unit 1 performs; and so on, the row 127 is loaded with the value associated with the long- and short-term memory cells 127, and its calculation is performed by the neural processing unit 127, as described in the subsequent forty-eighth diagram.

較佳地，X數值(位於列0，5，9，依此類推至列55)係由執行於處理器100之架構程式透過MTNN指令1400寫入/填入資料隨機存取記憶體122，並由執行於神經網路單元121之非架構程式進行讀取/使用，如第四十八圖所示之非架構程式。較佳地，I數值，F數值與O數值(位於列2/3/4，7/8/9，12/13/14，依此類推至列57/58/59)係由執行於神經處理單元121之非架構程式寫入/填入資料隨機存取記憶體122，詳如後述。較佳地，H數值(位於列1，6，10，依此類推至列56)係由執行於神經處理單元121之非架構程式寫入/填入資料隨機存取記憶體122並進行讀取/使用，並且由執行於處理器100之架構程式透過MFNN指令1500進行讀取。 Preferably, the X value (located in columns 0, 5, 9, according to this The analogy to column 55) is written/filled into the data random access memory 122 by the architecture program executed by the processor 100 through the MTNN instruction 1400, and is read/used by the non-architecture program executed by the neural network unit 121. , such as the non-architected program shown in Figure 48. Preferably, the I value, the F value and the O value (located in columns 2/3/4, 7/8/9, 12/13/14, and so on to column 57/58/59) are performed by neural processing. The non-architectural program of the unit 121 writes/fills the data random access memory 122 as will be described later. Preferably, the H values (in columns 1, 6, 10, and so on to column 56) are written/filled into the random access memory 122 by the non-architectural program executed by the neural processing unit 121 and read. / used, and read by the MFNN instruction 1500 by an architectural program executing on the processor 100.

第四十七圖之範例係假定此架構程式會執行以下步驟：(1)對於12個不同的時間步驟，將輸入X之數值填入資料隨機存取記憶體122(列0，5，依此類推至列55)；(2)啟動第四十八圖之非架構程式；(3)偵測非架構程式是否執行完畢；(4)從資料隨機存取記憶體122讀出輸出H之數值(列1，6，依此類推至列59)；以及(5)重複步驟(1)至(4)若干次直到完成任務，例如對手機使用者之話語進行辨識所需之計算。 The example in the 47th figure assumes that the architecture program performs the following steps: (1) For 12 different time steps, the value of the input X is filled in the data random access memory 122 (column 0, 5, according to this) Analogy to column 55); (2) start the non-architectural program of the 48th figure; (3) detect whether the non-architected program is executed; (4) read the value of the output H from the data random access memory 122 ( Columns 1, 6, and so on to column 59); and (5) repeat steps (1) through (4) several times until the task is completed, such as the calculation required to identify the phone user's words.

在另一種執行方式中，架構程式會執行以下步驟：(1)對單一個時間步驟，以輸入X之數值填入資料隨機存取記憶體122(如列0)；(2)啟動非架構程式(第四十八圖非架構程式之一修正後版本，不需迴圈，並且只存取資料隨機存記憶體122之單一組五個列)；(3)偵測非架構程式是否執行完畢；(4)從資料隨機存取記憶體122讀出輸出H之數值(如列1)；以及(5)重複步驟(1)至(4)若干次直到完成任務。此二種方式何者為優可依據長短期記憶層之輸入X數值的取樣方式而定。舉例來說，若是此任務容許在多個時間步驟對輸入進行取樣(例如大約12個時間步驟)並執行計算，第一種方式就較為理想，因為此方式可能帶來更多計算資源效率與/或較佳的效能，不過，若是此任務只容許在單一個時間步驟執行取樣，就需要使用第二種方式。 In another implementation, the architecture program performs the following steps: (1) for a single time step, filling in the data random access memory 122 (such as column 0) with the value of input X; (2) starting the non-architect program (The forty-eighth version of the non-architected program is a modified version that does not require loops and only accesses a single set of five columns of data storage memory 122); (3) detects whether the non-architected program has been executed; (4) Random access memory from data Body 122 reads the value of output H (e.g., column 1); and (5) repeats steps (1) through (4) several times until the task is completed. Which of the two methods is superior depends on the sampling method of the input X value of the long-term and short-term memory layers. For example, if this task allows sampling of inputs (eg, about 12 time steps) and performing calculations in multiple time steps, the first approach is ideal because it may result in more computational resource efficiency and / Or better performance, however, if this task only allows sampling in a single time step, the second method is required.

第三實施例類似於前述第二種方式，不過，不同於第二種方式使用單一組五列資料隨機存取記憶體122，此方式之非架構程式使用多組五列記憶體，也就是在各個時間步驟使用不同的五列一組記憶體，此部分類似於第一種方式。在此第三實施例中，較佳地，架構程式在步驟(2)前包含一步驟，此步驟中，架構程式會在非架構程式啟動前對其進行更新，例如將位址0之指令內的資料隨機存取記憶體122列更新為指向下一組五列記憶體。 The third embodiment is similar to the second method described above. However, unlike the second method, a single group of five columns of data random access memory 122 is used. In this manner, the non-architectural program uses multiple sets of five columns of memory, that is, Each time step uses a different five-column set of memory, which is similar to the first. In this third embodiment, preferably, the architecture program includes a step before the step (2), in which the architecture program updates the non-architecture program before starting, for example, the instruction of the address 0 is The data random access memory 122 column is updated to point to the next set of five columns of memory.

第四十八圖係一表格，顯示儲存於神經網路單元121之程式記憶體129之一程式，此程式係由神經網路單元121執行並依據第四十七圖之配置使用資料與權重，以達成關聯於長短期記憶胞層之計算。第四十八圖之範例程式包括24個非架構指令分別位於位址0至23。位址0之指令(INITIALIZE NPU,CLR ACC,LOOPCNT=12,DR IN ROW=-1,DR OUT ROW=2)會清除累加器202並將迴圈計數器3804初始化至數值12，以執行12次迴圈組(位址1至22之指令)。此初始化指令並會將資料隨機存取記憶體122之待讀取列初始化為數值-1，而在位址1之指令之第一次執行後，此數值會增加為零。此初始化指令並會將資料隨機存取記憶體122之待寫入列(例如第二十六與三十九圖之暫存器2606)初始化為列2。較佳地，此初始化指令並會使神經網路單元121處於寬配置，如此，神經網路單元121就會配置有512個神經處理單元126。如同後續章節所述，在位址0至23之指令執行過程中，這512個神經處理單元126其中之128個神經處理單元126係對應並作為128個長短期記憶胞4600進行運作。 The forty-eighth figure is a table showing a program stored in the program memory 129 of the neural network unit 121. The program is executed by the neural network unit 121 and uses data and weights according to the configuration of the 47th figure. To achieve the calculation associated with the long- and short-term memory cell layer. The sample program of the forty-eighth figure includes 24 non-architectural instructions located at addresses 0 through 23, respectively. The instruction of address 0 (INITIALIZE NPU, CLR ACC, LOOPCNT=12, DR IN ROW=-1, DR OUT ROW=2) will clear accumulator 202 and initialize loop counter 3804 to a value of 12 to Execute 12 loop groups (addresses 1 to 22). This initialization command initializes the column to be read of the data random access memory 122 to a value of -1, and the value is incremented to zero after the first execution of the instruction of the address 1. This initialization instruction initializes the columns to be written to the data random access memory 122 (e.g., the registers 2660 of the twenty-sixth and thirty-ninth figures) into column 2. Preferably, the initialization command causes the neural network unit 121 to be in a wide configuration. Thus, the neural network unit 121 is configured with 512 neural processing units 126. As described in subsequent sections, 128 of the 512 neural processing units 126 correspond to and operate as 128 long-term memory cells 4600 during the execution of instructions at addresses 0 through 23.

在位址1至4之指令之第一次執行中，這128個神經處理單元126(即神經處理單元0至127)中之各個神經處理單元126會針對相對應長短期記憶胞4600之第一時間步驟(時間步驟0)計算輸入閘(I)數值並將I數值寫入資料隨機存取記憶體122之列2之相對應文字；在位址1至4之指令之第二次執行中，這128個神經處理單元126中之各個神經處理單元126會針對相對應長短期記憶胞4600之第二時間步驟(時間步驟1)計算I數值並將I數值寫入資料隨機存取記憶體122之列7之相對應文字；依此類推，在位址1至4之指令之第十二次執行中，這128個神經處理單元126中之各個神經處理單元126會針對相對應長短期記憶胞4600之第十二時間步驟(時間步驟11)計算I數值並將I數值寫入資料隨機存取記憶體122之列57之相對應文字，如第四十七圖所示。 In the first execution of the instructions of addresses 1 through 4, each of the 128 neural processing units 126 (i.e., neural processing units 0 through 127) will be directed to the first of the corresponding long and short term memory cells 4600. The time step (time step 0) calculates the input gate (I) value and writes the I value to the corresponding text of column 2 of the data random access memory 122; in the second execution of the instructions of addresses 1 to 4, Each of the 128 neural processing units 126 calculates an I value for the second time step (time step 1) of the corresponding long and short term memory cell 4600 and writes the I value to the data random access memory 122. Corresponding text of column 7; and so on, in the twelfth execution of the instructions of addresses 1 to 4, each of the 128 neural processing units 126 will target the corresponding long-term and short-term memory cells 4600 The twelfth time step (time step 11) calculates the I value and writes the I value to the corresponding text of column 57 of the data random access memory 122, as shown in FIG.

進一步來說，位址1之乘法累加指令會讀取資料隨機存取記憶體122當前列後方之下一列(在第一執行即為列0，在第二執行即為列5，依此類推，在第十二執行即為列55)，此列係包含關聯於當前時間步驟之記憶胞輸入(X)值，此指令並會讀取權重隨機存取記憶體124中包含Wi數值之列0，並且將前述讀取數值相乘以產生第一乘積累加至剛剛由位址0之初始化指令或位址22之指令清除之累加器202。隨後，位址2之乘法累加指令會讀取下一個資料隨機存取記憶體122列(在第一執行即為列1，在第二執行即為列6，依此類推，在第十二執行即為列56)，此列係包含關聯於當前時間步驟之記憶胞輸出(H)值，此指令並會讀取權重隨機存取記憶體124中包含Ui數值之列1，並且將前述數值相乘以產生第二乘積累加至累加器202。關聯於當前時間步驟之H數值係由位址2之指令(以及位址6，10與18之指令)由資料隨機存取記憶體122讀取，在先前時間步驟產生，並由位址22之輸出指令寫入資料隨機存取記憶體122；不過，在第一次執行中，位址2之指令會以一初始值寫入資料隨機存取記憶體之列1作為H數值。較佳地，架構程式會在啟動第四十八圖之非架構程式前將初始H數值寫入資料隨機存取記憶體122之列1(例如使用MTNN指令1400)；不過，本發明並不限於此，非架構程式內包含有初始化指令將初始H數值寫入資料隨機存取記憶體122之列1之其他實施例亦屬於本發明之範疇。在一實施例中，此初始H數值為零。接下來，位址3之將權重文字加入累加器的指令 (ADD_W_ACC WR ROW 2)會讀取權重隨機存取記憶體124中包含Bi數值之列2並將其加入累加器202。最後，位址4之輸出指令(OUTPUT SIGMOID,DR OUT ROW+0,CLR ACC)會對累加器202數值執行一S型啟動函數並將執行結果寫入資料隨機存取記憶體122之當前輸出列(在第一執行即為列2，在第二執行即為列7，依此類推，在第十二執行即為列57)並且清除累加器202。 Further, the multiply accumulate instruction of address 1 reads a column below the current column of the data random access memory 122 (column 0 in the first execution, column 5 in the second execution, and so on, In the twelfth execution, column 55), the column contains the memory cell input (X) value associated with the current time step, and the instruction reads the column 0 of the Wi value contained in the weight random access memory 124. And the aforementioned read values are multiplied to produce an accumulator 202 that adds the first multiply accumulation to the instruction that was just cleared by the initialization instruction of address 0 or the address of address 22. Subsequently, the multiply accumulate instruction of address 2 reads the next data random access memory 122 column (column 1 in the first execution, column 6 in the second execution, and so on, in the twelfth execution) That is, column 56), this column contains the memory cell output (H) value associated with the current time step, and this instruction will read the column 1 containing the Ui value in the weight random access memory 124, and the aforementioned value Multiply by to generate a second multiply accumulation and add to accumulator 202. The H value associated with the current time step is read by the data random access memory 122 by the instruction of address 2 (and the instructions of addresses 6, 10 and 18), generated at the previous time step, and by address 22 The output instruction is written to the data random access memory 122; however, in the first execution, the instruction of the address 2 writes the column 1 of the data random access memory as an H value with an initial value. Preferably, the architecture program writes the initial H value to column 1 of the data random access memory 122 (eg, using the MTNN instruction 1400) prior to initiating the non-architecture of the twenty-eighth figure; however, the invention is not limited Thus, other embodiments in which the non-architectural program includes an initialization command to write the initial H value to column 1 of the data random access memory 122 are also within the scope of the present invention. In an embodiment, this initial H value is zero. Next, the instruction of the address 3 to add the weight text to the accumulator (ADD_W_ACC WR ROW 2) reads the column 2 containing the Bi value in the weighted random access memory 124 and adds it to the accumulator 202. Finally, the output instruction of address 4 (OUTPUT SIGMOID, DR OUT ROW+0, CLR ACC) performs an S-type start function on the value of the accumulator 202 and writes the execution result to the current output column of the data random access memory 122. (Column 2 in the first execution, column 7 in the second execution, and column 57 in the twelfth execution) and the accumulator 202 is cleared.

在位址5至8之指令之第一次執行中，這128個神經處理單元126中之各個神經處理單元126會針對相對應長短期記憶胞4600之第一時間步驟(時間步驟0)計算其遺忘閘(F)數值並將F數值寫入資料隨機存取記憶體122之列3之相對應文字；在位址5至8之指令之第二次執行中，這128個神經處理單元126中之各個神經處理單元126會針對相對應長短期記憶胞4600之第二時間步驟(時間步驟1)計算其遺忘閘(F)數值並將F數值寫入資料隨機存取記憶體122之列8之相對應文字；依此類推，在位址5至8之指令之第十二次執行中，這128個神經處理單元126中之各個神經處理單元126會針對相對應長短期記憶胞4600之第十二時間步驟(時間步驟11)計算其遺忘閘(F)數值並將F數值寫入資料隨機存取記憶體122之列58之相對應文字，如第四十七圖所示。位址5至8之指令計算F數值之方式類似於前述位址1至4之指令，不過，位址5至7之指令會分別從權重隨機存取記憶體124之列3，列4與列5讀取Wf，Uf與Bf數值以執行乘法與/或加法運算。 In the first execution of the instructions of addresses 5 through 8, each of the 128 neural processing units 126 will calculate its first time step (time step 0) for the corresponding long and short term memory cells 4600. Forgotten gate (F) values and write F values to the corresponding text of column 3 of data random access memory 122; in the second execution of instructions of addresses 5 through 8, the 128 neural processing units 126 Each of the neural processing units 126 calculates its forgetting gate (F) value for the second time step (time step 1) of the corresponding long-term and short-term memory cell 4600 and writes the F value to the column 8 of the data random access memory 122. Corresponding text; and so on, in the twelfth execution of the instructions of addresses 5 to 8, each of the 128 neural processing units 126 will be directed to the tenth of the corresponding long-term and short-term memory cells 4600. The second time step (time step 11) calculates its forgetting gate (F) value and writes the F value to the corresponding text of column 58 of data random access memory 122, as shown in FIG. The instructions of addresses 5 through 8 calculate the F value in a similar manner to the above instructions of addresses 1 through 4, however, the instructions of addresses 5 through 7 will be from column 3, column 4 and column of weight random access memory 124, respectively. 5 Read the Wf, Uf and Bf values to perform multiplication and/or addition operations.

在位址9至12之指令之十二次執行中，這128個神經處理單元126中之各個神經處理單元126會針對相對應長短期記憶胞4600之相對應時間步驟計算其候選記憶胞狀態(C’)數值並將C’數值寫入權重隨機存取記憶體124之列9之相對應文字。位址9至12之指令計算C’數值之方式類似於前述位址1至4之指令，不過，位址9至11之指令會分別從權重隨機存取記憶體124之列6，列7與列8讀取Wc，Uc與Bc數值以執行乘法與/或加法運算。此外，位址12之輸出指令會執行雙曲正切啟動函數而非(如位址4之輸出指令執行)S型啟動函數。 In the twelve executions of the instructions of addresses 9 through 12, each of the 128 neural processing units 126 calculates its candidate memory cell state for the corresponding time step of the corresponding long-term and short-term memory cells 4600 ( The C') value and the C' value are written to the corresponding text of column 9 of the weighted random access memory 124. The instructions of addresses 9 through 12 calculate the C' value in a manner similar to the above-mentioned instructions of addresses 1 through 4, however, the instructions of addresses 9 through 11 will be respectively from the weighted random access memory 124, column 6, column 7 and Column 8 reads the Wc, Uc and Bc values to perform multiplication and/or addition operations. In addition, the output instruction of address 12 performs a hyperbolic tangent start function instead of (as the output instruction execution of address 4) an S-type start function.

進一步來說，位址9之乘法累加指令會讀取資料隨機存取記憶體122之當前列(在第一次執行即為列0，在第二次執行即為列5，依此類推，在第十二次執行即為列55)，此當前列係包含關聯於當前時間步驟之記憶胞輸入(X)值，此指令並會讀取權重隨機存取記憶體124中包含Wc數值之列6，並且將前述數值相乘以產生第一乘積累加至剛剛由位址8之指令清除之累加器202。接下來，位址10之乘法累加指令會讀取資料隨機存取記憶體122之次一列(在第一次執行即為列1，在第二次執行即為列6，依此類推，在第十二次執行即為列56)，此列係包含關聯於當前時間步驟之記憶胞輸出(H)值，此指令並會讀取權重隨機存取記憶體124中包含Uc數值之列7，並且將前述數值相乘以產生第二乘積累加至累加器202。接下來，位址11之將權重文字加入累加器的指令會讀取權重隨機存取記憶體124中包含Bc數值之列8並將其加入累加器202。最後，位址12之輸出指令(OUTPUT TANH,WR OUT ROW 9,CLR ACC)會對累加器202數值執行一雙曲正切啟動函數並將執行結果寫入權重隨機存取記憶體124之列9，並且清除累加器202。 Further, the multiply accumulate instruction of address 9 reads the current column of the data random access memory 122 (column 0 in the first execution, column 5 in the second execution, and so on, The twelfth execution is column 55). This current column contains the memory cell input (X) value associated with the current time step. This instruction also reads the column 6 of the weighted random access memory 124 that contains the Wc value. And multiplying the aforementioned values to produce a first multiply accumulation plus an accumulator 202 that has just been cleared by the instruction of address 8. Next, the multiply-accumulate instruction of address 10 reads the next column of data random access memory 122 (column 1 in the first execution, column 6 in the second execution, and so on, in the The eleventh execution is column 56), which includes the memory cell output (H) value associated with the current time step, and the instruction reads the column 7 containing the Uc value in the weighted random access memory 124, and The aforementioned values are multiplied to produce a second multiplication accumulation applied to accumulator 202. Next, the instruction of the address 11 to add the weight text to the accumulator will read the weighted random access memory 124 containing the column 8 of the Bc value and It is added to the accumulator 202. Finally, the output instruction of address 12 (OUTPUT TANH, WR OUT ROW 9, CLR ACC) performs a hyperbolic tangent start function on the accumulator 202 value and writes the execution result to column 9 of the weighted random access memory 124. And the accumulator 202 is cleared.

在位址13至16之指令之十二次執行中，這128個神經處理單元126中之各個神經處理單元126會針對相對應長短期記憶胞4600之相對應時間步驟計算新的記憶胞狀態(C)數值並將此新的C數值寫入權重隨機存取記憶體122之列11之相對應文字，各個神經處理單元126還會計算tanh(C)並將其寫入權重隨機存取記憶體124之列10之相對應文字。進一步來說，位址13之乘法累加指令會讀取資料隨機存取記憶體122當前列後方之下一列(在第一次執行即為列2，在第二次執行即為列7，依此類推，在第十二次執行即為列57)，此列包含關聯於當前時間步驟之輸入閘(I)數值，此指令並讀取權重隨機存取記憶體124中包含候選記憶胞狀態(C’)數值之列9(剛剛由位址12之指令寫入)，並且將前述數值相乘以產生第一乘積累加至剛剛由位址12之指令清除之累加器202。接下來，位址14之乘法累加指令會讀取資料隨機存取記憶體122之下一列(在第一次執行即為列3，在第二次執行即為列8，依此類推，在第十二次執行即為列58)，此列包含關聯於當前時間步驟之遺忘閘(F)數值，此指令並讀取權重隨機存取記憶體124中包含於先前時間步驟中計算之當前記憶胞狀態(C)數值(由位址15之指令之最近一次執行進行寫入)之列11，並且將前述數值相乘以產生第二乘積加入累加器202。接下來，位址15之輸出指令(OUTPUT PASSTHRU,WR OUT ROW 11)會傳遞此累加器202數值並將其寫入權重隨機存取記憶體124之列11。需要理解的是，位址14之指令由資料隨機存取記憶體122之列11讀取之C數值即為位址13至15之指令於最近一次執行中產生並寫入之C數值。位址15之輸出指令並不會清除累加器202，如此，其數值即可由位址16之指令使用。最後，位址16之輸出指令(OUTPUT TANH,WR OUT ROW 10,CLR ACC)會對累加器202數值執行一雙曲正切啟動函數並將其執行結果寫入權重隨機存取記憶體124之列10供位址21之指令使用以計算記憶胞輸出(H)值。位址16之指令會清除累加器202。 In the twelve executions of the instructions of addresses 13 through 16, each of the 128 neural processing units 126 calculates a new memory cell state for the corresponding time step of the corresponding long-term and short-term memory cells 4600 ( C) the value and writing the new C value to the corresponding text of column 11 of the weighted random access memory 122, each neural processing unit 126 also calculates tanh(C) and writes it into the weighted random access memory. Corresponding text of column 10 of 124. Further, the multiply-accumulate instruction of address 13 reads a column below the current column of the data random access memory 122 (column 2 in the first execution and column 7 in the second execution, By analogy, in the twelfth execution, column 57), this column contains the input gate (I) value associated with the current time step, and the instruction reads the weighted random access memory 124 containing the candidate memory cell state (C ') The value of column 9 (just written by the instruction of address 12), and the aforementioned value is multiplied to produce a first multiply accumulated plus accumulator 202 that has just been cleared by the instruction of address 12. Next, the multiply accumulate instruction of address 14 reads a column below the data random access memory 122 (column 3 in the first execution, column 8 in the second execution, and so on, in the The eleventh execution is column 58), this column contains the forgetting gate (F) value associated with the current time step, and the instruction reads the current memory cell included in the weighted random access memory 124 and calculated in the previous time step. Column 11 of the state (C) value (written by the last execution of the instruction of address 15), and multiplying the aforementioned values to produce the first The diploid product is added to the accumulator 202. Next, the output instruction of the address 15 (OUTPUT PASSTHRU, WR OUT ROW 11) passes the value of the accumulator 202 and writes it to the column 11 of the weighted random access memory 124. It should be understood that the C value read by the address of the address 14 by the column 11 of the data random access memory 122 is the C value generated and written by the instruction of the address 13 to 15 in the most recent execution. The output instruction of address 15 does not clear accumulator 202, so its value can be used by the instruction of address 16. Finally, the output command of address 16 (OUTPUT TANH, WR OUT ROW 10, CLR ACC) performs a hyperbolic tangent start function on the value of accumulator 202 and writes its execution result to column 10 of weight random access memory 124. The instructions for address 21 are used to calculate the memory cell output (H) value. The instruction of address 16 clears accumulator 202.

在位址17至20之指令之第一次執行中，這128個神經處理單元126中之各個神經處理單元126會針對相對應長短期記憶胞4600之第一時間步驟(時間步驟0)計算其輸出閘(O)數值並將O數值寫入資料隨機存取記憶體122之列4之相對應文字；在位址17至20之指令之第二次執行中，這128個神經處理單元126中之各個神經處理單元126會針對相對應長短期記憶胞4600之第二時間步驟(時間步驟1)計算其輸出閘(O)數值並將O數值寫入資料隨機存取記憶體122之列9之相對應文字；依此類推，在位址17至20之指令之第十二次執行中，這128個神經處理單元126中之各個神經處理單元126會針對相對應長短期記憶胞4600之第十二時間步驟(時間步驟11)計算其輸出閘(O)數值並將O數值寫入資料隨機存取記憶體122之列58之相對應文字，如第四十七圖所示。位址17至20之指令計算O數值之方式類似於前述位址1至4之指令，不過，位址17至19之指令會分別從權重隨機存取記憶體124之列12，列13與列14讀取Wo，Uo與Bo數值以執行乘法與/或加法運算。 In the first execution of the instructions of addresses 17 through 20, each of the 128 neural processing units 126 will calculate its first time step (time step 0) for the corresponding long- and short-term memory cells 4600. The gate (O) value is output and the O value is written to the corresponding text of column 4 of the data random access memory 122; in the second execution of the instructions of addresses 17 through 20, the 128 neural processing units 126 Each of the neural processing units 126 calculates its output gate (O) value for the second time step (time step 1) of the corresponding long-term memory cell 4600 and writes the O value to the column 9 of the data random access memory 122. Corresponding text; and so on, in the twelfth execution of the instructions of addresses 17 to 20, each of the 128 neural processing units 126 will be directed to the tenth of the corresponding long-term and short-term memory cells 4600. The second time step (time step 11) calculates its output gate (O) value and writes the O value into the data random access record. The corresponding text of column 58 of body 122 is shown in Figure 47. The instructions of addresses 17 through 20 calculate the value of O similarly to the instructions of addresses 1 through 4 described above, however, the instructions of addresses 17 through 19 will be from column 12, column 13 and column of weight random access memory 124, respectively. 14 Read the Wo, Uo and Bo values to perform multiplication and/or addition operations.

在位址21至22之指令之第一次執行中，這128個神經處理單元126中之各個神經處理單元126會針對相對應長短期記憶胞4600之第一時間步驟(時間步驟0)計算其記憶胞輸出(H)值並將H數值寫入資料隨機存取記憶體122之列6之相對應文字；在位址21至22之指令之第二次執行中，這128個神經處理單元126中之各個神經處理單元126會針對相對應長短期記憶胞4600之第二時間步驟(時間步驟1)計算其記憶胞輸出(H)值並將H數值寫入資料隨機存取記憶體122之列11之相對應文字；依此類推，在位址21至22之指令之第十二次執行中，這128個神經處理單元126中之各個神經處理單元126會針對相對應長短期記憶胞4600之第十二時間步驟(時間步驟11)計算其記憶胞輸出(H)值並將H數值寫入資料隨機存取記憶體122之列60之相對應文字，如第四十七圖所示。 In the first execution of the instructions of addresses 21 through 22, each of the 128 neural processing units 126 will calculate its first time step (time step 0) for the corresponding long-term and short-term memory cells 4600. The memory cell outputs (H) values and writes the H values to the corresponding text of column 6 of the data random access memory 122; in the second execution of the instructions of addresses 21 through 22, the 128 neural processing units 126 Each of the neural processing units 126 calculates its memory cell output (H) value for the second time step (time step 1) of the corresponding long-term and short-term memory cell 4600 and writes the H value into the data random access memory 122. 11 corresponding text; and so on, in the twelfth execution of the instructions of addresses 21 to 22, each of the 128 neural processing units 126 will target the corresponding long-term and short-term memory cells 4600 The twelfth time step (time step 11) calculates its memory cell output (H) value and writes the H value to the corresponding text of column 60 of the data random access memory 122, as shown in FIG.

進一步來說，位址21之乘法累加指令會讀取資料隨機存取記憶體122當前列後方之第三列(在第一次執行即為列4，在第二次執行即為列9，依此類推，在第十二次執行即為列59)，此列包含關聯於當前時間步驟之輸出閘(O)數值，此指令並讀取權重隨機存取記憶體 124中包含tanh(C)數值之列10(由位址16之指令寫入)，並且將前述數值相乘以產生一乘積累加至剛剛由位址20之指令清除之累加器202。隨後，位址22之輸出指令會傳遞累加器202數值並將其寫入資料隨機存取記憶體122之接下來第二個輸出列11(在第一次執行即為列6，在第二次執行即為列11，依此類推，在第十二次執行即為列61)，並且清除累加器202。需要理解的是，由位址22之指令寫入資料隨機存取記憶體122列之H數值(在第一次執行即為列6，在第二次執行即為列11，依此類推，在第十二次執行即為列61)即為位址2，6，10與18之指令之後續執行中所消耗/讀取之H數值。不過，第十二次執行中寫入列61之H數值並不會被位址2，6，10與18之指令之執行所消耗/讀取；就一較佳實施例而言，此數值會是由架構程式所消耗/讀取。 Further, the multiply accumulate instruction of the address 21 reads the third column behind the current column of the data random access memory 122 (column 4 in the first execution and column 9 in the second execution, This type of push, in the twelfth execution, is column 59), this column contains the output gate (O) value associated with the current time step, and this instruction reads the weighted random access memory. 124 includes a column 10 of tanh (C) values (written by the instruction of address 16) and multiplies the aforementioned values to produce a multiply accumulator plus the accumulator 202 that was just cleared by the instruction of address 20. Subsequently, the output instruction of address 22 transfers the value of accumulator 202 and writes it to the next second output column 11 of data random access memory 122 (on the first execution, column 6, in the second time) Execution is column 11, and so on, at the twelfth execution, column 61), and the accumulator 202 is cleared. It should be understood that the H value of the data random access memory 122 column is written by the instruction of the address 22 (column 6 in the first execution, column 11 in the second execution, and so on, in The twelfth execution is the value of H consumed/read in the subsequent execution of the instructions of the instructions of addresses 2, 6, 10 and 18. However, the H value written in column 61 during the twelfth execution is not consumed/read by the execution of the instructions of addresses 2, 6, 10 and 18; for a preferred embodiment, this value will It is consumed/read by the architecture program.

位址23之指令(LOOP 1)會使迴圈計數器3804遞減並且在新的迴圈計數器3804數值大於零的情況下回到位址1之指令。 The instruction of bit 23 (LOOP 1) decrements the loop counter 3804 and returns to the instruction of address 1 if the value of the new loop counter 3804 is greater than zero.

第四十九圖係一方塊圖，顯示一神經網路單元121之實施例，此實施例之神經處理單元群組內具有輸出緩衝遮蔽與反饋能力。第四十九圖顯示單一個由四個神經處理單元126構成之神經處理單元群組4901。雖然第四十九圖僅顯示單一個神經處理單元群組4901，不過需要理解的是，神經網路單元121中之各個神經處理單元126都會包含於一個神經處理單元群組4901內，因此，一共會有N/J個神經處理單元群組4901，其中N是神經處理單元126的數量(舉例來說，就寬配置而言為512，就窄配置而言為1024)而J是單一個群組4901內之神經處理單元126的數量(舉例來說，就第四十九圖之實施例而言即為四)。第四十九圖中係將神經處理單元群組4901內之四個神經處理單元126稱為神經處理單元0，神經處理單元1，神經處理單元2與神經處理單元3。 A forty-ninth diagram is a block diagram showing an embodiment of a neural network unit 121 having an output buffer masking and feedback capability within the neural processing unit group of this embodiment. The forty-ninth diagram shows a single neural processing unit group 4901 consisting of four neural processing units 126. Although the forty-ninth diagram shows only a single neural processing unit group 4901, it should be understood that each of the neural processing units 126 in the neural network unit 121 is included in a neural processing unit group 4901, and therefore, a total of There will be N/J neural processing unit groups 4901, where N is nerve processing The number of units 126 (for example, 512 for a wide configuration, 1024 for a narrow configuration) and J is the number of neural processing units 126 within a single group 4901 (for example, forty In the case of the nine figure, it is four). In the forty-ninth diagram, the four neural processing units 126 in the neural processing unit group 4901 are referred to as a neural processing unit 0, a neural processing unit 1, a neural processing unit 2, and a neural processing unit 3.

第四十九圖之實施例中之各個神經處理單元係類似於前述第七圖之神經處理單元126，並且圖中具有相同標號之元件亦相類似。不過，多工暫存器208係經調整以包含四個額外的輸入4905，多工暫存器705係經調整以包含四個額外的輸入4907，選擇輸入213係經調整而能從原本之輸入211與207以及額外輸入4905中進行選擇提供至輸出209，並且，選擇輸入713係經調整而能從原本之輸入711與206以及額外輸入4907中進行選擇提供至輸出203。 Each of the neural processing units in the embodiment of the forty-ninth embodiment is similar to the neural processing unit 126 of the seventh embodiment described above, and elements having the same reference numerals are similar in the drawings. However, multiplex register 208 is tuned to include four additional inputs 4905 that are tuned to include four additional inputs 4907 that are adjusted to be input from the original Selections 211 and 207 and additional inputs 4905 are provided to output 209, and selection input 713 is adjusted to provide selection from the original inputs 711 and 206 and additional inputs 4907 to output 203.

如圖中所示，第十一圖之列緩衝器1104在第四十九圖中即為輸出緩衝器1104。進一步來說，圖中所示之輸出緩衝器1104之文字0，1，2與3係接收關聯於神經處理單元0，1，2與3之四個啟動函數單元212之相對應輸出。此部分之輸出緩衝器1104包含N個文字對應於一神經處理單元群組4901，這些文字係稱為一個輸出緩衝文字群組。在第四十九圖之實施例中，N為四。輸出緩衝器1104之這四個文字係反饋至多工暫存器208與705，並作為四個額外輸入4905由多工暫存器208所接收以及作為四個額外輸入4907由多工暫存器705所接收。輸出緩衝文字群組反饋至其相對應神經處理單元群組4901之反饋動作，使非架構程式之算術指令能夠從關聯於神經處理單元群組4901之輸出緩衝器1104之文字(即輸出緩衝文字群組)中選擇一個或兩個文字作為其輸入，其範例請參照後續第五十一圖之非架構程式，如圖中位址4，8，11，12與15之指令。也就是說，指定於非架構指令內之輸出緩衝器1104文字會確認選擇輸入213/713產生之數值。這個能力實際上使輸出緩衝器1104可以作為一個類別草稿記憶體(scratch pad memory)，能夠讓非架構程式減少寫入資料隨機存取記憶體122與/或權重隨機存取記憶體124以及後續從中讀取之次數，例如減少過程中居間產生與使用之數值。較佳地，輸出緩衝器1104，或稱列緩衝器1104，包括一個一維之暫存器陣列，用以儲存1024個窄文字或是512個寬文字。較佳地，對於輸出緩衝器1104之讀取可以在單一個時頻周期內執行，而對於輸出緩衝器1104之寫入也可以在單一個時頻周期內執行。不同於資料隨機存取記憶體122與權重隨機存取記憶體124，可由架構程式與非架構程式進行存取，輸出緩衝器1104無法由架構程式進行存取，而只能由非架構程式進行存取。 As shown in the figure, the column buffer 1104 of the eleventh diagram is the output buffer 1104 in the forty-ninth figure. Further, the characters 0, 1, 2, and 3 of the output buffer 1104 shown in the figure receive the corresponding outputs associated with the four start function units 212 of the neural processing units 0, 1, 2, and 3. The output buffer 1104 of this section contains N words corresponding to a neural processing unit group 4901, which is referred to as an output buffer text group. In the embodiment of the forty-ninth embodiment, N is four. The four texts of output buffer 1104 are fed back to multiplex registers 208 and 705, and are received by multiplex register 208 as four additional inputs 4905 and by multiplex register 705 as four additional inputs 4907. Received. lose The buffered text group is fed back to the feedback action of its corresponding neural processing unit group 4901, so that the arithmetic instructions of the non-architectural program can be from the text associated with the output buffer 1104 of the neural processing unit group 4901 (ie, the output buffer text group) Select one or two words as the input in the group). For an example, please refer to the non-architectural program in the following 51st figure, as shown in the figure 4, 8, 11, 12 and 15. That is, the output buffer 1104 text specified in the non-architected instruction will confirm the value generated by the selection input 213/713. This capability actually allows the output buffer 1104 to act as a scratch pad memory that allows non-architected programs to reduce the write data random access memory 122 and/or the weighted random access memory 124 and subsequent The number of reads, such as the value of the intervening generation and use during the process. Preferably, the output buffer 1104, or column buffer 1104, includes a one-dimensional array of registers for storing 1024 narrow words or 512 wide words. Preferably, the reading of the output buffer 1104 can be performed in a single time-frequency period, while the writing to the output buffer 1104 can be performed in a single time-frequency period. Different from the data random access memory 122 and the weighted random access memory 124, the data can be accessed by the architecture program and the non-architecture program. The output buffer 1104 cannot be accessed by the architecture program, but can only be stored by the non-architecture program. take.

輸出緩衝器1104係將經調整以接收一遮罩輸入(mask input)4903。較佳地，遮罩輸入4903包括四個位元對應至輸出緩衝器1104之四個文字，此四個文字係關聯於神經處理單元群組4901之四個神經處理單元126。較佳地，若是此對應至輸出緩衝器1104之文字之遮罩輸入4903位元為真，此輸出緩衝器1104之文字就會維持其當前值；否則，此輸出緩衝器1104之文字就會被啟動函數單元212之輸出所更新。也就是說，若是此對應至輸出緩衝器1104之文字之遮罩輸入4903位元為假，啟動函數單元212之輸出就會被寫入輸出緩衝器1104之文字。如此，非架構程式之輸出指令即可選擇性地將啟動函數單元212之輸出寫入輸出緩衝器1104之某些文字並使輸出緩衝器1104之其他文字之當前數值維持不變，其範例請參照後續第五十一圖之非架構程式之指令，如圖中位址6，10，13與14之指令。也就是說，指定於非架構程式內之輸出緩衝器1104之文字即決產生於遮罩輸入4903之數值。 Output buffer 1104 will be adjusted to receive a mask input 4903. Preferably, the mask input 4903 includes four bits corresponding to the four characters of the output buffer 1104 associated with the four neural processing units 126 of the neural processing unit group 4901. Preferably, if this corresponds to the cover of the output buffer 1104 The cover input 4903 bits are true, and the text of the output buffer 1104 maintains its current value; otherwise, the text of the output buffer 1104 is updated by the output of the start function unit 212. That is, if the mask input 4903 of the text corresponding to the output buffer 1104 is false, the output of the start function unit 212 is written to the text of the output buffer 1104. Thus, the output command of the non-architected program can selectively write the output of the start function unit 212 to some characters of the output buffer 1104 and maintain the current value of the other characters of the output buffer 1104. For an example, please refer to the example. The instructions of the non-architecture program in the subsequent fifty-first figure are as shown in the instructions of addresses 6, 10, 13 and 14. That is, the text of the output buffer 1104 specified in the non-architected program is generated from the value of the mask input 4903.

為了簡化說明，第四十九圖中並未顯示多工暫存器208/705之輸入1811(如第十八，十九與二十三圖所示)。不過，同時支援可動態配置神經處理單元126與輸出緩衝器1104之反饋/遮罩之實施例亦屬本發明之範疇。較佳地，在此等實施例中，輸出緩衝文字群組為可相對應地動態配置。 To simplify the description, the input 1811 of the multiplex register 208/705 is not shown in the forty-ninth figure (as shown in Figures 18, 19 and 23). However, embodiments that simultaneously support feedback/masking that can dynamically configure the neural processing unit 126 and the output buffer 1104 are also within the scope of the present invention. Preferably, in these embodiments, the output buffer text group is dynamically configurable accordingly.

需要理解的是，雖然此實施例之神經處理單元群組4901內之神經處理單元126的數量為四，不過，本發明並不限於此，群組內神經處理單元126數量較多或較少之實施例均屬於本發明之範疇。此外，就一個具有共享啟動函數單元1112之實施例而言，如第五十二圖所示，一個神經處理單元群組4901內之神經處理單元126數量與一個啟動函數單元212群組內之神經處理單元126 數量會有協同影響。神經處理單元群組內輸出緩衝器1104之遮蔽與反饋能力特別有助於提升關聯於長短期記憶胞4600之計算效率，詳如後續第五十與五十一圖所述。 It should be understood that although the number of neural processing units 126 in the neural processing unit group 4901 of this embodiment is four, the present invention is not limited thereto, and the number of intra-group neural processing units 126 is large or small. The examples are all within the scope of the invention. Moreover, in the case of an embodiment having a shared activation function unit 1112, as shown in the fifty-second diagram, the number of neural processing units 126 within a group of neural processing units 4901 and the nerves within a group of activation function units 212 Processing unit 126 The quantity will have a synergistic effect. The occlusion and feedback capabilities of the output buffer 1104 within the neural processing unit group are particularly useful for improving the computational efficiency associated with the long and short term memory cells 4600, as described in subsequent fifty and fifty-first figures.

第五十圖係一方塊圖，顯示當神經網路單元121執行關聯於第四十六圖中由128個長短期記憶胞4600構成之一層級之計算時，第四十九圖之神經網路單元121之資料隨機存取記憶體122，權重隨機存取記憶體124與輸出緩衝器1104內之資料配置之一範例。在第五十圖之範例中，神經網路單元121係配置為512個神經處理單元126或神經元，例如採取寬配置。如同第四十七與四十八圖之範例，在第五十與五十一圖之範例中之長短期記憶層中只具有128個長短期記憶胞4600。不過，在第五十圖之範例中，全部512個神經處理單元126(如神經處理單元0至127)產生之數值都會被使用。在執行第五十一圖之非架構程式的時候，各個神經處理單元群組4901會集體做為一個長短期記憶胞4600進行運作。 Figure 50 is a block diagram showing the neural network of the forty-ninth figure when the neural network unit 121 performs the calculation associated with a hierarchy of 128 long- and short-term memory cells 4600 in the forty-sixth figure. An example of data configuration in the data random access memory 122 of the unit 121, the weighted random access memory 124 and the output buffer 1104. In the example of Fig. 50, the neural network unit 121 is configured as 512 neural processing units 126 or neurons, for example, in a wide configuration. As in the forty-seventh and forty-eighth examples, there are only 128 long- and short-term memory cells 4600 in the long- and short-term memory layers of the fifty and fifty-first examples. However, in the example of Fig. 50, all of the values generated by 512 neural processing units 126 (e.g., neural processing units 0 through 127) will be used. When performing the non-architectural program of Figure 51, each neural processing unit group 4901 will collectively operate as a long-term and short-term memory cell 4600.

如圖中所示，資料隨機存記憶體122裝載記憶胞輸入(X)與輸出(H)值供一系列時間步驟使用。進一步來說，對於一給定時間步驟，會有一對兩列記憶體分別裝載X數值與H數值。以一個具有64列之資料隨機存取記憶體122為例，如圖中所示，此資料隨機存取記憶體122所裝載之記憶胞數值可供31個不同時間步驟使用。在第五十圖之範例中，列2與3裝載供時間步驟0使用之數值，列4與5裝載供時間步驟1使用之數值，依此類推，列62與63裝載供時間步驟30使用之數值。這對兩列記憶體中之第一列係裝載此時間步驟之X數值，而第二列則是裝載此時間步驟之H數值。如圖中所示，資料隨機存取記憶體122中各組四行對應至神經處理單元群組4901之記憶體係裝載供其對應長短期記憶胞4600使用之數值。也就是說，行0至3係裝載關聯於長短期記憶胞0之數值，其計算是由神經處理單元0-3執行，即神經處理單元群組0執行；行4至7係裝載關聯於長短期記憶胞1之數值，其計算是由神經處理單元4-7執行，即神經處理單元群組1執行；依此類推，行508至511係裝載關聯於長短期記憶胞127之數值，其計算是由神經處理單元508-511執行，即神經處理單元群組127執行，詳如後續第五十一圖所示。如圖中所示，列1並未被使用，列0裝載初始之記憶胞輸出(H)值，就一較佳實施例而言，可由架構程式填入零值，不過，本發明並不限於此，利用非架構程式指令填入列0之初始記憶胞輸出(H)數值亦屬於本發明之範疇。 As shown in the figure, the data random memory 122 loads the memory cell input (X) and output (H) values for use in a series of time steps. Further, for a given time step, a pair of two columns of memory are loaded with X values and H values, respectively. Taking a random access memory 122 having 64 columns as an example, as shown in the figure, the memory cell value loaded by the data random access memory 122 can be used in 31 different time steps. In the example of Fig. 50, columns 2 and 3 are loaded with values for time step 0, columns 4 and 5 are loaded with values for time step 1, and so on, columns 62 and 63 are loaded for time step 30. Value. The pair of two columns of memory The first column loads the X value for this time step, and the second column loads the H value for this time step. As shown in the figure, each of the four rows in the data random access memory 122 corresponds to a value stored in the memory system of the neural processing unit group 4901 for use by its corresponding long-term and short-term memory cells 4600. That is, rows 0 to 3 are loaded with values associated with long- and short-term memory cells 0, which are calculated by neuroprocessing unit 0-3, ie, neuroprocessing unit group 0 is executed; rows 4 to 7 are loaded associated with long The value of the short-term memory cell 1 is calculated by the neural processing unit 4-7, that is, the neural processing unit group 1 is executed; and so on, the lines 508 to 511 are loaded with the values associated with the long- and short-term memory cells 127, and the calculation is performed. It is performed by the neural processing unit 508-511, that is, the neural processing unit group 127, as shown in the subsequent fifty-first figure. As shown in the figure, column 1 is not used, column 0 is loaded with the initial memory cell output (H) value. In a preferred embodiment, the value can be filled in by the architecture program, however, the invention is not limited Thus, it is also within the scope of the present invention to fill in the initial memory cell output (H) values of column 0 using non-architectural program instructions.

較佳地，X數值(位於列2，4，6依此類推至列62)係由執行於處理器100之架構程式透過MTNN指令1400寫入/填入資料隨機存取記憶體122，並由執行於神經網路單元121之非架構程式進行讀取/使用，例如第五十圖所示之非架構程式。較佳地，H數值(位於列3，5，7依此類推至列63)係由執行於神經網路單元121之非架構程式寫入/填入資料隨機存取記憶體122並進行讀取/使用，詳如後述。較佳地，H數值並由執行於處理器100之架構程式透過MFNN指令1500進行讀取。需要注意的是，第五十一圖之非架構程式係假定對應至神經處理單元群組4901之各組四行記憶體(如行0-3，行4-7，行5-8，依此類推至行508-511)中，在一給定列之四個X數值係填入相同的數值(例如由架構程式填入)。類似地，第五十一圖之非架構程式會在對應至神經處理單元群組4901之各組四行記憶體中，計算並對一給定列之四個H數值寫入相同數值。 Preferably, the X value (in columns 2, 4, 6 and so on to column 62) is written/filled into the data random access memory 122 by the architecture program executed by the processor 100 through the MTNN instruction 1400, and The non-architecture program executed by the neural network unit 121 performs reading/using, such as the non-architected program shown in FIG. Preferably, the H value (in columns 3, 5, 7 and so on to column 63) is written/filled into the data random access memory 122 by the non-architectural program executed by the neural network unit 121 and read. / Use, as described later. Preferably, the H value is read by the MFNN instruction 1500 by an architectural program executing on the processor 100. Need to pay attention Yes, the non-architectural scheme of the fifty-first graph assumes that each group of four rows of memory corresponding to the neural processing unit group 4901 (eg, rows 0-3, rows 4-7, rows 5-8, and so on) In 508-511), the four X values in a given column are filled in with the same value (for example, filled in by the architecture program). Similarly, the non-architectural program of the fifty-first graph calculates and writes the same value for the four H values of a given column in each of the four rows of memory corresponding to the neural processing unit group 4901.

如圖中所示，權重隨機存取記憶體124係裝載神經網路單元121之神經處理單元所需之權重，偏移與記憶胞狀態(C)值。在對應至神經處理單元群組121之各組四行記憶體中(例如行0-3，行4-7，行5-8依此類推至行508-511)：(1)行編號除以4之餘數等於3之行，會在其列0，1，2與6分別裝載Wc，Uc，Bc，與C之數值；(2)行編號除以4之餘數等於2之行，會在其列3，4與5分別裝載Wo，Uo與Bo之數值；(3)行編號除以4之餘數等於1之行，會在其列3，4與5分別裝載Wf，Uf與Bf之數值；以及(4)行編號除以4之餘數等於0之行，會在其列3，4與5分別裝載Wi，Ui與Bi之數值。較佳地，這些權重與偏移值-Wi,Ui,Bi,Wf,Uf,Bf,Wc,Uc,Bc,Wo,Uo,Bo(在列0至5)-係由執行於處理器100之架構程式透過MTNN指令1400寫入/填入權重隨機存取記憶體124，並由執行於神經網路單元121之非架構程式進行讀取/使用，如第五十一圖之非架構程式。較佳地，居間之C值係由執行於神經網路單元121之非架構程式寫入/填入權重隨機存取記憶體124並進行讀取/使用，詳如後述。 As shown in the figure, the weighted random access memory 124 is loaded with the weights, offsets, and memory cell state (C) values required by the neural processing unit of the neural network unit 121. In each group of four rows of memory corresponding to the neural processing unit group 121 (eg, rows 0-3, rows 4-7, rows 5-8, and so on to rows 508-511): (1) the row number divided by The remainder of 4 is equal to 3, and the values of Wc, Uc, Bc, and C are loaded in columns 0, 1, 2, and 6, respectively; (2) the line number divided by 4 is equal to 2, and will be in Columns 3, 4, and 5 respectively load the values of Wo, Uo, and Bo; (3) the row number divided by the remainder of 4 is equal to 1, and the values of Wf, Uf, and Bf are loaded in columns 3, 4, and 5, respectively; And (4) the row number divided by the remainder of 4 is equal to 0, and the values of Wi, Ui and Bi are loaded in columns 3, 4 and 5, respectively. Preferably, these weights and offset values - Wi, Ui, Bi, Wf, Uf, Bf, Wc, Uc, Bc, Wo, Uo, Bo (in columns 0 to 5) - are executed by the processor 100. The architecture program writes/fills the weighted random access memory 124 through the MTNN instruction 1400 and is read/used by a non-architected program executed by the neural network unit 121, such as the non-architected program of FIG. Preferably, the intermediate C value is written/filled into the weight random access memory 124 by the non-architectural program executed by the neural network unit 121, and is read/used as described later.

第五十圖之範例係假定架構程式會執行以下步驟：(1)對於31個不同的時間步驟，將輸入X之數值填入資料隨機存取記憶體122(列2，4，依此類推至列62)；(2)啟動第五十一圖之非架構程式；(3)偵測非架構程式是否執行完畢；(4)從資料隨機存取記憶體122讀出輸出H之數值(列3，5，依此類推至列63)；以及(5)重複步驟(1)至(4)若干次直到完成任務，例如對手機使用者之話語進行辨識所需之計算。 The example in Figure 50 assumes that the architecture program performs the following steps: (1) For 31 different time steps, the value of the input X is filled in the data random access memory 122 (column 2, 4, and so on. Column 62); (2) start the non-architecture program of the 51st figure; (3) detect whether the non-architecture program is executed; (4) read the value of the output H from the data random access memory 122 (column 3) , 5, and so on to column 63); and (5) repeat steps (1) through (4) several times until the task is completed, such as the calculation required to identify the phone user's words.

在另一種執行方式中，架構程式會執行以下步驟：(1)對單一個時間步驟，以輸入X之數值填入資料隨機存取記憶體122(如列2)；(2)啟動非架構程式(第五十一圖非架構程式之一修正後版本，不需迴圈，並且只存取資料隨機存記憶體122之單一對兩個列)；(3)偵測非架構程式是否執行完畢；(4)從資料隨機存取記憶體122讀出輸出H之數值(如列3)；以及(5)重複步驟(1)至(4)若干次直到完成任務。此二種方式何者為優可依據長短期記憶層之輸入X數值的取樣方式而定。舉例來說，若是此任務容許在多個時間步驟對輸入進行取樣(例如大約31個時間步驟)並執行計算，第一種方式就較為理想，因為此方式可能帶來更多計算資源效率與/或較佳的效能，不過，若是此任務只容許在單一個時間步驟執行取樣，就需要使用第二種方式。 In another implementation, the architecture program performs the following steps: (1) for a single time step, filling in the data random access memory 122 (such as column 2) by inputting the value of X; (2) starting the non-architectural program (The 51st version of the non-architected program is a modified version that does not require loops and only accesses a single pair of two columns of data storage memory 122); (3) detects whether the non-architected program has been executed; (4) Reading the value of the output H from the data random access memory 122 (e.g., column 3); and (5) repeating steps (1) through (4) several times until the task is completed. Which of the two methods is superior depends on the sampling method of the input X value of the long-term and short-term memory layers. For example, if this task allows sampling of inputs (eg, approximately 31 time steps) and performing calculations in multiple time steps, the first approach is desirable because it may result in more computational resource efficiency and / Or better performance, however, if this task only allows sampling in a single time step, the second method is required.

第三實施例類似於前述第二種方式，不過，不同於第二種方式使用單一對兩列資料隨機存取記憶體122，此方式之非架構程式使用多對記憶體列，也就是在各個時間步驟使用不同對記憶體列，此部分類似於第一種方式。較佳地，此第三實施例之架構程式在步驟(2)前包含一步驟，在此步驟中，架構程式會在非架構程式啟動前對其進行更新，例如將位址1之指令內的資料隨機存取記憶體122列更新為指向下一對兩列記憶體。 The third embodiment is similar to the second method described above. However, unlike the second method, a single pair of two columns of data random access memory 122 is used. In this way, the non-architectural program uses multiple pairs of memory columns, that is, It is to use different pairs of memory columns at various time steps, this part is similar to the first way. Preferably, the architecture program of the third embodiment includes a step before the step (2), in which the architecture program updates the non-architecture program before starting the non-architecture program, for example, in the instruction of the address 1 The data random access memory 122 column is updated to point to the next pair of two columns of memory.

如圖中所示，對於神經網路單元121之神經處理單元0至511，在第五十一圖之非架構程式中不同位址之指令執行後，輸出緩衝器1104係裝載記憶胞輸出(H)，候選記憶胞狀態(C’)，輸入閘(I)，遺忘閘(F)，輸出閘(O)，記憶胞狀態(C)與tanh(C)之居間值，每一個輸出緩衝文字群組中(例如輸出緩衝器1104對應至神經處理單元群組4901之四個文字之群組，如文字0-3，4-7，5-8依此類推至508-511)，文字編號除以4之餘數為3的文字係表示為OUTBUF[3]，文字編號除以4之餘數為2的文字係表示為OUTBUF[2]，文字編號除以4之餘數為1的文字係表示為OUTBUF[1]，而文字編號除以4之餘數為0的文字係表示為OUTBUF[0]。 As shown in the figure, for the neural processing units 0 to 511 of the neural network unit 121, after the execution of the instructions of the different addresses in the non-architectural program of the fifty-first figure, the output buffer 1104 is loaded with the memory cell output (H). ), candidate memory cell state (C'), input gate (I), forget gate (F), output gate (O), memory cell state (C) and tanh (C), each output buffer text group In the group (for example, the output buffer 1104 corresponds to the group of four words of the neural processing unit group 4901, such as words 0-3, 4-7, 5-8 and so on to 508-511), the text number is divided by The text with a remainder of 4 is represented as OUTBUF[3], the text with the remainder of the text number divided by 4 is represented as OUTBUF[2], and the text with the remainder of the text number divided by 4 is represented as OUTBUF [ 1], and the text with the remainder of the text number divided by 4 is represented as OUTBUF[0].

如圖中所示，在第五十一圖之非架構程式中位址2之指令執行後，對於各個神經處理單元群組4901而言，輸出緩衝器1104之全部四個文字都會寫入相對應長短期記憶胞4600之初始記憶胞輸出(H)值。在位址6之指令執行後，對於各個神經處理單元群組4901而言，輸出緩衝器1104之OUTBUF[3]文字會寫入相對應長短期記憶胞4600之候選記憶胞狀態(C’)值，而輸出緩衝器1104之其他三個文字則會維持其先前數值。在位址10之指令執行後，對於各個神經處理單元群組4901而言，輸出緩衝器1104之OUTBUF[0]文字會寫入相對應長短期記憶胞4600之輸入閘(I)數值，OUTBUF[1]文字會寫入相對應長短期記憶胞4600之遺忘閘(F)數值，OUTBUF[2]文字會寫入相對應長短期記憶胞4600之輸出閘(O)數值，而OUTBUF[3]文字則是維持其先前數值。在位址13之指令執行後，對於各個神經處理單元群組4901而言，輸出緩衝器1104之OUTBUF[3]文字會寫入相對應長短期記憶胞4600之新的記憶胞狀態(C)值(對於輸出緩衝器1104而言，包含槽(slot)3之C數值，係寫入權重隨機存取記憶體124之列6，詳如後續第五十一圖所述)，而輸出緩衝器1104之其他三個文字則是維持其先前數值。在位址14之指令執行後，對於各個神經處理單元群組4901而言，輸出緩衝器1104之OUTBUF[3]文字會寫入相對應長短期記憶胞4600之tanh(C)數值，而輸出緩衝器1104之其他三個文字則是維持其先前數值。在位址16之指令執行後，對於各個神經處理單元群組4901而言，輸出緩衝器1104之全部四個文字都會寫入相對應長短期記憶胞4600之新的記憶胞輸出(H)值。前述位址6至16之執行流程(也就是排除位址2之執行，這是因為位址2不屬於程式迴圈之一部分)會再重複三十次，作為位址17回到位址3之程式迴圈。 As shown in the figure, after the instruction of address 2 in the non-architecture program of the fifty-first figure is executed, for each neural processing unit group 4901, all four characters of the output buffer 1104 are written correspondingly. The initial memory cell output (H) value of the long-term and short-term memory cells 4600. After the instruction of address 6 is executed, for each neural processing unit group 4901, the OUTBUF[3] text of the output buffer 1104 is written to the candidate memory cell state (C') value of the corresponding long-term memory cell 4600. The other three characters of the output buffer 1104 maintain their previous values. Instruction at address 10 After execution, for each neural processing unit group 4901, the OUTBUF[0] text of the output buffer 1104 is written to the input gate (I) value of the corresponding long-term memory cell 4600, and the OUTBUF[1] text is written. Corresponding to the forgotten gate (F) value of the long-term and short-term memory cell 4600, the OUTBUF[2] text is written to the output gate (O) value of the corresponding long-term memory cell 4600, while the OUTBUF[3] text is maintained at its previous value. . After the instruction of address 13 is executed, for each neural processing unit group 4901, the OUTBUF[3] text of the output buffer 1104 is written to the new memory cell state (C) value of the corresponding long-term memory cell 4600. (For the output buffer 1104, the C value containing the slot 3 is written to the column 6 of the weight random access memory 124, as described in the subsequent fifty-first figure), and the output buffer 1104 The other three words are to maintain their previous values. After the instruction of address 14 is executed, for each neural processing unit group 4901, the OUTBUF[3] text of the output buffer 1104 is written to the tanh(C) value of the corresponding long-term memory cell 4600, and the output buffer is output buffer. The other three words of the device 1104 are to maintain their previous values. After the instruction of address 16 is executed, for each neural processing unit group 4901, all four words of output buffer 1104 are written to the new memory cell output (H) value of the corresponding long-term memory cell 4600. The execution flow of the above address 6 to 16 (that is, the execution of the address 2 is excluded, because the address 2 is not part of the program loop) is repeated 30 times, and the program returns to the address 3 as the address 17. Loop.

第五十一圖係一表格，顯示儲存於神經網路單元121之程式記憶體129之一程式，此程式係由第四十九圖之神經網路單元121執行並依據第五十圖之配置使用資料與權重，以達成關聯於長短期記憶胞層之計算。第五十一圖之範例程式包含18個非架構指令分別位於位址0至17。位址0之指令是一個初始化指令，用以清除累加器202並將迴圈計數器3804初始化至數值31，以執行31次迴圈組(位址1至17之指令)。此初始化指令並會將資料隨機存取記憶體122之待寫入列(例如第二十六/三十九圖之暫存器2606)初始化為數值1，而在位址16之指令之第一次執行後，此數值會增加至3。較佳地，此初始化指令並會使神經網路單元121處於寬配置，如此，神經網路單元121就會配置有512個神經處理單元126。如後續章節所述，在位址0至17之指令執行過程中，這512個神經處理單元126構成之128個神經處理單元群組4901係作為128個相對應之長短期記憶胞4600進行運作。 The fifty-first figure is a table showing a program stored in the program memory 129 of the neural network unit 121. The program is executed by the neural network unit 121 of the forty-ninth figure and configured according to the fifty-fifth figure. Use data and weights to achieve calculations associated with long- and short-term memory cell layers. The sample program of Figure 51 contains 18 non-architected instructions located at addresses 0 through 17. The instruction of address 0 is an initialization instruction to clear accumulator 202 and initialize loop counter 3804 to a value of 31 to perform 31 loop groups (instructions of addresses 1 through 17). The initialization command initializes the column to be written of the data random access memory 122 (for example, the register 2606 of the twenty-sixth/39th figure) to a value of 1, and the first instruction of the address 16 After the second execution, this value will increase to 3. Preferably, the initialization command causes the neural network unit 121 to be in a wide configuration. Thus, the neural network unit 121 is configured with 512 neural processing units 126. As described in the subsequent sections, during the execution of the instructions of addresses 0 through 17, the 128 neural processing unit groups 490 of the 512 neural processing units 126 operate as 128 corresponding long-term and short-term memory cells 4600.

位址1與2之指令不屬於程式之迴圈組而只會執行一次。這些指令會產生初始記憶胞輸出(H)值(例如0)並將其寫入輸出緩衝器1104之所有文字。位址1之指令會從資料隨機存取記憶體122之列0讀取初始H數值並將其放置於由位址0之指令清除之累加器202。位址2之指令(OUTPUT PASSTHRU,NOP,CLR ACC)會將累加器202數值傳遞至輸出緩衝器1104，如第五十圖所示。位址2之輸出指令(以及第五十一圖之其他輸出指令)中之“NOP”標示表示輸出值只會被寫入輸出緩衝器1104，而不會被寫入記憶體，也就是不會被寫入資料隨機存取記憶體122或權重隨機存取記憶體124。位址2之指令並會清除累加器202。 The instructions for addresses 1 and 2 are not part of the program loop group and will only be executed once. These instructions will generate an initial memory cell output (H) value (eg, 0) and write it to all of the text in output buffer 1104. The instruction of address 1 reads the initial H value from column 0 of data random access memory 122 and places it in accumulator 202 cleared by instruction of address 0. The address 2 instruction (OUTPUT PASSTHRU, NOP, CLR ACC) passes the accumulator 202 value to the output buffer 1104 as shown in FIG. The "NOP" flag in the output instruction of address 2 (and the other output instructions in Figure 51) indicates that the output value will only be written to the output buffer 1104 and will not be written to the memory, ie it will not It is written to the data random access memory 122 or the weighted random access memory 124. The instruction of address 2 will clear accumulator 202.

位址3至17之指令係位於迴圈組內，其執行次數為迴圈計數之數值(如31)。 The instructions of addresses 3 through 17 are located in the loop group, and the number of executions is the value of the loop count (such as 31).

位址3至6之指令之每一次執行會計算當前時間步驟之tanh(C’)數值並將其寫入文字OUTBUF[3]，此文字將會被位址11之指令使用。更精確地說，位址3之乘法累加指令會從資料隨機存取記憶體122之當前讀取列(如列2，4，6依此類推至列62)讀取關聯於此時間步驟之記憶胞輸入(X)值，從權重隨機存取記憶體124之列0讀取Wc數值，並將前述數值相乘以產生一乘積加入由位址2之指令清除之累加器202。 Each execution of the instructions of addresses 3 through 6 calculates the tanh(C') value of the current time step and writes it to the text OUTBUF[3], which will be used by the instruction of address 11. More precisely, the multiply-accumulate instruction of address 3 reads the memory associated with this time step from the current read column of data random access memory 122 (e.g., columns 2, 4, 6 and so on to column 62). The cell inputs (X) value, reads the Wc value from column 0 of the weighted random access memory 124, and multiplies the aforementioned values to produce a product that is added to the accumulator 202 that is cleared by the instruction of address 2.

位址4之乘法累加指令(MULT-ACCUM OUTBUF[0],WR ROW 1)會從文字OUTBUF[0]讀取H數值(即神經處理單元群組4901之全部四個神經處理單元126)，從權重隨機存取記憶體124之列1讀取Uc數值，並將前述數值相乘以產生一第二乘積加入累加器202。 The multiply accumulate instruction of address 4 (MULT-ACCUM OUTBUF[0], WR ROW 1) reads the H value from the word OUTBUF[0] (ie all four neural processing units 126 of the neural processing unit group 4901), from Column 1 of weight random access memory 124 reads the Uc value and multiplies the aforementioned values to produce a second product to add accumulator 202.

位址5之將權重文字加入累加器指令(ADD_W_ACC WR ROW 2)會從權重隨機存記憶體124之列2讀取Bc數值並將其加入累加器202。 Adding the weight text to the accumulator instruction (ADD_W_ACC WR ROW 2) of address 5 reads the Bc value from column 2 of the weight random memory 124 and adds it to accumulator 202.

位址6之輸出指令(OUTPUT TANH,NOP,MASK[0：2],CLR ACC)會對累加器202數值執行一雙曲正切啟動函數，並且只將執行結果寫入文字OUTBUF[3](亦即，只有神經處理單元群組4901中編號除4之餘數為3之神經處理單元126會寫入此結果)，並且，累加器202會被清除。也就是說，位址6之輸出指令會遮蔽文字OUTBUF[0]，OUTBUF[1]與OUTBUF[2](如指令術語 MASK[0：2]所表示)而維持其當前數值，如第五十圖所示。此外，位址6之輸出指令並不會寫入記憶體(如指令術語NOP所表示)。 The output instruction of address 6 (OUTPUT TANH, NOP, MASK[0:2], CLR ACC) performs a hyperbolic tangent start function on the accumulator 202 value, and only writes the execution result to the text OUTBUF[3] (also That is, only the neural processing unit 126 whose number is 4 in the neural processing unit group 4901 will write this result), and the accumulator 202 will be cleared. In other words, the output instruction of address 6 will obscure the characters OUTBUF[0], OUTBUF[1] and OUTBUF[2] (such as instruction terms). Maintain the current value as indicated by MASK[0:2], as shown in Figure 50. In addition, the output instruction of address 6 is not written to the memory (as indicated by the instruction term NOP).

位址7至10之指令之每一次執行會計算當前時間步驟之輸入閘(I)數值，遺忘閘(F)數值與輸出閘(O)數值並將其分別寫入文字OUTBUF[0]，OUTBUF[1]，與OUTBUF[2]，這些數值將會被位址11，12與15之指令使用。更精確地說，位址7之乘法累加指令會從資料隨機存取記憶體122之當前讀取列(如列2，4，6依此類推至列62)讀取關聯於此時間步驟之記憶胞輸入(X)值，從權重隨機存取記憶體124之列3讀取Wi，Wf與Wo數值，並將前述數值相乘以產生一乘積加入由位址6之指令清除之累加器202。更精確地說，在神經處理單元群組4901中，編號除4之餘數為0之神經處理單元126會計算X與Wi之乘積，編號除4之餘數為1之神經處理單元126會計算X與Wf之乘積，而編號除4之餘數為2之神經處理單元126會計算X與Wo之乘積。 Each execution of the instruction of address 7 to 10 calculates the input gate (I) value of the current time step, forgets the gate (F) value and the output gate (O) value and writes them to the text OUTBUF[0], OUTBUF respectively. [1], and OUTBUF[2], these values will be used by the instructions of addresses 11, 12 and 15. More precisely, the multiply-accumulate instruction of address 7 reads the memory associated with this time step from the current read column of data random access memory 122 (e.g., columns 2, 4, 6 and so on to column 62). The cell inputs (X) values, the Wi, Wf and Wo values are read from column 3 of the weighted random access memory 124, and the aforementioned values are multiplied to produce a product that is added to the accumulator 202 cleared by the instruction of address 6. More precisely, in the neural processing unit group 4901, the neural processing unit 126 with the remainder of the number divided by 4 calculates the product of X and Wi, and the neural processing unit 126 with the remainder of the number divided by 1 calculates X and The product of Wf, and the remainder of the number divided by 4, the neural processing unit 126 calculates the product of X and Wo.

位址8之乘法累加指令會從文字OUTBUF[0]讀取H數值(即神經處理單元群組4901之全部四個神經處理單元126)，從權重隨機存取記憶體124之列4讀取Ui，Uf與Uo數值，並將前述數值相乘以產生一第二乘積加入累加器202。更精確地說，在神經處理單元群組4901中，編號除4之餘數為0之神經處理單元126會計算H與Ui之乘積，編號除4之餘數為1之神經處理單元126會計算H與Uf之乘積，而編號除4之餘數為2之神經處理單元126會計算H與Uo之乘積。 The multiply accumulate instruction of address 8 reads the H value from the word OUTBUF[0] (i.e., all four neural processing units 126 of the neural processing unit group 4901), and reads Ui from column 4 of the weighted random access memory 124. The Uf and Uo values are multiplied by the aforementioned values to produce a second product that is added to the accumulator 202. More precisely, in the neural processing unit group 4901, the neural processing unit 126 with the remainder of the number divided by 4 calculates the product of H and Ui, and the neural processing unit 126 with the remainder of the number divided by 1 calculates H and The product of Uf, and the number except the remainder of 4 is 2 Processing unit 126 calculates the product of H and Uo.

位址9之將權重文字加入累加器指令(ADD_W_ACC WR ROW 2)會從權重隨機存記憶體124之列5讀取Bi，Bf與Bo數值並將其加入累加器202。更精確地說，，在神經處理單元群組4901中，編號除4之餘數為0之神經處理單元126會執行Bi數值之加法計算，編號除4之餘數為1之神經處理單元126會執行Bf數值之加法計算，而編號除4之餘數為2之神經處理單元126會執行Bo數值之加法計算。 Adding the weight text to the accumulator instruction (ADD_W_ACC WR ROW 2) of address 9 reads the Bi, Bf and Bo values from column 5 of the weighted random memory 124 and adds it to accumulator 202. More precisely, in the neural processing unit group 4901, the neural processing unit 126 with the remainder of the number divided by 4 performs the addition calculation of the Bi value, and the neural processing unit 126 with the remainder of the number divided by 1 performs the Bf. The addition of the values is calculated, and the neural processing unit 126, which has a remainder of 4 except the number 4, performs the addition calculation of the Bo value.

位址10之輸出指令(OUTPUT SIGMOID,NOP,MASK[3],CLR ACC)會對累加器202數值執行一S型啟動函數並將計算出來之I，F與O數值分別寫入文字OUTBUF[0]，OUTBUF[1]與OUTBUF[2]，此指令並會清除累加器202，而不寫入記憶體。也就是說，位址10之輸出指令會遮蔽文字OUTBUF[3](如指令術語MASK[3]所表示)而維持此文字之當前數值(也就是C’)，如第五十圖所示。 The output instruction of address 10 (OUTPUT SIGMOID, NOP, MASK[3], CLR ACC) performs an S-type start function on the value of the accumulator 202 and writes the calculated I, F and O values to the text OUTBUF[0, respectively. ], OUTBUF[1] and OUTBUF[2], this instruction will clear the accumulator 202 without writing to the memory. That is, the output instruction of address 10 masks the text OUTBUF[3] (as indicated by the instruction term MASK[3]) and maintains the current value of the text (ie, C'), as shown in Figure 50.

位址11至13之指令之每一次執行會計算當前時間步驟產生之新的記憶胞狀態(C)值並將其寫入權重隨機存取記憶體124之列6供下一個時間步驟使用(也就是供位址12之指令在下一次迴圈執行時使用)，更精確的說，此數值係寫入列6對應於神經處理單元群組4901之四行文字中標號除4之餘數為3之文字。此外，位址14之指令之每一次執行都會將tanh(C)數值寫入OUTBUF[3]供位址15之指令使用。 Each execution of the instructions of addresses 11 through 13 computes a new memory cell state (C) value generated by the current time step and writes it to column 6 of the weighted random access memory 124 for use in the next time step (also That is, the instruction for the address 12 is used when the next loop is executed.) More precisely, this value is written in the column 6 corresponding to the four lines of the neural processing unit group 4901. . In addition, each execution of the instruction at address 14 writes the tanh(C) value to OUTBUF[3] for instruction of address 15.

更精確地說，位址11之乘法累加指令(MULT-ACCUM OUTBUF[0],OUTBUF[3])會從文字OUTBUF[0]讀取輸入閘(I)數值，從文字OUTBUF[3]讀取候選記憶胞狀態(C’)值，並將前述數值相乘以產生一第一乘積加入由位址10之指令清除之累加器202。更精確地說，神經處理單元群組4901之四個神經處理單元126中之各個神經處理單元126都會計算I數值與C’數值之第一乘積。 More precisely, the multiply accumulate instruction of address 11 (MULT-ACCUM OUTBUF[0], OUTBUF[3]) reads the input gate (I) value from the word OUTBUF[0] and reads from the text OUTBUF[3]. The candidate memory cell state (C') value is multiplied by the aforementioned value to produce a first product added to the accumulator 202 cleared by the instruction of address 10. More precisely, each of the four neural processing units 126 of the neural processing unit group 4901 calculates the first product of the I value and the C' value.

位址12之乘法累加指令(MULT-ACCUM OUTBUF[1],WR ROW 6)會指示神經處理單元126從文字OUTBUF[1]讀取遺忘閘(F)數值，從權重隨機存取記憶體124之列6讀取其相對應文字，並將其相乘以產生一第二乘積與位址11之指令產生於累加器202內之第一乘積相加。更精確地說，對於神經處理單元群組4901內標號除4之餘數為3之神經處理單元126而言，從列6讀取之文字是先前時間步驟計算出之當前記憶胞狀態(C)值，第一乘積與第二乘積之加總即為此新的記憶胞狀態(C)。不過，對於神經處理單元群組4901之其他三個神經處理單元126而言，從列6讀取之文字是不需理會的數值，這是因為這些數值所產生之累加值將不被使用，亦即不會被位址13與14之指令放入輸出緩衝器1104而會被位址14之指令所清除。也就是說，只有神經處理單元群組4901中標號除4之餘數為3之神經處理單元126所產生之新的記憶胞狀態(C)值將會被使用，即被位址13與14之指令使用。就位址12之指令之第二至三十一次執行而言，從權重隨機存取記憶體124之列6讀取之C數值是迴圈組之前次執行中由位址13之指令寫入之數值。不過，對於位址12之指令之第一次執行而言，列6之C數值則是由架構程式在啟動第五十一圖之非架構程式前或是由非架構程式之一調整後版本寫入之初始值。 The multiply accumulate instruction of address 12 (MULT-ACCUM OUTBUF[1], WR ROW 6) instructs the neural processing unit 126 to read the forgetting gate (F) value from the text OUTBUF[1] from the weighted random access memory 124. Column 6 reads its corresponding text and multiplies it to produce a second product that is summed with the address of address 11 resulting from the first product in accumulator 202. More precisely, for the neural processing unit 126 with the remainder of the labeling 4 in the neural processing unit group 4901, the text read from column 6 is the current memory cell state (C) value calculated in the previous time step. The sum of the first product and the second product is the new memory cell state (C). However, for the other three neural processing units 126 of the neural processing unit group 4901, the text read from column 6 is an unneeded value because the accumulated values produced by these values will not be used, That is, the instructions of addresses 13 and 14 are not placed in the output buffer 1104 and are cleared by the instruction of the address 14. That is to say, only the new memory cell state (C) value generated by the neural processing unit 126 with the remainder of the label of 4 in the neural processing unit group 4901 will be used, that is, the instructions of the addresses 13 and 14. use. For the second to thirty-one execution of the instruction at address 12, the right to The C value read by column 6 of the re-random access memory 124 is the value written by the instruction of address 13 in the previous execution of the loop group. However, for the first execution of the instruction at address 12, the value of C in column 6 is written by the architecture program before the non-architecture program of the fifty-first diagram is started or the version is modified by one of the non-architecture programs. Enter the initial value.

位址13之輸出指令(OUTPUT PASSTHRU,WR ROW 6,MASK[0：2])只會傳遞累加器202數值，即計算出之C數值，至文字OUTBUF[3](也就是說，只有神經處理單元群組4901中標號除4之餘數為3之神經處理單元126會將其計算出之C數值寫入輸出緩衝器1104)，而權重隨機存取記憶體124之列6則是以更新後之輸出緩衝器1104寫入，如第五十圖所示。也就是說，位址13之輸出指令會遮蔽文字OUTBUF[0]，OUTBUF[1]與OUTBUF[2]而維持其當前數值(即I，F與O數值)。如前述，只有列6對應於神經處理單元群組4901之四行文字中標號除4之餘數為3之文字內之C數值會被使用，也就是由位址12之指令使用；因此，非架構程式不會理會權重隨機存取記憶體124之列6中位於行0-2，行4-6，依此類推至行508-510之數值，如第五十圖所示(即I，F與O數值)。 The output instruction of address 13 (OUTPUT PASSTHRU, WR ROW 6, MASK[0:2]) will only pass the value of accumulator 202, ie the calculated C value, to the text OUTBUF[3] (that is, only neural processing) The neural processing unit 126 having the remainder of the label 4 in the unit group 4901 divides the calculated C value into the output buffer 1104), and the column 6 of the weighted random access memory 124 is updated. The output buffer 1104 is written as shown in Fig. 50. That is to say, the output instruction of address 13 will obscure the characters OUTBUF[0], OUTBUF[1] and OUTBUF[2] while maintaining their current values (ie, I, F and O values). As described above, only the column 6 corresponds to the C value in the four-line text of the neural processing unit group 4901 in which the remainder of the label except 4 is 3, that is, the instruction of the address 12 is used; therefore, the non-architecture The program does not care about the values of row 0-2, row 4-6, and so on to row 508-510 in column 6 of the random access memory 124, as shown in Fig. 50 (i.e., I, F and O value).

位址14之輸出指令(OUTPUT TANH,NOP,MASK[0：2],CLR ACC)會對累加器202數值執行一雙曲正切啟動函數，並將計算出來之tanh(C)數值寫入文字OUTBUF[3]，此指令並會清除累加器202，而不寫入記憶體。位址14之輸出指令，如同位址13之輸出指令，會遮蔽文字OUTBUF[0]，OUTBUF[1]與OUTBUF[2]而維持其原本數值，如第五十圖所示。 The output instruction of address 14 (OUTPUT TANH, NOP, MASK[0:2], CLR ACC) performs a hyperbolic tangent start function on the accumulator 202 value, and writes the calculated tanh(C) value to the text OUTBUF. [3], this instruction will also clear the accumulator 202 without writing to the memory. The output instruction of address 14 is the same as the output instruction of address 13. The text OUTBUF[0], OUTBUF[1] and OUTBUF[2] are maintained to maintain their original values, as shown in Figure 50.

位址15至16之指令之每一次執行會計算當前時間步驟產生之記憶胞輸出(H)值並將其寫入資料隨機存取記憶體122之當前輸出列後方第二列，其數值將會由架構程式讀取並用於下一次時間步驟(亦即在下一次迴圈執行中由位址3及7之指令使用)。更精確地說，位址15之乘法累加指令會從文字OUTBUF[2]讀取輸出閘(O)數值，從文字OUTBUF[3]讀取tanh(C)數值，並將其相乘以產生一乘積加入由位址14之指令清除之累加器202。更精確地說，神經處理單元群組4901之四個神經處理單元126中之各個神經處理單元126都會計算數值O與tanh(C)之乘積。 Each execution of the instructions of addresses 15 through 16 calculates the memory cell output (H) value generated by the current time step and writes it to the second column after the current output column of the data random access memory 122, the value of which will be Read by the architecture program and used for the next time step (ie, used by the instructions of addresses 3 and 7 in the next loop execution). More precisely, the multiply accumulate instruction at address 15 reads the output gate (O) value from the word OUTBUF[2], reads the tanh(C) value from the text OUTBUF[3], and multiplies it to produce a The product is added to accumulator 202 which is cleared by the instruction of address 14. More precisely, each of the four neural processing units 126 of the neural processing unit group 4901 calculates the product of the value O and tanh (C).

位址16之輸出指令會傳遞累加器202數值並在第一次執行中將計算出之H數值寫入列3，在第二次執行中將計算出之H數值寫入列5，依此類推在第三十一次執行中將計算出之H數值寫入列63，如第五十圖所示，接下來這些數值會由位址4與8之指令使用。此外，如第五十圖所示，這些計算出來之H數值會被放入輸出緩衝器1104供位址4與8之指令後續使用。位址16之輸出指令並會清除累加器202。在一實施例中，長短期記憶胞4600之設計係使位址16之輸出指令(以及/或第四十八圖中位址22之輸出指令)具有一啟動函數，如S型或雙曲正切函數，而非傳遞累加器202數值。 The output instruction of address 16 will pass the value of accumulator 202 and write the calculated H value to column 3 in the first execution, the calculated H value to column 5 in the second execution, and so on. In the thirty-first execution, the calculated H value is written to column 63. As shown in Fig. 50, these values are used by the instructions of addresses 4 and 8. In addition, as shown in Fig. 50, these calculated H values are placed in the output buffer 1104 for subsequent use by the instructions of addresses 4 and 8. The output instruction of address 16 will clear accumulator 202. In one embodiment, the long- and short-term memory cell 4600 is designed such that the output instruction of address 16 (and/or the output instruction of address 22 in the forty-eighth picture) has a start function, such as an S-type or a hyperbolic tangent. The function, not the value of the accumulator 202.

位址17之迴圈指令會使迴圈計數器3804 遞減並且在新的迴圈計數器3804數值大於零之情況下回到位址3之指令。 The loop instruction of address 17 will cause the loop counter 3804 Decrement and return to the instruction of address 3 if the new loop counter 3804 has a value greater than zero.

由此可發現，因為第四十九圖之神經網路單元121實施例中之輸出緩衝器1104之反饋與屏蔽能力，第五十一圖之非架構程式之迴圈組內的指令數相較於第四十八圖之非架構指令大致減少34%。此外，因為第四十九圖之神經網路單元121實施例中之輸出緩衝器1104之反饋與屏蔽能力，第五十一圖非架構程式之資料隨機存取記憶體122中之記憶體配置所搭配之時間步驟數大致為第四十八圖之三倍。前述改善有助於某些利用神經網路單元121執行長短期記憶胞層計算之架構程式應用，特別是針對長短期記憶胞層中之長短期記憶胞4600數量少於或等於128之應用。 It can be seen that, because of the feedback and shielding capability of the output buffer 1104 in the embodiment of the neural network unit 121 of the forty-ninth figure, the number of instructions in the loop group of the non-architectural program of the fifty-first graph is compared. The non-architectural instructions in Figure 48 are roughly reduced by 34%. In addition, because of the feedback and shielding capability of the output buffer 1104 in the embodiment of the neural network unit 121 of the forty-ninth figure, the memory configuration in the random access memory 122 of the non-architectural program of the fifty-first embodiment is shown in FIG. The number of time steps for collocation is roughly three times that of the forty-eighth figure. The foregoing improvements facilitate some architectural applications that utilize neural network unit 121 to perform long- and short-term memory cell layer calculations, particularly for applications where the number of long- and short-term memory cells 4600 in the long- and short-term memory cell layer is less than or equal to 128.

第四十七至五十一圖之實施例係假定各個時間步驟中之權重與偏移值維持不變。不過，本發明並不限於此，其他權重與偏移值隨時間步驟改變之實施例亦屬本發明之範疇，其中，權重隨機存取記憶體124並非如第四十七至五十圖所示填入單一組權重與偏移值，而是在各個時間步驟填入不同組權重與偏移值而第四十八至五十一圖之非架構程式之權重隨機存記憶體124位址會隨之調整。 The embodiments of the forty-seventh to fifty-first embodiments assume that the weights and offset values in the various time steps remain unchanged. However, the present invention is not limited thereto, and embodiments in which other weights and offset values are changed over time are also within the scope of the present invention, wherein the weight random access memory 124 is not as shown in the forty-seventh to fifty-th Fill in a single set of weights and offset values, but fill in different sets of weights and offset values at each time step and the weights of the non-architectural programs of the 48th to 51st maps. Adjustment.

基本上，在前述第四十七至五十一圖之實施例中，權重，偏移與居間值(如C，C’數值)係儲存於權重隨機存取記憶體124，而輸入與輸出值(如X，H數值)則是儲存於資料隨機存取記憶體122。此特徵有利於資料隨機存取記憶體122為雙埠而權重隨機存取記憶體124為單埠之實施例，這是因為從非架構程式與架構程式至資料隨機存取記憶體122會有更多的流量。不過，因為權重隨機存取記憶體124較大，在本發明之另一實施例中則是互換儲存非架構與架構程式寫入數值之記憶體(即互換資料隨機存取記憶體122與權重隨機存取記憶體124)。也就是說，W，U，B，C’，tanh(C)與C數值係儲存於資料隨機存取記憶體122而X，H，I，F與O數值則是儲存於權重隨機存取記憶體124(第四十七圖之調整後實施例)；以及W，U，B，與C數值係儲存於資料隨機存取記憶體122而X與H數值則是儲存於權重隨機存取記憶體124(第五十圖之調整後實施例)。因為權重隨機存取記憶體124較大，這些實施例在一個批次中可處理較多時間步驟。對於利用神經網路單元121執行計算之架構程式的應用而言，此特徵有利於某些能從較多之時間步驟得利之應用並且可以為單埠設計之記憶體(如權重隨機存取記憶體124)提供足夠頻寬。 Basically, in the foregoing embodiments of the forty-seventh to fifty-firstth embodiments, weights, offsets, and intermediate values (e.g., C, C' values) are stored in the weighted random access memory 124, and input and output values. (such as X, H values) is stored in the data random access memory 122. This feature is advantageous In the case where the data random access memory 122 is a double port and the weighted random access memory 124 is an embodiment, there is more traffic from the non-architecture program and the architecture program to the data random access memory 122. . However, because the weighted random access memory 124 is large, in another embodiment of the present invention, the memory of the non-architectural and architectural program write values is exchanged (ie, the interchangeable data random access memory 122 and the random weights are randomized). Access memory 124). That is, the W, U, B, C', tanh (C) and C values are stored in the data random access memory 122 and the X, H, I, F and O values are stored in the weighted random access memory. Body 124 (adjusted embodiment of the forty-seventh figure); and W, U, B, and C values are stored in the data random access memory 122 and the X and H values are stored in the weighted random access memory 124 (the adjusted example of the fifty-fifth figure). Because the weight random access memory 124 is large, these embodiments can handle more time steps in one batch. For applications that use the neural network unit 121 to perform computational architectural programs, this feature facilitates certain applications that can benefit from more time steps and can be designed for memory (such as weighted random access memory). 124) Provide sufficient bandwidth.

第五十二圖係一方塊圖，顯示一神經網路單元121之實施例，此實施例之神經處理單元群組內具有輸出緩衝遮蔽與反饋能力，並且共享啟動函數單元1112。第五十二圖之神經網路單元121係類似於第四十七圖之神經網路單元121，並且圖中具有相同標號之元件亦相類似。不過，第四十九圖之四個啟動函數單元212在本實施例中則是由單一個共享啟動函數單元1112所取代，此單一個啟動函數單元會接收四個來自四個累加器202 之輸出217並產生四個輸出至文字OUTBUF[0]，OUTBUF[1]，OUTBUF[2]與OUTBUF[3]。第五十二圖之神經網路單元212之運作方式類似於前文第四十九至五十一圖所述之實施例，並且其運作共享啟動函數單元1112之方式係類似於前文第十一至十三圖所述之實施例。 The fifty-second diagram is a block diagram showing an embodiment of a neural network unit 121 having an output buffer masking and feedback capability within the group of neural processing units of this embodiment, and sharing a start function unit 1112. The neural network unit 121 of the fifty-second diagram is similar to the neural network unit 121 of the forty-seventh diagram, and elements having the same reference numerals are similar in the drawings. However, the four start function units 212 of the forty-ninth figure are replaced by a single shared start function unit 1112 in this embodiment, and the single start function unit receives four from four accumulators 202. Output 217 produces four outputs to the words OUTBUF[0], OUTBUF[1], OUTBUF[2] and OUTBUF[3]. The neural network unit 212 of the fifty-second diagram operates in a manner similar to the embodiment described in the forty-ninth to fifty-first figures above, and the manner in which it operates to share the start function unit 1112 is similar to that of the foregoing eleventh The embodiment described in the thirteenth figure.

第五十三圖係一方塊圖，顯示當神經網路單元121執行關聯於第四十六圖中一個具有128個長短期記憶胞4600之層級之計算時，第四十九圖之神經網路單元121之資料隨機存取記憶體122，權重隨機存取記憶體124與輸出緩衝器1104內之資料配置之另一實施例。第五十三圖之範例係類似於第五十圖之範例。不過，在第五十三圖中，Wi，Wf與Wo值係位於列0(而非如第五十圖係位於列3)；Ui，Uf與Uo值係位於列1(而非如第五十圖係位於列4)；Bi，Bf與Bo值係位於列2(而非如第五十圖係位於列5)；C值係位於列3(而非如第五十圖係位於列6)。另外，第五十三圖之輸出緩衝器1104之內容係類似於第五十圖，不過，因為第五十四圖與第五十一圖之非架構程式之差異，第三列之內容(即I，F，O與C’數值)是在位址7之指令執行後出現在輸出緩衝器1104(而非如第五十圖是位址10之指令)；第四列之內容(即I，F，O與C數值)是在位址10之指令執行後出現在輸出緩衝器1104(而非如第五十圖是位址13之指令)；第五列之內容(即I，F，O與tanh(C)數值)是在位址11之指令執行後出現在輸出緩衝器1104(而非如第五十圖是位址 14之指令)；並且第六列之內容(即H數值)是在位址13之指令執行後出現在輸出緩衝器1104(而非如第五十圖是位址16之指令)，詳如後述。 The fifty-third figure is a block diagram showing the neural network of the forty-ninth figure when the neural network unit 121 performs the calculation associated with a hierarchy of 128 long- and short-term memory cells 4600 in the forty-sixth figure. Another embodiment of the data configuration in the data random access memory 122 of the unit 121, the weighted random access memory 124 and the output buffer 1104. The example of the fifty-third figure is similar to the example of the fifty-fifth figure. However, in the fifty-third figure, the Wi, Wf, and Wo values are in column 0 (instead of column 3 in Figure 50); Ui, Uf and Uo values are in column 1 (rather than fifth) Ten graphs are in column 4); Bi, Bf and Bo values are in column 2 (instead of column 5 in Fig. 50); C values are in column 3 (instead of column 50 in column 50) ). In addition, the content of the output buffer 1104 of the fifty-third figure is similar to the fifty-fifth figure, but because of the difference between the non-architectural programs of the fifty-fourth and fifty-first figures, the content of the third column (ie, The I, F, O and C' values are present in the output buffer 1104 after the instruction of address 7 is executed (instead of the instruction of address 10 as in Fig. 50); the content of the fourth column (i.e., I, The F, O and C values are present in the output buffer 1104 after the instruction of address 10 is executed (instead of the instruction of address 13 as in Fig. 50); the contents of the fifth column (i.e., I, F, O) And the tanh(C) value) appears in the output buffer 1104 after the instruction of address 11 is executed (rather than the address as in the fifty-fifth figure) The instruction of the sixth column); and the content of the sixth column (i.e., the H value) is present in the output buffer 1104 after the execution of the instruction of the address 13 (instead of the instruction of the address 16 as in the fifty-fifth figure), as described later. .

第五十四圖係一表格，顯示儲存於神經網路單元121之程式記憶體129之一程式，此程式係由第四十九圖之神經網路單元121執行並依據第五十三圖之配置使用資料與權重，以達成關聯於長短期記憶胞層之計算。第五十四圖之範例程式係類似於第五十一圖之程式。更精確地說，第五十四圖與第五十一圖中，位址0至5之指令相同；第五十四圖中位址7與8之指令相同於第五十一圖中位址10與11之指令；並且第五十四圖中位址10到14之指令相同於第五十一圖中位址13到17之指令。 The fifty-fourth figure is a table showing a program stored in the program memory 129 of the neural network unit 121, which is executed by the neural network unit 121 of the forty-ninth figure and according to the fifty-third figure. Configure usage data and weights to achieve calculations associated with long- and short-term memory cell layers. The example program of the fifty-fourth figure is similar to the program of the fifty-first figure. More precisely, in the fifty-fourth and fifty-first figures, the instructions of the addresses 0 to 5 are the same; the instructions of the addresses 7 and 8 in the fifty-fourth figure are the same as the addresses in the fifty-first figure. The instructions of 10 and 11; and the instructions of addresses 10 through 14 in the fifty-fourth figure are identical to the instructions of addresses 13 to 17 in the fifty-first figure.

不過，第五十四圖中位址6之指令並不會清除累加器202(相較之下，第五十一圖中位址6之指令則會清除累加器202)。此外，第五十一圖中位址7至9之指令並不出現在第五十四圖之非架構程式內。最後，就第五十四圖中位址9之指令與第五十一圖中位址12之指令而言，除了第五十四圖中位址9之指令係讀取權重隨機存取記憶體124之列3而第五十一圖中位址12之指令則是讀取權重隨機存取記憶體之列6外，其他部分均相同。 However, the instruction of address 6 in the fifty-fourth figure does not clear the accumulator 202 (in contrast, the instruction of address 6 in the fifty-first figure clears the accumulator 202). In addition, the instructions of addresses 7 through 9 in the fifty-first figure do not appear in the non-architectural program of the fifty-fourth figure. Finally, with respect to the instruction of address 9 in the fifty-fourth figure and the instruction of address 12 in the fifty-first figure, except that the instruction of address 9 in the fifty-fourth figure reads the weight random access memory The order of 124 is the same as that of the address 12 of the fifty-first figure except that the index of the weighted random access memory is 6 and the other parts are the same.

因為第五十四圖之非架構程式與第五十一圖之非架構程式之差異，第五十三圖之配置使用之權重隨機存取記憶體124之列數會減少三個，而程式迴圈內之指令數也會減少三個。第五十四圖之非架構程式內之迴圈組尺寸實質上只有第四十八圖之非架構程式內之迴圈組尺寸的一半，並且大致只有第五十一圖之非架構程式內之迴圈組尺寸之80%。 Because of the difference between the non-architecture program of the fifty-fourth figure and the non-architecture program of the fifty-first figure, the number of columns of the weighted random access memory 124 used in the configuration of the fifty-third figure is reduced by three, and the program is returned. The number of instructions in the circle will also be reduced by three. The loop group size in the non-architectural program of the fifty-fourth figure is essentially only the back of the non-architectural program of the forty-eighth figure. It is half the size of the circle group and is roughly only 80% of the size of the circle group in the non-architectural program of the 51st chart.

第五十五圖係一方塊圖，顯示本發明另一實施例之神經處理單元126之部分。更精確地說，對於第四十九圖之多個神經處理單元126中之單一個神經處理單元126而言，圖中顯示多工暫存器208與其相關聯輸入207，211與4905，以及多工暫存器705與其相關聯輸入206，711與4907。除了第四十九圖之輸入外，神經處理單元126之多工暫存器208與多工暫存器705個別接收一群組內編號(index_within_group)輸入5599。群組內編號輸入5599指出特定神經處理單元126在其神經處理單元群組4901內之編號。因此，舉例來說，以各個神經處理單元群組4901具有四個神經處理單元126之實施例為例，在各個神經處理單元群組4901內，其中一個神經處理單元126在其群組內編號輸入5599中接收數值零，其中一個神經處理單元126在其群組內編號輸入5599中接收數值一，其中一個神經處理單元126在其群組內編號輸入5599中接收數值二，而其中一個神經處理單元126在其群組內編號輸入5599中接收數值三。換句話說，神經處理單元126所接收之群組內編號輸入5599數值就是此神經處理單元126在神經網路單元121內之編號除以J之餘數，其中J是神經處理單元群組4901內之神經處理單元126之數量。因此，舉例來說，神經處理單元73在其群組內編號輸入5599接收數值一，神經處理單元353在其群組內編號輸入5599接收數值三，而神經處理單元6在其群組內編號輸入5599接收數值二。 Fifty-fifth is a block diagram showing a portion of a neural processing unit 126 in accordance with another embodiment of the present invention. More precisely, for a single one of the plurality of neural processing units 126 of the forty-ninth diagram, the multiplex register 208 is shown with its associated inputs 207, 211 and 4905, and more The scratchpad 705 is associated with its inputs 206, 711 and 4907. In addition to the input of the forty-ninth figure, the multiplexer 208 and the multiplexer 705 of the neural processing unit 126 individually receive an in-group number (index_within_group) input 5599. The intra-group numbering input 5599 indicates the numbering of the particular neural processing unit 126 within its neural processing unit group 4901. Thus, for example, taking an embodiment in which each neural processing unit group 4901 has four neural processing units 126, within each neural processing unit group 4901, one of the neural processing units 126 is numbered in its group. A value of zero is received in 5599, wherein one of the neural processing units 126 receives a value of one in its group number entry 5599, wherein one of the neural processing units 126 receives the value two in its group number input 5599, and one of the neural processing units 126 receives the value three in the numbering entry 5599 in its group. In other words, the intra-group number input 5599 value received by the neural processing unit 126 is the remainder of the number of the neural processing unit 126 in the neural network unit 121 divided by J, where J is within the neural processing unit group 4901. The number of neural processing units 126. Thus, for example, the neural processing unit 73 receives a value of one in its group number 5599, the neural processing unit 353 receives a value three in its group number 5599, and the neural processing unit 6 is in its group The internal number input 5599 receives the value two.

此外，當控制輸入213指定一預設值，在此表示為“SELF”，多工暫存器208會選擇對應於群組內編號輸入5599數值之輸出緩衝器1104輸出4905。因此，當一非架構指令以SELF之數值指定接收來自輸出緩衝器1104之資料(在第五十七圖位址2與7之指令中係標示為OUTBUF[SELF])，各個神經處理單元126之多工暫存器208會從輸出緩衝器1104接收其相對應文字。因此，舉例來說，當神經網路單元121執行第五十七圖中位址2與7之非架構指令，神經處理單元73之多工暫存器208會在四個輸入4905中選擇第二個(編號1)輸入以接收來自輸出緩衝器1104之文字73，神經處理單元353之多工暫存器208會在四個輸入4905中選擇第四個(編號3)輸入以接收來自輸出緩衝器1104之文字353，而神經處理單元6之多工暫存器208會在四個輸入4905中選擇第三個(編號2)輸入以接收來自輸出緩衝器1104之文字6。雖然並未使用於第五十七圖之非架構程式，不過，非架構指令亦可利用SELF數值(OUTBUF[SELF])指定接收來自輸出緩衝器1104之資料而使控制輸入713指定預設值使各個神經處理單元126之多工暫存器705從輸出緩衝器1104接收其相對應文字。 In addition, when control input 213 specifies a predetermined value, denoted herein as "SELF," multiplex register 208 selects output buffer 1104 output 4905 corresponding to the number of group input 5599 in the group. Therefore, when a non-architectural instruction specifies the receipt of data from the output buffer 1104 by the value of SELF (indicated as OUTBUF[SELF] in the instructions of addresses 57 and 7 in Figure 57), each neural processing unit 126 The multiplex register 208 receives its corresponding text from the output buffer 1104. Thus, for example, when the neural network unit 121 performs the non-architectural instructions of addresses 2 and 7 in the fifty-fifth diagram, the multiplexer 208 of the neural processing unit 73 selects the second of the four inputs 4905. The number (number 1) is input to receive the text 73 from the output buffer 1104, and the multiplex register 208 of the neural processing unit 353 selects the fourth (number 3) input among the four inputs 4905 to receive the output buffer. The text 353 of 1104, and the multiplex register 208 of the neural processing unit 6 selects the third (number 2) input among the four inputs 4905 to receive the text 6 from the output buffer 1104. Although not used in the non-architecture of Figure 57, the non-architected instructions may also use the SELF value (OUTBUF[SELF]) to specify that the data from the output buffer 1104 is received and the control input 713 to specify a preset value. The multiplex register 705 of each neural processing unit 126 receives its corresponding text from the output buffer 1104.

第五十六圖係一方塊圖，顯示當神經網路單元執行關聯於第四十三圖之Jordan時間遞歸神經網路之計算並利用第五十五圖之實施例時，神經網路單元121之資料隨機存取記憶體122與權重隨機存取記憶體124內之資料配置之一範例。圖中權重隨機存取記憶體124內之權重配置係相同於第四十四圖之範例。圖中資料隨機存取記憶體122內之數值的配置係相似於第四十四圖之範例，除了在本範例中，各個時間步驟具有相對應之一對兩列記憶體以裝載輸入層節點D值與輸出層節點Y值，而非如第四十四圖之範例使用一組四列之記憶體。也就是說，在本範例中，隱藏層Z數值與內容層C數值並不寫入資料隨機存取記憶體122。而是將輸出緩衝器1104作為隱藏層Z數值與內容層C數值之一類別草稿記憶體，詳如第五十七圖之非架構程式所述。前述OUTBUF[SELF]輸出緩衝器1104之反饋特徵，可以使非架構程式之運作更為快速(這是將對於資料隨機存取記憶體122執行之兩次寫入與兩次讀取動作，以對於輸出緩衝器1104執行之兩次寫入與兩次讀取動作來取代)並減少各個時間步驟使用之資料隨機存取記憶體122之空間，而使本實施例之資料隨機存取記憶體122所裝載之資料可用於大約兩倍於第四十四與四十五圖之實施例所具有之時間步驟，如圖中所示，即32個時間步驟。 Figure 56 is a block diagram showing the neural network unit 121 when the neural network unit performs the calculation associated with the Jordan time recurrent neural network of the 43rd diagram and utilizes the embodiment of the fifty-fifth diagram. Data random access memory 122 and weight random access memory 124 An example of data configuration. The weighting in the weight random access memory 124 in the figure is the same as the example in the forty-fourth figure. The configuration of the values in the data random access memory 122 is similar to the example of the forty-fourth figure, except that in this example, each time step has a corresponding one pair of two columns of memory to load the input layer node D. The value is compared to the output layer node Y value, rather than the set of four columns of memory as in the example of the forty-fourth figure. That is, in this example, the hidden layer Z value and the content layer C value are not written to the data random access memory 122. Rather, the output buffer 1104 is used as a hidden layer Z value and a content layer C value as a category of draft memory, as described in the non-architectural program of FIG. The feedback feature of the OUTBUF[SELF] output buffer 1104 can make the operation of the non-architected program faster (this is the two write and read operations to be performed on the data random access memory 122 for The output buffer 1104 performs two writes and two read operations instead of) and reduces the space of the data random access memory 122 used in each time step, so that the data random access memory 122 of the present embodiment The loaded data can be used for approximately twice the time steps of the fourty-fourth and forty-fifth embodiments, as shown in the figure, i.e., 32 time steps.

第五十七圖係一表格，顯示儲存於神經網路單元121之程式記憶體129之一程式，此程式係由神經網路單元121執行並依據第五十六圖之配置使用資料與權重，以達成Jordan時間遞歸神經網路。第五十七圖之非架構程式類似於第四十五圖之非架構程式，其差異處如下所述。 The fifty-seventh diagram is a table showing a program stored in the program memory 129 of the neural network unit 121. The program is executed by the neural network unit 121 and uses data and weights according to the configuration of the fifty-sixth figure. To achieve the Jordan time recurrent neural network. The non-architectural program of Figure 57 is similar to the non-architectural program of Figure 45. The differences are as follows.

第五十七圖之範例程式具有12個非架構指令分別位於位址0至11。位址0之初始化指令會清除累加器202並將迴圈計數器3804之數值初始化為32，使迴圈組(位址2至11之指令)執行32次。位址1之輸出指令會將累加器202(由位址0之指令所清除)之零值放入輸出緩衝器1104。由此可觀察到，在位址2至6之指令的執行過程中，這512個神經處理單元126係對應並作為512個隱藏層節點Z進行運作，而在位址7至10之指令的執行過程中，係對應並作為512個輸出層節點Y進行運作。也就是說，位址2至6之指令之32次執行會計算32個相對應時間步驟之隱藏層節點Z數值，並將其放入輸出緩衝器1104供位址7至9之指令之相對應32次執行使用，以計算這32個相對應時間步驟之輸出層節點Y並將其寫入資料隨機存取記憶體122，並提供位址10之指令之相對應32次執行使用，以將這32個相對應時間步驟之內容層節點C放入輸出緩衝器1104。(放入輸出緩衝器1104中第32個時間步驟之內容層節點C並不會被使用。) The sample program of Figure 57 has 12 non-architectures The instructions are located at addresses 0 through 11, respectively. The initialization instruction of address 0 clears the accumulator 202 and initializes the value of the loop counter 3804 to 32, causing the loop group (instructions of addresses 2 through 11) to execute 32 times. The output instruction of address 1 places the zero value of accumulator 202 (cleared by the instruction of address 0) into output buffer 1104. It can thus be observed that during the execution of the instructions of addresses 2 to 6, the 512 neural processing units 126 correspond to and operate as 512 hidden layer nodes Z, and the execution of the instructions at addresses 7 to 10. In the process, it corresponds to and operates as 512 output layer nodes Y. That is, 32 executions of the instructions of addresses 2 through 6 calculate the hidden layer node Z values for the 32 corresponding time steps and place them in the output buffer 1104 for the corresponding address of the instructions 7 through 9. 32 executions are used to calculate the output layer node Y of the 32 corresponding time steps and write it to the data random access memory 122, and provide the corresponding 32 executions of the instruction of the address 10 to use this The content layer node C of the 32 corresponding time steps is placed in the output buffer 1104. (The content layer node C placed in the 32nd time step in the output buffer 1104 is not used.)

在位址2與3之指令(ADD_D_ACC OUTBUF[SELF]與ADD_D_ACC ROTATE,COUNT=511)之第一次執行中，512個神經處理單元126中之各個神經處理單元126會將輸出緩衝器1104之512個內容節點C值累加至其累加器202，這些內容節點C值係由位址0至1之指令執行所產生與寫入。在位址2與3之指令之第二次執行中，這512個神經處理單元126中之各個神經處理單元126會將輸出緩衝器1104之512個內容節點C值累加至其累加器202，這些內容節點C值係由位址7至8與10之指令執行所產生與寫入。更精確地說，位址2之指令會指示各個神經處理單元126之多工暫存器208選擇其相對應輸出緩衝器1104文字，如前所述，並將其加入累加器202；位址3之指令會指示神經處理單元126在512個文字之旋轉器內旋轉內容節點C值，此512個文字之旋轉器係由這512個神經處理單元中相連接之多工暫存器208之集體運作所構成，而使各個神經處理單元126可以將這512個內容節點C值累加至其累加器202。位址3之指令並不會清除累加器202，如此位址4與5之指令即可將輸入層節點D值(乘上其相對應權重)加上由位址2與3之指令累加出之內容層節點C值。 In the first execution of the instructions of addresses 2 and 3 (ADD_D_ACC OUTBUF[SELF] and ADD_D_ACC ROTATE, COUNT=511), each of the 512 neural processing units 126 will have 512 of the output buffer 1104. The content node C values are accumulated to their accumulators 202, which are generated and written by instruction execution of addresses 0 through 1. In the second execution of the instructions of addresses 2 and 3, each of the 512 neural processing units 126 will accumulate 512 content node C values of the output buffer 1104 to its accumulator 202. Content node C value is an instruction from address 7 to 8 and 10. Execution is generated and written. More precisely, the instruction of address 2 will instruct the multiplexer 208 of each neural processing unit 126 to select its corresponding output buffer 1104 text, as previously described, and add it to accumulator 202; address 3 The instructions instruct the neural processing unit 126 to rotate the content node C value within a 512-character rotator that is collectively operated by the multiplexed registers 208 of the 512 neural processing units. It is constructed such that each neural processing unit 126 can accumulate the 512 content node C values to its accumulator 202. The instruction of address 3 does not clear the accumulator 202, so the instruction of the address 4 and 5 can add the input layer node D value (multiplied by its corresponding weight) by the instruction of the address 2 and 3. Content layer node C value.

在位址4與5之指令(MULT-ACCUM DR ROW+2,WR ROW 0與MULT-ACCUM ROTATE,WR ROW+1,COUNT=511)之各次執行中，這512個神經處理單元126中之各個神經處理單元126會執行512次乘法運算，將資料隨機存取記憶體122中關聯於當前時間步驟之列(例如：對於時間步驟0而言即為列0，對於時間步驟1而言即為列2，依此類推，對於對於時間步驟31而言即為列62)之512個輸入節點D值，乘上權重隨機存取記憶體124之列0至511中對應於此神經處理單元126之行之權重，以產生512個乘積，而連同這位址2與3之指令對於這512個內容節點C值執行之累加結果，一併累加至相對應神經處理單元126之累加器202以計算隱藏節點Z層數值。 In each execution of the instructions of addresses 4 and 5 (MULT-ACCUM DR ROW+2, WR ROW 0 and MULT-ACCUM ROTATE, WR ROW+1, COUNT=511), the 512 neural processing units 126 Each neural processing unit 126 performs 512 multiplication operations to associate the data random access memory 122 with the current time step (eg, column 0 for time step 0 and column 0 for time step 1) Column 2, and so on, for the 512 input node D values for column 62) for time step 31, multiplied by columns 0 through 511 of weight random access memory 124 corresponding to the neural processing unit 126 The weights are weighted to produce 512 products, and the accumulated results of the 512 content node C values, together with the instructions of the addresses 2 and 3, are added to the accumulator 202 of the corresponding neural processing unit 126 to calculate the concealment. Node Z layer value.

在位址6之指令(OUTPUT PASSTHRU,NOP,CLR ACC)之各次執行中，這512個神經處理單元 126之512個累加器202數值係傳遞並寫入輸出緩衝器1104之相對應文字，並且累加器202會被清除。 In each execution of the instruction of address 6 (OUTPUT PASSTHRU, NOP, CLR ACC), these 512 neural processing units The 512 accumulator 202 values of 126 are passed to and written to the corresponding text of the output buffer 1104, and the accumulator 202 is cleared.

在位址7與8之指令(MULT-ACCUM OUTBUF[SELF],WR ROW 512與MULT-ACCUM ROTATE,WR ROW+1,COUNT=511)之執行過程中，這512個神經處理單元126中之各個神經處理單元126會執行512次乘法運算，將輸出緩衝器1104中之512個隱藏節點Z值(由位址2至6之指令之相對應次執行所產生並寫入)，乘上權重隨機存取記憶體124之列512至1023中對應於此神經處理單元126之行之權重，以產生512個乘積累加至相對應神經處理單元126之累加器202。 During the execution of the instructions of addresses 7 and 8 (MULT-ACCUM OUTBUF[SELF], WR ROW 512 and MULT-ACCUM ROTATE, WR ROW+1, COUNT=511), each of the 512 neural processing units 126 The neural processing unit 126 performs 512 multiplication operations, and the 512 hidden node Z values in the output buffer 1104 (generated and written by the corresponding execution of the instructions of the addresses 2 to 6) are multiplied by the weights. The weights corresponding to the rows of the neural processing unit 126 in the columns 512 through 1023 of the memory 124 are taken to generate 512 multiply accumulates to the accumulator 202 of the corresponding neural processing unit 126.

在位址9之指令(OUTPUT ACTIVATION FUNCTION,DR OUT ROW+2)之各次執行中，會對於這512個累加值執行一啟動函數(如雙曲正切函數，S型函數，校正函數)以計算輸出節點Y值，此輸出節點Y值會被寫入資料隨機存取記憶體122中對應於當前時間步驟之列(例如：對於時間步驟0而言即為列1，對於時間步驟1而言即為列3，依此類推，對於時間步驟31而言即為列63)。位址9之指令並不會清除累加器202。 In each execution of the instruction of the address 9 (OUTPUT ACTIVATION FUNCTION, DR OUT ROW+2), a start function (such as a hyperbolic tangent function, a sigmoid function, a correction function) is performed on the 512 accumulated values to calculate Output node Y value, this output node Y value will be written into the data random access memory 122 corresponding to the current time step column (for example: for time step 0 is column 1, for time step 1 For column 3, and so on, for time step 31 it is column 63). The instruction of address 9 does not clear accumulator 202.

在位址10之指令(OUTPUT PASSTHRU,NOP,CLR ACC)之各次執行中，位址7與8之指令累加出之512個數值會被放入輸出緩衝器1104供位址2與3之指令之下一次執行使用，並且累加器202會被清除。 In each execution of the instruction of address 10 (OUTPUT PASSTHRU, NOP, CLR ACC), the 512 values accumulated by the instructions of addresses 7 and 8 are placed in the output buffer 1104 for the instructions of addresses 2 and 3. The next execution is used and the accumulator 202 is cleared.

位址11之迴圈指令會使迴圈計數器3804之數值遞減，而若是新的迴圈計數器3804數值仍然大於零，就指示回到位址2之指令。 The loop command of address 11 will decrement the value of the loop counter 3804, and if the value of the new loop counter 3804 is still greater than Zero indicates the instruction to return to address 2.

如同對應於第四十四圖之章節所述，在利用第五十七圖之非架構程式執行Jordan時間遞歸神經網路之範例中，雖然會對於累加器202數值施以一啟動函數以產生輸出層節點Y值，不過，此範例係假定在施以啟動函數前，累加器202數值就傳遞至內容層節點C，而非傳遞真正的輸出層節點Y值。不過，對於將啟動函數施加於累加器202數值以產生內容層節點C之Jordan時間遞歸神經網路而言，位址10之指令將會從第五十七圖之非架構程式中移除。在本文所述之實施例中，Elman或Jordan時間遞歸神經網路具有單一個隱藏節點層(如第四十與四十二圖)，不過，需要理解的是，這些處理器100與神經網路單元121之實施例可以使用類似於本文所述之方式，有效地執行關聯於具有多個隱藏層之時間遞歸神經網路之計算。 As described in the section corresponding to the forty-fourth figure, in the example of performing the Jordan time recurrent neural network using the non-architectural program of the fifty-seventh figure, a start function is applied to the accumulator 202 value to generate an output. The layer node Y value, however, this example assumes that the accumulator 202 value is passed to the content layer node C before the start function is applied, rather than passing the true output layer node Y value. However, for a Jordan time recurrent neural network that applies a start function to the accumulator 202 value to produce a content layer node C, the instruction of address 10 will be removed from the non-architectural program of the fifty-seventh figure. In the embodiments described herein, the Elman or Jordan time recurrent neural network has a single hidden node layer (such as the fortieth and forty-two maps), however, it is to be understood that these processors 100 and neural networks Embodiments of unit 121 can efficiently perform computations associated with temporal recurrent neural networks having multiple hidden layers, in a manner similar to that described herein.

如前文對應於第二圖之章節所述，各個神經處理單元126係作為一個人工神經網路內之神經元進行運作，而神經網路單元121內所有的神經處理單元126會以大規模平行處理之方式有效地計算此網路之一層級之神經元輸出值。此神經網路單元之平行處理方式，特別是使用神經處理單元多工暫存器集體構成之旋轉器，並非傳統上計算神經元層輸出之方式所能直覺想到。進一步來說，傳統方式通常涉及關聯於單一個神經元或是一個非常小之神經元子集合之計算(例如，使用平行算術單元執行乘法與加法計算)，然後就繼續執行關聯於同一層級之下一個神經元之計算，依此類推以序列方式繼續執行，直到完成對於此層級中所有之神經元之計算。相較之下，本發明在各個時頻週期內，神經網路單元121之所有神經處理單元126(神經元)會平行執行關聯於產生所有神經元輸出所需計算中之一個小集合(例如單一個乘法與累加計算)。在大約M個時頻週期結束後-M是當前層級內連結之節點數-神經網路單元121就會計算出所有神經元之輸出。在許多人工神經網路配置中，因為存在大量神經處理單元126，神經網路單元121就可以在M個時頻週期結束時對於整個層級之所有神經元計算其神經元輸出值。如本文所述，此計算對於所有類型之人工神經網路計算而言都具效率，這些人工神經網路包含但不限於前饋與時間遞歸神經網路，如Elman，Jordan與長短期記憶網路。最後，雖然本文之實施例中，神經網路單元121係配置為512個神經處理單元126(例如採取寬文字配置)以執行時間遞歸神經網路之計算，不過，本發明並不限於此，將神經網路單元121配置為1024個神經處理單元126(例如採取窄文字配置)以執行時間遞歸神經網路單元之計算之實施例，以及如前述具有512與1024以外其他數量之神經處理單元126之神經網路單元121，亦屬本發明之範疇。 As described above in the section corresponding to the second figure, each neural processing unit 126 operates as a neuron within an artificial neural network, and all of the neural processing units 126 in the neural network unit 121 are processed in large scale parallel. The way to effectively calculate the neuron output value at one level of this network. The parallel processing of this neural network unit, especially the rotator collectively constructed using the neural processing unit multiplex register, is not intuitively thought of as a conventional way of calculating the output of the neuron layer. Further, the traditional approach usually involves the calculation of a single neuron or a very small subset of neurons (for example, using parallel arithmetic units to perform multiplication and addition calculations), and then continue to perform associations with the same The calculation of a neuron below a level, and so on, continues in a sequential manner until the calculation of all neurons in this hierarchy is completed. In contrast, the present invention, in each time-frequency period, all of the neural processing units 126 (neurons) of the neural network unit 121 perform parallel execution of a small set associated with the generation of all neuron outputs (eg, a single A multiplication and accumulation calculation). At the end of approximately M time-frequency periods -M is the number of nodes connected in the current level - the neural network unit 121 calculates the output of all neurons. In many artificial neural network configurations, because of the large number of neural processing units 126, the neural network unit 121 can calculate its neuron output values for all neurons of the entire hierarchy at the end of the M time-frequency periods. As described herein, this calculation is efficient for all types of artificial neural network computations, including but not limited to feedforward and time recurrent neural networks such as Elman, Jordan, and long- and short-term memory networks. . Finally, although in the embodiments herein, the neural network unit 121 is configured as 512 neural processing units 126 (eg, taking a wide text configuration) to perform the calculation of the time recurrent neural network, the invention is not limited thereto, The neural network unit 121 is configured as 1024 neural processing units 126 (e.g., employing a narrow text configuration) to perform the calculation of the time recurrent neural network unit, and other numbers of neural processing units 126 having 512 and 1024 as previously described. The neural network unit 121 is also within the scope of the present invention.

惟以上所述者，僅為本發明之較佳實施例而已，當不能以此限定本發明實施之範圍，即大凡依本發明申請專利範圍及發明說明內容所作之簡單的等效變化與修飾，皆仍屬本發明專利涵蓋之範圍內。舉例來說，軟體可以執行本發明所述之裝置與方法的功能、製造、形塑、模擬、描述以及/或測試等。這可由一般的程式語言(如C、C++)、硬體描述語言(HDL)包含Verilog HDL,VHDL等，或是其他既有程式來達成。此軟體可以設置於任何已知的電腦可利用媒介，如磁帶、半導體、磁碟、光碟(如CD-ROM、DVD-ROM等)、網路接線、無線或是其他通訊媒介。此處描述之裝置與方法的實施例可被包含於一半導體智財核心，例如一微處理核心(如以硬體描述語言的實施方式)並且透過積體電路的製作轉換為硬體。此外，本文所描述之裝置與方法亦可包含硬體與軟體之結合。因此，本文所述的任何實施例，並非用以限定本發明之範圍。此外，本發明可應用於一般通用電腦之微處理器裝置。最後，所屬技術領域具有通常知識者利用本發明所揭露的觀念與實施例作為基礎，來設計並調整出不同的結構已達成相同的目的，亦不超出本發明之範圍。 The above is only the preferred embodiment of the present invention, and the scope of the invention is not limited thereto, that is, the simple equivalent changes and modifications made by the scope of the invention and the description of the invention are All remain within the scope of the invention patent. for example, The software can perform the functions, manufacture, shaping, simulation, description, and/or testing of the devices and methods described herein. This can be achieved by a general programming language (such as C, C++), a hardware description language (HDL) containing Verilog HDL, VHDL, etc., or other existing programs. The software can be placed on any known computer usable medium such as magnetic tape, semiconductor, magnetic disk, optical disk (such as CD-ROM, DVD-ROM, etc.), network cable, wireless or other communication medium. Embodiments of the apparatus and methods described herein can be included in a semiconductor intellectual property core, such as a micro processing core (such as an embodiment in a hardware description language) and converted to hardware by fabrication of integrated circuits. In addition, the devices and methods described herein may also comprise a combination of hardware and software. Therefore, any embodiments described herein are not intended to limit the scope of the invention. Furthermore, the present invention is applicable to a microprocessor device of a general-purpose computer. Finally, it is common for those skilled in the art to make use of the concepts and embodiments disclosed herein to form and adapt various structures to achieve the same objectives without departing from the scope of the invention.

100‧‧‧處理器 100‧‧‧ processor

101‧‧‧指令攫取單元 101‧‧‧Command capture unit

102‧‧‧指令快取 102‧‧‧ instruction cache

103‧‧‧架構指令 103‧‧‧Architecture Instructions

104‧‧‧指令轉譯器 104‧‧‧Instruction Translator

105‧‧‧微指令 105‧‧‧Microinstructions

106‧‧‧重命名單元 106‧‧‧Renaming unit

108‧‧‧保留站 108‧‧‧Reservation station

112‧‧‧其他執行單元 112‧‧‧Other execution units

114‧‧‧記憶體子系統 114‧‧‧ memory subsystem

116‧‧‧通用暫存器 116‧‧‧Universal register

118‧‧‧媒體暫存器 118‧‧‧Media register

121‧‧‧神經網路單元 121‧‧‧Neural Network Unit

122‧‧‧資料隨機存取記憶體 122‧‧‧ Data Random Access Memory

126‧‧‧神經處理單元 126‧‧‧Neural Processing Unit

127‧‧‧控制與狀態暫存器 127‧‧‧Control and Status Register

128‧‧‧定序器 128‧‧‧Sequencer

129‧‧‧程式記憶體 129‧‧‧Program memory

123,125,131‧‧‧記憶體位址 123,125,131‧‧‧ memory address

133‧‧‧結果 133‧‧‧ Results

Claims

An apparatus comprising: a plurality of arithmetic logic units, each of the arithmetic logic units having: an accumulator; and an integer arithmetic unit that receives an integer input and performs an integer arithmetic operation thereon, and an integer of a series of integer arithmetic operations The result is accumulated to the accumulator as an integer accumulated value; a temporary register can be programmed by one of the index of the number of fractional bits of the integer accumulated value and the number of fractional bits of the integer output; wherein One of the accumulators has a first bit width that is greater than twice the width of the second bit of the integer output; and a plurality of adjustment units that are based on the number of decimal places of the integer accumulated value programmed in the register The indicator and the indicator of the number of decimal places output by the integer, scale and saturate the first bit width accumulated value to produce the second bit width integer output.

The apparatus of claim 1, wherein the adjusting unit calculates a difference between the number of the decimal places of the output and the number of decimal places of the integer accumulated value, and the adjusting unit adds the integer accumulated value The offset of the difference is shifted to the right to scale the integer accumulated value.

The device of claim 1, wherein the integer accumulated value after shifting to the right is greater than/less than the second bit width The degree represents the maximum/minimum value, and the adjustment unit fills the integer accumulated value to a maximum/minimum value that can be represented by the second bit width to fill the integer accumulated value.

The apparatus of claim 1, wherein the accumulator has at least Q storage bits to store the integer accumulated value, wherein Q is a number of bits sufficient to accumulate a series of P integer results without loss of precision .

The apparatus of claim 4, wherein Q is equal to M plus log2P, wherein M is one of the bit widths of the integer result.

The device of claim 5, wherein P is one of the series of integer results that produces the integer accumulated value is a preset maximum allowable number.

The apparatus of claim 5, wherein when the integer input has a bit width of 8, P is 1023 and Q is 28.

The apparatus of claim 5, wherein when the integer input has a bit width of 16, P is 511 and Q is 41.

The device of claim 1, further comprising: a memory to load the integer input and provide the plurality of arithmetic logic units.

The device of claim 9, wherein the plurality of adjustment units write the integer input into the memory.

The apparatus of claim 1, wherein each of the arithmetic logic units further comprises: an integer multiplier performing one integer multiplication of the integer input to generate an integer product; and an integer adder executing The integer product is summed with one of the integer accumulated values to produce an integer sum total stored back to the accumulator.

The device of claim 1, wherein each of the adjusting units further comprises: a rounder, rounding the integer accumulated value according to the least significant J bits of the integer accumulated value, wherein Is the difference between the number of decimal places of the output and the number of decimal places of the integer accumulated value.

The device of claim 1, wherein the register is further programmed by a first indicator and a second indicator, the first indicator indicating that the data received from the first memory is signed or not. A symbol indicating that the weight text received from the second memory is signed or unsigned.

Such as the device of claim 1 of the patent scope, wherein the whole The one-bit width of the number input is the same as the second bit width of the integer output.

The device of claim 1, wherein the device comprises a neural network unit; the adjusting unit comprises a start function unit for performing a start function on the scaled or saturated integer accumulated value, It is normalized in a non-linear manner to produce a result that falls within a predetermined range of values.

A method comprising: programstizing a temporary register by using one of an indicator of the number of fractional bits of the integer accumulated value and the number of fractional bits of the integer output; using a plurality of arithmetic logic units, each having one An accumulator and an arithmetic logic unit of an integer arithmetic unit: using the integer arithmetic unit, performing an integer arithmetic operation on the integer input; and accumulating a series of integer result of the integer arithmetic operations to the accumulator as an integer accumulated value; Wherein the first bit width of one of the accumulators is greater than twice the width of the second bit of the integer output; and the adjusting unit of the plurality of adjusting units is programmed according to the temporary register The indicator of the number of decimal places of the integer accumulated value and the number of decimal places of the integer output, scaling and filling the first bit width accumulated value to generate the second bit width integer output .

The method of claim 16, further comprising: calculating, by using the adjusting unit, a difference between the number of the decimal places of the output and the number of decimals of the integer accumulated value; wherein the adjusting unit is used to scale the The step of incrementing the integer includes moving the integer accumulated value to the right by the offset of the difference.

The method of claim 16, wherein the step of filling the integer accumulated value comprises: when the rightward shifting, the integer accumulated value is greater than/less than a maximum/minimum value that can be represented by the second bit width, The adjustment unit fills the integer accumulated value to a maximum/minimum value that can be represented by the second bit width.

The method of claim 16, wherein the accumulator has at least Q storage bits to store the integer accumulated value, wherein Q is a number of bits sufficient to accumulate a series of P integer results without loss of precision .

The method of claim 19, wherein Q is equal to M plus log2P, wherein M is one of the bit widths of the integer result.

For example, the method of claim 16 of the patent scope further includes: using each of the rounding units, the minimum effective J according to the integer accumulated value a bit, the rounding operation is performed on the integer accumulated value, where J is the difference between the number of the decimal places of the output and the number of the decimal places of the integer accumulated value.

For example, the method of claim 16 further includes: using the adjusting unit, performing a starting function on the scaled or saturated integer accumulated value, and performing a normalization operation in a nonlinear manner to generate A result that falls within a predetermined range of values.

A computer program product encoded in at least one non-transitory computer usable medium for use by a computer device, comprising: a computer usable code embodied in the medium for describing a neural network unit, the computer The use code includes: a first code to describe a plurality of arithmetic logic units, each of the arithmetic logic units having an accumulator and an integer arithmetic unit, the integer arithmetic unit receiving an integer input and performing an integer arithmetic operation thereon, and A series of integer arithmetic operations of the integer operations are accumulated to the accumulator as an integer accumulated value; a second code is used to describe a register, and the register can utilize the number of decimal places of the integer accumulated value One indicator and one of the number of decimal places of the integer output are programmed; wherein the first bit width of one of the accumulators is greater than twice the width of the second bit of the integer output; and the third code For describing a plurality of adjustment units, according to the number of decimal places of the integer accumulated value programmed in the register The indicator and the indicator of the number of decimal places output by the integer, scale and saturate the first bit width accumulated value to produce the second bit width integer output.