TWI616825B

TWI616825B - Neural network unit with output buffer feedback and masking capability

Info

Publication number: TWI616825B
Application number: TW105132064A
Authority: TW
Inventors: Ｇ葛蘭亨利; G. Glenn Henry; 泰瑞派克斯; Terry Parks; 凱爾Ｔ奧布萊恩; Kyle T O'BRIEN
Original assignee: 上海兆芯集成電路有限公司; Via Alliance Semiconductor Co., Ltd.
Priority date: 2015-10-08
Filing date: 2016-10-04
Publication date: 2018-03-01
Also published as: CN106598545A; TWI579694B; TWI608429B; CN106599992A; CN106599990A; TW201714120A; TWI626587B; CN106599991B; TWI650707B; CN106598545B; TW201714078A; CN106650923B; CN106599992B; TWI601062B; TW201714091A; TW201714080A; CN106599991A; CN106599989A; CN106599989B; TW201714081A

Abstract

輸出緩衝器係用以裝載N個文字，分配至N/J個互斥之輸出緩衝文字群組。N個處理單元係分配至N/J個互斥之處理單元群組。各個處理單元包括第一與第二多工暫存器，一累加器與一算術單元。各個多工暫存器包括至少J+1個輸入與一輸出，其中之第一輸入從記憶體接收運算元，其他輸入接收相對應輸出緩衝文字群組之文字。累加器係提供其輸出至相對應輸出緩衝文字。算術單元對第一與第二多工暫存器之輸出以及累加器輸出執行運算以產生結果累加至累加器。輸出緩衝器包括一遮罩輸入以控制N個文字中哪些文字會維持原本數值或是以相對應累加器輸出進行更新。 The output buffer is used to load N characters and is allocated to N / J mutually exclusive output buffer character groups. The N processing units are allocated to N / J mutually exclusive processing unit groups. Each processing unit includes first and second multiplexing registers, an accumulator and an arithmetic unit. Each multiplexing register includes at least J + 1 inputs and an output. One of the first inputs receives an operand from the memory, and the other input receives the text corresponding to the output buffer text group. The accumulator provides its output to the corresponding output buffer text. The arithmetic unit performs operations on the outputs of the first and second multiplexing registers and the output of the accumulator to generate a result and accumulate the result to the accumulator. The output buffer includes a mask input to control which of the N characters will maintain the original value or be updated with the corresponding accumulator output.

Description

God with output buffer feedback and masking function Via network unit

本申請案主張下列之美國臨時申請案之國際優先權。這些優先權案之全文併入本案以供參考。 This application claims the following international priority for the United States provisional application. The full text of these priority cases is incorporated in this case for reference.

本申請案係關聯於下列同時提出申請之美國臨時申請案。這些關聯申請案之全文併入本案以供參考。 This application is related to the following U.S. provisional applications that file simultaneously. The full texts of these related applications are incorporated in this case for reference.

近年來，人工神經網路(artificial neural networks,ANN)重新吸引了人們的注意。這些研究通常被稱為深度學習(deep learning)、電腦學習(computer learning)等類似術語。通用處理器運算能力的提升也推升了人們在數十年後的現在對於人工神經網路的興趣。人工神經網路近期的應用包括語言與影像辨識等。對於提升人工神經網路之運算效能與效率的需求似乎正在增加。 In recent years, artificial neural networks (ANN) have regained attention. These studies are often called similar terms such as deep learning, computer learning, and so on. The increase in the computing power of general-purpose processors has also boosted people's interest in artificial neural networks now, decades later. Recent applications of artificial neural networks include language and image recognition. There seems to be an increasing need for increasing the computational efficiency and efficiency of artificial neural networks.

有鑑於此，本發明提供一種裝置，此裝置包括一輸出緩衝器與一個由N個處理單元構成之陣列。輸出緩衝器係用以裝載N個文字，N個文字係分配至N/J個互斥之輸出緩衝文字群組內，輸出緩衝文字群組具有N個文字中之J個文字，J大於2，N至少是J的兩倍。陣列內之N個處理單元係分配至N/J個互斥之處理單元群組，處理單元群組具有N個處理單元中之J個處理單元，各個處理單元群組係對應N/J個輸出緩衝文字群組之其中之一，各個處理單元包括第一與第二多工暫存器，一累加器與一算術單元。各個多工暫存器包括至少J+1個輸入，一輸出與一控制輸入。J+1個輸入中之一第一輸入從一記憶體接收一運算元，J+1個輸入中之其他J個輸入接收相對應輸出緩衝文字群組之J個文字。控制輸入係用以控制對於J+1個輸入之選擇以提供至輸出。累加器具有一輸出以提供至N個輸出緩衝文字中之一相對應輸出緩衝文字。算術單元具有第一，第二與第三輸入，第一與第二輸入分別用以接收第一與第二多工暫存器之輸出，第三輸入係用以接收累加器之輸出，算術單元對於該第一，第二與第三輸入執行一運算以產生一結果累加至累加器。其中，輸出緩衝器包括一遮罩輸入，用以控制N個文字中哪些文字會維持其原本數值或是以其相對應累加器之輸出進行更新。 In view of this, the present invention provides a device including an output buffer and an array of N processing units. The output buffer is used to load N characters. The N characters are allocated to N / J mutually exclusive output buffer text groups. The output buffer text group has J characters of N characters, J is greater than 2, N is at least twice that of J. The N processing units in the array are allocated to N / J mutually exclusive processing unit groups. The processing unit group has J processing units among the N processing units, and each processing unit group corresponds to N / J outputs. One of the buffered text groups, each processing unit includes first and second multiplexing registers, one Accumulator and an arithmetic unit. Each multiplexing register includes at least J + 1 inputs, one output and one control input. One of the J + 1 inputs receives an operand from a memory, and the other J inputs of the J + 1 inputs receive the J characters of the corresponding output buffer text group. The control input is used to control the selection of J + 1 inputs to provide to the output. The accumulator has an output to provide one of the N output buffered characters corresponding to the output buffered characters. The arithmetic unit has first, second, and third inputs. The first and second inputs are used to receive the output of the first and second multiplex registers, respectively. The third input is used to receive the output of the accumulator. For the first, second and third inputs, an operation is performed to generate a result accumulated to the accumulator. The output buffer includes a mask input, which is used to control which characters in the N characters will maintain their original values or be updated by the output of the corresponding accumulator.

本發明並提供一種處理器，此處理器包括一執行單元，此執行單元包括一輸出緩衝器與一個由N個處理單元構成之陣列。輸出緩衝器係用以裝載N個文字，N個文字係分配至N/J個互斥之輸出緩衝文字群組內，輸出緩衝文字群組具有N個文字中之J個文字，J大於2，N至少是J的兩倍。陣列內之N個處理單元係分配至N/J個互斥之處理單元群組，處理單元群組具有N個處理單元中之J個處理單元，各個處理單元群組係對應N/J個輸出緩衝文字群組之其中之一，各個處理單元包括第一與第二多工暫存器，一累加器與一算術單元。各個多工暫存器包括至少J+1個輸入，一輸出與一控制輸入。J+1個輸入中之一第一輸入從一記憶體接收一運算元，J+1個輸入中之其他J個輸入接收相對應輸出緩衝文字群組之J個文字。控制輸入係用以控制對於J+1個輸入之選擇以提供至輸出。累加器具有一輸出以提供至N個輸出緩衝文字中之一相對應輸出緩衝文字。算術單元具有第一，第二與第三輸入，第一與第二輸入分別用以接收第一與第二多工暫存器之輸出，第三輸入係用以接收累加器之輸出，算術單元對於該第一，第二與第三輸入執行一運算以產生一結果累加至累加器。其中，輸出緩衝器包括一遮罩輸入，用以控制N個文字中哪些文字會維持其原本數值或是以其相對應累加器之輸出進行更新。 The invention also provides a processor. The processor includes an execution unit. The execution unit includes an output buffer and an array of N processing units. The output buffer is used to load N characters. The N characters are allocated to N / J mutually exclusive output buffer text groups. The output buffer text group has J characters of N characters, J is greater than 2, N is at least twice that of J. The N processing units in the array are allocated to N / J mutually exclusive processing unit groups. The processing unit group has J processing units among the N processing units, and each processing unit group corresponds to N / J outputs. One of the buffered text groups, each processing unit includes a first and a second multiplexing register, an accumulator and an arithmetic unit. Each multiplexing register includes at least J + 1 inputs, one output and one control input. One of the J + 1 inputs The first input receives an operand from a memory, and J + 1 inputs The other J inputs in the input receive the J characters of the corresponding output buffer text group. The control input is used to control the selection of J + 1 inputs to provide to the output. The accumulator has an output to provide one of the N output buffered characters corresponding to the output buffered characters. The arithmetic unit has first, second and third inputs, the first and second inputs are used to receive the output of the first and second multiplexer registers respectively, and the third input is used to receive the output of the accumulator. The arithmetic unit For the first, second and third inputs, an operation is performed to generate a result accumulated to the accumulator. The output buffer includes a mask input, which is used to control which characters in the N characters will maintain their original values or be updated by the output of the corresponding accumulator.

本發明並提供一種編碼於至少一非暫態電腦可使用媒體以供一電腦裝置使用之一電腦程式產品。此電腦程式產品包括內含於媒體之電腦可使用程式碼，用以描述一裝置。此電腦可使用程式碼包括第一程式碼與第二程式碼。第一程式碼係用以描述一輸出緩衝器，此輸出緩衝器係用以裝載N個文字，N個文字係分配至N/J個互斥之輸出緩衝文字群組內，輸出緩衝文字群組具有N個文字中之J個文字，J大於2，N至少是J的兩倍。第二程式碼係用以描述與一個由N個處理單元構成之陣列，此陣列內之N個處理單元係分配至N/J個互斥之處理單元群組，處理單元群組具有N個處理單元中之J個處理單元，各個處理單元群組係對應N/J個輸出緩衝文字群組之其中之一，各個處理單元包括第一與第二多工暫存器，一累加器與一算術單元。各個多工暫存器包括至少J+1個輸入，一輸出與一控制輸入。J+1個輸入中之一第一輸入從一記憶體接收一運算元，J+1個輸入中之其他J個輸入接收相對應輸出緩衝文字群組之J個文字。控制輸入係用以控制對於J+1個輸入之選擇以提供至輸出。累加器具有一輸出以提供至N個輸出緩衝文字中之一相對應輸出緩衝文字。算術單元具有第一，第二與第三輸入，第一與第二輸入分別用以接收第一與第二多工暫存器之輸出，第三輸入係用以接收累加器之輸出，算術單元對於該第一，第二與第三輸入執行一運算以產生一結果累加至累加器。其中，輸出緩衝器包括一遮罩輸入，用以控制N個文字中哪些文字會維持其原本數值或是以其相對應累加器之輸出進行更新。 The invention also provides a computer program product encoded in at least one non-transitory computer usable medium for use by a computer device. This computer program product includes computer-usable code included in the media to describe a device. This computer can use the code including the first code and the second code. The first code is used to describe an output buffer. This output buffer is used to load N characters. The N characters are allocated to N / J mutually exclusive output buffer text groups. The output buffer text group. With J characters of N characters, J is greater than 2, N is at least twice J. The second code is used to describe an array composed of N processing units. The N processing units in this array are allocated to N / J mutually exclusive processing unit groups. The processing unit group has N processing units. Each of the J processing units in the unit corresponds to one of the N / J output buffer text groups. Each processing unit includes first and second multiplexing registers, an accumulator, and an arithmetic unit. unit. Each multiplexing register includes at least J + 1 inputs, one output and one control input. One of J + 1 inputs An input receives an operand from a memory, and the other J inputs of the J + 1 inputs receive the J characters of the corresponding output buffer text group. The control input is used to control the selection of J + 1 inputs to provide to the output. The accumulator has an output to provide one of the N output buffered characters corresponding to the output buffered characters. The arithmetic unit has first, second and third inputs, the first and second inputs are used to receive the output of the first and second multiplexer registers respectively, and the third input is used to receive the output of the accumulator. The arithmetic unit For the first, second and third inputs, an operation is performed to generate a result accumulated to the accumulator. The output buffer includes a mask input, which is used to control which characters in the N characters will maintain their original values or be updated by the output of the corresponding accumulator.

本發明所採用的具體實施例，將藉由以下之實施例及圖式作進一步之說明。 The specific embodiments used in the present invention will be further described by the following embodiments and drawings.

100‧‧‧處理器 100‧‧‧ processor

101‧‧‧指令攫取單元 101‧‧‧ instruction fetch unit

102‧‧‧指令快取 102‧‧‧Instruction cache

103‧‧‧架構指令 103‧‧‧Architecture Instructions

104‧‧‧指令轉譯器 104‧‧‧Instruction translator

105‧‧‧微指令 105‧‧‧microinstructions

106‧‧‧重命名單元 106‧‧‧ Rename Unit

108‧‧‧保留站 108‧‧‧ Reserved Station

112‧‧‧其他執行單元 112‧‧‧Other execution units

114‧‧‧記憶體子系統 114‧‧‧Memory Subsystem

116‧‧‧通用暫存器 116‧‧‧General purpose register

118‧‧‧媒體暫存器 118‧‧‧Media Register

121‧‧‧神經網路單元 121‧‧‧ Neural Network Unit

122‧‧‧資料隨機存取記憶體 122‧‧‧Data Random Access Memory

124‧‧‧權重隨機存取記憶體 124‧‧‧ Weight Random Access Memory

126‧‧‧神經處理單元 126‧‧‧Neural Processing Unit

127‧‧‧控制與狀態暫存器 127‧‧‧Control and status register

128‧‧‧定序器 128‧‧‧Sequencer

129‧‧‧程式記憶體 129‧‧‧program memory

123,125,131‧‧‧記憶體位址 123,125,131‧‧‧Memory address

133,215,133A,133B,215A,215B‧‧‧結果 133,215,133A, 133B, 215A, 215B‧‧‧Result

202‧‧‧累加器 202‧‧‧ Accumulator

204‧‧‧算術邏輯單元 204‧‧‧ Arithmetic Logic Unit

203,209,217,203A,203B, 209A,209B,217A,217B‧‧‧輸出 203,209,217,203A, 203B, 209A, 209B, 217A, 217B‧‧‧Output

205,205A,205B‧‧‧暫存器 205,205A, 205B‧‧‧Register

206,207,211,1811,206A,206B,207A,207B,211A,211B,1811A,1811B,711‧‧‧輸入 206,207,211,1811,206A, 206B, 207A, 207B, 211A, 211B, 1811A, 1811B, 711‧‧‧Enter

208,705,208A,208B‧‧‧多工暫存器 208,705,208A, 208B‧‧‧Multiplex Register

213,713,803‧‧‧控制輸入 213,713,803‧‧‧Control input

212‧‧‧啟動函數單元 212‧‧‧Start function unit

242‧‧‧乘法器 242‧‧‧Multiplier

244‧‧‧加法器 244‧‧‧ Adder

246,246A,246B‧‧‧乘積 246,246A, 246B‧‧‧product

802‧‧‧多工器 802‧‧‧ Multiplexer

1104‧‧‧列緩衝器 1104‧‧‧column buffer

1112‧‧‧啟動函數單元 1112‧‧‧Start function unit

1400‧‧‧MTNN指令 1400‧‧‧MTNN instruction

1500‧‧‧MFNN指令 1500‧‧‧MFNN instruction

1432,1532‧‧‧函數 1432, 1532‧‧‧ functions

1402,1502‧‧‧執行碼欄位 1402,1502‧‧‧‧Code field

1404‧‧‧src1欄位 1404‧‧‧src1 field

1406‧‧‧src2欄位 1406‧‧‧src2

1408,1508‧‧‧gpr欄位 1408, 1508‧‧‧‧gpr field

1412,1512‧‧‧立即欄位 1412, 1512 ‧‧‧ Immediate field

1422,1522‧‧‧位址 1422, 1522 ‧‧‧ Address

1424,1426,1524‧‧‧資料塊 1424, 1426, 1524 ‧‧‧ data blocks

1428,1528‧‧‧選定列 1428, 1528 ‧‧‧selected columns

1434‧‧‧控制邏輯 1434‧‧‧Control Logic

1504‧‧‧dst欄位 1504‧‧‧dst field

1602‧‧‧讀取埠 1602‧‧‧Read Port

1604‧‧‧寫入埠 1604‧‧‧write port

1606‧‧‧記憶體陣列 1606‧‧‧Memory Array

1702‧‧‧埠 1702‧‧‧port

1704‧‧‧緩衝器 1704‧‧‧Buffer

1898‧‧‧運算元選擇邏輯 1898‧‧‧operator selection logic

1896A‧‧‧寬多工器 1896A‧‧‧Wide Multiplexer

1896B‧‧‧窄多工器 1896B‧‧‧Narrow Multiplexer

242A‧‧‧寬乘法器 242A‧‧‧ Wide Multiplier

242B‧‧‧窄乘法器 242B‧‧‧Narrow Multiplier

244A‧‧‧寬加法器 244A‧‧‧Wide Adder

244B‧‧‧窄加法器 244B‧‧‧Narrow Adder

204A‧‧‧寬算術邏輯單元 204A‧‧‧wide arithmetic logic unit

204B‧‧‧窄算術邏輯單元 204B‧‧‧ Narrow Arithmetic Logic Unit

202A‧‧‧寬累加器 202A‧‧‧Wide Accumulator

202B‧‧‧窄累加器 202B‧‧‧Narrow Accumulator

212A‧‧‧寬啟動函數單元 212A‧‧‧Wide Start Function Unit

212B‧‧‧窄啟動函數單元 212B‧‧‧Narrow Start Function Unit

2402‧‧‧卷積核 2402‧‧‧ Convolution Kernel

2404‧‧‧資料陣列 2404‧‧‧Data Array

2406A,2406B‧‧‧資料矩陣 2406A, 2406B‧‧‧ Data Matrix

2602,2604,2606,2608,2902,2912,2914,2922,2924,2926,2932,2934,2942,2944,2952,2954,2956,2923,2962,2964‧‧‧欄位 2602, 2604, 2606, 2608, 2902, 2912, 2914, 2922, 2924, 2926, 2932, 2934, 2942, 2944, 2952, 2954, 2956, 2923, 2962, 2964

3003‧‧‧隨機位元來源 3003‧‧‧ Random bit source

3005‧‧‧隨機位元 3005‧‧‧random bits

3002‧‧‧正類型轉換器與輸出二進位小數點對準器 3002‧‧‧ Positive Type Converter and Output Binary Decimal point aligner

3004‧‧‧捨入器 3004‧‧‧rounder

3006‧‧‧多工器 3006‧‧‧Multiplexer

3008‧‧‧標準尺寸壓縮器與飽和器 3008‧‧‧Standard size compressor and saturator

3012‧‧‧位元選擇與飽和器 3012‧‧‧Bit Selection and Saturator

3018‧‧‧校正器 3018‧‧‧corrector

3014‧‧‧倒數乘法器 3014‧‧‧Countdown Multiplier

3016‧‧‧向右移位器 3016‧‧‧Right shifter

3028‧‧‧標準尺寸傳遞值 3028‧‧‧Standard size transfer value

3022‧‧‧雙曲正切模組 3022‧‧‧Hyperbolic Tangent Module

3024‧‧‧S型模組 3024‧‧‧S Type Module

3026‧‧‧軟加模組 3026‧‧‧Soft add module

3032‧‧‧多工器 3032‧‧‧Multiplexer

3034‧‧‧符號恢復器 3034‧‧‧Symbol Restorer

3036‧‧‧尺寸轉換器與飽和器 3036‧‧‧size converter and saturator

3037‧‧‧多工器 3037‧‧‧Multiplexer

3038‧‧‧輸出暫存器 3038‧‧‧Output Register

3402‧‧‧多工器 3402‧‧‧Multiplexer

3401‧‧‧神經處理單元管線級 3401‧‧‧Neural Processing Unit pipeline stage

3404‧‧‧解碼器 3404‧‧‧ Decoder

3412,3414,3418‧‧‧微運算 3412, 3414, 3418 ‧‧‧ micro-computing

3416‧‧‧微指令 3416‧‧‧Micro Instructions

3422‧‧‧模式指標 3422‧‧‧Mode Indicator

3502‧‧‧時頻產生邏輯 3502‧‧‧ Time-frequency generation logic

3504‧‧‧時頻降低邏輯 3504‧‧‧ Time-frequency reduction logic

3514‧‧‧介面邏輯 3514‧‧‧Interface logic

3522‧‧‧資料隨機存取記憶體緩衝 3522‧‧‧Data Random Access Memory Buffer

3524‧‧‧權重隨機存取記憶體緩衝 3524‧‧‧weighted random access memory buffer

3512‧‧‧緩和指標 3512 ‧ ‧ mitigation indicators

3802‧‧‧程式計數器 3802‧‧‧Program Counter

3804‧‧‧迴圈計數器 3804‧‧‧Loop counter

3806‧‧‧迭代次數計數器 3806‧‧‧Iteration counter

3912,3914,3916‧‧‧欄位 3912, 3914, 3916 ‧‧‧ fields

4901‧‧‧神經處理單元群組 4901‧‧‧ Neural Processing Unit Group

4903‧‧‧遮罩 4903‧‧‧Mask

4905,4907,5599‧‧‧輸入 4905,4907,5599‧‧‧Enter

第一圖係顯示一包含一神經網路單元(neural network unit,NNU)之處理器之方塊示意圖。 The first diagram is a block diagram of a processor including a neural network unit (NNU).

第二圖係顯示第一圖之一神經處理單元(neural processing unit,NPU)之方塊示意圖。 The second diagram is a block diagram of a neural processing unit (NPU), one of the first diagrams.

第三圖係一方塊圖，顯示利用第一圖之神經網路單元之N個神經處理單元之N個多工暫存器，對於由第一圖之資料隨機存取記憶體取得之一列資料文字執行如同一N個文字之旋轉器(rotator)或稱循環移位器(circular shifter)之運作。 The third diagram is a block diagram showing the use of N multiplexing registers of the N neural processing units of the neural network unit of the first diagram, for a row of data text obtained from the data random access memory of the first diagram Performs the operation of a rotator or circular shifter of the same N characters.

第四圖係一表格，顯示一個儲存於第一圖之神經網路單元之程式記憶體並由該神經網路單元執行之程式。 The fourth diagram is a table showing a program stored in the program memory of the neural network unit of the first diagram and executed by the neural network unit.

第五圖係顯示神經網路單元執行第四圖之程式之時序圖。 The fifth diagram is a timing diagram showing the execution of the program of the fourth diagram by the neural network unit.

第六A圖係顯示第一圖之神經網路單元執行第四圖之程式之方塊示意圖。 The sixth diagram A is a block diagram showing that the neural network unit of the first diagram executes the program of the fourth diagram.

第六B圖係一流程圖，顯示第一圖之處理器執行一架構程式，以利用神經網路單元執行關聯於一人工神經網路之隱藏層之神經元之典型乘法累加啟動函數運算之運作，如同由第四圖之程式執行之運作。 The sixth diagram B is a flowchart showing the processor of the first diagram executing a framework program to use a neural network unit to perform a typical multiply accumulate activation function operation of a neuron associated with a hidden layer of an artificial neural network. It works as if executed by the program in Figure 4.

第七圖係顯示第一圖之神經處理單元之另一實施例之方塊示意圖。 The seventh diagram is a block diagram showing another embodiment of the neural processing unit of the first diagram.

第八圖係顯示第一圖之神經處理單元之又一實施例之方塊示意圖。 The eighth diagram is a block diagram showing another embodiment of the neural processing unit of the first diagram.

第九圖係一表格，顯示一個儲存於第一圖之神經網路單元之程式記憶體並由該神經網路單元執行之程式。 The ninth figure is a table showing a program stored in the program memory of the neural network unit of the first figure and executed by the neural network unit.

第十圖係顯示神經網路單元執行第九圖之程式之時序圖。 The tenth diagram is a timing chart showing the execution of the program of the ninth diagram by the neural network unit.

第十一圖係顯示第一圖之神經網路單元之一實施例之方塊示意圖。在第十一圖之實施例中，一個神經元係分成兩部分，即啟動函數單元部分與算術邏輯單元部分(此部分並包含移位暫存器部分)，而各個啟動函數單元部分係由多個算術邏輯單元部分共享。 The eleventh figure is a block diagram showing an embodiment of the neural network unit of the first figure. In the embodiment of FIG. 11, a neuron system is divided into two parts, that is, an activation function unit part and an arithmetic logic unit part (this part also includes a shift register part), and each activation function unit part is composed of multiple Part of the arithmetic logic unit is shared.

第十二圖係顯示第十一圖之神經網路單元執行第四圖之程式之時序圖。 The twelfth figure is a timing chart showing that the neural network unit of the eleventh figure executes the program of the fourth figure.

第十三圖係顯示第十一圖之神經網路單元執行第四圖之程式之時序圖。 The thirteenth figure is a timing chart showing the neural network unit of the eleventh figure executing the program of the fourth figure.

第十四圖係一方塊示意圖，顯示一移動至神經網路(MTNN)架構指令以及其對應於第一圖之神經網路單元之部分之運作。 The fourteenth figure is a block diagram showing a move to neural network (MTNN) architecture instruction and its operation corresponding to the neural network unit of the first figure.

第十五圖係一方塊示意圖，顯示一移動至神經網路(MTNN)架構指令以及其對應於第一圖之神經網路單元之部分之運作。 The fifteenth figure is a block diagram showing a move to neural network (MTNN) architecture instruction and its operation corresponding to the neural network unit of the first figure.

第十六圖係顯示第一圖之資料隨機存取記憶體之一實施例之方塊示意圖。 The sixteenth figure is a block diagram showing an embodiment of the data random access memory of the first figure.

第十七圖係顯示第一圖之權重隨機存取記憶體與一緩衝器之一實施例之方塊示意圖。 The seventeenth figure is a block diagram showing an embodiment of the weighted random access memory and a buffer of the first figure.

第十八圖係顯示第一圖之一可動態配置之神經處理單元之方塊示意圖。 The eighteenth figure is a block diagram showing a dynamically configurable neural processing unit, one of the first figures.

第十九圖係一方塊示意圖，顯示依據第十八圖之實施例，利用第一圖之神經網路單元之N個神經處理單元之2N個多工暫存器，對於由第一圖之資料隨機存取記憶體取得之一列資料文字執行如同一旋轉器(rotator)之運作。 The nineteenth figure is a block diagram showing the embodiment according to the eighteenth figure, using the 2N multiplexing registers of the N neural processing units of the neural network unit of the first figure. The random access memory obtains a row of data text and performs the same operation as a rotator.

第二十圖係一表格，顯示一個儲存於第一圖之神經網路單元之程式記憶體並由該神經網路單元執行之程式，而此神經網路單元具有如第十八圖之實施例所示之神經處理單元。 Figure 20 is a table showing a program stored in the program memory of the neural network unit of the first figure and executed by the neural network unit. The neural network unit has an embodiment as shown in figure 18. Neural Processing Unit shown.

第二十一圖係顯示一神經網路單元執行第二十圖之程式之時序圖，此神經網路單元具有如第十八圖所示之神經處理單元執行於窄配置。 The twenty-first diagram is a timing chart showing a neural network unit executing the program of the twenty-first diagram. The neural network unit has the nerve shown in the eighteenth diagram. The processing unit is executed in a narrow configuration.

第二十二圖係顯示第一圖之神經網路單元之方塊示意圖，此神經網路單元具有如第十八圖所示之神經處理單元以執行第二十圖之程式。 The twenty-second figure is a block diagram showing the neural network unit of the first figure. This neural network unit has the neural processing unit shown in the eighteenth figure to execute the program of the twentieth figure.

第二十三圖係顯示第一圖之一可動態配置之神經處理單元之另一實施例之方塊示意圖。 The twenty-third figure is a block diagram showing another embodiment of the dynamically configurable neural processing unit in the first figure.

第二十四圖係一方塊示意圖，顯示由第一圖之神經網路單元使用以執行一卷積(convolution)運作之資料結構之一範例。 The twenty-fourth figure is a block diagram showing an example of a data structure used by the neural network unit of the first figure to perform a convolution operation.

第二十五圖係一流程圖，顯示第一圖之處理器執行一架構程式以利用神經網路單元依據第二十四圖之資料陣列執行卷積核之卷積運算。 The twenty-fifth figure is a flowchart showing that the processor of the first figure executes a framework program to perform a convolution operation of a convolution kernel using a neural network unit according to the data array of the twenty-fourth figure.

第二十六A圖係一神經網路單元程式之一程式列表，此神經網路單元程式係利用第二十四圖之卷積核執行一資料矩陣之卷積運算並將其寫回權重隨機存取記憶體。 Figure 26A is a program list of a neural network unit program. This neural network unit program uses the convolution kernel of Figure 24 to perform a convolution operation of a data matrix and writes it back to random weights. Access memory.

第二十六B圖係顯示第一圖之神經網路單元之控制暫存器之某些欄位之一實施例之方塊示意圖。 Figure 26B is a block diagram showing an embodiment of some fields of the control register of the neural network unit in the first figure.

第二十七圖係一方塊示意圖，顯示第一圖中填入輸入資料之權重隨機存取記憶體之一範例，此輸入資料係由第一圖之神經網路單元執行共源運作(pooling operation)。 The twenty-seventh figure is a block diagram showing an example of the weighted random access memory filled with input data in the first figure. This input data is performed by the neural network unit in the first figure. ).

第二十八圖係一神經網路單元程式之一程式列表，此神經網路單元程式係執行第二十七圖之輸入資料矩陣之共源運作並將其寫回權重隨機存取記憶體。 The twenty-eighth figure is a program list of a neural network unit program. This neural network unit program performs the common source operation of the input data matrix of the twenty-seventh figure and writes it back to the weight random access memory.

第二十九A圖係顯示第一圖之控制暫存器之一實施例之方塊示意圖。 Figure 29A is a block diagram showing an embodiment of the control register of the first figure.

第二十九B圖係顯示第一圖之控制暫存器之另一實施例之方塊示意圖。 Figure 29B is a block diagram showing another embodiment of the control register of the first figure.

第二十九C圖係顯示以兩個部分儲存第二十九A圖之倒數(reciprocal)之一實施例之方塊示意圖。 Figure 29C is a block diagram showing one embodiment of storing the reciprocal of Figure 29A in two parts.

第三十圖係顯示第二圖之啟動函數單元(AFU)之一實施例之方塊示意圖。 The thirtieth figure is a block diagram showing an embodiment of an activation function unit (AFU) of the second figure.

第三十一圖係顯示第三十圖之啟動函數單元之運作之一範例。 The thirty-first figure is an example showing the operation of the activation function unit of the thirty-first figure.

第三十二圖係顯示第三十圖之啟動函數單元之運作之第二個範例。 The thirty-second figure is the second example showing the operation of the activation function unit of the thirty-third figure.

第三十三圖係顯示第三十圖之啟動函數單元之運作之第三個範例。 The thirty-third figure is a third example showing the operation of the activation function unit of the thirty-third figure.

第三十四圖係顯示第一圖之處理器以及神經網路單元之部分細節之方塊示意圖。 The thirty-fourth figure is a block diagram showing some details of the processor and the neural network unit of the first figure.

第三十五圖係一方塊圖，顯示具有一可變率神經網路單元之處理器。 Figure 35 is a block diagram showing a processor with a variable rate neural network unit.

第三十六A圖係一時序圖，顯示一具有神經網路單元之處理器運作於一般模式之一運作範例，此一般模式即以主要時頻率運作。 Figure 36A is a timing diagram showing an example of the operation of a processor with a neural network unit in one of the general modes. This general mode operates at the main time frequency.

第三十六B圖係一時序圖，顯示一具有神經網路單元之處理器運作於緩和模式之一運作範例，緩和模式之運作時頻率低於主要時頻率。 Figure 36B is a timing diagram showing an example of the operation of a processor with a neural network unit in a relaxation mode. The frequency of operation of the relaxation mode is lower than the main frequency.

第三十七圖係一流程圖，顯示第三十五圖之處理器之運作。 Figure 37 is a flowchart showing the operation of the processor in Figure 35.

第三十八圖係一方塊圖，詳細顯示神經網路單元之序列。 The thirty-eighth figure is a block diagram showing the sequence of the neural network units in detail.

第三十九圖係一方塊圖，顯示神經網路單元之控制與狀態暫存器之某些欄位。 Figure 39 is a block diagram showing some fields of the control and status register of the neural network unit.

第四十圖係一方塊圖，顯示Elman時間遞歸神經網路(recurrent neural network,RNN)之一範例。 The fortieth diagram is a block diagram showing an example of an Elman time recurrent neural network (RNN).

第四十一圖係一方塊圖，顯示當神經網路單元執行關聯於第四十圖之Elman時間遞歸神經網路之計算時，神經網路單元之資料隨機存取記憶體與權重隨機存取記憶體內之資料配置之一範例。 The forty-first diagram is a block diagram showing the random access memory and weight random access of the neural network unit when the neural network unit performs the calculation of the Elman time recurrent neural network associated with the forty diagram. An example of data allocation in memory.

第四十二圖係一表格，顯示儲存於神經網路單元之程式記憶體之一程式，此程式係由神經網路單元執行，並依據第四十一圖之配置使用資料與權重，以達成Elman時間遞歸神經網路 Figure 42 is a table showing one of the programs stored in the neural network unit. This program is executed by the neural network unit and uses data and weights according to the configuration of figure 41 to achieve Elman time recurrent neural network

第四十三圖係一方塊圖顯示Jordan時間遞歸神經網路之一範例。 The forty-third diagram is a block diagram showing an example of Jordan time recurrent neural network.

第四十四圖係一方塊圖，顯示當神經網路單元執行關聯於第四十三圖之Jordan時間遞歸神經網路之計算時，神經網路單元之資料隨機存取記憶體與權重隨機存取記憶體內之資料配置之一範例。 The forty-fourth figure is a block diagram showing that when the neural network unit performs the calculation of the Jordan time recursive neural network associated with the forty-third figure, the data of the neural network unit is randomly stored in memory and weights are stored randomly. Take an example of data configuration in memory.

第四十五圖係一表格，顯示儲存於神經網路單元之程式記憶體之一程式，此程式係由神經網路單元執行，並依據第四十四圖之配置使用資料與權重，以達成Jordan時間遞歸神經網路。 The forty-fifth figure is a table showing one of the programs stored in the neural network unit. This program is executed by the neural network unit and uses data and weights according to the configuration of the forty-fourth figure to achieve Jordan Time Recurrent Neural Network.

第四十六圖係一方塊圖，顯示長短期記憶(long short term memory,LSTM)胞之一實施例。 The forty-sixth figure is a block diagram showing an embodiment of a long short term memory (LSTM) cell.

第四十七圖係一方塊圖，顯示當神經網路單元執行關聯於第四十六圖之長短期記憶胞層之計算時，神經網路單元之資料隨機存取記憶體與權重隨機存取記憶體內之資料配置之一範例。 The forty-seventh diagram is a block diagram showing when the neural network unit performs the association In the calculation of the long-term and short-term memory cell in Figure 46, an example of the data allocation in the data of the neural network unit and the weighted random access memory.

第四十八圖係一表格，顯示儲存於神經網路單元之程式記憶體之一程式，此程式係由神經網路單元執行並依據第四十七圖之配置使用資料與權重，以達成關聯於長短期記憶胞層之計算。 Figure 48 is a table showing a program stored in the neural network unit's program memory. This program is executed by the neural network unit and uses data and weights according to the configuration of figure 47 to achieve association. Calculation of long-term and short-term memory cells.

第四十九圖係一方塊圖，顯示一神經網路單元之實施例，此實施例之神經處理單元群組內具有輸出緩衝遮蔽與反饋能力。 The forty-ninth figure is a block diagram showing an embodiment of a neural network unit. The neural processing unit group in this embodiment has output buffer masking and feedback capabilities.

第五十圖係一方塊圖，顯示當神經網路單元執行關聯於第四十六圖之長短期記憶胞層之計算時，第四十九圖之神經網路單元之資料隨機存取記憶體，權重隨機存取記憶體與輸出緩衝器內之資料配置之一範例。 Figure 50 is a block diagram showing the random access memory of the data of the neural network unit of Figure 49 when the neural network unit performs calculations related to the long-term and short-term memory cell layer of Figure 46. , An example of data allocation in weighted random access memory and output buffer.

第五十一圖係一表格，顯示儲存於神經網路單元之程式記憶體之一程式，此程式係由第四十九圖之神經網路單元執行並依據第五十圖之配置使用資料與權重，以達成關聯於長短期記憶胞層之計算。 Figure 51 is a table showing a program stored in the neural network unit's program memory. This program is executed by the neural network unit of Figure 49 and uses the data according to the configuration of Figure 50. Weights to achieve calculations related to long- and short-term memory cells.

第五十二圖係一方塊圖，顯示一神經網路單元之實施例，此實施例之神經處理單元群組內具有輸出緩衝遮蔽與反饋能力，並且共享啟動函數單元。 The fifty-second figure is a block diagram showing an embodiment of a neural network unit. The neural processing unit group in this embodiment has output buffer masking and feedback capabilities, and shares an activation function unit.

第五十三圖係一方塊圖，顯示當神經網路單元執行關聯於第四十六圖之長短期記憶胞層之計算時，第四十九圖之神經網路單元之資料隨機存取記憶體，權重隨機存取記憶體與輸出緩衝器內之資料配置之另一實施例。 Figure 53 is a block diagram showing the random access memory of the data of the neural network unit of Figure 49 when the neural network unit performs calculations related to the long-term and short-term memory cell layer of Figure 46. Another embodiment of the data allocation in the memory, the weighted random access memory and the output buffer.

第五十四圖係一表格，顯示儲存於神經網路單元之程式記憶體之一程式，此程式係由第四十九圖之神經網路單元執行並依據第五十三圖之配置使用資料與權重，以達成關聯於長短期記憶胞層之計算。 Figure 54 is a table showing a program stored in the neural network unit's program memory. This program is executed by the neural network unit in Figure 49 and uses the data according to the configuration of Figure 53. And weights to achieve calculations related to long- and short-term memory cells.

第五十五圖係一方塊圖，顯示本發明另一實施例之部分神經處理單元。 The fifty-fifth figure is a block diagram showing a part of a neural processing unit according to another embodiment of the present invention.

第五十六圖係一方塊圖，顯示當神經網路單元執行關聯於第四十三圖之Jordan時間遞歸神經網路之計算並利用第五十五圖之實施例時，神經網路單元之資料隨機存取記憶體與權重隨機存取記憶體內之資料配置之一範例。 The fifty-sixth diagram is a block diagram showing the calculation of the neural network unit when the neural network unit performs calculations of the Jordan time recurrent neural network associated with the forty-third diagram and uses the embodiment of the fifty-fifth diagram An example of data allocation in data random access memory and weighted random access memory.

第五十七圖係一表格，顯示儲存於神經網路單元之程式記憶體之一程式，此程式係由神經網路單元執行並依據第五十六圖之配置使用資料與權重，以達成Jordan時間遞歸神經網路。 Figure 57 is a table showing one of the programs stored in the neural network unit. This program is executed by the neural network unit and uses data and weights according to the configuration of Figure 56 to achieve Jordan. Time recurrent neural network.

Processor with architecture neural network unit

第一圖係顯示一包含一神經網路單元(neural network unit,NNU)121之處理器100之方塊示意圖。如圖中所示，此處理器100包含一指令攫取單元101，一指令快取102，一指令轉譯器104，一重命名單元106，多個保留站108，多個媒體暫存器118，多個通用暫存器116，前述神經網路單元121外之多個執行單元112與一記憶體子系統114。 The first diagram is a block diagram of a processor 100 including a neural network unit (NNU) 121. As shown in the figure, the processor 100 includes an instruction fetch unit 101, an instruction cache 102, an instruction translator 104, a rename unit 106, multiple reservation stations 108, multiple media registers 118, multiple The general purpose register 116 includes a plurality of execution units 112 and a memory subsystem 114 outside the aforementioned neural network unit 121.

處理器100係一電子裝置，作為積體電路之中央處理單元。處理器100接收輸入的數位資料，依據由記憶體攫取之指令處理這些資料，並產生由指令指示之運算的處理結果作為其輸出。此處理器100可用於一桌上型電腦、行動裝置、或平板電腦，並用於計算、文字處理、多媒體顯示與網路瀏覽等應用。此處理器100並可設置於一嵌入系統內，以控制各種包括設備、行動電話、智能電話、車輛、與工業用控制器之裝置。中央處理器係透過對資料執行包括算術、邏輯與輸入/輸出等運算，以執行電腦程式(或稱為電腦應用程式或應用程式)指令之電子電路(即硬體)。積體電路係一組製作於一小型半導體材料，通常是矽，之電子電路。積體電路也通常被用於表示晶片、微晶片或晶粒。 The processor 100 is an electronic device as an integrated circuit Central Processing Unit. The processor 100 receives the input digital data, processes the data according to the instructions fetched from the memory, and generates the processing result of the operation instructed by the instructions as its output. The processor 100 may be used in a desktop computer, a mobile device, or a tablet computer, and used in applications such as computing, word processing, multimedia display, and web browsing. The processor 100 may be provided in an embedded system to control various devices including devices, mobile phones, smart phones, vehicles, and industrial controllers. The central processing unit is an electronic circuit (ie, hardware) that executes instructions of a computer program (or computer application or application) by performing operations on data, including arithmetic, logic, and input / output. Integrated circuits are a group of electronic circuits made from a small semiconductor material, usually silicon. Integrated circuits are also commonly used to represent wafers, microchips, or dies.

指令攫取單元101控制由系統記憶體(未圖示)攫取架構指令103至指令快取102之運作。指令攫取單元101提供一攫取位址至指令快取102，以指定處理器100攫取至快取記憶體102之架構指令位元組之快取列的記憶體位址。攫取位址之選定係基於處理器100之指令指標(未圖示)的當前值或程式計數器。一般而言，程式計數器會依照指令大小循序遞增，直到指令串流中出現例如分支、呼叫或返回之控制指令，或是發生例如中斷、岔斷(trap)、例外或錯誤等例外條件，而需要以如分支目標位址、返回位址或例外向量等非循序位址更新程式計數器。總而言之，程式計數器會因應執行單元112/121執行指令而進行更新。程式計數器亦可在偵測到例外條件時進行更新，例如指令轉譯器104遭遇到未被定義於處理器100之指令集架構之指令103。 The instruction fetching unit 101 controls the operation of fetching the architectural instructions 103 to the instruction cache 102 from the system memory (not shown). The instruction fetch unit 101 provides a fetch address to the instruction cache 102 to specify the memory address of the cache line of the architecture instruction byte of the processor 100 to fetch to the cache memory 102. The selection of the fetch address is based on the current value of the instruction index (not shown) of the processor 100 or the program counter. Generally, the program counter is incremented according to the instruction size until a control instruction such as a branch, call, or return appears in the instruction stream, or an exception condition such as an interrupt, a trap, an exception, or an error occurs. Update the program counter with non-sequential addresses such as branch target address, return address, or exception vector. In a word, the program counter is updated according to the execution instructions of the execution units 112/121. The program counter can also be updated when an exception condition is detected, such as when the instruction translator 104 encounters an undefined Instruction 103, meaning the instruction set architecture of the processor 100.

指令快取102係儲存攫取自一個耦接至處理器100之系統記憶體之架構指令103。這些架構指令103包括一移動至神經網路(MTNN)指令與一由神經網路移出(MFNN)指令，詳如後述。在一實施例中，架構指令103是x86指令集架構之指令，並附加上MTNN指令與MFNN指令。在本揭露內容中，x86指令集架構處理器係理解為在執行相同機械語言指令之情況下，與Intel® 80386®處理器在指令集架構層產生相同結果之處理器。不過，其他指令集架構，例如，進階精簡指令集機器架構(ARM)、昇陽(SUN)之可擴充處理器架構(SPARC)、或是增強精簡指令集性能運算性能優化架構(PowerPC)，亦可用於本發明之其他實施例。指令快取102提供架構指令103至指令轉譯器104，以將架構指令103轉譯為微指令105。 The instruction cache 102 stores architecture instructions 103 obtained from a system memory coupled to the processor 100. These architectural instructions 103 include a move to neural network (MTNN) instruction and a move from neural network (MFNN) instruction, as described later. In one embodiment, the architecture instruction 103 is an instruction of the x86 instruction set architecture, and MTNN instruction and MFNN instruction are added. In this disclosure, an x86 instruction set architecture processor is understood to be a processor that produces the same result as the Intel® 80386® processor at the instruction set architecture layer when executing the same mechanical language instructions. However, other instruction set architectures, such as Advanced Reduced Instruction Set Machine Architecture (ARM), Sun's Scalable Processor Architecture (SPARC), or Enhanced Reduced Instruction Set Performance Computing Performance Optimization Architecture (PowerPC), It can also be used in other embodiments of the present invention. The instruction cache 102 provides the architectural instruction 103 to the instruction translator 104 to translate the architectural instruction 103 into a micro instruction 105.

微指令105係提供至重命名單元106而最終由執行單元112/121執行。這些微指令105會實現架構指令。就一較佳實施例而言，指令轉譯器104包括一第一部分，用以將頻繁執行以及/或是相對較不複雜之架構指令103轉譯為微指令105。此指令轉譯器104並包括一第二部分，其具有一微碼單元(未圖示)。微碼單元具有一微碼記憶體裝載微碼指令，以執行架構指令集中複雜與/或少用的指令。微碼單元並包括一微定序器(microsequencer)提供一非架構微程式計數器(micro-PC)至微碼記憶體。就一較佳實施例而言，這些微指令係經由微轉譯器(未圖示)轉譯為微指令105。選擇器依據微碼單元當前是否具有控制權，選擇來自第一部分或第二部分之微指令105提供至重命名單元106。 The microinstruction 105 is provided to the renaming unit 106 and is finally executed by the execution unit 112/121. These micro-instructions 105 implement architectural instructions. In a preferred embodiment, the instruction translator 104 includes a first part for translating frequently executed and / or relatively less complex architectural instructions 103 into microinstructions 105. The instruction translator 104 also includes a second part having a microcode unit (not shown). The microcode unit has a microcode memory loaded with microcode instructions to execute complex and / or rarely used instructions in the architecture instruction set. The microcode unit also includes a microsequencer to provide a non-architecture microprogram counter (micro-PC) to the microcode memory. In a preferred embodiment, these micro-instructions are translated into micro-instructions 105 via a micro-translator (not shown). Selector by micro Whether the code unit currently has control right, the microinstruction 105 from the first part or the second part is selected and provided to the rename unit 106.

重命名單元106會將架構指令103指定之架構暫存器重命名為處理器100之實體暫存器。就一較佳實施例而言，此處理器100包括一重排緩衝器(未圖示)。重命名單元106會依照程式順序將重排緩衝器之項目分配給各個微指令105。如此即可使處理器100依據程式順序撤除微指令105以及其相對應之架構指令103。在一實施例中，媒體暫存器118具有256位元寬度，而通用暫存器116具有64位元寬度。在一實施例中，媒體暫存器118為x86媒體暫存器，例如先進向量擴充(AVX)暫存器。 The renaming unit 106 renames the architecture register specified by the architecture instruction 103 to the physical register of the processor 100. According to a preferred embodiment, the processor 100 includes a rearrangement buffer (not shown). The renaming unit 106 assigns the reordering buffer items to each microinstruction 105 in a program order. In this way, the processor 100 can remove the micro instruction 105 and its corresponding architecture instruction 103 according to the program sequence. In one embodiment, the media register 118 has a width of 256 bits, and the general purpose register 116 has a width of 64 bits. In one embodiment, the media register 118 is an x86 media register, such as an advanced vector extension (AVX) register.

在一實施例中，重排緩衝器之各個項目具有儲存空間以儲存微指令105之結果。此外，處理器100包括一架構暫存器檔案，此架構暫存器檔案具有一實體暫存器對應於各個架構暫存器，如媒體暫存器118、通用暫存器116以及其他架構暫存器。(就一較佳實施例而言，舉例來說，媒體暫存器118與通用暫存器116之大小不同，即可使用分開的暫存器檔案對應至這兩種暫存器。)對於微指令105中指定有一個架構暫存器之各個源運算元，重命名單元會利用寫入架構暫存器之舊有微指令105中最新一個微指令之重排緩衝器目錄，填入微指令105之源運算元欄位。當執行單元112/121完成微指令105之執行，執行單元112/121會將其結果寫入此微指令105之重排緩衝器項目。當微指令105撤除時，撤除單元(未圖示)會將來自此微指令之重排緩衝器欄位之結果寫入實體暫存器檔案之暫存器，此實體暫存器檔案係關聯於由此撤除微指令105所指定之架構目的暫存器。 In one embodiment, each item of the rearrangement buffer has a storage space to store the result of the microinstruction 105. In addition, the processor 100 includes an architecture register file. The architecture register file has a physical register corresponding to each architecture register, such as the media register 118, the general purpose register 116, and other architecture registers. Device. (For a preferred embodiment, for example, the media register 118 and the general-purpose register 116 are different in size, and separate register files can be used to correspond to these two registers.) For micro Each source operand of an architecture register is specified in the instruction 105. The rename unit will use the reordering buffer directory of the latest microinstruction in the old microinstruction 105 written in the architecture register and fill in the microinstruction 105. The source operand field. When the execution unit 112/121 completes the execution of the microinstruction 105, the execution unit 112/121 writes its result into the reorder buffer item of the microinstruction 105. When the microinstruction 105 is removed, the removal unit (not shown) will write the result from the rearranged buffer field of this microinstruction Register of the physical register file. This physical register file is associated with the register for the architectural purpose specified by the removal instruction 105.

在另一實施例中，處理器100包括一實體暫存器檔案，其具有之實體暫存器的數量多於架構暫存器的數量，不過，此處理器100不包括一架構暫存器檔案，而且重排緩衝器項目內不包括結果儲存空間。(就一較佳實施例而言，因為媒體暫存器118與通用暫存器116之大小不同，即可使用分開的暫存器檔案對應至這兩種暫存器。)此處理器100並包括一指標表，其具有各個架構暫存器之相對應指標。對於微指令105內指定有架構暫存器之各個運算元，重命名單元會利用一個指向實體暫存器檔案內一自由暫存器之指標，填入微指令105內之目的運算元欄位。若是實體暫存器檔案內不存在自由暫存器，重命名單元106會暫時擱置管線。對於微指令105內指定有架構暫存器之各個源運算元，重命名單元會利用一個指向實體暫存器檔案中，指派給寫入架構暫存器之舊有微指令105中最新微指令之暫存器的指標，填入微指令105內之源運算元欄位。當執行單元112/121完成執行微指令105，執行單元112/121會將結果寫入實體暫存器檔案中微指令105之目的運算元欄位指向之一暫存器。當微指令105撤除時，撤除單元會將微指令105之目的運算元欄位值複製至關聯於此撤除微指令105指定之架構目的暫存器之指標表的指標。 In another embodiment, the processor 100 includes a physical register file, which has more physical registers than the number of architecture registers. However, the processor 100 does not include an architecture register file. , And the result storage space is not included in the rearrangement buffer item. (For a preferred embodiment, because the size of the media register 118 and the general-purpose register 116 are different, a separate register file can be used to correspond to these two registers.) This processor 100 does It includes an index table with corresponding indexes of each architecture register. For each operand in the microinstruction 105 where the architecture register is specified, the rename unit will use an indicator pointing to a free register in the physical register file to fill in the destination operand field in the microinstruction 105. If there is no free register in the physical register file, the renaming unit 106 will temporarily suspend the pipeline. For each source operand in the microinstruction 105 where the architecture register is specified, the rename unit will use a pointer to the physical register file and assign it to the latest microinstruction in the old microinstruction 105 written to the architecture register. The index of the register is filled in the source operand field in the microinstruction 105. When the execution unit 112/121 finishes executing the micro instruction 105, the execution unit 112/121 writes the result to the register of the destination operand field of the micro instruction 105 in the physical register file. When the microinstruction 105 is removed, the removal unit will copy the value of the destination operand field of the microinstruction 105 to the indicator associated with the index table of the architectural purpose register specified by the removal microinstruction 105.

保留站108會裝載微指令105，直到這些微指令完成發佈至執行單元112/121以供執行之準備。當一個微指令105之所有源運算元都可取用並且執行單元112/121也可用於執行時，即為此微指令105完成發佈之準備。執行單元112/121係由重排緩衝器或前述第一實施例所述之架構暫存器檔案，或是由前述第二實施例所述之實體暫存器檔案接收暫存器源運算元。此外，執行單元112/121可直接透過結果傳送匯流排(未圖示)接收暫存器源運算元。此外，執行單元112/121可以從保留站108接收微指令105所指定之立即運算元。MTNN與MFNN架構指令103包括一立即運算元以指定神經網路單元121所要執行之功能，而此功能係由MTNN與MFNN架構指令103轉譯產生之一個或多個微指令105所提供，詳如後述。 The reservation station 108 loads the micro-instructions 105 until the micro-instructions are issued to the execution units 112/121 for preparation for execution. When one When all the source operands of a microinstruction 105 are available and the execution units 112/121 are also available for execution, the preparation for issuing the microinstruction 105 is completed. The execution unit 112/121 receives the register source operand from the rearrangement buffer or the structure register file described in the first embodiment, or the physical register file described in the second embodiment. In addition, the execution unit 112/121 can directly receive the register source operand through the result transmission bus (not shown). In addition, the execution unit 112/121 may receive the immediate operand specified by the micro instruction 105 from the reservation station 108. The MTNN and MFNN architecture instructions 103 include an immediate operand to specify the function to be performed by the neural network unit 121. This function is provided by one or more micro-instructions 105 generated by the translation of the MTNN and MFNN architecture instructions 103, as described later. .

執行單元112包括一個或多個載入/儲存單元(未圖示)，由記憶體子系統114載入資料並且儲存資料至記憶體子系統114。就一較佳實施例而言，此記憶體子系統114包括一記憶體管理單元(未圖示)，此記憶體管理單元可包括，例如多個轉譯查找(lookaside)緩衝器、一個表移動(tablewalk)單元、一個階層一資料快取(與指令快取102)、一個階層二統一快取與一個作為處理器100與系統記憶體間之介面的匯流排介面單元。在了實施例中，第一圖之處理器100係以一多核處理器之多個處理核心之其中之一來表示，而此多核處理器係共享一個最後階層快取記憶體。執行單元112並可包括多個整數單元、多個媒體單元、多個浮點單元與一個分支單元。 The execution unit 112 includes one or more load / store units (not shown). The memory subsystem 114 loads data and stores the data in the memory subsystem 114. In a preferred embodiment, the memory subsystem 114 includes a memory management unit (not shown). The memory management unit may include, for example, a plurality of lookaside buffers, a table movement ( table walk) unit, a level one data cache (and instruction cache 102), a level two unified cache, and a bus interface unit as an interface between the processor 100 and the system memory. In the embodiment, the processor 100 in the first figure is represented by one of a plurality of processing cores of a multi-core processor, and the multi-core processor shares a last-level cache memory. The execution unit 112 may include multiple integer units, multiple media units, multiple floating point units, and a branch unit.

神經網路單元121包括一權重隨機存取記憶體(RAM)124、一資料隨機存取記憶體122、N個神經處理單元(NPU)126、一個程式記憶體129、一個定序器128與多個控制與狀態暫存器127。這些神經處理單元126在概念上係如同神經網路中之神經元之功能。權重隨機存取記憶體124、資料隨機存取記憶體122與程式記憶體129均可透過MTNN與MFNN架構指令103分別寫入與讀取。權重隨機存取記憶體124係排列為W列，每列N個權重文字，資料隨機存取記憶體122係排列為D列，每列N個資料文字。各個資料文字與各個權重文字均為複數個位元，就一較佳實施例而言，可以是8個位元、9個位元、12個位元或16個位元。各個資料文字係作為網路中前一層之一神經元的輸出值(有時以啟動值表示)，各個權重文字係作為網路中關聯於進入網路當前層之一神經元之一連結的權重。雖然在神經網路單元121之許多應用中，裝載於權重隨機存取記憶體124之文字或運算元實際上就是關聯於進入一神經元之連結的權重，不過需要注意的是，在神經網路單元121之某些應用中，裝載於權重隨機存取記憶體124之文字並非權重，不過因為這些文字是儲存於權重隨機存取記憶體124中，所以仍然以“權重文字”之用語表示。舉例來說，在神經網路單元121之某些應用中，例如第二十四至二十六A圖之卷積運算之範例或是第二十七至二十八圖之共源運作之範例，權重隨機存取記憶體124會裝載權重以外之物件，例如資料矩陣(如影像畫素資料)之元素。同樣地，雖然在神經網路單元121之許多應用中，裝載於資料隨機存取記憶體122之文字或運算元實質上就是神經元之輸出值或啟動值，不過需要注意的是，在神經網路單元121之某些應用中，裝載於資料隨機存取記憶體122之文字並非如此，不過因為這些文字是儲存於資料隨機存取記憶體122中，所以仍然以“資料文字”之用語表示。舉例來說，在神經網路單元121之某些應用中，例如第二十四至二十六A圖之卷積運算之範例，資料隨機存取記憶體122會裝載非神經元之輸出，例如卷積核之元素。 The neural network unit 121 includes a weighted random access memory (RAM) 124, a data random access memory 122, and N nerve units. A processing unit (NPU) 126, a program memory 129, a sequencer 128, and a plurality of control and status registers 127. These neural processing units 126 are conceptually functioning as neurons in a neural network. The weight random access memory 124, the data random access memory 122, and the program memory 129 can be written to and read from the MTNN and MFNN architecture instructions 103, respectively. The weight random access memory 124 is arranged in W rows, each row having N weight characters, and the data random access memory 122 is arranged in D rows, each row having N data characters. Each data text and each weight text are a plurality of bits. In a preferred embodiment, they can be 8 bits, 9 bits, 12 bits, or 16 bits. Each data text is used as the output value (sometimes expressed by the activation value) of a neuron in the previous layer in the network, and each weight text is used as the weight of a link in the network that is associated with a neuron entering the current layer of the network . Although in many applications of the neural network unit 121, the text or operand loaded in the weight random access memory 124 is actually the weight associated with the connection into a neuron, it should be noted that in the neural network In some applications of the unit 121, the words loaded in the weight random access memory 124 are not weighted, but because these words are stored in the weight random access memory 124, they are still expressed in terms of "weight words". For example, in some applications of the neural network unit 121, for example, examples of convolution operations of the twenty-fourth to twenty-sixth graphs A or examples of common source operations of the twenty-seventh to twenty-eighth graphs The weight random access memory 124 will load objects other than weights, such as elements of a data matrix (such as image pixel data). Similarly, although in many applications of the neural network unit 121, the text or operand loaded in the data random access memory 122 is essentially the output value or activation value of the neuron, However, it should be noted that in some applications of the neural network unit 121, the text loaded in the data random access memory 122 is not the same, but because these texts are stored in the data random access memory 122, they are still Expressed in terms of "text text". For example, in some applications of the neural network unit 121, such as the example of the convolution operation of the twenty-fourth to twenty-sixth A graphs, the data random access memory 122 will load non-neuronal outputs, such as Elements of a convolution kernel.

在一實施例中，神經處理單元126與定序器128包括組合邏輯、定序邏輯、狀態機器、或是其組合。架構指令(例如MFNN指令1500)會將狀態暫存器127之內容載入其中一個通用暫存器116，以確認神經網路單元121之狀態，如神經網路單元121已經從程式記憶體129完成一個命令或是一個程式之運作，或是神經網路單元121可自由接收一個新的命令或開始一個新的神經網路單元程式。 In one embodiment, the neural processing unit 126 and the sequencer 128 include combination logic, sequence logic, state machine, or a combination thereof. An architecture instruction (such as MFNN instruction 1500) loads the content of the state register 127 into one of the general purpose registers 116 to confirm the state of the neural network unit 121. For example, the neural network unit 121 has been completed from the program memory 129 A command or a program operates, or the neural network unit 121 is free to receive a new command or start a new neural network unit program.

神經處理單元126之數量可依據需求增加，權重隨機存起記憶體124與資料隨機存取記憶體122之寬度與深度亦可隨之調整進行擴張。就一較佳實施例而言，權重隨機存取記憶體124會大於資料隨機存取記憶體122，這是因為典型的神經網路層中存在許多連結，因而需要較大之儲存空間儲存關聯於各個神經元的權重。本文揭露許多關於資料與權重文字之大小、權重隨機存取記憶體124與資料隨機存取記憶體122之大小、以及不同神經處理單元126數量之實施例。在一實施例中，神經網路單元121具有一個大小為64KB(8192位元x64列)之資料隨機存取記憶體122，一個大小為2MB(8192位元x2048列)之權重隨機存取記憶體124，以及512個神經處理單元126。此神經網路單元121是以台灣積體電路(TSMC)之16奈米製程製造，其所占面積大約是3.3毫米平方。 The number of neural processing units 126 can be increased according to demand, and the width and depth of the random storage memory 124 and the data random access memory 122 can be adjusted and expanded accordingly. For a preferred embodiment, the weight random access memory 124 is larger than the data random access memory 122, because there are many links in a typical neural network layer, so a larger storage space is needed The weight of each neuron. This document discloses many embodiments regarding the size of data and weight text, the size of weight random access memory 124 and data random access memory 122, and the number of different neural processing units 126. In one embodiment, the neural network unit 121 has a size of 64 KB (8192 bits x 64 columns). Data random access memory 122, a weighted random access memory 124 with a size of 2MB (8192 bits x 2048 rows), and 512 neural processing units 126. The neural network unit 121 is manufactured by Taiwan Semiconductor Manufacturing Corporation (TSMC) in a 16 nanometer manufacturing process, and its area is approximately 3.3 mm square.

定序器128係由程式記憶體129攫取指令並執行，其執行之運作還包括產生位址與控制信號提供給資料隨機存取記憶體122、權重隨機存取記憶體124與神經處理單元126。定序器128產生一記憶體位址123與一讀取命令提供給資料隨機存取記憶體122，藉以在D個列之N個資料文字中選擇其一提供給N個神經處理單元126。定序器128並會產生一記憶體位址125與一讀取命令提供給權重隨機存取記憶體124，藉以在W個列之N個權重文字中選擇其一提供給N個神經處理單元126。定序器128產生並提供給神經處理單元126之位址123,125的順序即確定神經元間之“連結”。定序器128還會產生一記憶體位址123與一寫入命令提供給資料隨機存取記憶體122，藉以在D個列之N個資料文字中選擇其一由N個神經處理單元126進行寫入。定序器128還會產生一記憶體位址125與一寫入命令提供給權重隨機存取記憶體124，藉以在W個列之N個權重文字中選擇其一由N個神經處理單元126進行寫入。定序器128還會產生一記憶體位址131至程式記憶體129以選擇提供給定序器128之一神經網路單元指令，這部分在後續章節會進行說明。記憶體位址131係對應至程式計數器(未圖示)，定序器128通常是依據程式記憶體129之位置順序使程式計數器遞增，除非定序器128遭遇到一控制指令，例如一迴圈指令(請參照如第二十六A圖所示)，在此情況下，定序器128會將程式計數器更新為此控制指令之目標位址。定序器128還會產生控制信號至神經處理單元126，指示神經處理單元126執行各種不同之運算或功能，例如起始化、算術/邏輯運算、轉動/移位運算、啟動函數、以及寫回運算，相關之範例在後續章節(請參照如第三十四圖之微運算3418所示)會有更詳細的說明。 The sequencer 128 fetches instructions from the program memory 129 and executes them. The operations performed by the sequencer 128 include generating an address and a control signal to the data random access memory 122, a weight random access memory 124, and a neural processing unit 126. The sequencer 128 generates a memory address 123 and a read command and provides it to the data random access memory 122, so as to select one of the N data characters in the D rows and provide it to the N neural processing units 126. The sequencer 128 also generates a memory address 125 and a read command and provides them to the weighted random access memory 124, so as to select one of the W weighted N weighted text and provide it to the N neural processing units 126. The sequence of the addresses 123, 125 generated by the sequencer 128 and provided to the neural processing unit 126 determines the "connection" between the neurons. The sequencer 128 also generates a memory address 123 and a write command to the data random access memory 122, so that one of the N data characters in D rows is selected for writing by the N neural processing units 126. Into. The sequencer 128 also generates a memory address 125 and a write command to provide the weighted random access memory 124, so that one of the N weighted text in W columns is selected for writing by the N neural processing units 126. Into. The sequencer 128 also generates a memory address 131 to the program memory 129 to select a neural network unit instruction provided to the sequencer 128. This section will be described in the subsequent sections. The memory address 131 corresponds to a program counter (not shown). The sequencer 128 is usually The program counter is incremented according to the position sequence of the program memory 129, unless the sequencer 128 encounters a control instruction, such as a loop instruction (see Figure 26A), in which case the sequence is The controller 128 updates the program counter to the target address of the control instruction. The sequencer 128 also generates control signals to the neural processing unit 126, instructing the neural processing unit 126 to perform various operations or functions, such as initialization, arithmetic / logical operations, rotation / shift operations, activation functions, and write back Operations, related examples will be explained in more detail in the subsequent chapters (refer to the micro-operation 3418 shown in Figure 34).

N個神經處理單元126會產生N個結果文字133，這些結果文字133可被寫回權重隨機存取記憶體124或資料隨機存取記憶體122之一個列。就一較佳實施例而言，權重隨機存取記憶體124與資料隨機存取記憶體122係直接耦接至N個神經處理單元126。進一步來說，權重隨機存取記憶體124與資料隨機存取記憶體122係轉屬於這些神經處理單元126，而不分享給處理器100中其他的執行單元112，這些神經處理單元126能夠持續地在每一個時頻週期內從權重隨機存取記憶體124與資料隨機存取記憶體122之一或二者取得並完成一個列，就一較佳實施例而言，可採管線方式處理。在一實施例中，資料隨機存取記憶體122與權重隨機存取記憶體124中的每一個都可以在每一個時頻週期內提供8192個位元至神經處理單元126。這8192個位元可以視為512個16位元組或是1024個8位元組來進行處理，詳如後述。 The N neural processing units 126 generate N result texts 133, and these result texts 133 can be written back to a row of the weight random access memory 124 or the data random access memory 122. In a preferred embodiment, the weight random access memory 124 and the data random access memory 122 are directly coupled to the N neural processing units 126. Further, the weight random access memory 124 and the data random access memory 122 are transferred to these neural processing units 126 and are not shared with other execution units 112 in the processor 100. These neural processing units 126 can continuously A row is obtained and completed from one or both of the weighted random access memory 124 and the data random access memory 122 in each time-frequency cycle. For a preferred embodiment, it can be processed in a pipeline manner. In one embodiment, each of the data random access memory 122 and the weighted random access memory 124 can provide 8192 bits to the neural processing unit 126 in each time-frequency period. These 8192 bits can be treated as 512 16-bytes or 1024 8-bytes, as described later.

由神經網路單元121處理之資料組大小並不受限於權重隨機存取記憶體124與資料隨機存取記憶體122的大小，而只會受限於系統記憶體的大小，這是因為資料與權重可在系統記憶體與權重隨機存取記憶體124以及資料隨機存取記憶體122間透過MTNN與MFNN指令之使用(例如，透過媒體暫存器118)而移動。在一實施例中，資料隨機存取記憶體122係被賦予雙埠，使能在由資料隨機存取記憶體122讀取資料文字或寫入資料文字至資料隨機存取記憶體122之同時，寫入資料文字至資料隨機存取記憶體122。另外，包括快取記憶體在內之記憶體子系統114之大型記憶體階層結構可提供非常大的資料頻寬供系統記憶體與神經網路單元121間進行資料傳輸。此外，就一較佳實施例而言，此記憶體子系統114包括硬體資料預攫取器，追蹤記憶體之存取模式，例如由系統記憶體載入之神經資料與權重，並對快取階層結構執行資料預攫取以利於在傳輸至權重隨機存取記憶體124與資料隨機存取記憶體122之過程中達成高頻寬與低延遲之傳輸。 The size of the data set processed by the neural network unit 121 is It is not limited to the size of the weight random access memory 124 and the data random access memory 122, but is only limited by the size of the system memory. This is because data and weight can be accessed in the system memory and weight random access. The memory 124 and the data random access memory 122 are moved by using MTNN and MFNN instructions (for example, through the media register 118). In one embodiment, the data random access memory 122 is given a dual port, so that when the data random access memory 122 reads data text or writes data text to the data random access memory 122, Write data text to data random access memory 122. In addition, the large memory hierarchy of the memory subsystem 114 including the cache memory can provide a very large data bandwidth for data transmission between the system memory and the neural network unit 121. In addition, in a preferred embodiment, the memory subsystem 114 includes a hardware data pre-fetcher, which tracks memory access patterns, such as neural data and weights loaded by the system memory, and caches the data. The hierarchical structure performs data pre-fetching to facilitate high-bandwidth and low-latency transmission during transmission to the weighted random access memory 124 and data random access memory 122.

雖然本文之實施例中，由權重記憶體提供至各個神經處理單元126之其中一個運算元係標示為權重，此用語常見於神經網路，不過需要理解的是，這些運算元也可以是其他與計算有關聯之類型的資料，而其計算速度可透過這些裝置加以提升。 Although in the embodiment herein, one of the operands provided by the weight memory to each neural processing unit 126 is labeled as a weight, this term is commonly used in neural networks, but it should be understood that these operands can also be other and Calculate related types of data, and their speed can be increased by these devices.

第二圖係顯示第一圖之一神經處理單元126之方塊示意圖。如圖中所示，此神經處理單元126之運作可執行許多功能或運算。尤其是，此神經處理單元 126可作為人工神經網路內之一神經元或節點進行運作，以執行典型之乘積累加功能或運算。也就是說，一般而言，神經網路單元126(神經元)係用以：(1)從各個與其具有連結之神經元接收一輸入值，此連結通常會但不必然是來自人工神經網路中之前一層；(2)將各個輸出值乘上關聯於其連結之一相對應權重值以產生一乘積；(3)將所有乘積加總以產生一總數；(4)對此總數執行一啟動函數以產生神經元之輸出。不過，不同於傳統方式需要執行關聯於所有連結輸入之所有乘法運算並將其乘積加總，本發明之各個神經元在一給定之時頻週期內可執行關聯於其中一個連結輸入之權重乘法運算並將其乘積與關聯於該時點前之時頻週期內所執行之連結輸入之乘積的累加值相加(累加)。假定一共有M個連結連接至此神經元，在M個乘積加總後(大概需要M個時頻週期的時間)，此神經元會對此累加數執行啟動函數以產生輸出或結果。此方式之優點在於可減少所需之乘法器的數量，並且在神經元內只需要一個較小、較簡單且更為快速之加法器電路(例如使用兩個輸入之加法器)，而不需使用能夠將所有連結輸入之乘積加總或甚至對其中一子集合加總所需之加法器。此方式亦有利於在神經網路單元121內使用極大數量(N)之神經元(神經處理單元126)，如此，在大約M個時頻週期後，神經網路單元121就可產生此大數量(N)神經元之輸出。最後，對於大量之不同連結輸入，由這些神經元構成之神經網路單元121就能有效地作為一人工神經網路層執行。也就是說，若是不同層中M的數量有所增減，產生記憶胞輸出所需之時頻週期數也會相對應地增減，而資源(例如乘法器與累加器)會被充分利用。相較之下，傳統設計對於較小之M值而言，會有某些乘法器與加法器之部分未能被利用。因此，因應神經網路單元之連結輸出數，本文所述之實施例兼具彈性與效率之優點，而能提供極高的效能。 The second diagram is a block diagram of the neural processing unit 126, one of the first diagrams. As shown in the figure, the operation of the neural processing unit 126 can perform many functions or operations. In particular, this neural processing unit 126 can operate as a neuron or node in an artificial neural network to perform typical multiply-accumulate-plus functions or operations. That is, in general, the neural network unit 126 (neuron) is used to: (1) receive an input value from each neuron with a connection to it, this connection usually but not necessarily from an artificial neural network (2) Multiply each output value by a corresponding weight value associated with one of its links to produce a product; (3) Add all products to produce a total; (4) Perform a start on this total Function to generate neuron output. However, unlike traditional methods that need to perform all multiplication operations associated with all connected inputs and sum their products, each neuron of the present invention can perform weighted multiplication operations associated with one of the connected inputs within a given time-frequency period. And the product is added (accumulated) to the accumulated value of the product of the connected inputs performed in the time-frequency period before the time point. Assume that there are a total of M links connected to this neuron. After the M products are summed up (approximately M time-frequency cycles are required), this neuron will execute a start function on this accumulated number to generate an output or result. The advantage of this method is that it can reduce the number of multipliers required, and only a small, simpler and faster adder circuit (such as an adder using two inputs) is needed in the neuron, without the need for Use an adder required to add up the product of all linked inputs or even a subset of them. This method is also conducive to using a large number (N) of neurons (neural processing unit 126) in the neural network unit 121. In this way, after about M time-frequency cycles, the neural network unit 121 can generate this large number (N) Neuron output. Finally, for a large number of different connection inputs, the neural network unit 121 composed of these neurons can be effectively implemented as an artificial neural network layer. In other words, if It is because the number of M in different layers increases or decreases, the number of time-frequency cycles required to generate the memory cell output will increase or decrease accordingly, and resources (such as multipliers and accumulators) will be fully utilized. In contrast, for smaller M values in the traditional design, some multipliers and adders cannot be used. Therefore, according to the number of connected outputs of the neural network unit, the embodiment described herein has both advantages of flexibility and efficiency, and can provide extremely high performance.

神經處理單元126包括一暫存器205、一個雙輸入多工暫存器208、一算術邏輯單元(ALU)204、一累加器202、與一啟動函數單元(AFU)212。暫存器205由權重隨機存取記憶體124接收一權重文字206並在一後續時頻週期提供其輸出203。多工暫存器208在兩個輸入207,211中選擇其一儲存於其暫存器並在一後續時頻週期提供於其輸出209。輸入207接收來自資料隨機存取記憶體122之資料文字。另一個輸入211則接收相鄰神經處理單元126之輸出209。第二圖所示之神經處理單元126係於第一圖所示之N個神經處理單元中標示為神經處理單元J。也就是說，神經處理單元J是這N個神經處理單元126之一代表範例。就一較佳實施例而言，神經處理單元126之J範例之多工暫存器208的輸入211係接收神經處理單元126之J-1範例之多工暫存器208之輸出209，而神經處理單元J之多工暫存器208的輸出209係提供給神經處理單元126之J+1範例之多工暫存器208之輸入211。如此，N個神經處理單元126之多工暫存器208即可共同運作，如同一N個文字之旋轉器或稱循環移位器，這部分在後續第三圖會有更詳細的說明。多工暫存器208係利用一控制輸入213控制這兩個輸入中哪一個會被多工暫存器208選擇儲存於其暫存器並於後續提供於輸出209。 The neural processing unit 126 includes a register 205, a dual-input multiplexing register 208, an arithmetic logic unit (ALU) 204, an accumulator 202, and an activation function unit (AFU) 212. The register 205 receives a weighted text 206 from the weighted random access memory 124 and provides its output 203 in a subsequent time-frequency period. The multiplexer register 208 selects one of the two inputs 207, 211 to store in its register and provides it to its output 209 in a subsequent time-frequency period. Input 207 receives the data text from the data random access memory 122. The other input 211 receives the output 209 from the neighboring neural processing unit 126. The neural processing unit 126 shown in the second figure is labeled as the neural processing unit J among the N neural processing units shown in the first figure. That is, the neural processing unit J is a representative example of the N neural processing units 126. For a preferred embodiment, the input 211 of the multiplexing register 208 of the J example of the neural processing unit 126 receives the output 209 of the multiplexing register 208 of the J-1 example of the neural processing unit 126, and the neural The output 209 of the multiplexing register 208 of the processing unit J is provided to the input 211 of the multiplexing register 208 of the J + 1 example of the neural processing unit 126. In this way, the multiplexing registers 208 of the N neural processing units 126 can work together, such as the spinner or cyclic shifter of the same N characters, which will be described in more detail in the subsequent third figure. Multiplex register 208 uses A control input 213 controls which of the two inputs is selected by the multiplexer register 208 to be stored in its register and subsequently provided to the output 209.

算術邏輯單元204具有三個輸入。其中一個輸入由暫存器205接收權重文字203。另一個輸入接收多工暫存器208之輸出209。再另一個輸入接收累加器202之輸出217。此算術邏輯單元204會對其輸入執行算術與/或邏輯運算以產生一結果提供於其輸出。就一較佳實施例而言，算術邏輯單元204執行之算術與/或邏輯運算係由儲存於程式記憶體129之指令所指定。舉例來說，第四圖中的乘法累加指令指定一乘法累加運算，亦即，結果215會是累加器202數值217與權重文字203以及多工暫存器208輸出209之資料文字之乘積的加總。不過也可以指定其他運算，這些運算包括但不限於：結果215是多工暫存器輸出209傳遞之數值；結果215是權重文字203傳遞之數值；結果215是零值；結果215是累加器202數值217與權重203之加總；結果215是累加器202數值217與多工暫存器輸出209之加總；結果215是累加器202數值217與權重203中的最大值；結果215是累加器202數值217與多工暫存器輸出209中的最大值。 The arithmetic logic unit 204 has three inputs. One of the inputs receives the weight text 203 from the register 205. The other input receives the output 209 of the multiplexer register 208. The other input receives the output 217 of the accumulator 202. The arithmetic logic unit 204 performs arithmetic and / or logic operations on its input to generate a result to provide its output. For a preferred embodiment, the arithmetic and / or logical operations performed by the arithmetic logic unit 204 are specified by instructions stored in the program memory 129. For example, the multiply-accumulate instruction in the fourth figure specifies a multiply-accumulate operation, that is, the result 215 will be the product of the accumulator 202 value 217 and the weight text 203 and the multiplexer register 208 output 209 data text total. However, other operations can also be specified. These operations include but are not limited to: result 215 is the value passed by the multiplexer register 209; result 215 is the value passed by the weight text 203; result 215 is the zero value; result 215 is the accumulator 202 The sum of the value 217 and the weight 203; the result 215 is the sum of the value 217 of the accumulator 202 and the multiplexer output 209; the result 215 is the maximum of the value 217 of the accumulator 202 and the weight 203; the result 215 is the accumulator 202 is the maximum value of 217 and multiplexer register 209.

算術邏輯單元204提供其輸出215至累加器202儲存。算術邏輯單元204包括一乘法器242對權重文字203與多工暫存器208輸出209之資料文字進行乘法運算以產生一乘積246。在一實施例中，乘法器242係將兩個16位元運算元相乘以產生一個32位元之結果。此算術邏輯單元204並包括一加法器244在累加器202之輸出217 加上乘積246以產生一總數，此總數即為儲存於累加器202之累加運算的結果215。在一實施例中，加法器244係在累加器202之一個41位元值217加上乘法器242之一個32位元結果以產生一個41位元結果。如此，在多個時頻週期之期間內利用多工暫存器208所具有之旋轉器特性，神經處理單元126即可達成神經網路所需之神經元之乘積加總運算。此算術邏輯單元204亦可包括其他電路元件以執行其他如前所述之算術/邏輯運算。在一實施例中，第二加法器係在多工暫存器208輸出209之資料文字減去權重文字203以產生一差值，隨後加法器244會在累加器202之輸出217加上此差值以產生一結果215，此結果即為累加器202內之累加結果。如此，在多個時頻週期之期間內，神經處理單元126就能達成差值加總之運算。就一較佳實施例而言，雖然權重文字203與資料文字209之大小相同(以位元計)，他們也可具有不同之二進位小數點位置，詳如後述。就一較佳實施例而言，乘法器242與加法器244係為整數乘法器與加法器，相較於使用浮點運算之算術邏輯單元，此算術邏輯單元204具有低複雜度、小型、快速與低耗能之優點。不過，在本發明之其他實施例中，算術邏輯單元204亦可執行浮點運算。 The arithmetic logic unit 204 provides its output 215 to the accumulator 202 for storage. The arithmetic logic unit 204 includes a multiplier 242 to perform a multiplication operation on the weight text 203 and the data text output from the multiplexer register 208 to generate a product 246. In one embodiment, the multiplier 242 multiplies two 16-bit operands to produce a 32-bit result. The arithmetic logic unit 204 also includes an adder 244 at the output 217 of the accumulator 202 The product 246 is added to generate a total, which is the result 215 of the accumulation operation stored in the accumulator 202. In one embodiment, the adder 244 is a 41-bit value 217 of the accumulator 202 and a 32-bit result of the multiplier 242 to generate a 41-bit result. In this way, the neural processing unit 126 can use the rotator characteristics of the multiplexer register 208 during a plurality of time-frequency periods to complete the product sum operation of the neurons required by the neural network. The arithmetic logic unit 204 may also include other circuit elements to perform other arithmetic / logical operations as described above. In one embodiment, the second adder subtracts the weight text 203 from the data text output 209 of the multiplexer register 208 to generate a difference, and the adder 244 adds the difference to the output 217 of the accumulator 202. Value to produce a result 215, which is the accumulated result in the accumulator 202. In this way, during a plurality of time-frequency periods, the neural processing unit 126 can achieve the sum of the difference values. In a preferred embodiment, although the weight text 203 and the data text 209 have the same size (in bits), they may also have different positions of the decimal point, as described later. In a preferred embodiment, the multiplier 242 and the adder 244 are integer multipliers and adders. Compared with an arithmetic logic unit using floating-point arithmetic, the arithmetic logic unit 204 has low complexity, small size, and fast speed. And the advantages of low energy consumption. However, in other embodiments of the present invention, the arithmetic logic unit 204 may also perform floating-point operations.

雖然第二圖之算術邏輯單元204內只顯示一個乘法器242與加法器244，不過，就一較佳實施例而言，此算術邏輯單元204還包括有其他元件以執行前述其他不同的運算。舉例來說，此算術邏輯單元204可包括一比較器(未圖示)比較累加器202與一資料/權重文字，以及一多工器(未圖示)在比較器指定之兩個數值中選擇較大者(最大值)儲存至累加器202。在另一個範例中，算術邏輯單元204包括選擇邏輯(未圖示)，利用一資料/權重文字來跳過乘法器242，使加法器224在累加器202之數值217加上此資料/權重文字以產生一總數儲存至累加器202。這些額外的運算會在後續章節如第十八至二十九A圖有更詳細的說明，而這些運算也有助於如卷積運算與共源運作之執行。 Although only one multiplier 242 and an adder 244 are shown in the arithmetic logic unit 204 of the second figure, according to a preferred embodiment, the arithmetic logic unit 204 further includes other components to perform the aforementioned different operations. For example, the arithmetic logic unit 204 may include a comparator (not shown) to compare the accumulator 202 with a data / weight text. And a multiplexer (not shown) selects the larger (maximum) value from the two values specified by the comparator and stores it in the accumulator 202. In another example, the arithmetic logic unit 204 includes selection logic (not shown), using a data / weight text to skip the multiplier 242, so that the adder 224 adds the data / weight text to the value 217 of the accumulator 202 It is stored in the accumulator 202 to generate a total. These additional operations will be explained in more detail in subsequent chapters such as Figures 18 to 29A, and these operations are also helpful for the implementation of convolution operations and common source operations.

啟動函數單元212接收累加器202之輸出217。啟動函數單元212會對累加器202之輸出執行一啟動函數以產生第一圖之結果133。一般而言，人工神經網路之中介層之神經元內的啟動函數可用來標準化乘積累加後之總數，尤其可以採用非線性之方式進行。為了“標準化”累加總數，當前神經元之啟動函數會在連接當前神經元之其他神經元預期接收作為輸入之數值範圍內產生一結果值。(標準化後的結果有時會稱為“啟動”，在本文中，啟動是當前節點之輸出，而接收節點會將此輸出乘上一關聯於輸出節點與接收節點間連結之權重以產生一乘積，而此乘積會與關聯於此接收節點之其他輸入連結的乘積累加。)舉例來說，在接收/被連結神經元預期接收作為輸入之數值介於0與1間之情況下，輸出神經元會需要非線性地擠壓與/或調整(例如向上移位以將負值轉換為正值)超出0與1之範圍外的累加總數，使其落於此預期範圍內。因此，啟動函數單元212對累加器202數值217執行之運算會將結果133帶到一已知範圍內。N個神經執行單元126之結果133都可被同時寫回資料隨機存取記憶體122或權重隨機存取記憶體124。就一較佳實施例而言，啟動函數單元212係用以執行多個啟動函數，而例如來自控制暫存器127之一輸入會在這些啟動函數中選擇其一執行於累加器202之輸出217。這些啟動函數可包括但不限於階梯函數、校正函數、S型函數、雙曲正切函數與軟加函數(也稱為平滑校正函數)。軟加函數之解析公式為f(x)=ln(1+e^x)，也就是1與e^x之加總的自然對數，其中，“e”是歐拉數(Euler’s number)，x是此函數之輸入217。就一較佳實施例而言，啟動函數亦可包括一傳遞(pass-through)函數，直接傳遞累加器202數值217或其中一部分，詳如後述。在一實施例中，啟動函數單元212之電路會在單一個時頻週期內執行啟動函數。在一實施例中，啟動函數單元212包括多個表單，其接收累加值並輸出一數值，對某些啟動函數，如S型函數、雙取正切函數、軟加函數等，此數值會近似於真正的啟動函數所提供之數值。 The start function unit 212 receives the output 217 of the accumulator 202. The startup function unit 212 executes a startup function on the output of the accumulator 202 to generate a result 133 of the first figure. Generally speaking, the activation function in the neurons of the intermediary layer of the artificial neural network can be used to standardize the total after multiplying and accumulating, especially in a non-linear manner. In order to "normalize" the cumulative total, the activation function of the current neuron will generate a result value within the range of values expected to be received by other neurons connected to the current neuron. (The standardized result is sometimes called "startup". In this article, startup is the output of the current node, and the receiving node will multiply this output by a weight associated with the connection between the output node and the receiving node to produce a product. , And this product is added to the product of other inputs associated with this receiving node.) For example, if the receiving / connected neuron expects to receive an input value between 0 and 1, the output neuron The cumulative total beyond the range of 0 and 1 may need to be squeezed and / or adjusted non-linearly (eg, shifted up to convert negative values to positive values) to fall within this expected range. Therefore, the operation performed by the activation function unit 212 on the value 217 of the accumulator 202 will bring the result 133 to a known range. The results 133 of the N neural execution units 126 can all be written back to the data random access memory 122 or the weight random access memory 124 simultaneously. In a preferred embodiment, the startup function unit 212 is used to execute multiple startup functions. For example, an input from the control register 127 selects one of these startup functions to be executed on the output 217 of the accumulator 202. . These activation functions may include, but are not limited to, step functions, correction functions, sigmoid functions, hyperbolic tangent functions, and soft addition functions (also known as smooth correction functions). The analytic formula of the soft addition function is f (x) = ln (1 + e ^x ), which is the natural logarithm of the sum of 1 and e ^x , where "e" is Euler's number, and x is this Function input 217. In a preferred embodiment, the start-up function may also include a pass-through function, which directly passes the value 217 or a part of the accumulator 202, as described later. In one embodiment, the circuit of the activation function unit 212 executes the activation function in a single time-frequency period. In an embodiment, the activation function unit 212 includes a plurality of forms, which receive an accumulated value and output a value. For some activation functions, such as an S-type function, a double-taken tangent function, and a soft addition function, the value will be approximately The value provided by the real startup function.

就一較佳實施例而言，累加器202之寬度(以位元計)係大於啟動函數功能212之輸出133之寬度。舉例來說，在一實施例中，此累加器之寬度為41位元，以避免在累加至最多512個32位元之乘積的情況下(這部分在後續章節如對應於第三十圖處會有更詳細的說明)損失精度，而結果133之寬度為16位元。在一實施例中，在後續時頻週期中，啟動函數單元212會傳遞累加器202輸出217之其他未經處理之部分，並且會將這些部分寫回資料隨機存取記憶體122或權重隨機存取記憶體124，這部分在後續章節對應於第八圖處會有更詳細的說明。如此即可將未經處理之累加器202數值透過MFNN指令載回媒體暫存器118，藉此，在處理器100之其他執行單元112執行之指令就可以執行啟動函數單元212無法執行之複雜啟動函數，例如常見的軟極大(softmax)函數，此函數也被稱為標準化指數函數。在一實施例中，處理器100之指令集架構包括執行此指數函數之一指令，通常表示為e^x或exp(x)，此指令可由處理器100之其他執行單元112使用以提升軟極大啟動函數之執行速度。 For a preferred embodiment, the width (in bits) of the accumulator 202 is greater than the width of the output 133 of the activation function 212. For example, in an embodiment, the width of the accumulator is 41 bits, to avoid the situation of accumulating up to a maximum of 512 32-bit products (this part in the subsequent chapters corresponds to the position shown in Figure 30). There will be a more detailed explanation) loss of accuracy, and the width of the result 133 is 16 bits. In an embodiment, in the subsequent time-frequency cycle, the activation function unit 212 passes the other unprocessed parts of the accumulator 202 output 217, and writes these parts back to the data random access memory 122 or weight random storage. Take the memory 124. This part will be explained in more detail in the subsequent chapter corresponding to the eighth figure. In this way, the value of the unprocessed accumulator 202 can be loaded back to the media register 118 through the MFNN instruction, so that the instructions executed by the other execution units 112 of the processor 100 can perform the complex startup that the function unit 212 cannot perform. A function, such as the common softmax function, is also called a normalized exponential function. In an embodiment, the instruction set architecture of the processor 100 includes executing an instruction of this exponential function, which is usually expressed as e ^x or exp (x). This instruction can be used by other execution units 112 of the processor 100 to enhance the soft start Function execution speed.

在一實施例中，神經處理單元126係採管線設計。舉例來說，神經處理單元126可包括算術邏輯單元204之暫存器，例如位於乘法器與加法器以及/或是算術邏輯單元204之其他電路間之暫存器，神經處理單元126還可包括一個裝載啟動函數功能212輸出之暫存器。此神經處理單元126之其他實施例會在後續章節進行說明。 In one embodiment, the neural processing unit 126 is a pipeline design. For example, the neural processing unit 126 may include a register of the arithmetic logic unit 204, such as a register located between the multiplier and the adder and / or other circuits of the arithmetic logic unit 204. The neural processing unit 126 may further include A register that loads the output of the start function function 212. Other embodiments of the neural processing unit 126 will be described in subsequent chapters.

第三圖係一方塊圖，顯示利用第一圖之神經網路單元121之N個神經處理單元126之N個多工暫存器208，對於由第一圖之資料隨機存取記憶體122取得之一列資料文字207執行如同一N個文字之旋轉器(rotator)或稱循環移位器(circular shifter)之運作。在第三圖之實施例中，N是512，因此，神經網路單元121具有512個多工暫存器208，標示為0至511，分別對應至512個神經處理單元126。每個多工暫存器208會接收資料隨機存取記憶體122之D個列之其中一個列上的相對應資料文字207。也就是說，多工暫存器0會從資料隨機存取記憶體122列接收資料文字0，多工暫存器1會從資料隨機存取記憶體122列接收資料文字1，多工暫存器2會從資料隨機存取記憶體122列接收資料文字2，依此類推，多工暫存器511會從資料隨機存取記憶體122列接收資料文字511。此外，多工暫存器1會接收多工暫存器0之輸出209作為另一輸入211，多工暫存器2會接收多工暫存器1之輸出209作為另一輸入211，多工暫存器3會接收多工暫存器2之輸出209作為另一輸入211，依此類推，多工暫存器511會接收多工暫存器510之輸出209作為另一輸入211，而多工暫存器0會接收多工暫存器511之輸出209作為其他輸入211。每個多工暫存器208都會接收一控制輸入213以控制其選擇資料文字207或是循環輸入211。在此運作之一模式中，控制輸入213會在一第一時頻週期內，控制每個多工暫存器208選擇資料文字207以儲存至暫存器並於後續步驟提供給算術邏輯單元204，而在後續之時頻週期內(如前述M-1個時頻週期)，控制輸入213會控制每個多工暫存器208選擇循環輸入211以儲存至暫存器並於後續步驟提供給算術邏輯單元204。 The third diagram is a block diagram showing the use of N multiplexing registers 208 of N neural processing units 126 of the neural network unit 121 of the first diagram to obtain data from the random access memory 122 of the data of the first diagram One row of data characters 207 performs the operation of a rotator or a circular shifter of the same N characters. In the embodiment of the third figure, N is 512. Therefore, the neural network unit 121 has 512 multiplexing registers 208, labeled 0 to 511, corresponding to 512 neural processing units 126, respectively. Each multiplexer register 208 will receive data random access records Corresponding data text 207 on one of the D columns of the memory 122. In other words, multiplexer register 0 will receive data text 0 from 122 rows of data random access memory, multiplexer register 1 will receive data text 1 from 122 rows of data random access memory, and multiplex temporarily The device 2 will receive the data text 2 from the 122 rows of the data random access memory, and so on, and the multiplexer register 511 will receive the data text 511 from the 122 rows of the data random access memory. In addition, multiplexer register 1 will receive output 209 of multiplexer register 0 as another input 211, and multiplexer register 2 will receive output 209 of multiplexer register 1 as another input 211. Register 3 will receive output 209 of multiplex register 2 as another input 211, and so on, multiplex register 511 will receive output 209 of multiplex register 510 as another input 211, and more The working register 0 will receive the output 209 of the multiplexing register 511 as the other input 211. Each multiplexer register 208 receives a control input 213 to control whether it selects data text 207 or cyclic input 211. In one mode of this operation, the control input 213 controls each multiplexing register 208 to select the data text 207 to be stored in the register in a first time-frequency period and provided to the arithmetic logic unit 204 in the subsequent steps. In the subsequent time-frequency cycles (such as the aforementioned M-1 time-frequency cycles), the control input 213 controls each multiplexing register 208 to select the cyclic input 211 to store in the register and provide it to the subsequent steps. Arithmetic logic unit 204.

雖然第三圖(以及後續之第七與十九圖)所描述之實施例中，多個神經處理單元126可用以將這些多工暫存器208/705之數值向右旋轉，亦即由神經處理單元J朝向神經處理單元J+1移動，不過本發明並不限於此，在其他的實施例中(例如對應於第二十四至二十六圖之實施例)，多個神經處理單元126可用以將多工暫存器208/705之數值向左旋轉，亦即由神經處理單元J朝向神經處理單元J-1移動。此外，在本發明之其他實施例中，這些神經處理單元126可選擇性地將多工暫存器208/705之數值向左或向右旋轉，舉例來說，此選擇可由神經網路單元指令所指定。 Although in the embodiment described in the third diagram (and subsequent seventh and nineteenth diagrams), multiple neural processing units 126 can be used to rotate the values of the multiplex registers 208/705 to the right, that is, by the nerves. The processing unit J moves toward the neural processing unit J + 1, but the present invention is not limited to this. In other embodiments (for example, corresponding to the twenty-fourth to twenty-sixth) The embodiment of the figure), the plurality of neural processing units 126 can be used to rotate the value of the multiplex register 208/705 to the left, that is, move from the neural processing unit J toward the neural processing unit J-1. In addition, in other embodiments of the present invention, the neural processing units 126 can selectively rotate the values of the multiplexer registers 208/705 to the left or right. For example, this selection can be instructed by the neural network unit. Specified.

第四圖係一表格，顯示一個儲存於第一圖之神經網路單元121之程式記憶體129並由該神經網路單元121執行之程式。如前所述，此範例程式係執行與人工神經網路之一層有關的計算。第四圖之表格顯示有四個列與三個行。每一個列係對應於程式記憶體129中標示於第一行之一位址。第二行指定相對應的指令，而第三行係指出關聯於此指令之時頻週期數。就一較佳實施例而言，前述時頻週期數係表示在管線執行之實施例中每指令時頻週期值之有效的時頻週期數，而非指令延遲。如圖中所示，因為神經網路單元121具有管線執行的本質，每個指令均有一相關聯之時頻週期，位於位址2之指令是一個例外，此指令實際上自己會重複執行511次，因而需要511個時頻週期，詳如後述。 The fourth diagram is a table showing a program stored in the program memory 129 of the neural network unit 121 of the first diagram and executed by the neural network unit 121. As mentioned earlier, this example program performs calculations related to one layer of an artificial neural network. The table in Figure 4 shows four columns and three rows. Each row corresponds to an address marked in the first row in the program memory 129. The second line specifies the corresponding instruction, and the third line indicates the number of time-frequency cycles associated with this instruction. In a preferred embodiment, the foregoing number of time-frequency cycles refers to a valid number of time-frequency cycles per instruction time-frequency cycle value in the embodiment executed by the pipeline, rather than an instruction delay. As shown in the figure, because the neural network unit 121 has the nature of pipeline execution, each instruction has an associated time-frequency cycle. The exception is the instruction at address 2, which actually repeats itself 511 times. Therefore, 511 time-frequency cycles are required, as described later.

所有的神經處理單元126會平行處理程式中的每個指令。也就是說，所有的N個神經處理單元126都會在同一個時頻週期執行第一列之指令，所有的N個神經處理單元126都會在同一個時頻週期執行第二列之指令，依此類推。不過本發明並不限於此，在後續章節之其他實施例中，有些指令則是以部分平行部分序列之方式執行，舉例來說，如第十一圖之實施例所述，在多個神經處理單元126共享一個啟動函數單元之實施例中，啟動函數與位於位址3與4之輸出指令即是以此方式執行。第四圖之範例中假定一個層具有512個神經元(神經處理單元126)，而每個神經元具有512個來自前一層之512個神經元之連結輸入，總共有256K個連結。每個神經元會從每個連結輸入接收一個16位元資料值，並將此16位元資料值乘上一個適當的16位元權重值。 All neural processing units 126 process each instruction in the program in parallel. In other words, all N neural processing units 126 will execute the first row of instructions in the same time-frequency cycle, and all N neural processing units 126 will execute the second row of instructions in the same time-frequency cycle. analogy. However, the present invention is not limited to this. In other embodiments of the subsequent chapters, some instructions are based on partially parallel partial sequences. Mode execution, for example, as described in the embodiment of FIG. 11, in the embodiment in which multiple neural processing units 126 share a startup function unit, the startup function and the output instructions at addresses 3 and 4 are This way is performed. The example in the fourth figure assumes that one layer has 512 neurons (neural processing unit 126), and each neuron has 512 connection inputs from the 512 neurons of the previous layer, for a total of 256K connections. Each neuron receives a 16-bit data value from each link input, and multiplies this 16-bit data value by an appropriate 16-bit weight value.

位於位址0之第一列(亦可指定至其他位址)會指定一初始化神經處理單元指令。此初始化指令會清除累加器202數值使之為零。在一實施例中，初始化指令亦可在累加器202內載入資料隨機存取記憶體122或權重隨機存取記憶體124之一個列中，由此指令所指定之相對應之文字。此初始化指令也會將配置值載入控制暫存器127，這部分在後續第二十九A與二十九B圖會有更詳細的說明。舉例來說，可將資料文字207與權重文字209之寬度載入，供算術邏輯單元204利用以確認電路執行之運算大小，此寬度也會影響儲存於累加器202之結果215。在一實施例中，神經處理單元126包括一電路在算術邏輯單元204之輸出215儲存於累加器202前填滿此輸出215，而初始化指令會將一配置值載入此電路，此配置值會影響前述之填滿運算。在一實施例中，也可在算術邏輯單元函數指令(如位址1之乘法累加指令)或輸出指令(如位址4之寫入起始函數單元輸出指令)中如此指定，以將累加器202清除至零值。 The first row at address 0 (can also be assigned to other addresses) will specify an instruction to initialize the neural processing unit. This initialization instruction clears the value of accumulator 202 to zero. In an embodiment, the initialization instruction may also load a row of the data random access memory 122 or the weight random access memory 124 into the accumulator 202, and the corresponding text specified by the instruction. This initialization instruction will also load the configuration value into the control register 127. This part will be explained in more detail in the subsequent figures 29A and 29B. For example, the width of the data text 207 and the weight text 209 can be loaded for use by the arithmetic logic unit 204 to confirm the size of the operation performed by the circuit. This width also affects the result 215 stored in the accumulator 202. In one embodiment, the neural processing unit 126 includes a circuit to fill the output 215 before the output 215 of the arithmetic logic unit 204 is stored in the accumulator 202, and the initialization instruction will load a configuration value into the circuit, and the configuration value will Affects the aforementioned fill operation. In an embodiment, it may also be specified in an arithmetic logic unit function instruction (such as a multiply accumulate instruction at address 1) or an output instruction (such as a write start function unit output instruction at address 4) so that the accumulator 202 is cleared to zero.

位於位址1之第二列係指定一乘法累加指令指示這512個神經處理單元126從資料隨機存取記憶體122之一列載入一相對應之資料文字以及從權重隨機存取記憶體124之一列載入一相對應之權重文字，並且對此資料文字輸入207與權重文字輸入206執行一第一乘法累加運算，即加上初始化累加器202零值。進一步來說，此指令會指示定序器128在控制輸入213產生一數值以選擇資料文字輸入207。在第四圖之範例中，資料隨機存取記憶體122之指定列為列17，權重隨機存取記憶體124之指定列為列0，因此定序器會被指示輸出數值17作為一資料隨機存取記憶體位址123，輸出數值0作為一權重隨機存取記憶體位址125。因此，來自資料隨機存取記憶體122之列17之512個資料文字係提供作為512個神經處理單元126之相對應資料輸入207，而來自權重隨機存取記憶體124之列0之512個權重文字係提供作為512個神經處理單元126之相對應權重輸入206。 The second row at address 1 specifies a multiply-accumulate instruction to instruct the 512 neural processing units 126 to load a corresponding data text from one row of the data random access memory 122 and from the weight random access memory 124. A column is loaded with a corresponding weight text, and a first multiplication and accumulation operation is performed on the data text input 207 and the weight text input 206, that is, the zero value of the initialized accumulator 202 is added. Further, this instruction instructs the sequencer 128 to generate a value at the control input 213 to select the data text input 207. In the example of the fourth figure, the designated row of the data random access memory 122 is row 17 and the designated row of the weight random access memory 124 is row 0, so the sequencer is instructed to output the value 17 as a data random The memory address 123 is accessed, and the value 0 is output as a weight of the random access memory address 125. Therefore, 512 data words from row 17 of data random access memory 122 are provided as corresponding data inputs 207 of 512 neural processing units 126, and 512 weights from row 0 of weight random access memory 124 The text provides the corresponding weight input 206 as 512 neural processing units 126.

位於位址2之第三列係指定一乘法累加旋轉指令，此指令具有一計數其數值為511，以指示這512個神經處理單元126執行511次乘法累加運算。此指令指示這512個神經處理單元126將511次乘法累加運算之每一次運算中輸入算術邏輯單元204之資料文字209，作為從鄰近神經處理單元126來的旋轉值211。也就是說，此指令會指示定序器128在控制輸入213產生一數值以選擇旋轉值211。此外，此指令會指示這512個神經處理單元126將511次乘法累加運算之每一次運算中之一相對應權重值載入權重隨機存取記憶體124之“下一”列。也就是說，此指令會指示定序器128將權重隨機存取記憶體位址125從前一個時頻週期的數值增加一，在此範例中，指令之第一時頻週期是列1，下一個時頻週期就是列2，在下一個時頻週期就是列3，依此類推，第511個時頻週期就是列511。在這511個乘法累加運算中之每一個運算中，旋轉輸入211與權重文字輸入206之乘積會被加入累加器202之前一個數值。這512個神經處理單元126會在511個時頻週期內執行這511個乘法累加運算，每個神經處理單元126會對於來自資料隨機存取記憶體122之列17之不同資料文字一也就是，相鄰之神經處理單元126在前一個時頻週期執行運算的資料文字，以及關聯於資料文字之不同權重文字執行一個乘法累加運算在概念上即為神經元之不同連結輸入。此範例假設各個神經處理單元126(神經元)具有512個連結輸入，因此牽涉到512個資料文字與512個權重文字之處理。在列2之乘法累加旋轉指令重複最後一次迭代後，累加器202內就會存放有這512個連結輸入之乘積的加總。在一實施例中，神經處理單元126之指令集係包括一“執行”指令以指示算術邏輯單元204執行由初始化神經處理單元指令指定之一算術邏輯單元運算，例如第二十九A圖之算術邏輯單元函數2926所指定者，而非對於各個不同類型之算術邏輯運算(例如前述之乘法累加、累加器與權重之最大值等)具有一獨立的指令。 The third column at address 2 specifies a multiply accumulate rotation instruction, which has a count of 511 to instruct the 512 neural processing units 126 to perform 511 multiply accumulate operations. This instruction instructs the 512 neural processing units 126 to input the data text 209 of the arithmetic logic unit 204 in each operation of the 511 multiply-accumulate operations as the rotation value 211 from the neighboring neural processing unit 126. That is, this instruction instructs the sequencer 128 to generate a value at the control input 213 to select the rotation value 211. In addition, this instruction will instruct the 512 neural processing units 126 to assign a corresponding weight to each of the 511 multiply-accumulate operations. The reloading loads the "next" row of the weighted random access memory 124. That is, this instruction will instruct the sequencer 128 to increase the weight random access memory address 125 by one from the previous time-frequency period. In this example, the first time-frequency period of the instruction is column 1, and the next time The frequency period is column 2, the next time-frequency period is column 3, and so on, and the 511th time-frequency period is column 511. In each of these 511 multiply-accumulate operations, the product of the rotation input 211 and the weighted text input 206 is added to a value before the accumulator 202. The 512 neural processing units 126 will perform the 511 multiply-accumulate operations in 511 time-frequency cycles. Each neural processing unit 126 will perform different operations on the different data characters from column 17 of data random access memory 122. That is, Adjacent neural processing units 126 performed data texts in the previous time-frequency cycle, and performed a multiply-accumulate operation on texts with different weights associated with the data texts, which are conceptually different inputs for different connections of neurons. This example assumes that each neural processing unit 126 (neuron) has 512 link inputs, and therefore involves the processing of 512 data words and 512 weight words. After the multiply accumulate and rotate instruction of column 2 is repeated for the last iteration, the accumulator 202 stores the sum of the products of the 512 linked inputs. In one embodiment, the instruction set of the neural processing unit 126 includes an "execute" instruction to instruct the arithmetic logic unit 204 to perform one of the arithmetic logic unit operations specified by the instruction of the initialization neural processing unit, such as the arithmetic of figure 29A The logic unit function 2926 specifies a separate instruction for each different type of arithmetic logic operation (such as the aforementioned multiply-accumulate, accumulator, and maximum values of weights, etc.).

位於位址3之第四列係指定一啟動函數指令。此啟動函數指令指示啟動函數單元212對於累加器202數值執行所指定之啟動函數以產生結果133。啟動函數之實施例在後續章節會有更詳細的說明。 The fourth column at address 3 specifies a starting function make. This startup function instruction instructs the startup function unit 212 to execute the specified startup function on the value of the accumulator 202 to produce a result 133. An example of the startup function will be described in more detail in the subsequent sections.

位於位址4之第五列係指定一寫入啟動函數單元輸出指令，以指示這512個神經處理單元216將其啟動函數單元212輸出作為結果133寫回至資料隨機存取記憶體122之一列，在此範例中即列16。也就是說，此指令會指示定序器128輸出數值16作為資料隨機存取記憶體位址123以及一寫入命令(相對應於由位址1之乘法累加指令所指定之讀取命令)。就一較佳實施例而言，因為管線執行之特性，寫入啟動函數單元輸出指令可與其他指令同時執行，因此寫入啟動函數單元輸出指令實際上可以在單一個時頻週期內執行。 The fifth column at address 4 specifies a write enable function unit output instruction to instruct the 512 neural processing units 216 to output their enable function unit 212 as a result 133 and write it back to one of the data random access memory 122 rows. , In this example, column 16. That is, this instruction instructs the sequencer 128 to output the value 16 as the data random access memory address 123 and a write command (corresponding to the read command specified by the multiply-accumulate instruction at address 1). In a preferred embodiment, because of the characteristics of pipeline execution, the write start function unit output instruction can be executed simultaneously with other instructions, so the write start function unit output instruction can actually be executed in a single time-frequency cycle.

就一較佳實施例而言，每個神經處理單元126係作為一管線，此管線具有各種不同功能元件，例如多工暫存器208(以及第七圖之多工暫存器705)、算術邏輯單元204、累加器202、啟動函數單元212、多工器802(請參照第八圖)、列緩衝器1104與啟動函數單元1112(請參照第十一圖)等，其中某些元件本身即可管線執行。除了資料文字207與權重文字206外，此管線還會從程式記憶體129接收指令。這些指令會沿著管線流動並控制多種功能單元。在另一實施例中，此程式內不包含啟動函數指令，而是由初始化神經處理單元指令指定執行於累加器202數值217之啟動函數，指出被指定之啟動函數之一數值係儲存於一配置暫存器，供管線之啟動函數單元212部分在產生最後的累加器202數值217後，也就是在位址2之乘法累加旋轉指令重複最後一次執行後，加以利用。就一較佳實施例而言，為了節省耗能，管線之啟動函數單元212部分在寫入啟動函數單元輸出指令到達前會處於不啟動狀態，在指令到達時，啟動函數單元212會啟動並對初始化指令指定之累加器202輸出217執行啟動函數。 In a preferred embodiment, each neural processing unit 126 serves as a pipeline, and the pipeline has various functional elements, such as the multiplexer register 208 (and the multiplexer register 705 in the seventh figure), arithmetic The logic unit 204, the accumulator 202, the start function unit 212, the multiplexer 802 (see FIG. 8), the column buffer 1104, and the start function unit 1112 (see FIG. 11), etc., some of the elements themselves are Can be executed in pipeline. In addition to the data text 207 and the weight text 206, this pipeline also receives instructions from the program memory 129. These instructions flow along the pipeline and control multiple functional units. In another embodiment, the program does not include a start function instruction. Instead, the initialization neural processing unit instruction specifies the start function to be executed on the accumulator 202 value 217, indicating that one of the specified start function values is stored in a configuration. Register for pipeline startup function The unit 212 generates the last value 217 of the accumulator 202, that is, after the multiply accumulate and rotate instruction of address 2 is repeated for the last execution, it is used. In a preferred embodiment, in order to save energy consumption, the startup function unit 212 portion of the pipeline will be in a non-starting state before the output instruction of the write startup function unit is reached. When the instruction arrives, the startup function unit 212 will start and The accumulator 202 designated by the initialization instruction outputs 217 to execute the start function.

第五圖係顯示神經網路單元121執行第四圖之程式之時序圖。此時序圖之每一列係對應至第一行指出之連續時頻週期。其他行則是分別對應至這512個神經處理單元126中不同的神經處理單元126並指出其運算。圖中僅顯示神經處理單元0,1,511之運算以簡化說明。 The fifth diagram is a timing chart showing that the neural network unit 121 executes the program of the fourth diagram. Each column of this timing diagram corresponds to the continuous time-frequency period indicated by the first row. The other lines correspond to different neural processing units 126 of the 512 neural processing units 126 and indicate their operations. The figure only shows the operations of the neural processing unit 0,1,511 to simplify the description.

在時頻週期0，這512個神經處理單元126中的每一個神經處理單元126都會執行第四圖之初始化指令，在第五圖中即是將一零值指派給累加器202。 At time-frequency period 0, each of the 512 neural processing units 126 executes the initialization instruction of the fourth figure, and in the fifth figure, a zero value is assigned to the accumulator 202.

在時頻週期1，這512個神經處理單元126中的每一個神經處理單元126都會執行第四圖中位址1之乘法累加指令。如圖中所示，神經處理單元0會將累加器202數值(即零)加上資料隨機存取記憶體122之列17之文字0與權重隨機存取記憶體124之列0之文字0之乘積；神經處理單元1會將累加器202數值(即零)加上資料隨機存取記憶體122之列17之文字1與權重隨機存取記憶體124之列0之文字1之乘積；依此類推，神經處理單元511會將累加器202數值(即零)加上資料隨機存取記憶體122 之列17之文字511與權重隨機存取記憶體124之列0之文字511之乘積。 In the time-frequency cycle 1, each of the 512 neural processing units 126 executes a multiply-accumulate instruction at address 1 in the fourth figure. As shown in the figure, the neural processing unit 0 adds the value of accumulator 202 (that is, zero) to the text 0 of row 17 of data random access memory 122 and the text 0 of row 0 of weight random access memory 124 Product; the neural processing unit 1 adds the value of accumulator 202 (ie, zero) to the product of text 1 in row 17 of data random access memory 122 and text 1 in row 0 of weight random access memory 124; By analogy, the neural processing unit 511 adds the value of the accumulator 202 (ie, zero) to the data random access memory 122 The product of the character 511 in column 17 and the character 511 in column 0 of the weight random access memory 124.

在時頻週期2，這512個神經處理單元126中的每一個神經處理單元126都會進行第四圖中位址2之乘法累加旋轉指令之第一次迭代。如圖中所示，神經處理單元0會將累加器202數值加上由神經處理單元511之多工暫存器208輸出209接收之旋轉資料文字211(即由資料隨機存取記憶體122接收之資料文字511)與權重隨機存取記憶體124之列1之文字0之乘積；神經處理單元1會將累加器202數值加上由神經處理單元0之多工暫存器208輸出209接收之旋轉資料文字211(即由資料隨機存取記憶體122接收之資料文字0)與權重隨機存取記憶體124之列1之文字1之乘積；依此類推，神經處理單元511會將累加器202數值加上由神經處理單元510之多工暫存器208輸出209接收之旋轉資料文字211(即由資料隨機存取記憶體122接收之資料文字510)與權重隨機存取記憶體124之列1之文字511之乘積。 In time-frequency cycle 2, each of the 512 neural processing units 126 performs the first iteration of the multiply-accumulate rotation instruction at address 2 in the fourth figure. As shown in the figure, the neural processing unit 0 adds the value of the accumulator 202 to the rotated data text 211 received by the multiplexer register 208 output 209 of the neural processing unit 511 (that is, the data received by the data random access memory 122). Data text 511) multiplied by text 0 of column 1 of weighted random access memory 124; neural processing unit 1 adds the value of accumulator 202 to the rotation received by output 209 of multiplex register 208 of neural processing unit 0 The product of data text 211 (data text 0 received by data random access memory 122) and text 1 of row 1 of weight random access memory 124; and so on, the neural processing unit 511 sets the value of accumulator 202 Add the rotation data text 211 (i.e., the data text 510 received by the data random access memory 122) received by the multiplexing register 208 output 209 of the neural processing unit 510 and the weight random access memory 124 in line 1. Product of text 511.

在時頻週期3，這512個神經處理單元126中的每一個神經處理單元126都會進行第四圖中位址2之乘法累加旋轉指令之第二次迭代。如圖中所示，神經處理單元0會將累加器202數值加上由神經處理單元511之多工暫存器208輸出209接收之旋轉資料文字211(即由資料隨機存取記憶體122接收之資料文字510)與權重隨機存取記憶體124之列2之文字0之乘積；神經處理單元1會將累加器202數值加上由神經處理單元0之多工暫存器 208輸出209接收之旋轉資料文字211(即由資料隨機存取記憶體122接收之資料文字511)與權重隨機存取記憶體124之列2之文字1之乘積；依此類推，神經處理單元511會將累加器202數值加上由神經處理單元510之多工暫存器208輸出209接收之旋轉資料文字211(即由資料隨機存取記憶體122接收之資料文字509)與權重隨機存取記憶體124之列2之文字511之乘積。如同第五圖之省略標號顯示，接下來509個時頻週期會依此持續進行，直到時頻週期512。 At time-frequency period 3, each of the 512 neural processing units 126 performs a second iteration of the multiply-accumulate rotation instruction at address 2 in the fourth figure. As shown in the figure, the neural processing unit 0 adds the value of the accumulator 202 to the rotated data text 211 received by the multiplexer register 208 output 209 of the neural processing unit 511 (that is, the data received by the data random access memory 122). (Data text 510) multiplied by text 0 of weight 2 of column 2 of random access memory 124; neural processing unit 1 adds the value of accumulator 202 to the multiplexing register of neural processing unit 0 Product of 208 output 209 received rotated data text 211 (ie, data text 511 received by data random access memory 122) and weight 1 of text 2 in random access memory 124; and so on, neural processing unit 511 The value of the accumulator 202 is added to the rotated data text 211 (i.e. the data text 509 received by the data random access memory 122) and the weighted random access memory received by the multiplexer register 208 output 209 of the neural processing unit 510. Product of characters 511 in column 2 of body 124. As shown in the fifth figure, the 509 time-frequency cycles will continue in this manner until the time-frequency cycle 512.

在時頻週期512，這512個神經處理單元126中的每一個神經處理單元126都會進行第四圖中位址2之乘法累加旋轉指令之第511次迭代。如圖中所示，神經處理單元0會將累加器202數值加上由神經處理單元511之多工暫存器208輸出209接收之旋轉資料文字211(即由資料隨機存取記憶體122接收之資料文字1)與權重隨機存取記憶體124之列511之文字0之乘積；神經處理單元1會將累加器202數值加上由神經處理單元0之多工暫存器208輸出209接收之旋轉資料文字211(即由資料隨機存取記憶體122接收之資料文字2)與權重隨機存取記憶體124之列511之文字1之乘積；依此類推，神經處理單元511會將累加器202數值加上由神經處理單元510之多工暫存器208輸出209接收之旋轉資料文字211(即由資料隨機存取記憶體122接收之資料文字0)與權重隨機存取記憶體124之列511之文字511之乘積。在一實施例中需要多個時頻週期從資料隨機存取記憶體122與權重隨機存取記憶體124讀取資料文字與權重文字以執行第四圖中位址1之乘法累加指令；不過，資料隨機存取記憶體122、權重隨機存取記憶體124與神經處理單元126係採管線配置，如此在第一個乘法累加運算開始後(如第五圖之時頻週期1所示)，後續的乘法累加運算(如第五圖之時頻週期2-512所示)就會開始在接續的時頻週期內執行。就一較佳實施例而言，因應利用架構指令，如MTNN或MFNN指令(在後續第十四與十五圖會進行說明)，對於資料隨機存取記憶體122與/或權重隨機存取記憶體124之存取動作，或是架構指令轉譯出之微指令，這些神經處理單元126會短暫地擱置。 In the time-frequency cycle 512, each of the 512 neural processing units 126 performs the 511th iteration of the multiply-accumulate rotation instruction at address 2 in the fourth figure. As shown in the figure, the neural processing unit 0 adds the value of the accumulator 202 to the rotated data text 211 received by the multiplexer register 208 output 209 of the neural processing unit 511 (that is, the data received by the data random access memory 122). Data text 1) Multiplied by text 0 of weighted random access memory 124 column 511; neural processing unit 1 adds the value of accumulator 202 to the rotation received by output 209 of multiplex register 208 of neural processing unit 0 The product of data text 211 (data text 2 received by data random access memory 122) and text 1 of row 511 of weight random access memory 124; and so on, the neural processing unit 511 sets the value of accumulator 202 Add the rotated data text 211 (ie, the data text 0 received by the data random access memory 122) received by the multiplexer register 208 output 209 of the neural processing unit 510 and the weighted random access memory 124 row 511 Product of text 511. In one embodiment, multiple time-frequency cycles are required from the data random access memory 122 and the weight random storage. The fetch memory 124 reads the data text and weight text to execute the multiply-accumulate instruction of address 1 in the fourth figure; however, the data random access memory 122, the weight random access memory 124, and the neural processing unit 126 are pipelines. Configured so that after the first multiply-accumulate operation starts (as shown in time-frequency period 1 of the fifth figure), subsequent multiply-accumulation operations (as shown in time-frequency period 2-512 of the fifth figure) will start at Executed in subsequent time-frequency cycles. According to a preferred embodiment, according to the architecture instructions, such as MTNN or MFNN instructions (which will be described in the following fourteenth and fifteenth drawings), the data random access memory 122 and / or weight random access memory These neural processing units 126 are temporarily put on hold by the access operation of the body 124 or the micro instructions translated from the architectural instructions.

在時頻週期513，這512個神經處理單元126中的每一個神經處理單元126之啟動函數單元212都會執行第四圖中位址3之啟動函數。最後，在時頻週期514，這512個神經處理單元126中的每一個神經處理單元126會透過將其結果133寫回資料隨機存取記憶體122之列16中之相對應文字以執行第四圖中位址4之寫入啟動函數單元輸出指令，也就是說，神經處理單元0之結果133會被寫入資料隨機存取記憶體122之文字0，神經處理單元1之結果133會被寫入資料隨機存取記憶體122之文字1，依此類推，神經處理單元511之結果133會被寫入資料隨機存取記憶體122之文字511。對應於前述第五圖之運算之相對應方塊圖係顯示於第六A圖。 In the time-frequency period 513, the activation function unit 212 of each of the 512 neural processing units 126 executes the activation function at address 3 in the fourth figure. Finally, in the time-frequency cycle 514, each of the 512 neural processing units 126 will execute the fourth by writing the result 133 back to the corresponding text in column 16 of the data random access memory 122. In the figure, the write start function unit of address 4 outputs an instruction, that is, the result 133 of the neural processing unit 0 will be written into the data 0 of the random access memory 122, and the result 133 of the neural processing unit 1 will be written. The text 1 of the data random access memory 122 is entered, and so on, the result 133 of the neural processing unit 511 is written into the text 511 of the data random access memory 122. The corresponding block diagram corresponding to the operation of the aforementioned fifth figure is shown in the sixth A figure.

第六A圖係顯示第一圖之神經網路單元121執行第四圖之程式之方塊示意圖。此神經網路單元 121包括512個神經處理單元126、接收位址輸入123之資料隨機存取記憶體122，與接收位址輸入125之權重隨機存取記憶體124。在時頻週期0的時候，這512個神經處理單元126會執行初始化指令。此運作在圖中並未顯示。如圖中所示，在時頻週期1的時候，列17之512個16位元之資料文字會從資料隨機存取記憶體122讀出並提供至這512個神經處理單元126。在時頻週期1至512之過程中，列0至列511之512個16位元之權重文字會分別從權重隨機存取記憶體122讀出並提供至這512個神經處理單元126。在時頻週期1的時候，這512個神經處理單元126會對載入之資料文字與權重文字執行其相對應之乘法累加運算。此運作在圖中並未顯示。在時頻週期2至512之過程中，512個神經處理單元126之多工暫存器208會如同一個具有512個16位元文字之旋轉器進行運作，而將先前由資料隨機存取記憶體122之列17載入之資料文字轉動至鄰近之神經處理單元126，而這些神經處理單元126會對轉動後之相對應資料文字以及由權重隨機存取記憶體124載入之相對應權重文字執行乘法累加運算。在時頻週期513的時候，這512個啟動函數單元212會執行啟動指令。此運作在圖中並未顯示。在時頻週期514的時候，這512個神經處理單元126會將其相對應之512個16位元結果133寫回資料隨機存取記憶體122之列16。 The sixth diagram A is a block diagram showing that the neural network unit 121 of the first diagram executes the program of the fourth diagram. This neural network unit 121 includes 512 neural processing units 126, a data random access memory 122 that receives an address input 123, and a weight random access memory 124 that receives an address input 125. When the time-frequency cycle is 0, the 512 neural processing units 126 execute initialization instructions. This operation is not shown in the figure. As shown in the figure, at time-frequency period 1, 512 16-bit data characters of row 17 are read from the data random access memory 122 and provided to the 512 neural processing units 126. During the time-frequency period from 1 to 512, the 512 16-bit weight characters of columns 0 to 511 will be read from the weight random access memory 122 and provided to the 512 neural processing units 126. At time-frequency period 1, the 512 neural processing units 126 perform their corresponding multiply-accumulate operations on the loaded data text and weight text. This operation is not shown in the figure. During the time-frequency cycle from 2 to 512, the multiplex register 208 of 512 neural processing units 126 will operate as a rotator with 512 16-bit characters, and will previously use data random access memory The data text loaded in column 17 of 122 is rotated to the adjacent neural processing unit 126, and these neural processing units 126 execute the corresponding data text after rotation and the corresponding weight text loaded by the weight random access memory 124. Multiply and accumulate operations. At the time-frequency period 513, the 512 startup function units 212 execute the startup instruction. This operation is not shown in the figure. At the time-frequency period 514, the 512 neural processing units 126 write their corresponding 512 16-bit results 133 back to the 16th row of the data random access memory 122.

如圖中所示，產生結果文字(神經元輸出)並寫回資料隨機存取記憶體122或權重隨機存取記憶體124需要之時頻週期數大致為神經網路之當前層接收到的資料輸入(連結)數量的平方根。舉例來說，若是當前層具有512個神經元，而各個神經元具有512個來自前一層的連結，這些連結的總數就是256K，而產生當前層結果需要的時頻週期數就會略大於512。因此，神經網路單元121在神經網路計算方面可提供極高的效能。 As shown in the figure, the number of time-frequency cycles required to generate the result text (neuronal output) and write back the data to the random access memory 122 or the weighted random access memory 124 is approximately as received by the current layer of the neural network. The square root of the number of data inputs (links). For example, if the current layer has 512 neurons and each neuron has 512 connections from the previous layer, the total number of these connections is 256K, and the number of time-frequency cycles required to generate the current layer results will be slightly greater than 512. Therefore, the neural network unit 121 can provide extremely high performance in neural network computing.

第六B圖係一流程圖，顯示第一圖之處理器100執行一架構程式，以利用神經網路單元121執行關聯於一人工神經網路之隱藏層之神經元之典型乘法累加啟動函數運算之運作，如同由第四圖之程式執行之運作。第六B圖之範例係假定有四個隱藏層(標示於初始化步驟602之變數NUM_LAYERS)，各個隱藏層具有512個神經元，各個神經元係連結前一層全部之512個神經元(透過第四圖之程式)。不過，需要理解的是，這些層與神經元之數量的選擇係為說明本案發明，神經網路單元121當可將類似的計算應用於不同數量隱藏層之實施例，每一層中具有不同數量神經元之實施例，或是神經元未被全部連結之實施例。在一實施例中，對於這一層中不存在之神經元或是不存在之神經元連結的權重值會被設定為零。就一較佳實施例而言，架構程式會將第一組權重寫入權重隨機存取記憶體124並啟動神經網路單元121，當神經網路單元121正在執行關聯於第一層之計算時，此架構程式會將第二組權重寫入權重隨機存取記憶體124，如此，一旦神經網路單元121完成第一隱藏層之計算，神經網路單元121就可以開始第二層之計算。如此，架構程式會往返於權重隨機存取記憶體124之兩個區域，以確保神經網路單元121可以被充分利用。此流程始於步驟602。 The sixth diagram B is a flowchart showing that the processor 100 of the first diagram executes a structural program to perform a typical multiplication and accumulation start function operation of a neuron associated with a hidden layer of an artificial neural network using the neural network unit 121 The operation is the same as that performed by the program of the fourth figure. The example in Figure 6B assumes that there are four hidden layers (variable NUM_LAYERS marked in the initialization step 602), each hidden layer has 512 neurons, and each neuron is connected to all 512 neurons in the previous layer (through the fourth Figure program). However, it should be understood that the selection of the number of these layers and the number of neurons is to illustrate the invention of this case. The neural network unit 121 can apply similar calculations to the embodiment of different numbers of hidden layers, each layer has a different number of neurons. Embodiments, or embodiments in which neurons are not all connected. In one embodiment, a weight value for a non-existent neuron or a non-existent neuron connection in this layer is set to zero. In a preferred embodiment, the architecture program writes the first set of weights into the weight random access memory 124 and activates the neural network unit 121. When the neural network unit 121 is performing calculations associated with the first layer This architecture program will write the second set of weights into the weight random access memory 124. In this way, once the neural network unit 121 completes the calculation of the first hidden layer, the neural network unit 121 can start the calculation of the second layer. In this way, the framework program will go back and forth between the two regions of the weight random access memory 124 Domain to ensure that the neural network unit 121 can be fully utilized. This process starts at step 602.

在步驟602，如第六A圖之相關章節所述，執行架構程式之處理器100係將輸入值寫入資料隨機存取記憶體122之當前神經元隱藏層，也就是寫入資料隨機存取記憶體122之列17。這些值也可能已經位於資料隨機存取記憶體122之列17作為神經網路單元121針對前一層之運算結果133(例如卷積、共源或輸入層)。其次，架構程式會將變數N初始化為數值1。變數N代表隱藏層中即將由神經網路單元121處理之當前層。此外，架構程式會將變數NUM_LAYERS初始化為數值4，因為在本範例中有四個隱藏層。接下來流程前進至步驟604。 In step 602, as described in the relevant section of FIG. 6A, the processor 100 executing the architecture program writes the input value into the current neuron hidden layer of the data random access memory 122, that is, writes the data random access 17 of memory 122. These values may also be located in the row 17 of the data random access memory 122 as the operation result 133 (such as convolution, common source or input layer) of the neural network unit 121 for the previous layer. Second, the framework program initializes the variable N to the value 1. The variable N represents the current layer in the hidden layer to be processed by the neural network unit 121. In addition, the framework program initializes the variable NUM_LAYERS to the value 4 because there are four hidden layers in this example. The flow then proceeds to step 604.

在步驟604，處理器100將層1之權重文字寫入權重隨機存取記憶體124，例如第六A圖所示之列0至511。接下來流程前進至步驟606。 In step 604, the processor 100 writes the weight text of the layer 1 into the weight random access memory 124, for example, the columns 0 to 511 shown in FIG. 6A. The flow then proceeds to step 606.

在步驟606中，處理器100利用指定一函數1432以寫入程式記憶體129之MTNN指令1400，將一乘法累加啟動函數程式(如第四圖所示)寫入神經網路單元121程式記憶體129。處理器100隨後利用一MTNN指令1400以啟動神經網路單元程式，此指令係指定一函數1432開始執行此程式。接下來流程前進至步驟608。 In step 606, the processor 100 uses the MTNN instruction 1400 that specifies a function 1432 to write the program memory 129, and writes a multiply-accumulate activation function program (as shown in the fourth figure) into the neural network unit 121 program memory. 129. The processor 100 then uses a MTNN instruction 1400 to start the neural network unit program, which instruction specifies a function 1432 to start executing the program. The flow then proceeds to step 608.

在決策步驟608中，架構程式確認變數N之數值是否小於NUM_LAYERS。若是，流程就會前進至步驟612；否則就前進至步驟614。 In decision step 608, the architecture program determines whether the value of the variable N is less than NUM_LAYERS. If yes, the flow proceeds to step 612; otherwise, it proceeds to step 614.

在步驟612中，處理器100將層N+1之權重文字寫入權重隨機存取記憶體124，例如列512至1023。因此，架構程式就可以在神經網路單元121執行當前層之隱藏層計算時將下一層的權重文字寫入權重隨機存取記憶體124，藉此，在完成當前層之計算，也就是寫入資料隨機存取記憶體122後，神經網路單元121就可以立刻開始執行下一層之隱藏層計算。接下來前進至步驟614。 In step 612, the processor 100 weights the layer N + 1 The text write weight is random access memory 124, for example, rows 512 to 1023. Therefore, the architecture program can write the weight text of the next layer into the weight random access memory 124 when the neural network unit 121 performs the calculation of the hidden layer of the current layer, thereby completing the calculation of the current layer, that is, writing After the data random access memory 122, the neural network unit 121 can immediately perform the hidden layer calculation of the next layer. Next, proceed to step 614.

在步驟614中，處理器100確認正在執行之神經網路單元程式(就層1而言，在步驟606開始執行，就層2至4而言，則是在步驟618開始執行)是否已經完成執行。就一較佳實施例而言，處理器100會透過執行一MFNN指令1500讀取神經網路單元121狀態暫存器127以確認是否已經完成執行。在另一實施例中，神經網路單元121會產生一中斷，表示已經完成乘法累加啟動函數層程式。接下來流程前進至決策步驟616。 In step 614, the processor 100 confirms whether the executing neural network unit program (in the case of layer 1, the execution is started in step 606, and in the case of layers 2 to 4, it is executed in step 618) whether the execution has been completed. . In a preferred embodiment, the processor 100 reads the state register 127 of the neural network unit 121 by executing an MFNN instruction 1500 to confirm whether the execution has been completed. In another embodiment, the neural network unit 121 generates an interrupt, indicating that the multiplication and accumulation start function layer program has been completed. The flow then proceeds to decision step 616.

在決策步驟616中，架構程式確認變數N之數值是否小於NUM_LAYERS。若是，流程會前進至步驟618；否則就前進至步驟622。 In decision step 616, the architecture program determines whether the value of the variable N is less than NUM_LAYERS. If yes, the flow proceeds to step 618; otherwise, it proceeds to step 622.

在步驟618中，處理器100會更新乘法累加啟動函數程式，使能執行層N+1之隱藏層計算。進一步來說，處理器100會將第四圖中位址1之乘法累加指令之資料隨機存取記憶體122列值，更新為資料隨機存取記憶體122中前一層計算結果寫入之列(例如更新為列16)並更新輸出列(例如更新為列15)。處理器100隨後開始更新神經網路單元程式。在另一實施例中，第四圖之程式係指定位址4之輸出指令之同一列作為位址1之乘法累加指令所指定之列(也就是由資料隨機存取記憶體122讀取之列)。在此實施例中，輸入資料文字之當前列會被覆寫(因為此列資料文字已經被讀入多工暫存器208並透過N文字旋轉器在這些神經處理單元126間進行旋轉，只要這列資料文字不需被用於其他目的，這樣的處理方式就是可以被允許的)。在此情況下，在步驟618中就不需要更新神經網路單元程式，而只需要將其重新啟動。接下來流程前進至步驟622。 In step 618, the processor 100 updates the multiply accumulate activation function program to enable execution of the hidden layer calculation of the layer N + 1. Further, the processor 100 updates the data random access memory 122 row value of the multiply accumulate instruction at address 1 in the fourth figure to the row in which the calculation result of the previous layer in the data random access memory 122 is written ( For example, update to column 16) and update the output column (for example, update to column 15). The processor 100 then starts updating the neural network unit program. In another embodiment, the program of the fourth figure specifies the same row of the output instruction at address 4 as the multiplication and accumulation of address 1. The row specified by the instruction (that is, the row read by the data random access memory 122). In this embodiment, the current row of input data text will be overwritten (because this row of data text has been read into the multiplexer register 208 and rotated between these neural processing units 126 through the N text rotator, as long as this row Data text need not be used for other purposes, such processing is allowed). In this case, the neural network unit program does not need to be updated in step 618, but only needs to be restarted. The flow then proceeds to step 622.

在步驟622中，處理器100從資料隨機存取記憶體122讀取層N之神經網路單元程式之結果。不過，若是這些結果只會被用於下一層，架構程式就不須從資料隨機存取記憶體122讀取這些結果，而可將其保留在資料隨機存取記憶體122供下一個隱藏層計算之用。接下來流程前進至步驟624。 In step 622, the processor 100 reads the result of the neural network unit program of the layer N from the data random access memory 122. However, if these results are only used in the next layer, the architecture program does not need to read these results from the data random access memory 122, but can retain them in the data random access memory 122 for the next hidden layer calculation. Use. The flow then proceeds to step 624.

在決策步驟624中，架構程式確認變數N之數值是否小於NUM_LAYERS。若是，流程前進至步驟626；否則就終止此流程。 In decision step 624, the architecture program determines whether the value of the variable N is less than NUM_LAYERS. If yes, the flow proceeds to step 626; otherwise, the flow is terminated.

在步驟626中，架構程式會將N的數值增加一。接下來流程會回到決策步驟608。 In step 626, the framework program increases the value of N by one. The flow then returns to decision step 608.

如同第六B圖之範例所示，大致上每512個時頻週期，這些神經處理單元126就會對資料隨機存取記憶體122執行一次讀取與一次寫入(透過第四圖之神經網路單元程式之運算的效果)。此外，這些神經處理單元126大致上每個時頻週期都會對權重隨機存取記憶體124進行讀取以讀取一列權重文字。因此，權重隨機存取記憶體124全部的頻寬都會因為神經網路單元121以混合方式執行隱藏層運算而被消耗。此外，假定在一實施例中具有一個寫入與讀取緩衝器，例如第十七圖之緩衝器1704，神經處理單元126進行讀取之同時，處理器100對權重隨機存取記憶體124進行寫入，如此緩衝器1704大致上每16個時頻週期會對權重隨機存取記憶體124執行一次寫入以寫入權重文字。因此，在權重隨機存取記憶體124為單一埠之實施例中(如同第十七圖之相對應章節所述)，大致上每16個時頻週期這些神經處理單元126就會暫時擱置對權重隨機存取記憶體124進行之讀取，而使緩衝器1704能夠對權重隨機存取記憶體124進行寫入。不過，在雙埠權重隨機存取記憶體124之實施例中，這些神經處理單元126就不需被擱置。 As shown in the example of Figure 6B, roughly every 512 time-frequency cycles, these neural processing units 126 perform a read and a write on the data random access memory 122 (through the neural network of Figure 4) Effect of the path unit program). In addition, these neural processing units 126 read the weight random access memory 124 for each time-frequency cycle to read a list of weight text. Therefore, weighted random access The entire bandwidth of the memory 124 will be consumed because the neural network unit 121 performs hidden layer operations in a hybrid manner. In addition, it is assumed that in one embodiment, there is a write and read buffer, such as the buffer 1704 of FIG. 17, while the neural processing unit 126 reads, the processor 100 performs the weight random access memory 124. Write, so that the buffer 1704 performs a write to the weight random access memory 124 approximately every 16 time-frequency cycles to write the weight text. Therefore, in the embodiment in which the weight random access memory 124 is a single port (as described in the corresponding section of FIG. 17), these neural processing units 126 temporarily suspend the weighting for approximately every 16 time-frequency cycles. The read by the random access memory 124 enables the buffer 1704 to write to the weighted random access memory 124. However, in the dual-port weight random access memory 124 embodiment, these neural processing units 126 need not be left unused.

第七圖係顯示第一圖之神經處理單元126之另一實施例之方塊示意圖。第七圖之神經處理單元126係類似於第二圖之神經處理單元126。不過，第七圖之神經處理單元126另外具有一個雙輸入多工暫存器705。此多工暫存器705選擇其中一個輸入206或711儲存於其暫存器，並於後續時頻週期提供於其輸出203。輸入206從權重隨機存取記憶體124接收權重文字。另一個輸入711則是接收相鄰神經處理單元126之第二多工暫存器705之輸出203。就一較佳實施例而言，神經處理單元J之輸入711會接收之排列在J-1之神經處理單元126之多工暫存器705輸出203，而神經處理單元J之輸出203則是提供至排列在J+1之神經處理單元126之多工暫存器705之輸入 711。如此，N個神經處理單元126之多工暫存器705就可共同運作，如同一N個文字之旋轉器，其運作係類似於前述第三圖所示之方式，不過是用於權重文字而非資料文字。多工暫存器705係利用一控制輸入213控制這兩個輸入中哪一個會被多工暫存器705選擇儲存於其暫存器並於後續提供於輸出203。 The seventh diagram is a block diagram showing another embodiment of the neural processing unit 126 of the first diagram. The neural processing unit 126 in the seventh figure is similar to the neural processing unit 126 in the second figure. However, the neural processing unit 126 in the seventh figure additionally has a dual-input multiplexer register 705. The multiplexer register 705 selects one of the inputs 206 or 711 to be stored in its register, and provides it to its output 203 in the subsequent time-frequency cycle. Input 206 receives weighted text from weighted random access memory 124. The other input 711 receives the output 203 of the second multiplexer register 705 of the neighboring neural processing unit 126. In a preferred embodiment, the input 711 of the neural processing unit J receives the output 203 of the multiplexer register 705 arranged in the neural processing unit 126 of J-1, and the output 203 of the neural processing unit J provides Input to multiplexer register 705 of neural processing unit 126 arranged in J + 1 711. In this way, the multiplexer registers 705 of the N neural processing units 126 can work together. For example, the rotation of the same N characters is similar to the way shown in the third figure above, but it is used for weighted characters. Non-data text. The multiplexer register 705 uses a control input 213 to control which of the two inputs is selected by the multiplexer register 705 to be stored in its register and subsequently provided to the output 203.

利用多工暫存器208與/或多工暫存器705(以及如第十八與二十三圖所示之其他實施例中之多工暫存器)，實際上構成一個大型的旋轉器將來自資料隨機存取記憶體122與/或權重隨機存取記憶體124之一列之資料/權重進行旋轉，神經網路單元121就不需要在資料隨機存取記憶體122與/或權重隨機存取記憶體124間使用一個非常大的多工器以提供需要的資料/權重文字至適當的神經網路單元。 The use of the multiplexer register 208 and / or the multiplexer register 705 (and the multiplexer register in other embodiments as shown in Figures 18 and 23) actually constitutes a large rotator By rotating the data / weights from one row of the data random access memory 122 and / or the weight random access memory 124, the neural network unit 121 does not need to randomly store data in the data random access memory 122 and / or the weights. A very large multiplexer is used between the memory 124 to provide the required data / weight text to the appropriate neural network unit.

Write back the accumulator value in addition to the start function result

對於某些應用而言，讓處理器100接收回(例如透過第十五圖之MFNN指令接收至媒體暫存器118)未經處理之累加器202數值217，以提供給執行於其他執行單元112之指令執行計算，確實有其用處。舉例來說，在一實施例中，啟動函數單元212不針對軟極大啟動函數之執行進行配置以降低啟動函數單元212之複雜度。所以，神經網路單元121會輸出未經處理之累加器202數值217或其中一個子集合至資料隨機存取記憶體122或權重隨機存取記憶體124，而架構程式在後續步驟可以由資料隨機存取記憶體122或權重隨機存取記憶體124讀取並對此未經處理之數值進行計算。不過，對於未經處理之累加器202數值217之應用並不限於執行軟極大運算，其他應用亦為本發明所涵蓋。 For some applications, the processor 100 receives (eg, the media register 118 via the MFNN instruction in Figure 15) the unprocessed accumulator 202 value 217 to provide for execution to other execution units 112 The instructions for performing calculations have their usefulness. For example, in one embodiment, the startup function unit 212 is not configured for the execution of the soft maximum startup function to reduce the complexity of the startup function unit 212. Therefore, the neural network unit 121 outputs the unprocessed accumulator 202 value 217 or one of its subsets to the data random access memory 122 or the weight random access memory 124, and the architecture program can be changed in the subsequent steps by The data random access memory 122 or the weight random access memory 124 reads and calculates this unprocessed value. However, the application of the unprocessed accumulator 202 value 217 is not limited to performing soft maximum operations, other applications are also covered by the present invention.

第八圖係顯示第一圖之神經處理單元126之又一實施例之方塊示意圖。第八圖之神經處理單元126係類似於第二圖之神經處理單元126。不過，第八圖之神經處理單元126在啟動函數單元212內包括一多工器802，而此啟動函數單元212具有一控制輸入803。累加器202之寬度(以位元計)係大於資料文字之寬度。多工器802具有多個輸入以接收累加器202輸出217之資料文字寬度部分。在一實施例中，累加器202之寬度為41個位元，而神經處理單元216可用以輸出一個16位元之結果文字133；如此，舉例來說，多工器802(或第三十圖之多工器3032與/或多工器3037)具有三個輸入分別接收累加器202輸出217之位元[15：0]、位元[31：16]與位元[47：32]。就一較佳實施例而言，非由累加器202提供之輸出位元(例如位元[47：41])會被強制設定為零值位元。 The eighth diagram is a schematic block diagram showing another embodiment of the neural processing unit 126 of the first diagram. The neural processing unit 126 of the eighth figure is similar to the neural processing unit 126 of the second figure. However, the neural processing unit 126 of the eighth figure includes a multiplexer 802 in the activation function unit 212, and the activation function unit 212 has a control input 803. The width (in bits) of the accumulator 202 is greater than the width of the data text. The multiplexer 802 has a plurality of inputs to receive the data text width portion of the accumulator 202 output 217. In an embodiment, the width of the accumulator 202 is 41 bits, and the neural processing unit 216 can output a 16-bit result text 133. Thus, for example, the multiplexer 802 (or the thirty-third figure) The multiplexer 3032 and / or the multiplexer 3037) has three inputs respectively receiving bits [15: 0], bits [31:16], and bits [47:32] of the output 217 of the accumulator 202. For a preferred embodiment, output bits (such as bits [47:41]) not provided by the accumulator 202 are forced to be set to zero-valued bits.

定序器128會在控制輸入803產生一數值，控制多工器802在累加器202之文字(如16位元)中選擇其一，以因應一寫入累加器指令，例如後續第九圖中位於位址3至5之寫入累加器指令。就一較佳實施例而言，多工器802並具有一個或多個輸入以接收啟動函數電路(如第三十圖中之元件3022,3024,3026,3018,3014與3016)之輸出，而這些啟動函數電路產生之輸出的寬度等於一個資料文字。定序器128會在控制輸入803產生一數值以控制多工器802在這些啟動函數電路輸出中選擇其一，而非在累加器202之文字中選擇其一，以因應如第四圖中位址4之啟動函數單元輸出指令。 The sequencer 128 generates a value at the control input 803, and controls the multiplexer 802 to select one of the text (such as 16 bits) of the accumulator 202 to write an accumulator instruction corresponding to one, for example, the ninth figure below Write accumulator instructions at addresses 3 to 5. For a preferred embodiment, the multiplexer 802 has one or more inputs to receive the output of a startup function circuit (such as components 3022, 3024, 3026, 3018, 3014, and 3016 in the thirty figure), and The width of the output produced by these startup function circuits Degree is equal to one profile text. The sequencer 128 will generate a value at the control input 803 to control the multiplexer 802 to choose one of these startup function circuit outputs, instead of choosing one of the words in the accumulator 202, in response to the position in the fourth figure The start function unit of address 4 outputs the instruction.

第九圖係一表格，顯示一個儲存於第一圖之神經網路單元121之程式記憶體129並由該神經網路單元121執行之程式。第九圖之範例程式係類似於第四圖之程式。尤其是，二者在位址0至2之指令完全相同。不過，第四圖中位址3與4之指令在第九圖中則是由寫入累加器指令取代，此指令會指示512個神經處理單元126將其累加器202輸出217作為結果133寫回資料隨機存取記憶體122之三個列，在此範例中即列16至18。也就是說，此寫入累加器指令會指示定序器128在第一時頻週期輸出一數值為16之資料隨機存取記憶體位址123以及一寫入命令，在第二時頻週期輸出一數值為17之資料隨機存取記憶體位址123以及一寫入命令，在第三時頻週期則是輸出一數值為18之資料隨機存取記憶體位址123與一寫入命令。就一較佳實施例而言，寫入累加器指令之執行時間可以與其他指令重疊，如此，寫入累加器指令就實際上就可以在這三個時頻週期內執行，其中每一個時頻週期會寫入資料隨機存取記憶體122之一列。在一實施例中，使用者指定啟動函數2934與控制暫存器127之輸出命令2956欄之數值(第二十九A圖)，以將累加器202之所需部份寫入資料隨機存取記憶體122或權重隨機存取記憶體124。另外，寫入累加器指令可以選擇性地寫回累加器 202之一子集，而非寫回累加器202之全部內容。在一實施例中，可寫回標準型之累加器202。這部分在後續對應於第二十九至三十一圖之章節會有更詳細的說明。 The ninth figure is a table showing a program stored in the program memory 129 of the neural network unit 121 of the first figure and executed by the neural network unit 121. The example program in Figure 9 is similar to the program in Figure 4. In particular, the instructions at addresses 0 to 2 are exactly the same. However, the instructions at addresses 3 and 4 in the fourth figure are replaced by the write accumulator instruction in the ninth figure. This instruction instructs 512 neural processing units 126 to write their accumulator 202 output 217 as the result 133 and write it back. The three rows of data random access memory 122, in this example, rows 16 to 18. In other words, the write accumulator instruction instructs the sequencer 128 to output a data random access memory address 123 with a value of 16 in the first time-frequency period and a write command, and output a data command in the second time-frequency period. The data random access memory address 123 with a value of 17 and a write command are outputted in the third time-frequency period with a data random access memory address 123 with a value of 18 and a write command. In a preferred embodiment, the execution time of the write accumulator instruction can overlap with other instructions, so that the write accumulator instruction can actually be executed in these three time-frequency cycles, each of which A cycle is written into one row of the data random access memory 122. In one embodiment, the user specifies the value of the start function 2934 and the output command 2956 column of the control register 127 (Figure 29A) to write the required part of the accumulator 202 into the data for random access. Memory 122 or weight random access memory 124. In addition, the write accumulator instruction can optionally write back to the accumulator A subset of 202 instead of writing back the entire contents of accumulator 202. In one embodiment, the standard type accumulator 202 can be written back. This section will be explained in more detail in the subsequent chapters corresponding to the twenty-ninth to thirty-first figures.

第十圖係顯示神經網路單元121執行第九圖之程式之時序圖。第十圖之時序圖類似於第五圖之時序圖，其中時頻週期0至512均為相同。不過，在時頻週期513-515，這512個神經處理單元126中每一個神經處理單元126之啟動函數單元212會執行第九圖中位址3至5之寫入累加器指令之其中之一。尤其是，在時頻週期513，512個神經處理單元126中每一個神經處理單元126會將累加器202輸出217之位元[15：0]作為其結果133寫回資料隨機存取記憶體122之列16中之相對應文字；在時頻週期514，512個神經處理單元126中每一個神經處理單元126會將累加器202輸出217之位元[31：16]作為其結果133寫回資料隨機存取記憶體122之列17中之相對應文字；而在時頻週期515，512個神經處理單元126中每一個神經處理單元126會將累加器202輸出217之位元[40：32]作為其結果133寫回資料隨機存取記憶體122之列18中之相對應文字。就一較佳實施例而言，位元[47：41]會被強制設定為零值。 The tenth diagram is a timing chart showing the execution of the routine of the ninth diagram by the neural network unit 121. The timing diagram of the tenth diagram is similar to the timing diagram of the fifth diagram, in which the time-frequency periods 0 to 512 are the same. However, in the time-frequency period 513-515, the start function unit 212 of each of the 512 neural processing units 126 will execute one of the write accumulator instructions at addresses 3 to 5 in the ninth figure . In particular, in the time-frequency period 513, each of the 512 neural processing units 126 will output bit 217 [15: 0] of the accumulator 202 as its result 133 and write it back to the data random access memory 122 Corresponding text in column 16; in the time-frequency cycle 514, each of the 512 neural processing units 126 will output bit 217 [31:16] of accumulator 202 as the result 133 and write it back to the data Corresponding text in column 17 of random access memory 122; and in the time-frequency cycle 515, each of the 512 neural processing units 126 will output 217 bits of accumulator 202 [40:32] As a result 133, the corresponding text in column 18 of the data random access memory 122 is written back. In a preferred embodiment, the bits [47:41] are forced to be set to zero.

Shared startup function unit

第十一圖係顯示第一圖之神經網路單元121之一實施例之方塊示意圖。在第十一圖之實施例中，一個神經元係分成兩部分，即啟動函數單元部分與算術邏輯單元部分(此部分並包含移位暫存器部分)，而各個啟動函數單元部分係由多個算術邏輯單元部分共享。在第十一圖中，算術邏輯單元部分係指神經處理單元126，而共享之啟動函數單元部分則是指啟動函數單元1112。相對於如第二圖之實施例，各個神經元則是包含自己的啟動函數單元212。依此，在第十一圖實施例之一範例中，神經處理單元126(算術邏輯單元部分)可包括第二圖之累加器202、算術邏輯單元204、多工暫存器208與暫存器205，但不包括啟動函數單元212。在第十一圖之實施例中，神經網路單元121包括512個神經處理單元126，不過，本發明並不限於此。在第十一圖之範例中，這512個神經處理單元126被分成64個群組，在第十一圖中標示為群組0至63，而每個群組具有八個神經處理單元126。 The eleventh figure is a block diagram showing an embodiment of the neural network unit 121 in the first figure. In the embodiment of FIG. 11, a neuron system is divided into two parts, that is, the activation function unit part and the arithmetic The logic unit section (this section does not include the shift register section), and each start function unit section is shared by multiple arithmetic logic unit sections. In the eleventh figure, the arithmetic logic unit part refers to the neural processing unit 126, and the shared activation function unit part refers to the activation function unit 1112. Compared with the embodiment shown in the second figure, each neuron includes its own activation function unit 212. Accordingly, in an example of the embodiment in the eleventh figure, the neural processing unit 126 (the arithmetic logic unit part) may include the accumulator 202, the arithmetic logic unit 204, the multiplexer register 208 and the register in the second figure. 205, but does not include the start function unit 212. In the embodiment of the eleventh figure, the neural network unit 121 includes 512 neural processing units 126, but the present invention is not limited thereto. In the example of FIG. 11, the 512 neural processing units 126 are divided into 64 groups, which are labeled as groups 0 to 63 in FIG. 11, and each group has eight neural processing units 126.

神經網路單元121並包括一列緩衝器1104與複數個共享之啟動函數單元1112，這些啟動函數單元1112係耦接於神經處理單元126與列緩衝器1104間。列緩衝器1104之寬度(以位元計)與資料隨機存取記憶體122或權重隨機存取記憶體124之一列相同，例如512個文字。每一個神經處理單元126群組具有一個啟動函數單元1112，亦即，每個啟動函數單元1112係對應於一神經處理單元126群組；如此，在第十一圖之實施例中就存在64個啟動函數單元1112對應至64個神經處理單元126群組。同一個群組之八個神經處理單元126係共享對應於此群組之啟動函數單元1112。本發明亦可應用於具有不同數量之啟動函數單元以及每一個群組中具有不同數量之神經處理單元之實施例。舉例來說，本發明亦可應用於每個群組中具有兩個、四個或十六個神經處理單元126共享同一個啟動函數單元1112之實施例。 The neural network unit 121 includes a column buffer 1104 and a plurality of shared activation function units 1112. These activation function units 1112 are coupled between the neural processing unit 126 and the column buffer 1104. The width (in bits) of the row buffer 1104 is the same as one row of the data random access memory 122 or the weight random access memory 124, for example, 512 characters. Each group of neural processing units 126 has an activation function unit 1112, that is, each activation function unit 1112 corresponds to a group of neural processing units 126; thus, in the embodiment of FIG. 11, there are 64 The activation function unit 1112 corresponds to a group of 64 neural processing units 126. The eight neural processing units 126 of the same group share the activation function unit 1112 corresponding to this group. The invention can also be applied to units with different numbers of activation functions and to different numbers of groups in each group. Examples of neural processing units. For example, the present invention can also be applied to an embodiment in which two, four, or sixteen neural processing units 126 share the same activation function unit 1112 in each group.

共享啟動函數單元1112有助於縮減神經網路單元121之尺寸。尺寸縮減會犧牲效能。也就是說，依據共享率之不同，會需要使用額外的時頻週期才能產生整個神經處理單元126陣列之結果133，舉例來說，如以下第十二圖所示，在8：1之共享率的情況下就需要七個額外的時頻週期。不過，一般而言，相較於產生累加總數所需之時頻週期數(舉例來說，對於每個神經元具有512個連結之一層，就需要512個時頻週期)，前述額外增加的時頻週期數(例如7)相當少。因此，共享啟動函數單元對效能的影響非常小(例如，增加大約百分之一之計算時間)，對於所能縮減神經網路單元121之尺寸而言會是一個合算的成本。 The shared activation function unit 1112 helps to reduce the size of the neural network unit 121. Size reduction will sacrifice performance. That is, depending on the sharing rate, additional time-frequency cycles will be required to generate the result 133 of the entire neural processing unit 126 array. For example, as shown in the twelfth figure below, the sharing rate at 8: 1 In this case, seven additional time-frequency cycles are required. However, in general, compared with the number of time-frequency cycles required to generate the cumulative total (for example, for each neuron with one layer of 512 connections, 512 time-frequency cycles are required), the aforementioned additional time The number of frequency cycles (for example, 7) is relatively small. Therefore, the effect of the shared activation function unit on the performance is very small (for example, an increase of about one percent of the calculation time), which may be a cost effective for reducing the size of the neural network unit 121.

在一實施例中，每一個神經處理單元126包括一啟動函數單元212用以執行相對簡單的啟動函數，這些簡單的啟動函數單元212具有較小的尺寸而能被包含在每個神經處理單元126內；反之，共享的複雜啟動函數單元1112則是執行相對複雜的啟動函數，其尺寸會明顯大於簡單的啟動函數單元212。在此實施例中，只有在指定複雜啟動函數而需要由共享複雜啟動函數單元1112執行之情況下，需要額外的時頻週期，在指定的啟動函數可以由簡單啟動函數單元212執行之情況下，就不需要此額外的時頻週期。 In one embodiment, each neural processing unit 126 includes an activation function unit 212 to execute a relatively simple activation function. These simple activation function units 212 have a small size and can be included in each of the neural processing units 126. Conversely, the shared complex startup function unit 1112 executes a relatively complex startup function, and its size will be significantly larger than the simple startup function unit 212. In this embodiment, only when the complex startup function is specified and needs to be executed by the shared complex startup function unit 1112, an additional time-frequency cycle is required. In the case where the designated startup function can be executed by the simple startup function unit 212, This extra time-frequency period is not needed.

第十二與第十三圖係顯示第十一圖之神經網路單元121執行第四圖之程式之時序圖。第十二圖之時序圖係類似於第五圖之時序圖，二者之時頻週期0至512均相同。不過，在時頻週期513之運算並不相同，因為第十一圖之神經處理單元126會共享啟動函數單元1112；亦即，同一個群組之神經處理單元126會共享關聯於此群組之啟動函數單元1112，而第十一圖即顯示此共享架構。 The twelfth and thirteenth diagrams are timing charts showing that the neural network unit 121 of the eleventh diagram executes the program of the fourth diagram. The timing diagram of the twelfth figure is similar to the timing diagram of the fifth figure, and the time-frequency periods 0 to 512 are the same. However, the operation in the time-frequency period 513 is not the same, because the neural processing unit 126 of the eleventh figure will share the activation function unit 1112; that is, the neural processing units 126 of the same group will share the functions associated with this group. The function unit 1112 is activated, and the eleventh figure shows the shared architecture.

第十三圖之時序圖之每一列係對應至標示於第一行之連續時頻週期。其他行則是分別對應至這64個啟動函數單元1112中不同的啟動函數單元1112並指出其運算。圖中僅顯示神經處理單元0,1,63之運算以簡化說明。第十三圖之時頻週期係對應至第十二圖之時頻週期，但以不同方式顯示神經處理單元126共享啟動函數單元1112之運算。如第十三圖所示，在時頻週期0至512，這64個啟動函數單元1112都是處於不啟動狀態，而神經處理單元126執行初始化神經處理單元指令、乘法累加指令與乘法累加旋轉指令。 Each column of the timing chart of the thirteenth figure corresponds to a continuous time-frequency period marked on the first row. The other lines correspond to different start function units 1112 of the 64 start function units 1112 and indicate their operations. The figure only shows the operations of the neural processing units 0,1,63 to simplify the description. The time-frequency cycle of the thirteenth figure corresponds to the time-frequency cycle of the twelfth figure, but the operation of the neural processing unit 126 sharing the activation function unit 1112 is shown in different ways. As shown in the thirteenth figure, in the time-frequency period of 0 to 512, the 64 startup function units 1112 are in a non-starting state, and the neural processing unit 126 executes an instruction to initialize the neural processing unit, a multiply accumulate instruction, and a multiply accumulate rotation instruction. .

如第十二與十三圖所示，在時頻週期513，啟動函數單元0(關聯於群組0之啟動函數單元1112)開始對神經處理單元0之累加器202數值217執行所指定之啟動函數，神經處理單元0即群組0中第一個神經處理單元216，而啟動函數單元1112之輸出將會被儲存於列暫存器1104之文字0。同樣在時頻週期513，每個啟動函數單元1112都會開始對相對應神經處理單元216群組中第一個神經處理單元126之累加器202數值217執行所指定之啟動函數。因此，如第十三圖所示，在時頻週期513，啟動函數單元0開始對神經處理單元0之累加器202執行所指定之啟動函數以產生將會儲存於列暫存器1104之文字0之結果；啟動函數單元1開始對神經處理單元8之累加器202執行所指定之啟動函數以產生將會儲存於列暫存器1104之文字8之結果；依此類推，啟動函數單元63開始對神經處理單元504之累加器202執行所指定之啟動函數以產生將會儲存於列暫存器1104之文字504之結果。 As shown in the twelfth and thirteenth figures, in the time-frequency period 513, the activation function unit 0 (the activation function unit 1112 associated with the group 0) starts performing the specified activation on the value 217 of the accumulator 202 of the neural processing unit 0 Function, the neural processing unit 0 is the first neural processing unit 216 in the group 0, and the output of the activation function unit 1112 will be stored in the character 0 of the column register 1104. Also in the time-frequency period 513, each start-up function unit 1112 will start to correspond to the first in the corresponding neural processing unit 216 group. The accumulator 202 value 217 of a neural processing unit 126 executes the specified activation function. Therefore, as shown in the thirteenth figure, in the time-frequency period 513, the activation function unit 0 starts to execute the specified activation function on the accumulator 202 of the neural processing unit 0 to generate the text 0 that will be stored in the column register 1104 As a result, the activation function unit 1 starts executing the specified activation function on the accumulator 202 of the neural processing unit 8 to generate a result of the text 8 which will be stored in the register 1104; and so on, the activation function unit 63 starts The accumulator 202 of the neural processing unit 504 executes the designated activation function to generate a result of the text 504 to be stored in the register 1104.

在時頻週期514，啟動函數單元0(關聯於群組0之啟動函數單元1112)開始對神經處理單元1之累加器202數值217執行所指定之啟動函數，神經處理單元1即群組0中第二個神經處理單元216，而啟動函數單元1112之輸出將會被儲存於列暫存器1104之文字1。同樣在時頻週期514，每個啟動函數單元1112都會開始對相對應神經處理單元216群組中第二個神經處理單元126之累加器202數值217執行所指定之啟動函數。因此，如第十三圖所示，在時頻週期514，啟動函數單元0開始對神經處理單元1之累加器202執行所指定之啟動函數以產生將會儲存於列暫存器1104之文字1之結果；啟動函數單元1開始對神經處理單元9之累加器202執行所指定之啟動函數以產生將會儲存於列暫存器1104之文字9之結果；依此類推，啟動函數單元63開始對神經處理單元505之累加器202執行所指定之啟動函數以產生將會儲存於列暫存器1104之文字505之結果。這樣的處理會持續到時頻週期 520，啟動函數單元0(關聯於群組0之啟動函數單元1112)開始對神經處理單元7之累加器202數值217執行所指定之啟動函數，神經處理單元7即群組0中第八個(最後一個)神經處理單元216，而啟動函數單元1112之輸出將會被儲存於列暫存器1104之文字7。同樣在時頻週期520，每個啟動函數單元1112都會開始對相對應神經處理單元216群組中第八個神經處理單元126之累加器202數值217執行所指定之啟動函數。因此，如第十三圖所示，在時頻週期520，啟動函數單元0開始對神經處理單元7之累加器202執行所指定之啟動函數以產生將會儲存於列暫存器1104之文字7之結果；啟動函數單元1開始對神經處理單元15之累加器202執行所指定之啟動函數以產生將會儲存於列暫存器1104之文字15之結果；依此類推，啟動函數單元63開始對神經處理單元511之累加器202執行所指定之啟動函數以產生將會儲存於列暫存器1104之文字511之結果。 In the time-frequency period 514, the activation function unit 0 (the activation function unit 1112 associated with group 0) starts to execute the specified activation function on the accumulator 202 value 217 of the neural processing unit 1. The neural processing unit 1 is in group 0. The second neural processing unit 216, and the output of the activation function unit 1112 will be stored in the text 1 of the column register 1104. Also in the time-frequency period 514, each activation function unit 1112 starts to execute the specified activation function on the value 217 of the accumulator 202 of the second neural processing unit 126 in the corresponding neural processing unit 216 group. Therefore, as shown in the thirteenth figure, in the time-frequency period 514, the activation function unit 0 starts to execute the specified activation function on the accumulator 202 of the neural processing unit 1 to generate the text 1 that will be stored in the column register 1104. As a result, the activation function unit 1 starts executing the specified activation function on the accumulator 202 of the neural processing unit 9 to produce a result of the text 9 which will be stored in the column register 1104; and so on, the activation function unit 63 starts to The accumulator 202 of the neural processing unit 505 executes the designated activation function to generate a result of the text 505 which will be stored in the column register 1104. This process will continue until the time-frequency cycle 520, the activation function unit 0 (the activation function unit 1112 associated with group 0) starts to execute the specified activation function on the value 217 of the accumulator 202 of the neural processing unit 7, the neural processing unit 7 is the eighth in group 0 ( The last one is the neural processing unit 216, and the output of the activation function unit 1112 will be stored in the character 7 of the column register 1104. Also in the time-frequency period 520, each activation function unit 1112 starts to execute the specified activation function on the value 217 of the accumulator 202 of the eighth neural processing unit 126 in the corresponding neural processing unit 216 group. Therefore, as shown in the thirteenth figure, in the time-frequency period 520, the activation function unit 0 starts to execute the specified activation function on the accumulator 202 of the neural processing unit 7 to generate a character 7 that will be stored in the column register 1104. As a result, the activation function unit 1 starts executing the specified activation function on the accumulator 202 of the neural processing unit 15 to produce a result of the character 15 which will be stored in the register 1104; and so on, the activation function unit 63 starts The accumulator 202 of the neural processing unit 511 executes the designated activation function to generate a result of the text 511 which will be stored in the column register 1104.

在時頻週期521，一旦這512個神經處理單元126之全部512個結果都已經產生並寫入列暫存器1104，列暫存器1104就會開始將其內容寫入資料隨機存取記憶體122或是權重隨機存取記憶體124。如此，每一個神經處理單元126群組之啟動函數單元1112都執行第四圖中位址3之啟動函數指令之一部分。 In the time-frequency cycle 521, once all 512 results of the 512 neural processing units 126 have been generated and written to the column register 1104, the column register 1104 will begin to write its contents into the data random access memory. 122 or weight random access memory 124. In this way, the activation function unit 1112 of each group of neural processing units 126 executes a part of the activation function instruction at address 3 in the fourth figure.

如第十一圖所示在算術邏輯單元204群組中共享啟動函數單元1112之實施例，特別有助於搭配整數算術邏輯單元204之使用。這部分在後續章節如對應於第二十九A至三十三圖處會有相關說明。 The embodiment in which the activation function unit 1112 is shared in the group of the arithmetic logic unit 204 as shown in FIG. 11 is particularly helpful for the use with the integer arithmetic logic unit 204. This part in the subsequent chapters corresponds to Relevant explanations will be shown at the 29th to 33rd drawings.

MTNN and MFNN architecture instructions

第十四圖係一方塊示意圖，顯示一移動至神經網路(MTNN)架構指令1400以及其對應於第一圖之神經網路單元121之部分之運作。此MTNN指令1400包括一執行碼欄位1402、一src1欄位1404、一src2欄位、一gpr欄位1408與一立即欄位1412。此MTNN指令係一架構指令，亦即此指令係包含在處理器100之指令集架構內。就一較佳實施例而言，此指令集架構會利用執行碼欄位1402之一預設值，來區分MTNN指令1400與指令集架構內之其他指令。此MTNN指令1400之執行碼1402可包括常見於x86架構等之前置碼(prefix)，也可以不包括。 The fourteenth figure is a block diagram showing the operation of a move to neural network (MTNN) architecture instruction 1400 and its portion corresponding to the neural network unit 121 of the first figure. The MTNN instruction 1400 includes an execution code field 1402, an src1 field 1404, a src2 field, a gpr field 1408, and an immediate field 1412. The MTNN instruction is an architecture instruction, that is, the instruction is included in the instruction set architecture of the processor 100. For a preferred embodiment, the instruction set architecture uses a preset value in the execution code field 1402 to distinguish the MTNN instruction 1400 from other instructions in the instruction set architecture. The execution code 1402 of the MTNN instruction 1400 may include a prefix that is common in the x86 architecture and the like, or may not be included.

立即欄位1412提供一數值以指定一函數1432至神經網路單元121之控制邏輯1434。就一較佳實施例而言，此函數1432係作為第一圖之微指令105之一立即運算元。這些可以由神經網路單元121執行之函數1432包括寫入資料隨機存取記憶體122、寫入權重隨機存取記憶體124、寫入程式記憶體129、寫入控制暫存器127、開始執行程式記憶體129內之程式、暫停執行程式記憶體129內之程式、完成執行程式記憶體129內之程式後之通知請求(例如中斷)、以及重設神經網路單元121，但不限於此。就一較佳實施例而言，此神經網路單元指令組會包括一個指令，此指令之結果指出神經網路單元程式已完成。另外，此神經網路單元指令集包括一個明確產生中斷指令。就一較佳實施例而言，對神經網路單元121進行重設之運作包括將神經網路單元121中，除了資料隨機存取記憶體122、權重隨機存取記憶體124、程式記憶體129之資料會維持完整不動外之其他部分，有效地強制回復至一重設狀態(例如，清空內部狀態機器並將其設定為閒置狀態)。此外，內部暫存器，如累加器202，並不會受到重設函數之影響，而必須被明示地清空，例如使用第四圖中位址0之初始化神經處理單元指令。在一實施例中，函數1432可包括一直接執行函數，其第一來源暫存器包含一微運算(舉例來說，可參照第三十四圖之微運算3418)。此直接執行函數指示神經網路單元121直接執行所指定之微運算。如此，架構程式就可以直接控制神經網路單元121執行運算，而非將指令寫入程式記憶體129並於後續指示神經網路單元121執行此位於程式記憶體129內之指令或是透過MTNN指令1400(或第十五圖之MFNN指令1500)之執行。第十四圖顯示此寫入資料隨機存取記憶體122之函數之一範例。 The immediate field 1412 provides a value to specify a function 1432 to the control logic 1434 of the neural network unit 121. For a preferred embodiment, this function 1432 is an immediate operand of one of the microinstructions 105 in the first figure. These functions 1432 that can be executed by the neural network unit 121 include write data random access memory 122, write weight random access memory 124, write program memory 129, write control register 127, and start execution The program in the program memory 129, the execution of the program in the program memory 129 is suspended, the notification request (such as interruption) after the execution of the program in the program memory 129 is completed, and the neural network unit 121 is reset, but is not limited thereto. According to a preferred embodiment, the neural network unit instruction set includes an instruction, and the result of the instruction indicates that the neural network unit program has been completed. In addition, the neural network unit instruction set includes an explicit product Generate interrupt instruction. In a preferred embodiment, the operation of resetting the neural network unit 121 includes removing the data from the neural network unit 121 in addition to the data random access memory 122, weight random access memory 124, and program memory 129. The data will remain intact and effectively reset to a reset state (for example, empty the internal state machine and set it to idle state). In addition, internal registers, such as the accumulator 202, will not be affected by the reset function and must be explicitly cleared, for example, using the instruction of initializing the neural processing unit at address 0 in the fourth figure. In an embodiment, the function 1432 may include a direct execution function, and a first source register thereof includes a micro operation (for example, refer to the micro operation 3418 in FIG. 34). This direct execution function instructs the neural network unit 121 to directly execute the specified micro-operation. In this way, the architecture program can directly control the neural network unit 121 to perform operations, instead of writing instructions to the program memory 129 and subsequently instructing the neural network unit 121 to execute the instructions located in the program memory 129 or via MTNN instructions 1400 (or MFNN instruction 1500 in the fifteenth figure). FIG. 14 shows an example of the function of the write data random access memory 122.

此gpr欄位指定通用暫存器檔案116內之一通用暫存器。在一實施例中，每個通用暫存器均為64位元。此通用暫存器檔案116提供所選定之通用暫存器之數值至神經網路單元121，如圖中所示，而神經網路單元121係將此數值作為位址1422使用。此位址1422會選擇函數1432中指定之記憶體之一列。就資料隨機存取記憶體122或權重隨機存取記憶體124而言，此位址1422會額外選擇一資料塊，其大小是此選定列中媒體暫存器之位置的兩倍(如512個位元)。就一較佳實施例而言，此位置係位於一個512位元邊界。在一實施例中，多工器會選擇位址1422(或是在以下描述之MFNN指令1400之情況下的位址1422)或是來自定序器128之位址123/125/131提供至資料隨機存取記憶體124/權重隨機存取記憶體124/程式記憶體129。在一實施例中，資料隨機存取記憶體122具有雙埠，使神經處理單元126能夠利用媒體暫存器118對此資料隨機存取記憶體122之讀取/寫入，同時讀取/寫入此資料隨機存取記憶體122。在一實施例中，為了類似的目的，權重隨機存取記憶體124亦具有雙埠。 This gpr field specifies one of the general purpose registers in the general purpose register file 116. In one embodiment, each general purpose register is 64 bits. The universal register file 116 provides the value of the selected universal register to the neural network unit 121, as shown in the figure, and the neural network unit 121 uses the value as the address 1422. This address 1422 selects a row of memory specified in function 1432. For data random access memory 122 or weight random access memory 124, an additional data block is selected at this address 1422, and its size is the position of the media register in this selected row. Twice (such as 512 bits). For a preferred embodiment, this location is located on a 512-bit boundary. In an embodiment, the multiplexer selects the address 1422 (or the address 1422 in the case of the MFNN instruction 1400 described below) or the address 123/125/131 from the sequencer 128 to provide data. Random access memory 124 / weight random access memory 124 / program memory 129. In one embodiment, the data random access memory 122 has dual ports, so that the neural processing unit 126 can use the media register 118 to read / write the data random access memory 122 and read / write at the same time. Enter this data into the random access memory 122. In an embodiment, for similar purposes, the weight random access memory 124 also has dual ports.

圖中之src1欄位1404與src2欄位1406均指定媒體暫存器檔案118之一媒體暫存器。在一實施例中，每個媒體暫存器118均為256位元。媒體暫存器檔案118會將來自所選定之媒體暫存器之相連資料(例如512個位元)提供至資料隨機存取記憶體122(或是權重隨機存取記憶體124或是程式記憶體129)以寫入位址1422指定之選定列1428以及在選定列1428中由位址1422指定之位置，如圖中所示。透過一系列MTNN指令1400(以及以下所述之MFNN指令1500)之執行，執行於處理器100之架構程式即可填滿資料隨機存取記憶體122列與權重隨機存取記憶體124列並將一程式寫入程式記憶體129，例如本文所述之程式(如第四與九圖所示之程式)可使神經網路單元121對資料與權重以非常快的速度進行運算，以完成此人工神經網路。在一實施例中，此架構程式係直接控制神經網路單元121而非將程式寫入程式記憶體129。 The src1 field 1404 and the src2 field 1406 in the figure both specify a media register of the media register file 118. In one embodiment, each media register 118 is 256 bits. The media register file 118 provides the connected data (e.g., 512 bits) from the selected media register to the data random access memory 122 (or weight random access memory 124 or program memory). 129) The selected row 1428 designated by write address 1422 and the position designated by address 1422 in the selected row 1428 are shown in the figure. Through the execution of a series of MTNN instructions 1400 (and the MFNN instruction 1500 described below), the architecture program executed on the processor 100 can fill 122 rows of data random access memory and 124 rows of weight random access memory and A program is written into the program memory 129, for example, the program described in this article (such as the programs shown in Figures 4 and 9) enables the neural network unit 121 to perform operations on data and weights at a very fast speed to complete this manual operation. Neural network. In one embodiment, the architecture program directly controls the neural network unit 121 instead of writing the program into the program memory. 忆体 129.

在一實施例中，MTNN指令1400係指定一起始來源暫存器以及來源暫存器之數量，即Q，而非指定兩個來源暫存器(如欄位1404與1406所指定者)。這種形式之MTNN指令1400會指示處理器100將指定為起始來源暫存器之媒體暫存器118以及接下來Q-1個接續的媒體暫存器118寫入神經網路單元121，也就是寫入所指定之資料隨機存取記憶體122或權重隨機存取記憶體124。就一較佳實施例而言，指令轉譯器104會將MTNN指令1400轉譯為寫入所有Q個所指定之媒體暫存器118所需數量之微指令。舉例來說，在一實施例中，當MTNN指令1400將暫存器MR4指定為起始來源暫存器並且Q為8，指令轉譯器104就會將MTNN指令1400轉譯為四個微指令，其中第一個微指令係寫入暫存器MR4與MR5，第二個微指令係寫入暫存器MR6與MR7，第三個微指令係寫入暫存器MR8與MR9，而第四個微指令係寫入暫存器MR10與MR11。在另一個實施例中，由媒體暫存器118至神經網路單元121之資料路徑是1024位元而非512位元，在此情況下，指令轉譯器104會將MTNN指令1400轉譯為兩個微指令，其中第一個微指令係寫入暫存器MR4至MR7，第二個微指令則是寫入暫存器MR8至MR11。本發明亦可應用於MFNN指令1500指定一起始目的暫存器以及目的暫存器之數量之實施例，而使每一個MFNN指令1500可以從資料隨機存取記憶體122或權重隨機存取記憶體124之一列讀取大於單一媒體暫存器118之資料塊。 In one embodiment, the MTNN instruction 1400 specifies an initial source register and the number of source registers, that is, Q, rather than specifying two source registers (such as those specified in fields 1404 and 1406). This form of MTNN instruction 1400 instructs the processor 100 to write the media register 118 designated as the starting source register and the next Q-1 successive media registers 118 to the neural network unit 121. That is, the designated data random access memory 122 or weight random access memory 124 is written. For a preferred embodiment, the instruction translator 104 translates the MTNN instruction 1400 into the micro-instructions required to write all Q designated media registers 118. For example, in one embodiment, when the MTNN instruction 1400 designates the register MR4 as the starting source register and Q is 8, the instruction translator 104 translates the MTNN instruction 1400 into four micro instructions, where The first microinstruction is written into the registers MR4 and MR5, the second microinstruction is written into the registers MR6 and MR7, the third microinstruction is written into the registers MR8 and MR9, and the fourth microinstruction is written into the registers MR8 and MR9. The instructions are written into the registers MR10 and MR11. In another embodiment, the data path from the media register 118 to the neural network unit 121 is 1024 bits instead of 512 bits. In this case, the instruction translator 104 translates the MTNN instruction 1400 into two Micro instructions. The first micro instruction is written into the registers MR4 to MR7, and the second micro instruction is written into the registers MR8 to MR11. The present invention can also be applied to the embodiment where the MFNN instruction 1500 specifies an initial destination register and the number of destination registers, so that each MFNN instruction 1500 can access the data random access memory 122 or weight random access memory. One row of 124 reads data larger than single media register 118 Piece.

第十五圖係一方塊示意圖，顯示一移動至神經網路(MTNN)架構指令1500以及其對應於第一圖之神經網路單元121之部分之運作。此MFNN指令1500包括一執行碼欄位1502、一dst欄位1504、一gpr欄位1508以及一立即欄位1512。MFNN指令係一架構指令，亦即此指令係包含於處理器100之指令集架構內。就一較佳實施例而言，此指令集架構會利用執行碼欄位1502之一預設值，來區分MFNN指令1500與指令集架構內之其他指令。此MFNN指令1500之執行碼1502可包括常見於x86架構等之前置碼(prefix)，也可以不包括。 The fifteenth figure is a block diagram showing the operation of a move to neural network (MTNN) architecture instruction 1500 and its portion corresponding to the neural network unit 121 of the first figure. The MFNN instruction 1500 includes an execution code field 1502, a dst field 1504, a gpr field 1508, and an immediate field 1512. The MFNN instruction is an architecture instruction, that is, the instruction is included in the instruction set architecture of the processor 100. For a preferred embodiment, the instruction set architecture uses a preset value in the execution code field 1502 to distinguish the MFNN instruction 1500 from other instructions in the instruction set architecture. The execution code 1502 of the MFNN instruction 1500 may include a prefix commonly used in the x86 architecture and the like, or may not be included.

立即欄位1512提供一數值以指定一函數1532至神經網路單元121之控制邏輯1434。就一較佳實施例而言，此函數1532係作為第一圖之微指令105之一立即運算元。這些神經網路單元121可以執行之函數1532包括讀取資料隨機存取記憶體122、讀取權重隨機存取記憶體124、讀取程式記憶體129、以及讀取狀態暫存器127，但不限於此。第十五圖之範例係顯示讀取資料隨機存取記憶體122之函數1532。 The immediate field 1512 provides a value to specify a function 1532 to the control logic 1434 of the neural network unit 121. For a preferred embodiment, this function 1532 is an immediate operand of one of the microinstructions 105 in the first figure. The functions 1532 that these neural network units 121 can execute include read data random access memory 122, read weight random access memory 124, read program memory 129, and read status register 127, but not Limited to this. The example in FIG. 15 shows a function 1532 for reading data from the random access memory 122.

此gpr欄位1508指定通用暫存器檔案116內之一通用暫存器。此通用暫存器檔案116提供所選定之通用暫存器之數值至神經網路單元121，如圖中所示，而神經網路單元121係將此數值作為位址1522並以類似於第十四圖之位址1422之方式進行運算，藉以選擇函數1532中指定之記憶體之一列。就資料隨機存取記憶體122 或權重隨機存取記憶體124而言，此位址1522會額外選擇一資料塊，其大小即為此選定列中媒體暫存器(如256個位元)之位置。就一較佳實施例而言，此位置係位於一個256位元邊界。 This gpr field 1508 specifies one of the general purpose registers in the general purpose register file 116. This universal register file 116 provides the value of the selected universal register to the neural network unit 121, as shown in the figure, and the neural network unit 121 uses this value as the address 1522 and resembles tenth. The four-picture address 1422 performs calculations to select a row of memory specified in function 1532. Random Access Memory for Data 122 For weight random access memory 124, an additional data block is selected at this address 1522, and its size is the position of the media register (such as 256 bits) in the selected row. For a preferred embodiment, this location is on a 256-bit boundary.

此dst欄位1504係於一媒體暫存器檔案118內指定一媒體暫存器。如圖中所示，媒體暫存器檔案118係將來自資料隨機存取記憶體122(或權重隨機存取記憶體124或程式記憶體129)之資料(如256位元)接收至選定的媒體暫存器，此資料係讀取自資料接收中位址1522指定之選定列1528以及選定列1528中位址1522指定之位置。 The dst field 1504 specifies a media register in a media register file 118. As shown in the figure, the media register file 118 receives data (such as 256 bits) from the data random access memory 122 (or weight random access memory 124 or program memory 129) to the selected media Register, this data is read from the selected row 1528 specified by the address 1522 in the data receiving and the location specified by the address 1522 in the selected row 1528.

Port allocation of internal random access memory in neural network unit

第十六圖係顯示第一圖之資料隨機存取記憶體122之一實施例之方塊示意圖。此資料隨機存取記憶體122包括一記憶體陣列1606、一讀取埠1602與一寫入埠1604。記憶體陣列1606係裝載資料文字，就一較佳實施例而言，這些資料係排列成如前所述D個列之N個文字之陣列。在一實施例中，此記憶體陣列1606包括一個由64個水平排列之靜態隨機存取記憶胞構成之陣列，其中每個記憶胞具有128位元之寬度以及64位元之高度，如此即可提供一個64KB之資料隨機存取記憶體122，其寬度為8192位元並且具有64列，而此資料隨機存取記憶體122所使用之晶粒面積大致為0.2平方毫米。不過，本發明並不限於此。 The sixteenth diagram is a block diagram showing an embodiment of the data random access memory 122 of the first diagram. The data random access memory 122 includes a memory array 1606, a read port 1602, and a write port 1604. The memory array 1606 is loaded with data characters. For a preferred embodiment, these data are arranged in an array of N characters in D rows as described above. In one embodiment, the memory array 1606 includes an array of 64 horizontally arranged static random access memory cells, where each memory cell has a width of 128 bits and a height of 64 bits. A 64 KB data random access memory 122 is provided, which has a width of 8192 bits and has 64 rows. The die area used by this data random access memory 122 is approximately 0.2 mm 2. However, the present invention is not limited to this.

就一較佳實施例而言，寫入埠1602係以多工方式耦接至神經處理單元126以及媒體暫存器118。進一步來說，這些媒體暫存器118可以透過結果匯流排耦接至讀取埠，而結果匯流排也用於提供資料至重排緩衝器與/或結果傳送匯流排以提供至其他執行單元112。這些神經處理單元126與媒體暫存器118係共享此讀取埠1602，以對資料隨機存取記憶體122進行讀取。又，就一較佳實施例而言，寫入埠1604亦是以多工方式耦接至神經處理單元126以及媒體暫存器118。這些神經處理單元126與媒體暫存器118係共享此寫入埠1604，以寫入此資料隨機存取記憶體122。如此，媒體暫存器118就可以在神經處理單元126對資料隨機存取記憶體122進行讀取之同時，寫入資料隨機存取記憶體122，而神經處理單元126也就可以在媒體暫存器118正在對資料隨機存取記憶體122進行讀取之同時，寫入資料隨機存取記憶體122。這樣的進行方式可以提升效能。舉例來說，這些神經處理單元126可以讀取資料隨機存取記憶體122(例如持續執行計算)，而此同時，媒體暫存器118可以將更多資料文字寫入資料隨機存取記憶體122。在另一範例中，這些神經處理單元126可以將計算結果寫入資料隨機存取記憶體122，而此同時，媒體暫存器118則可以從資料隨機存取記憶體122讀取計算結果。在一實施例中，神經處理單元126可以將一列計算結果寫入資料隨機存取記憶體122，同時還從資料隨機存取記憶體122讀取一列資料文字。在一實施例中，記憶體陣列1606係配置成記憶體區塊(bank)。在神經處理單元126存取資料隨機存取記憶體122的時候，所有的記憶體區塊都會被啟動來存取記憶體陣列1606之一完整列；不過，在媒體暫存器118存取資料隨機存取記憶體122的時候，只有所指定的記憶體區塊會被啟動。在一實施例中，每個記憶體區塊之寬度均為128位元，而媒體暫存器118之寬度則是256位元，如此，舉例來說，每次存取媒體暫存器118就需要啟動兩個記憶體區塊。在一實施例中，這些埠1602/1604之其中之一為讀取/寫入埠。在一實施例中，這些埠1602/1604都是讀取/寫入埠。 In a preferred embodiment, the write port 1602 is multiplexed to the neural processing unit 126 and the media register 118. Further, these media registers 118 may be coupled to the read port through a result bus, and the result bus is also used to provide data to the reorder buffer and / or the result transfer bus to provide to other execution units 112 . The neural processing unit 126 and the media register 118 share the read port 1602 to read the data random access memory 122. In addition, in a preferred embodiment, the write port 1604 is also multiplexed to the neural processing unit 126 and the media register 118. The neural processing units 126 and the media register 118 share the write port 1604 to write the data to the random access memory 122. In this way, the media register 118 can write to the data random access memory 122 while the neural processing unit 126 reads the data random access memory 122, and the neural processing unit 126 can also temporarily store the data in the media. While the device 118 is reading the data random access memory 122, it is writing into the data random access memory 122. This way of doing things can improve performance. For example, these neural processing units 126 can read the data random access memory 122 (for example, continuously perform calculations), and at the same time, the media register 118 can write more data text into the data random access memory 122 . In another example, the neural processing units 126 can write the calculation results into the data random access memory 122, and at the same time, the media register 118 can read the calculation results from the data random access memory 122. In one embodiment, the neural processing unit 126 can write a row of calculation results into the data random access memory 122, and also read a row of data characters from the data random access memory 122. In one embodiment, the memory array 1606 is configured as a memory area. Bank. When the neural processing unit 126 accesses the data random access memory 122, all the memory blocks will be activated to access a complete row of the memory array 1606; however, the media register 118 accesses the data randomly When accessing the memory 122, only the designated memory block will be activated. In one embodiment, the width of each memory block is 128 bits, and the width of the media register 118 is 256 bits. Thus, for example, each time the media register 118 is accessed, Need to start two memory blocks. In one embodiment, one of the ports 1602/1604 is a read / write port. In one embodiment, these ports 1602/1604 are read / write ports.

讓這些神經處理單元126具備如本文所述之旋轉器之能力的優點在於，相較於為了確保神經處理單元126可被充分利用而使架構程式(通過媒體暫存器118)得以持續提供資料至資料隨機存取記憶體122並且在神經處理單元126執行計算之同時，從資料隨機存取記憶體122取回結果所需要之記憶體陣列，此能力有助於減少資料隨機存取記憶體122之記憶體陣列1606的列數，因而可以縮小尺寸。 The advantage of having these neural processing units 126 with the ability of a rotator as described herein is that compared to ensuring that the neural processing unit 126 can be fully utilized, the architecture program (through the media register 118) can continuously provide data to The data random access memory 122 and the memory array required to retrieve the results from the data random access memory 122 while the neural processing unit 126 performs calculations, this capability helps reduce the amount of data in the random access memory 122. The number of rows of the memory array 1606 can be reduced.

內部隨機存取記憶體緩衝器 Internal random access memory buffer

第十七圖係顯示第一圖之權重隨機存取記憶體124與一緩衝器1704之一實施例之方塊示意圖。此權重隨機存取記憶體124包括一記憶體陣列1706與一埠1702。此記憶體陣列1706係裝載權重文字，就一較佳實施例而言，這些權重文字係排列成如前所述W個列之N個文字之陣列。在一實施例中，此記憶體陣列1706包括一個由128個水平排列之靜態隨機存取記憶胞構成之陣列，其中每個記憶胞具有64位元之寬度以及2048位元之高度，如此即可提供一個2MB之權重隨機存取記憶體124，其寬度為8192位元並且具有2048列，而此權重隨機存取記憶體124所使用之晶粒面積大致為2.4平方毫米。不過，本發明並不限於此。 The seventeenth figure is a block diagram showing an embodiment of the weight random access memory 124 and a buffer 1704 of the first figure. The weighted random access memory 124 includes a memory array 1706 and a port 1702. This memory array 1706 is loaded with weighted texts. For a preferred embodiment, these weighted texts are arranged in an array of N texts in W rows as described above. In one embodiment, the memory array 1706 includes An array of 128 horizontally arranged static random access memory cells, where each memory cell has a width of 64 bits and a height of 2048 bits, so as to provide a 2MB weight random access memory 124, Its width is 8192 bits and it has 2048 rows. The grain area used by this weighted random access memory 124 is approximately 2.4 mm 2. However, the present invention is not limited to this.

就一較佳實施例而言，此埠1702係以多工方式耦接至神經處理單元126與緩衝器1704。這些神經處理單元126與緩衝器1704係透過此埠1702讀取並寫入權重隨機存取記憶體124。緩衝器1704並耦接至第一圖之媒體暫存器118，如此，媒體暫存器118即可透過緩衝器1704讀取並寫入權重隨機存取記憶體124。此方式之優點在於，當神經處理單元126正在讀取或寫入權重隨機存取記憶體124的時候，媒體暫存器118還可以寫入緩衝器118或是從緩衝器118讀取(不過若是神經處理單元126正在執行，在較佳之情況下係擱置這些神經處理單元126，以避免當緩衝器1704存取權重隨機存取記憶體124時，存取權重隨機存取記憶體124)。此方式可以提升效能，特別是因為媒體暫存器118對於權重隨機存取記憶體124之讀取與寫入相對上明顯小於神經處理單元126對於權重隨機存取記憶體124之讀取與寫入。舉例來說，在一實施例中，神經處理單元126一次讀取/寫入8192個位元(一列)，不過，媒體暫存器118之寬度僅為256位元，而每個MTNN指令1400僅寫入兩個媒體暫存器118，即512位元。因此，在架構程式執行十六個MTNN指令1400以填滿緩衝器1704之情況下，神經處理單元126與存取權重隨機存取記憶體124之架構程式間發生衝突的時間會少於大致全部時間之百分之六。在另一實施例中，指令轉譯器104將一個MTNN指令1400轉譯為兩個微指令105，而每個微指令會將單一個資料暫存器118寫入緩衝器1704，如此，神經處理單元126與架構程式在存取權重隨機存取記憶體124時產生衝突之頻率還會進一步減少。 In a preferred embodiment, the port 1702 is multiplexed to the neural processing unit 126 and the buffer 1704. The neural processing unit 126 and the buffer 1704 read and write the weight random access memory 124 through the port 1702. The buffer 1704 is coupled to the media register 118 of the first figure. In this way, the media register 118 can read and write the weight random access memory 124 through the buffer 1704. The advantage of this method is that when the neural processing unit 126 is reading or writing the weight random access memory 124, the media register 118 can also write to or read from the buffer 118 (but if it is The neural processing unit 126 is being executed, and in a better case, these neural processing units 126 are shelved to avoid accessing the weight random access memory 124 when the buffer 1704 accesses the weight random access memory 124). This method can improve performance, especially because the reading and writing of the weighted random access memory 124 by the media register 118 is relatively significantly smaller than the reading and writing of the weighted random access memory 124 by the neural processing unit 126. . For example, in one embodiment, the neural processing unit 126 reads / writes 8192 bits (one column) at a time. However, the width of the media register 118 is only 256 bits, and each MTNN instruction 1400 only Two media registers 118 are written, that is, 512 bits. Therefore, sixteen MTNN instructions 1400 are executed in the framework program to fill In the case of the full buffer 1704, the conflict time between the neural processing unit 126 and the architecture program of the access weight random access memory 124 will be less than about 6% of the total time. In another embodiment, the instruction translator 104 translates one MTNN instruction 1400 into two microinstructions 105, and each microinstruction writes a single data register 118 into the buffer 1704. Thus, the neural processing unit 126 The frequency of conflicts with the framework program when accessing the random access memory 124 is further reduced.

在包含緩衝器1704之實施例中，利用架構程式寫入權重隨機存取記憶體124需要多個MTNN指令1400。一個或多個MTNN指令1400指定一函數1432以寫入緩衝器1704中指定之資料塊，隨後一MTNN指令1400指定一函數1432指示神經網路單元121將緩衝器1704之內容寫入權重隨機存取記憶體124之一選定列。單一個資料塊之大小為媒體暫存器118之位元數的兩倍，而這些資料塊會自然地排齊於緩衝器1704中。在一實施例中，每個指定函數1432以寫入緩衝器1704指定資料塊之MTNN指令1400係包含一位元遮罩(bitmask)，其具有位元對應至緩衝器1704之各個資料塊。來自兩個指定之來源暫存器118之資料係被寫入緩衝器1704之資料塊中，在位元遮罩內之對應位元為被設定之各個資料塊。此實施例有助於權重隨機存取記憶體124之一列內存在重複資料值之情形。舉例來說，為了將緩衝器1704(以及接下去之權重隨機存取記憶體124之一列)歸零，程式設計者可以將零值載入來源暫存器並且設定位元遮罩之所有位元。此外，位元遮罩也可以讓程式設計者僅寫入緩衝器1704中之選定資料塊，而使其他資料塊維持其先前之資料狀態。 In the embodiment including the buffer 1704, writing the weighted random access memory 124 using a framework program requires multiple MTNN instructions 1400. One or more MTNN instructions 1400 specify a function 1432 to write the data block specified in the buffer 1704, and then an MTNN instruction 1400 specifies a function 1432 to instruct the neural network unit 121 to write the contents of the buffer 1704 to the weight random access One of the memory 124 is selected. The size of a single data block is twice the number of bits in the media register 118, and these data blocks are naturally aligned in the buffer 1704. In one embodiment, the MTNN instruction 1400 of each designated function 1432 to write a designated block of data to the buffer 1704 includes a bit mask having a bit corresponding to each data block of the buffer 1704. The data from the two designated source registers 118 are written into the data block of the buffer 1704, and the corresponding bits in the bit mask are the respective data blocks that are set. This embodiment facilitates the case where duplicate data values exist in one row of the weighted random access memory 124. For example, in order to reset buffer 1704 (and the next row of weighted random access memory 124) to zero, the programmer can load a zero value into the source register and set all bits of the bit mask. . In addition, the bit mask allows the programmer to write to buffer 1704 only. The selected data block, while keeping the other data blocks in their previous data state.

在包含緩衝器1704之實施例中，利用架構程式讀取權重隨機存取記憶體124需要多個MFNN指令1500。初始的MFNN指令1500指定一函數1532將權重隨機存取單元124之一指定列載入緩衝器1704，隨後一個或多個MFNN指令1500指定一函數1532將緩衝器1704之一指定資料塊讀取至目的暫存器。單一個資料塊之大小即為媒體暫存器118之位元數，而這些資料塊會自然地排齊於緩衝器1704中。本發明之技術特徵亦可適用於其他實施例，如權重隨機存取記憶體124具有多個緩衝器1704，透過增加神經處理單元126執行時架構程式之可存取數量，以進一步減少神經處理單元126與架構程式間因存取權重隨機存記憶體124所產生之衝突，而增加在神經處理單元126不須存取權重隨機存取記憶體124之時頻週期內，改由緩衝器1704進行存取之可能性。 In the embodiment including the buffer 1704, reading the weighted random access memory 124 using a framework program requires multiple MFNN instructions 1500. The initial MFNN instruction 1500 specifies a function 1532 to load a designated column of the weighted random access unit 124 into the buffer 1704, and then one or more MFNN instructions 1500 specify a function 1532 to read a specified data block of the buffer 1704 to Purpose register. The size of a single data block is the number of bits in the media register 118, and these data blocks are naturally aligned in the buffer 1704. The technical features of the present invention can also be applied to other embodiments. For example, the weight random access memory 124 has multiple buffers 1704. By increasing the accessible number of the architecture program when the neural processing unit 126 is executed, the neural processing unit can be further reduced. The conflict between 126 and the structure program due to the access weight random storage memory 124 is increased. In the time-frequency period of the neural processing unit 126 without accessing the weight random access memory 124, it is stored by the buffer 1704 instead. Take the possibility.

第十六圖係描述一雙埠資料隨機存取記憶體122，不過，本發明並不限於此。本發明之技術特徵亦可適用於權重隨機存取記憶體124亦為雙埠設計之其他實施例。此外，第十七圖中描述一緩衝器搭配權重隨機存取記憶體124使用，不過，本發明並不限於此。本發明之技術特徵亦可適用於資料隨機存取記憶體122具有一個類似於緩衝器1704之相對應緩衝器之實施例。 The sixteenth figure illustrates a dual-port data random access memory 122, but the present invention is not limited thereto. The technical features of the present invention can also be applied to other embodiments in which the weight random access memory 124 is also a dual-port design. In addition, FIG. 17 illustrates a buffer used with the weighted random access memory 124, but the present invention is not limited thereto. The technical features of the present invention can also be applied to the embodiment in which the data random access memory 122 has a corresponding buffer similar to the buffer 1704.

可動態配置之神經處理單元 Dynamically configurable neural processing unit

第十八圖係顯示第一圖之一可動態配置之神經處理單元126之方塊示意圖。第十八圖之神經處理單元126係類似於第二圖之神經處理單元126。不過，第十八圖之神經處理單元126係可動態配置以運作於兩個不同配置之其中之一。在第一個配置中，第十八圖之神經處理單元126之運作係類似於第二圖之神經處理單元126。也就是說，在第一個配置中，在此標示為“寬的”配置或“單一個”配置，神經處理單元126之算術邏輯單元204對單一個寬的資料文字以及單一個寬的權重文字(例如16個位元)執行運算以產生單一個寬的結果。相較之下，在第二個配置中，即本文標示為“窄的”配置或“雙數”配置，神經處理單元126會對兩個窄的資料文字以及兩個窄的權重文字(例如8個位元)執行運算分別產生兩個窄的結果。在一實施例中，神經處理單元126之配置(寬或窄)係由初始化神經處理單元指令(例如位於前述第二十圖中位址0之指令)達成。另外，此配置也可以由一個具有函數1432指定來設定神經處理單元設定之配置(寬或窄)之MTNN指令來達成。就一較佳實施例而言，程式記憶體129指令或確定配置(寬或窄)之MTNN指令會填滿配置暫存器。舉例來說，配置暫存器之輸出係提供給算術邏輯單元204、啟動函數單元212以及產生多工暫存器控制信號213之邏輯。基本上，第十八圖之神經處理單元126之元件與第二圖中相同編號之元件會執行類似的功能，可從中取得參照以瞭解第十八圖之實施例。以下係針對第十八圖之實施例包含其與第二圖之不同處進行說明。 The eighteenth figure is a block diagram showing a dynamically configurable neural processing unit 126, one of the first figures. Figure 18: Neural Processing The unit 126 is similar to the neural processing unit 126 in the second figure. However, the neural processing unit 126 of Fig. 18 can be dynamically configured to operate in one of two different configurations. In the first configuration, the operation of the neural processing unit 126 of the eighteenth figure is similar to that of the neural processing unit 126 of the second figure. That is, in the first configuration, which is marked here as a "wide" configuration or a "single" configuration, the arithmetic logic unit 204 of the neural processing unit 126 performs a single wide data text and a single wide weight text (E.g., 16 bits) performs an operation to produce a single wide result. In contrast, in the second configuration, which is marked as a “narrow” configuration or a “double” configuration, the neural processing unit 126 performs two narrow data texts and two narrow weight texts (for example, eight (Bits) performing operations each yield two narrow results. In one embodiment, the configuration (wide or narrow) of the neural processing unit 126 is achieved by an instruction to initialize the neural processing unit (eg, an instruction located at address 0 in the twentieth graph). In addition, this configuration can also be achieved by an MTNN instruction with a configuration (wide or narrow) specified by the function 1432 to set the settings of the neural processing unit. In a preferred embodiment, the program memory 129 instructions or MTNN instructions that determine the configuration (wide or narrow) will fill the configuration register. For example, the output of the configuration register is provided to the arithmetic logic unit 204, the start function unit 212, and the logic for generating the multiplexer register control signal 213. Basically, the elements of the neural processing unit 126 of the eighteenth figure perform similar functions as the elements of the same number in the second figure, and reference can be obtained from it to understand the embodiment of the eighteenth figure. The following is a description of the embodiment of the eighteenth figure including the differences from the second figure.

第十八圖之神經處理單元126包括兩個暫存器205A與205B、兩個三輸入多工暫存器208A與208B、一個算術邏輯單元204、兩個累加器202A與202B、以及兩個啟動函數單元212A與212B。暫存器205A/205B分別具有第二圖之暫存器205之寬度之一半(如8個位元)。暫存器205A/205B分別從權重隨機存取記憶體124接收一相對應之窄權重文字206A/B206(例如8個位元)並將其輸出203A/203B在一後續時頻週期提供至算術邏輯單元204之運算元選擇邏輯1898。神經處理單元126處於寬配置的時候，暫存器205A/205B就會一起運作以接收來自權重隨機存取記憶體124之一寬權重文字206A/206B(例如16個位元)，類似於第二圖之實施例中的暫存器205；神經處理單元126處於窄配置的時候，暫存器205A/205B實際上就會是獨立運作，各自接收來自權重隨機存取記憶體124之一窄權重文字206A/206B(例如8個位元)，如此，神經處理單元126實際上就相當於兩個窄的神經處理單元各自獨立運作。不過，不論神經處理單元126之配置態樣為何，權重隨機存取記憶體124之相同輸出位元都會耦接並提供至暫存器205A/205B。舉例來說，神經處理單元0之暫存器205A接收到位元組0、神經處理單元0之暫存器205B接收到位元組1、神經處理單元1之暫存器205A接收到位元組2、神經處理單元1之暫存器205B接收到位元組3、依此類推，神經處理單元511之暫存器205B就會接收到位元組1023。 The neural processing unit 126 of the eighteenth figure includes two temporary Registers 205A and 205B, two three-input multiplexing registers 208A and 208B, one arithmetic logic unit 204, two accumulators 202A and 202B, and two start function units 212A and 212B. The registers 205A / 205B each have half the width (eg, 8 bits) of the register 205 of the second figure. The registers 205A / 205B respectively receive a corresponding narrow weight text 206A / B206 (for example, 8 bits) from the weight random access memory 124 and provide the output 203A / 203B to the arithmetic logic in a subsequent time-frequency cycle. The operand selection logic of unit 204 is 1898. When the neural processing unit 126 is in the wide configuration, the registers 205A / 205B will work together to receive a wide weight text 206A / 206B (for example, 16 bits) from the weight random access memory 124, similar to the second The register 205 in the embodiment of the figure; when the neural processing unit 126 is in a narrow configuration, the registers 205A / 205B will actually operate independently, each receiving a narrow weight text from the weight random access memory 124 206A / 206B (for example, 8 bits). In this way, the neural processing unit 126 is actually equivalent to two narrow neural processing units operating independently. However, regardless of the configuration of the neural processing unit 126, the same output bits of the weight random access memory 124 are coupled and provided to the registers 205A / 205B. For example, register 205A of neural processing unit 0 receives byte 0, register 205B of neural processing unit 0 receives byte 1, register 205A of neural processing unit 1 receives byte 2, neural Register 205B of processing unit 1 receives byte 3, and so on, and register 205B of neural processing unit 511 receives byte 1023.

多工暫存器208A/208B分別具有第二圖之暫存器208之寬度之一半(如8個位元)。多工暫存器208A 會在輸入207A、211A與1811A中選擇一個儲存至其暫存器並在後續時頻週期由輸出209A提供，多工暫存器208B會在輸入207B、211B與1811B中選擇一個儲存至其暫存器並在後續時頻週期由輸出209B提供至運算元選擇邏輯1898。輸入207A從資料隨機存取記憶體122接收一窄資料文字(例如8個位元)，輸入207B從資料隨機存取記憶體122接收一窄資料文字。當神經處理單元126處於寬配置的時候，多工暫存器208A/208B實際上就會是一起運作以接收來自資料隨機存取記憶體122之一寬資料文字207A/207B(例如16個位元)，類似於第二圖之實施例中的多工暫存器208；神經處理單元126處於窄配置的時候，多工暫存器208A/208B實際上就會是獨立運作，各自接收來自資料隨機存取記憶體122之一窄資料文字207A/207B(例如8個位元)，如此，神經處理單元126實際上就相當於兩個窄的神經處理單元各自獨立運作。不過，不論神經處理單元126之配置態樣為何，資料隨機存取記憶體122之相同輸出位元都會耦接並提供至多工暫存器208A/208B。舉例來說，神經處理單元0之多工暫存器208A接收到位元組0、神經處理單元0之多工暫存器208B接收到位元組1、神經處理單元1之多工暫存器208A接收到位元組2、神經處理單元1之多工暫存器208B接收到位元組3、依此類推，神經處理單元511之多工暫存器208B就會接收到位元組1023。 The multiplex register 208A / 208B has half the width (eg, 8 bits) of the register 208 of the second figure, respectively. Multiplex Register 208A One of the inputs 207A, 211A, and 1811A will be stored in its register and will be provided by the output 209A in the subsequent time-frequency period. The multiplex register 208B will be selected in the inputs 207B, 211B, and 1811B and stored in its register. The device is provided to the operand selection logic 1898 by the output 209B in the subsequent time-frequency period. Input 207A receives a narrow data character (for example, 8 bits) from the data random access memory 122, and input 207B receives a narrow data character from the data random access memory 122. When the neural processing unit 126 is in a wide configuration, the multiplexer registers 208A / 208B actually work together to receive one wide data character 207A / 207B (e.g., 16 bits) from the data random access memory 122. ), Similar to the multiplex register 208 in the embodiment of the second figure; when the neural processing unit 126 is in a narrow configuration, the multiplex register 208A / 208B will actually operate independently, each receiving random data from One of the narrow data characters 207A / 207B (for example, 8 bits) of the memory 122 is accessed. In this way, the neural processing unit 126 is actually equivalent to two narrow neural processing units operating independently. However, regardless of the configuration of the neural processing unit 126, the same output bits of the data random access memory 122 will be coupled and provided to the multiplexer registers 208A / 208B. For example, multiplexer register 208A of neural processing unit 0 receives byte 0, multiplexer register 208B of neural processing unit 0 receives multiplexer register 208A of byte 1, neural processing unit 1 receives The multiplexer register 208B of byte 2 and neural processing unit 1 receives byte group 3, and so on, and the multiplexer register 208B of neural processing unit 511 receives byte group 1023.

輸入211A接收鄰近之神經處理單元126之多工暫存器208A之輸出209A，輸入211B接收鄰近之神經處理單元126之多工暫存器208B之輸出209B。輸入1811A接收鄰近神經處理單元126之多工暫存器208B之輸出209B，而輸入1811B接收鄰近神經處理單元126之多工暫存器208A之輸出209A。第十八圖所示之神經處理單元126係屬於第一圖所示之N個神經處理單元126之其中之一並標示為神經處理單元J。也就是說，神經處理單元J是這N個神經處理單元之一代表範例。就一較佳實施例而言，神經處理單元J之多工暫存器208A輸入211A會接收範例J-1之神經處理單元126之多工暫存器208A輸出209A，而神經處理單元J之多工暫存器208A輸入1811A會接收範例J-1之神經處理單元126之多工暫存器208B輸出209B，並且神經處理單元J之多工暫存器208A輸出209A會同時提供至範例J+1之神經處理單元126之多工暫存器208A輸入211A以及範例J之神經處理單元126之多工暫存器208B輸入211B；神經處理單元J之多工暫存器208B之輸入211B會接收範例J-1之神經處理單元126之多工暫存器208B輸出209B，而神經處理單元J之多工暫存器208B之輸入1811B會接收範例J之神經處理單元126之多工暫存器208A輸出209A，並且，神經處理單元J之多工暫存器208B之輸出209B會同時提供至範例J+1之神經處理單元126之多工暫存器208A輸入1811A以及範例J+1之神經處理單元126之多工暫存器208B輸入211B。 Input 211A receives the output 209A of the multiplex register 208A of the neighboring neural processing unit 126, and input 211B receives the neighboring nerve The output 209B of the multiplexing register 208B of the processing unit 126. Input 1811A receives the output 209B of the multiplexer register 208B adjacent to the neural processing unit 126, and input 1811B receives the output 209A of the multiplexer register 208A adjacent to the neural processing unit 126. The neural processing unit 126 shown in FIG. 18 belongs to one of the N neural processing units 126 shown in the first figure and is labeled as a neural processing unit J. That is, the neural processing unit J is a representative example of these N neural processing units. For a preferred embodiment, the input 211A of the multiplexing register 208A of the neural processing unit J will receive the output 209A of the multiplexing register 208A of the neural processing unit 126 of Example J-1, and the number of the neural processing unit J is as large as 211A. The input 1811A of the industrial register 208A will receive the output 209B of the multiplex register 208B of the neural processing unit 126 of Example J-1, and the output 209A of the multiple register 208A of the neural processing unit J will be provided to the example J + 1 at the same time. The multiplexer register 208A of the neural processing unit 126 inputs 211A and the multiplexer register 208B of the neural processing unit 126 inputs 211B of the example J; the input 211B of the multiplexer register 208B of the neural processing unit J receives the example J The multiplexer register 208B of the neural processing unit 126 of -1 outputs 209B, and the input 1811B of the multiplexer register 208B of the neural processing unit J will receive the multiplexer register 208A of the neural processing unit 126 of example J to output 209A Moreover, the output 209B of the multiplexing register 208B of the neural processing unit J will be provided to the input 1811A of the multiplexing register 208A of the neural processing unit 126 of example J + 1 and the neural processing unit 126 of example J + 1. The multiplexer register 208B inputs 211B.

控制輸入213控制多上暫存器208A/208B中之每一個，從這三個輸入中選擇其一儲存至其相對應之暫存器，並在後續步驟提供至相對應之輸出 209A/209B。當神經處理單元126被指示要從資料隨機存取記憶體122載入一列時(例如第二十圖中位址1之乘法累加指令，詳如後述)，無論此神經處理單元126是處於寬配置或是窄配置，控制輸入213會控制多工暫存器208A/208B中之每一個多工暫存器，從資料隨機存取記憶體122之選定列之相對應窄文字中選擇一相對應之窄資料文字207A/207B(如8位元)。 The control input 213 controls each of the multiple registers 208A / 208B. Select one of the three inputs to store in its corresponding register, and provide the corresponding output in the subsequent steps. 209A / 209B. When the neural processing unit 126 is instructed to load a row from the data random access memory 122 (for example, the multiply-accumulate instruction of address 1 in the twentieth figure, as described later), whether the neural processing unit 126 is in a wide configuration Or a narrow configuration, the control input 213 controls each of the multiplexer registers 208A / 208B, and selects a corresponding one from the corresponding narrow text in the selected row of the data random access memory 122. Narrow data text 207A / 207B (such as 8 bits).

當神經處理單元126接收指示需要對先前接收之資料列數值進行旋轉時(例如第二十圖中位址2之乘法累加旋轉指令，詳如後述)，若是神經處理單元126是處於窄配置，控制輸入213就會控制多工暫存器208A/208B中每一個多工暫存器選擇相對應之輸入1811A/1811B。在此情況下，多工暫存器208A/208B實際上會是獨立運作而使神經處理單元126實際上就如同兩個獨立的窄神經處理單元。如此，N個神經處理單元126之多工暫存器208A與208B共同運作就會如同一2N個窄文字之旋轉器，這部分在後續對應於第十九圖處有更詳細的說明。 When the neural processing unit 126 receives an instruction that it is necessary to rotate the previously received data column value (for example, the multiply accumulate rotation instruction of address 2 in the twentieth figure, as described later), if the neural processing unit 126 is in a narrow configuration, control Input 213 will control each multiplexer register 208A / 208B to select the corresponding input 1811A / 1811B. In this case, the multiplexing registers 208A / 208B will actually operate independently so that the neural processing unit 126 will actually act like two independent narrow neural processing units. In this way, the multiplex registers 208A and 208B of the N neural processing units 126 work together as the same 2N narrow-text rotator, which is explained in more detail in the subsequent part corresponding to the nineteenth figure.

當神經處理單元126接收指示需要對先前接收之資料列數值進行旋轉時，若是神經處理單元126是處於寬配置，控制輸入213就會控制多工暫存器208A/208B中每一個多工暫存器選擇相對應輸入211A/211B。在此情況下，多工暫存器208A/208B會共同運作而實際上就好像這個神經處理單元126是單一個寬神經處理單元126。如此，N個神經處理單元126之多工暫存器208A與208B共同運作就會如同一N個寬文字之旋轉器，類似對應於第三圖所描述之方式。 When the neural processing unit 126 receives an instruction that it is necessary to rotate the value of the previously received data column, if the neural processing unit 126 is in a wide configuration, the control input 213 controls each of the multiplexing registers in the multiplexing registers 208A / 208B. Select the corresponding input 211A / 211B. In this case, the multiplexer registers 208A / 208B work together as if the neural processing unit 126 is a single wide neural processing unit 126. As such, the multiplexing of N neural processing units 126 Registers 208A and 208B work together as a spinner for the same N wide characters, similar to the way described in the third figure.

算術邏輯單元204包括運算元選擇邏輯1898、一個寬乘法器242A、一個窄乘法器242B、一個寬雙輸入多工器1896A，一個窄雙輸入多工器1896B，一個寬加法器244A與一個窄加法器244B。實際上，此算術邏輯單元204可理解為包括運算元選擇邏輯、一個寬算術邏輯單元204A(包括前述寬乘法器242A、前述寬多工器1896A與前述寬加法器244A)與一個窄算術邏輯單元204B(包括前述窄乘法器242B、前述窄多工器1896B與前述窄加法器244B)。就一較佳實施例而言，寬乘法器242A可將兩個寬文字相乘，類似於第二圖之乘法器242，例如一個16位元乘16位元之乘法器。窄乘法器242B可將兩個窄文字相乘，例如一個8位元乘8位元之乘法器以產生一個16位元之結果。神經處理單元126處於窄配置時，透過運算元選擇邏輯1898之協助，即可充分利用寬乘法器242A，將其作為一個窄乘法器使兩個窄文字相乘，如此神經處理單元126就會如同兩個有效運作之窄神經處理單元。就一較佳實施例而言，寬加法器244A會將寬多工器1896A之輸出與寬累加器202A之輸出217A相加已產生一總數215A供寬累加器202A使用，其運作係類似於第二圖之加法器244。窄加法器244B會將窄多工器1896B之輸出與窄累加器202B輸出217B相加以產生一總數215B供窄累加器202B使用。在一實施例中，窄累加器202B具有28位元之寬度，以避免在進行多達1024個16位元乘積之累加運算時會喪失準確度。神經處理單元126處於寬配置時，窄乘法器244B、窄累加器202B與窄啟動函數單元212B最好是處於不啟動狀態以降低能量耗損。 Arithmetic logic unit 204 includes operand selection logic 1898, a wide multiplier 242A, a narrow multiplier 242B, a wide dual-input multiplexer 1896A, a narrow dual-input multiplexer 1896B, a wide adder 244A, and a narrow addition器 244B. In fact, the arithmetic logic unit 204 can be understood to include operand selection logic, a wide arithmetic logic unit 204A (including the aforementioned wide multiplier 242A, the aforementioned wide multiplexer 1896A, and the aforementioned wide adder 244A) and a narrow arithmetic logic unit 204B (including the aforementioned narrow multiplier 242B, the aforementioned narrow multiplexer 1896B, and the aforementioned narrow adder 244B). In a preferred embodiment, the wide multiplier 242A can multiply two wide characters, similar to the multiplier 242 of the second figure, such as a 16-bit multiplier by 16-bit multiplier. The narrow multiplier 242B can multiply two narrow characters, such as an 8-bit by 8-bit multiplier to produce a 16-bit result. When the neural processing unit 126 is in a narrow configuration, with the assistance of the operator selection logic 1898, the wide multiplier 242A can be fully used as a narrow multiplier to multiply two narrow texts, so the neural processing unit 126 will be as Two effectively functioning narrow neural processing units. For a preferred embodiment, the wide adder 244A adds the output of the wide multiplexer 1896A and the output 217A of the wide accumulator 202A to generate a total 215A for the wide accumulator 202A. Its operation is similar to that of the first Adder 244 of the second figure. The narrow adder 244B adds the output of the narrow multiplexer 1896B and the output of the narrow accumulator 202B to 217B to generate a total 215B for the narrow accumulator 202B. In one embodiment, the narrow accumulator 202B has a width of 28 bits to avoid performing up to 1024 16 bits Accumulation of meta products will lose accuracy. When the neural processing unit 126 is in a wide configuration, the narrow multiplier 244B, the narrow accumulator 202B, and the narrow start function unit 212B are preferably in a non-start state to reduce energy consumption.

運算元選擇邏輯1898會從209A、209B、203A與203B中選擇運算元提供至算術邏輯單元204之其他元件，詳如後述。就一較佳實施例而言，運算元選擇邏輯1898也具有其他功能，例如執行帶符號數值資料文字與權重文字之符號延展。舉例來說，若是神經處理單元126是處於窄配置，運算元選擇邏輯1898會將窄資料文字與權重文字之符號延展至寬文字之寬度，然後才提供給寬乘法器242A。類似地，若是算術邏輯單元204接受指示要傳遞一個窄資料/權重文字(利用寬多工器1896A跳過寬乘法器242A)，運算元選擇邏輯1898會將窄資料文字與權重文字之符號延展至寬文字之寬度，然後才提供給寬加法器244A。就一較佳實施例而言，此執行符號延展功能之邏輯亦存在於第二圖之神經處理單元126之算術邏輯運算204之內部。 Operand selection logic 1898 selects operands from 209A, 209B, 203A, and 203B to provide other elements of the arithmetic logic unit 204, as described later. For a preferred embodiment, the operand selection logic 1898 also has other functions, such as performing sign extension of signed numeric data text and weight text. For example, if the neural processing unit 126 is in a narrow configuration, the operator selection logic 1898 will extend the symbols of the narrow data text and weight text to the width of the wide text, and then provide it to the wide multiplier 242A. Similarly, if the arithmetic logic unit 204 accepts an instruction to pass a narrow data / weight text (using the wide multiplexer 1896A to skip the wide multiplier 242A), the operator selection logic 1898 will extend the sign of the narrow data text and weight text to The width of the wide text is then supplied to the wide adder 244A. For a preferred embodiment, the logic for performing the symbol extension function also exists in the arithmetic logic operation 204 of the neural processing unit 126 in the second figure.

寬多工器1896A接收寬乘法器242A之輸出與來自運算元選擇邏輯1898之一運算元，並從這些輸入中選擇其一提供給寬加法器244A，窄多工器1896B接收窄乘法器242B之輸出與來自運算元選擇邏輯1898之一運算元，並從這些輸入中選擇其一提供給窄加法器244B。 The wide multiplexer 1896A receives the output of the wide multiplier 242A and an operand from the operand selection logic 1898, and selects one of these inputs to provide to the wide adder 244A. The narrow multiplexer 1896B receives the narrow multiplier 242B. The output and one of the operands from the operand selection logic 1898 selects one of these inputs and supplies it to the narrow adder 244B.

運算元選擇邏輯1898會依據神經處理單元126之配置以及算術邏輯單元204將要執行之算術與/ 或邏輯運算提供運算元，此算術/邏輯運算係依據神經處理單元126執行之指令所指定之函數來決定。舉例來說，若是指令指示算術邏輯單元204執行一乘法累加運算而神經處理單元126係處於寬配置，運算元選擇邏輯1898就將輸出209A與209B串接構成之一寬文字提供至寬乘法器242A之一輸入，而將輸出203A與203B串接構成之一寬文字提供至另一輸入，而窄乘法器242B則是不啟動，如此，神經處理單元126之運作就會如同單一個類似於第二圖之神經處理單元126之寬神經處理單元126。不過，若是指令指示算術邏輯單元執行一乘法累加運算並且神經處理單元126是處於窄配置，運算元選擇邏輯1898就將一延展後或擴張後版本之窄資料文字209A提供至寬乘法器242A之一輸入，而將延展後版本之窄權重文字203A提供至另一輸入；此外，運算元選擇邏輯1898會將窄資料文字209B提供至窄乘法器242B之一輸入，而將窄權重文字203B提供至另一輸入。為達成如前所述對窄文字進行延展或擴張之運算，若是窄文字帶有符號，運算元選擇邏輯1898就會對窄文字進行符號延展；若是窄文字不帶有符號，運算元選擇邏輯1898就會在窄文字加入上方零值位元。 The operand selection logic 1898 is based on the configuration of the neural processing unit 126 and the arithmetic and / The OR operation provides an operand, and the arithmetic / logical operation is determined according to a function specified by an instruction executed by the neural processing unit 126. For example, if the instruction instructs the arithmetic logic unit 204 to perform a multiply-accumulate operation and the neural processing unit 126 is in a wide configuration, the operand selection logic 1898 provides a wide text formed by concatenating the outputs 209A and 209B to the wide multiplier 242A. One input, and a wide text formed by concatenating the outputs 203A and 203B to the other input, and the narrow multiplier 242B is not activated. In this way, the operation of the neural processing unit 126 will be similar to a single one similar to the second The wide neural processing unit 126 of the neural processing unit 126 in the figure. However, if the instruction instructs the arithmetic logic unit to perform a multiply-accumulate operation and the neural processing unit 126 is in a narrow configuration, the operand selection logic 1898 provides an extended or expanded version of the narrow data text 209A to one of the wide multipliers 242A Input, and the extended version of the narrow weight text 203A is provided to another input; in addition, the operator selection logic 1898 provides the narrow data text 209B to one of the narrow multipliers 242B input, and the narrow weight text 203B to another One input. In order to achieve the operation of extending or expanding the narrow text as described above, if the narrow text has a sign, the operand selection logic 1898 will extend the narrow text; if the narrow text has no sign, the operand selection logic 1898 Will add the top zero bit to the narrow text.

在另一範例中，若是神經處理單元126處於寬配置並且指令指示算術邏輯單元204執行一權重文字之累加運算，寬乘法器242A就會被跳過，而運算元選擇邏輯1898就會將輸出203A與203B串接提供至寬多工器1896A以提供給寬加法器244A。不過，若是神經處理單元126處於窄配置並且指令指示算術邏輯單元204執行一權重文字之累加運算，寬乘法器242A就會被跳過，而運算元選擇邏輯1898就會將一延展後版本之輸出203A提供至寬多工器1896A以提供給寬加法器244A；此外，窄乘法器242B會被跳過，運算元選擇邏輯1898會將延展後版本之輸出203B提供至窄多工器1896B以提供給窄加法器244B。 In another example, if the neural processing unit 126 is in a wide configuration and the instruction instructs the arithmetic logic unit 204 to perform an accumulation operation on a weight literal, the wide multiplier 242A will be skipped, and the operand selection logic 1898 will output 203A In series with 203B, a wide multiplexer 1896A is provided to provide a wide adder 244A. However, if it is neural processing Unit 126 is in a narrow configuration and the instruction instructs the arithmetic logic unit 204 to perform an accumulation operation on the weighted text. The wide multiplier 242A will be skipped, and the operand selection logic 1898 will provide an extended version of the output 203A up to wide The multiplexer 1896A is provided to the wide adder 244A; in addition, the narrow multiplier 242B is skipped, and the operand selection logic 1898 provides the output 203B of the extended version to the narrow multiplexer 1896B for the narrow adder 244B.

在另一範例中，若是神經處理單元126處於寬配置並且指令指示算術邏輯單元204執行一資料文字之累加運算，寬乘法器242A就會被跳過，而運算元選擇邏輯1898就會將輸出209A與209B串接提供至寬多工器1896A以提供給寬加法器244A。不過，若是神經處理單元126處於窄配置並且指令指示算術邏輯單元204執行一資料文字之累加運算，寬乘法器242A就會被跳過，而運算元選擇邏輯1898就會將一延展後版本之輸出209A提供至寬多工器1896A以提供給寬加法器244A；此外，窄乘法器242B會被跳過，運算元選擇邏輯1898會將延展後版本之輸出209B提供至窄多工器1896B以提供給窄加法器244B。權重/資料文字之累加計算有助於平均運算，平均運算可用如影像處理在內之某些人工神經網路應用之共源(pooling)層。 In another example, if the neural processing unit 126 is in a wide configuration and the instruction instructs the arithmetic logic unit 204 to perform an accumulation operation on a data literal, the wide multiplier 242A will be skipped, and the operand selection logic 1898 will output 209A In series with 209B, a wide multiplexer 1896A is provided to a wide adder 244A. However, if the neural processing unit 126 is in a narrow configuration and the instruction instructs the arithmetic logic unit 204 to perform an accumulation operation on the data text, the wide multiplier 242A will be skipped, and the operand selection logic 1898 will output an extended version 209A is provided to the wide multiplexer 1896A to be provided to the wide adder 244A; in addition, the narrow multiplier 242B will be skipped, and the operand selection logic 1898 will provide the output of the extended version 209B to the narrow multiplexer 1896B to provide Narrow adder 244B. Accumulation of weights / data text is helpful for averaging. The averaging can be used as a pooling layer in some artificial neural network applications such as image processing.

就一較佳實施例而言，神經處理單元126還包括一第二寬多工器(未圖示)，用以跳過寬加法器244A，以利於將寬配置下之一寬資料/權重文字或是窄配置下之一延展後之窄資料/權重文字載入寬累加器 202A，以及一第二窄多工器(未圖示)，用以跳過窄加法器244B，以利於將窄配置下之一窄資料/權重文字載入窄累加器202B。就一較佳實施例而言，此算術邏輯單元204還包括寬與窄之比較器/多工器組合(未圖示)，此比較器/多工器組合係接收相對應之累加器數值217A/217B與相對應之多工器1896A/1896B輸出，藉以在累加器數值217A/217B與一資料/權重文字209A/209B/203A/203B間選擇最大值，某些人工神經網路應用之共源(pooling)層係使用此運算，這部分在後續章節，例如對應於第二十七與二十八圖處，會有更詳細的說明。此外，運算元選擇邏輯1898係用以提供數值零之運算元(用於加零之加法運算或是用以清除累加器)，並提供數值一之運算元(用於乘一之乘法運算)。 In a preferred embodiment, the neural processing unit 126 further includes a second wide multiplexer (not shown) for skipping the wide adder 244A, so as to facilitate the configuration of a wide data / weight text in a wide configuration. Or one of the narrow data / weight text in the narrow configuration after loading wide accumulator 202A, and a second narrow multiplexer (not shown), which is used to skip the narrow adder 244B to facilitate loading a narrow data / weight text in the narrow configuration into the narrow accumulator 202B. According to a preferred embodiment, the arithmetic logic unit 204 further includes a wide / narrow comparator / multiplexer combination (not shown). The comparator / multiplexer combination receives a corresponding accumulator value 217A. / 217B and the corresponding multiplexer 1896A / 1896B output, so as to select the maximum value between the accumulator value 217A / 217B and a data / weight text 209A / 209B / 203A / 203B, a common source of some artificial neural network applications The (pooling) layer uses this operation. This part will be explained in more detail in the subsequent chapters, for example, corresponding to the 27th and 28th figures. In addition, the operand selection logic 1898 is used to provide an operand with a value of zero (for addition operations to add zeros or to clear the accumulator), and an operand for a value of one (for multiplications by multiplication by one).

窄啟動函數單元212B接收窄累加器202B之輸出217B並對其執行一啟動函數以產生一窄結果133B，寬啟動函數單元212A接收寬累加器202A之輸出217A並對其執行一啟動函數以產生一寬結果133A。神經處理單元126處於窄配置時，寬啟動函數單元212A會依此配置理解累加器202A之輸出217A並對其執行一啟動函數以產生一窄結果，如8位元，這部分在後續章節如對應於第二十九A至三十圖處有更詳細的說明。 The narrow start function unit 212B receives the output 217B of the narrow accumulator 202B and executes a start function on it to produce a narrow result 133B. The wide start function unit 212A receives the output 217A of the wide accumulator 202A and executes a start function on it to generate a Wide result 133A. When the neural processing unit 126 is in a narrow configuration, the wide start function unit 212A will understand the output 217A of the accumulator 202A according to this configuration and execute a start function on it to generate a narrow result, such as 8 bits. More detailed descriptions are provided in the 29th to 30th drawings.

如前所述，單一個神經處理單元126在處於窄配置時實際上可以作為兩個窄神經處理單元來運作，因此，對於較小的文字而言，相較於寬配置時，大致上可以提供多達兩倍的處理能力。舉例來說，假定神經網路層具有1024個神經元，而每個神經元從前一層接收1024個窄輸入(並具有窄權重文字)，如此就會產生一百萬個連結。對於具有512個神經處理單元126之神經網路單元121而言，在窄配置下(相當於1024個窄神經處理單元)，雖然處理的是窄文字而非寬文字，不過其所能處理之連結數可以達到寬配置之四倍(一百萬個連結對上256K個連結)，而所需的時間大致為一半(約1026個時頻週期對上514個時頻週期)。 As mentioned before, a single neural processing unit 126 can actually operate as two narrow neural processing units when in a narrow configuration, so for smaller text, compared to the wide configuration, it can roughly provide Up to twice the processing power. For example, suppose God The network layer has 1024 neurons, and each neuron receives 1024 narrow inputs (and has narrow weighted text) from the previous layer, which results in one million connections. For the neural network unit 121 with 512 neural processing units 126, in a narrow configuration (equivalent to 1024 narrow neural processing units), although it processes narrow text instead of wide text, it can process links The number can reach four times that of the wide configuration (a million connections with 256K connections), and the time required is about half (about 1026 time-frequency cycles versus 514 time-frequency cycles).

在一實施例中，第十八圖之動態配置神經處理單元126包括類似於多工暫存器208A與208B之三輸入多工暫存器以取代暫存器205A與205B，以構成一旋轉器，處理由權重隨機存取記憶體124接收之權重文字列，此運作部分類似於第七圖之實施例所描述之方式但應用於第十八圖所述之動態配置中。 In an embodiment, the dynamically configured neural processing unit 126 of FIG. 18 includes a three-input multiplex register similar to the multiplex registers 208A and 208B to replace the registers 205A and 205B to form a rotator. Processing the weight character string received by the weight random access memory 124. This operation part is similar to the method described in the embodiment of FIG. 7 but applied to the dynamic configuration described in FIG. 18.

第十九圖係一方塊示意圖，顯示依據第十八圖之實施例，利用第一圖之神經網路單元121之N個神經處理單元126之2N個多工暫存器208A/208B，對於由第一圖之資料隨機存取記憶體122取得之一列資料文字207執行如同一旋轉器之運作。在第十九圖之實施例中，N是512，神經處理單元121具有1024個多工暫存器208A/208B，標示為0至511，分別對應至512個神經處理單元126以及實際上1024個窄神經處理單元。神經處理單元126內之兩個窄神經處理單元分別標示為A與B，在每個多工暫存器208中，其相對應之窄神經處理單元亦加以標示。進一步來說，標示為0之神經處理單元126之多工暫存器208A係標示為0-A，標示為0之神經處理單元126之多工暫存器208B係標示為0-B，標示為1之神經處理單元126之多工暫存器208A係標示為1-A，標示為1之神經處理單元126之多工暫存器208B係標示為1-B，標示為511之神經處理單元126之多工暫存器208A係標示為511-A，而標示為511之神經處理單元126之多工暫存器208B係標示為511-B，其數值亦對應至後續第二十一圖所述之窄神經處理單元。 The nineteenth figure is a block diagram showing the embodiment according to the eighteenth figure, using the 2N multiplexing registers 208A / 208B of the N neural processing units 126 of the neural network unit 121 of the first figure. The data random access memory 122 of the first figure obtains a row of data characters 207 and performs the same operation as the same rotator. In the embodiment of the nineteenth figure, N is 512, and the neural processing unit 121 has 1024 multiplexing registers 208A / 208B, which are labeled as 0 to 511, respectively corresponding to 512 neural processing units 126 and actually 1024. Narrow nerve processing unit. The two narrow neural processing units in the neural processing unit 126 are labeled A and B, respectively. In each multiplexing register 208, the corresponding narrow neural processing units are also labeled. Further, the multiplexing of the neural processing unit 126 labeled 0 The register 208A is labeled as 0-A, the multiplex register 208B of the neural processing unit 126 labeled as 0, and the register 208A of the nerve processing unit 126 is labeled 0-B, and the multiplex register 208A of the neural processing unit 126 is labeled 1. Is 1-A, the multiplex register 208B of the neural processing unit 126 labeled 1 is labeled 1-B, the multiplex register 208A of the neural processing unit 126 labeled 511 is labeled 511-A, and The multiplex register 208B of the neural processing unit 126 labeled 511 is labeled 511-B, and its value also corresponds to the narrow neural processing unit described in the following twenty-first figure.

每個多工暫存器208A在資料隨機存取記憶體122之D個列之其中一列中接收其相對應的窄資料文字207A，而每個多工暫存器208B在資料隨機存取記憶體122之D個列之其中一列中接收其相對應的窄資料文字207B。也就是說，多工暫存器0-A接收資料隨機存取記憶體122列之窄資料文字0，多工暫存器0-B接收資料隨機存取記憶體122列之窄資料文字1，多工暫存器1-A接收資料隨機存取記憶體122列之窄資料文字2，多工暫存器1-B接收資料隨機存取記憶體122列之窄資料文字3，依此類推，多工暫存器511-A接收資料隨機存取記憶體122列之窄資料文字1022，而多工暫存器511-B則是接收資料隨機存取記憶體122列之窄資料文字1023。此外，多工暫存器1-A接收多工暫存器0-A之輸出209A作為其輸入211A，多工暫存器1-B接收多工暫存器0-B之輸出209B作為其輸入211B，依此類推，多工暫存器511-A接收多工暫存器510-A之輸出209A作為其輸入211A，多工暫存器511-B接收多工暫存器510-B之輸出209B作為其輸入 211B，並且多工暫存器0-A接收多工暫存器511-A之輸出209A作為其輸入211A，多工暫存器0-B接收多工暫存器511-B之輸出209B作為其輸入211B。每個多工暫存器208A/208B都會接收控制輸入213以控制其選擇資料文字207A/207B或是旋轉後輸入211A/211B或是旋轉後輸入1811A/1811B。最後，多工暫存器1-A接收多工暫存器0-B之輸出209B作為其輸入1811A，多工暫存器1-B接收多工暫存器1-A之輸出209A作為其輸入1811B，依此類推，多工暫存器511-A接收多工暫存器510-B之輸出209B作為其輸入1811A，多工暫存器511-B接收多工暫存器511-A之輸出209A作為其輸入1811B，並且多工暫存器0-A接收多工暫存器511-B之輸出209B作為其輸入1811A，多工暫存器0-B接收多工暫存器0-A之輸出209A作為其輸入1811B。每個多工暫存器208A/208B都會接收控制輸入213以控制其選擇資料文字207A/207B或是旋轉後輸入211A/211B或是旋轉後輸入1811A/1811B。在一運算模式中，在第一時頻週期，控制輸入213會控制每個多工暫存器208A/208B選擇資料文字207A/207B儲存至暫存器供後續提供至算術邏輯單元204；而在後續時頻週期(例如前述之M-1時頻週期)，控制輸入213會控制每個多工暫存器208A/208B選擇旋轉後輸入1811A/1811B儲存至暫存器供後續提供至算術邏輯單元204，這部分在後續章節會有更詳細的說明。 Each multiplexer register 208A receives its corresponding narrow data text 207A in one of the D rows of data random access memory 122, and each multiplexer register 208B stores data in the data random access memory One of the D columns of 122 receives its corresponding narrow data text 207B. That is, the multiplexer register 0-A receives the narrow data text 0 of 122 rows of the random access memory, and the multiplex register 0-B receives the narrow data text 1 of 122 rows of the random access memory. Multiplex register 1-A receives narrow data text 2 in 122 rows of random access memory, multiplex register 1-B receives narrow data text 3 in 122 rows of random access memory, and so on, Multiplex register 511-A receives narrow data characters 1022 of 122 rows of random access memory, and multiplex register 511-B receives narrow data characters 1023 of 122 rows of random access memory. In addition, multiplexer 1-A receives the output 209A of multiplexer 0-A as its input 211A, and multiplexer 1-B receives the output 209B of multiplexer 0-B as its input 211B, and so on, multiplexer register 511-A receives the output of multiplexer register 510-A 209A as its input 211A, multiplexer register 511-B receives the output of multiplexer register 510-B 209B as its input 211B, and multiplexer register 0-A receives the output 209A of multiplexer register 511-A as its input 211A, multiplexer register 0-B receives the output 209B of multiplexer register 511-B as its input Enter 211B. Each multiplexer register 208A / 208B will receive control input 213 to control its selection of data text 207A / 207B or input 211A / 211B after rotation or input 1811A / 1811B after rotation. Finally, multiplexer 1-A receives the output 209B of multiplexer 0-B as its input 1811A, and multiplexer 1-B receives the output 209A of multiplexer 1-A as its input 1811B, and so on, multiplexer register 511-A receives the output of multiplexer register 510-B 209B as its input 1811A, multiplexer register 511-B receives the output of multiplexer register 511-A 209A is used as its input 1811B, and multiplex register 0-A receives the output of multiplex register 511-B 209B is used as its input 1811A, multiplex register 0-B receives the multiplex register 0-A Output 209A as its input 1811B. Each multiplexer register 208A / 208B will receive control input 213 to control its selection of data text 207A / 207B or input 211A / 211B after rotation or input 1811A / 1811B after rotation. In an operation mode, in the first time-frequency period, the control input 213 controls each multiplexer register 208A / 208B to select the data text 207A / 207B and store it in the register for subsequent supply to the arithmetic logic unit 204; For subsequent time-frequency cycles (such as the aforementioned M-1 time-frequency cycle), the control input 213 will control each multiplexer register 208A / 208B after selecting rotation and input 1811A / 1811B to store in the register for subsequent supply to the arithmetic logic unit 204, this part will be explained in more detail in subsequent chapters.

第二十圖係一表格，顯示一個儲存於第一圖之神經網路單元121之程式記憶體129並由該神經網路單元121執行之程式，而此神經網路單元121具有如第十八圖之實施例所示之神經處理單元126。第二十圖之範例程式係類似於第四圖之程式。以下係針對其差異進行說明。位於位址0之初始化神經處理單元係指令指定神經處理單元126將會進入窄配置。此外，如圖中所示，位於位址2之乘法累加旋轉指令係指定一數值為1023之計數值並需要1023個時頻週期。這是因為第二十圖之範例中假定在一層中實際上具有1024個窄(如8位元)神經元(即神經處理單元)，每個窄神經元具有1024個來自前一層之1024個神經元之連結輸入，因此總共有1024K個連結。每個神經元從每個連結輸入接收一個8位元資料值並將此8位元資料值乘上一個適當的8位元權重值。 Figure 20 is a table showing a program memory 129 stored in the neural network unit 121 of the first figure The program executed by the unit 121, and the neural network unit 121 has a neural processing unit 126 as shown in the embodiment in FIG. The example program in Figure 20 is similar to the program in Figure 4. The differences are explained below. The initialization neural processing unit at address 0 specifies that the neural processing unit 126 will enter a narrow configuration. In addition, as shown in the figure, the multiply-accumulate rotation instruction at address 2 specifies a count value of 1023 and requires 1023 time-frequency cycles. This is because the example in Figure 20 assumes that there are actually 1024 narrow (e.g. 8-bit) neurons (i.e., neural processing units) in a layer, and each narrow neuron has 1024 neurons from the previous layer Yuan's link input, so there are 1024K links in total. Each neuron receives an 8-bit data value from each link input and multiplies this 8-bit data value by an appropriate 8-bit weight value.

第二十一圖係顯示一神經網路單元121執行第二十圖之程式之時序圖，此神經網路單元121具有如第十八圖所示之神經處理單元126執行於窄配置。第二十一圖之時序圖係類似於第五圖之時序圖。以下係針對其差異進行說明。 The twenty-first diagram is a timing chart showing a neural network unit 121 executing the program of the twentieth diagram. The neural network unit 121 has the neural processing unit 126 shown in the eighteenth diagram executed in a narrow configuration. The timing chart of the twenty-first figure is similar to the timing chart of the fifth figure. The differences are explained below.

在第二十一圖之時序圖中，這些神經處理單元126會處於窄配置，這是因為位於位址0之初始化神經處理單元指令將其初始化為窄配置。所以，這512個神經處理單元126實際上運作起來就如同1024個窄神經處理單元(或神經元)，這1024個窄神經處理單元在欄位內係以神經處理單元0-A與神經處理單元0-B(標示為0之神經處理單元126之兩個窄神經處理單元)，神經處理單元1-A與神經處理單元1-B(標示為1之神經處理單元126 之兩個窄神經處理單元)，依此類推直到神經處理單元511-A與神經處理單元511-B(標示為511之神經處理單元126之兩個窄神經處理單元)，加以指明。為簡化說明，圖中僅顯示窄神經處理單元0-A、0-B與511-B之運算。因為位於位址2之乘法累加旋轉指令所指定之計數值為1023，而需要1023個時頻週期進行運作因此，第二十一圖之時序圖之列數包括多達1026個時頻週期。 In the timing diagram of the twenty-first figure, these neural processing units 126 will be in a narrow configuration, because the initialization neural processing unit instruction at address 0 initializes them to a narrow configuration. Therefore, these 512 neural processing units 126 actually work like 1024 narrow neural processing units (or neurons). These 1024 narrow neural processing units are connected with neural processing units 0-A and neural processing units in the column. 0-B (two narrow neural processing units of the neural processing unit 126 labeled 0), neural processing unit 1-A and neural processing unit 1-B (the neural processing unit 126 labeled 1) Two narrow neural processing units), and so on until the neural processing unit 511-A and the neural processing unit 511-B (the two narrow neural processing units of the neural processing unit 126 labeled 511) are specified. In order to simplify the description, only the operations of the narrow neural processing units 0-A, 0-B, and 511-B are shown in the figure. Because the count value specified by the multiply accumulate rotation instruction at address 2 is 1023, which requires 1023 time-frequency cycles to operate, the number of columns in the timing chart of Figure 21 includes up to 1026 time-frequency cycles.

在時頻週期0，這1024個神經處理單元之每一個都會執行第四圖之初始化指令，即第五圖所示指派零值至累加器202之運作。 At time-frequency period 0, each of the 1024 neural processing units will execute the initialization instruction of the fourth figure, that is, the operation of assigning a zero value to the accumulator 202 shown in the fifth figure.

在時頻週期1，這1024個窄神經處理單元之每一個都會執行第二十圖中位於位址1之乘法累加指令。如圖中所示，窄神經處理單元0-A將累加器202A數值(即零)加上資料隨機存取單元122之列17窄文字0與權重隨機存取單元124之列0窄文字0之乘積；窄神經處理單元0-B將累加器202B數值(即零)加上資料隨機存取單元122之列17窄文字1與權重隨機存取單元124之列0窄文字1之乘積；依此類推直到窄神經處理單元511-B將累加器202B數值(即零)加上資料隨機存取單元122之列17窄文字1023與權重隨機存取單元124之列0窄文字1023之乘積。 At time-frequency period 1, each of the 1024 narrow neural processing units executes the multiply-accumulate instruction at address 1 in the twentieth figure. As shown in the figure, the narrow neural processing unit 0-A adds the value of accumulator 202A (ie, zero) to the column 17 of the data random access unit 122 and the narrow text 0 and the weight 0 of the random access unit 124 to the narrow text 0. Product: Narrow nerve processing unit 0-B multiplies the value of accumulator 202B (ie, zero) plus data random access unit 122 column 17 narrow text 1 and weight random access unit 124 column 0 narrow text 1; By analogy, the narrow neural processing unit 511-B multiplies the value of accumulator 202B (ie, zero) by the product of column 17 narrow text 1023 of data random access unit 122 and column 0 narrow text 1023 of weight random access unit 124.

在時頻週期2，這1024個窄神經處理單元之每一個都會執行第二十圖中位於位址2之乘法累加旋轉指令之第一次迭代。如圖中所示，窄神經處理單元0-A將累加器202A數值217A加上由窄神經處理單元511-B之多工暫存器208B輸出209B所接收之旋轉後窄資料文字1811A(也就是由資料隨機存取記憶體122所接收之窄資料文字1023)與權重隨機存取單元124之列1窄文字0之乘積；窄神經處理單元0-B將累加器202B數值217B加上由窄神經處理單元0-A之多工暫存器208A輸出209A所接收之旋轉後窄資料文字1811B(也就是由資料隨機存取記憶體122所接收之窄資料文字0)與權重隨機存取單元124之列1窄文字1之乘積；依此類推，直到窄神經處理單元511-B將累加器202B數值217B加上由窄神經處理單元511-A之多工暫存器208A輸出209A所接收之旋轉後窄資料文字1811B(也就是由資料隨機存取記憶體122所接收之窄資料文字1022)與權重隨機存取單元124之列1窄文字1023之乘積。 In time-frequency cycle 2, each of the 1024 narrow neural processing units executes the first iteration of the multiply-accumulate rotation instruction at address 2 in the twentieth figure. As shown in the figure, the narrow nerve processing unit 0-A adds the value 217A of the accumulator 202A to the narrow nerve processing unit 511-B. The multiplexer register 208B outputs the rotated narrow data text 1811A (i.e., the narrow data text 1023 received by the data random access memory 122) received by the multiplexer 208B and the weight random access unit 124 of the narrow text 0 of 0 Product; the narrow neural processing unit 0-B adds the accumulator 202B value 217B to the multiplexed register 208A of the narrow neural processing unit 0-A and outputs the rotated narrow data text 1811B received by the 209A (that is, the data is stored randomly) Take the product of the narrow data character 0 received by the memory 122 and the narrow character 1 of column 1 of the weight random access unit 124; and so on, until the narrow neural processing unit 511-B adds the value 217B of the accumulator 202B by the narrow The multiplexing register 208A of the neural processing unit 511-A outputs the rotated narrow data text 1811B (that is, the narrow data text 1022 received by the data random access memory 122) received by the 209A and the weight random access unit 124 Product of column 1 narrow text 1023.

在時頻週期3，這1024個窄神經處理單元之每一個都會執行第二十圖中位於位址2之乘法累加旋轉指令之第二次迭代。如圖中所示，窄神經處理單元0-A將累加器202A數值217A加上由窄神經處理單元511-B之多工暫存器208B輸出209B所接收之旋轉後窄資料文字1811A(也就是由資料隨機存取記憶體122所接收之窄資料文字1022)與權重隨機存取單元124之列2窄文字0之乘積；窄神經處理單元0-B將累加器202B數值217B加上由窄神經處理單元0-A之多工暫存器208A輸出209A所接收之旋轉後窄資料文字1811B(也就是由資料隨機存取記憶體122所接收之窄資料文字1023)與權重隨機存取單元124之列2窄文字1之乘積；依此類推，直到窄神經處理單元511-B將累加器202B數值217B加上由窄神經處理單元511-A之多工暫存器208A輸出209A所接收之旋轉後窄資料文字1811B(也就是由資料隨機存取記憶體122所接收之窄資料文字1021)與權重隨機存取單元124之列2窄文字1023之乘積。如第二十一圖所示，此運算會在後續1021個時頻週期持續進行，直到以下所述之時頻週期1024。 At time-frequency period 3, each of the 1024 narrow neural processing units will execute the second iteration of the multiply-accumulate rotation instruction at address 2 in the twentieth figure. As shown in the figure, the narrow neural processing unit 0-A adds the value 217A of the accumulator 202A plus the rotated narrow data text 1811A received by the multiplexer register 208B output 209B of the narrow neural processing unit 511-B (that is, The product of the narrow data text 1022) received by the data random access memory 122 and the narrow text 0 of column 2 of the weight random access unit 124; the narrow neural processing unit 0-B adds the value 217B of the accumulator 202B by the narrow nerve The multiplexing register 208A of the processing unit 0-A outputs the narrow data text 1811B (that is, the narrow data text 1023 received by the data random access memory 122) received by the 209A and the weight random access unit 124. Column 2 is the product of narrow text 1; and so on, until the narrow neural processing order Element 511-B adds the accumulator 202B value 217B to the multiplexed register 208A output 209A of the narrow neural processing unit 511-A and receives the rotated narrow data text 1811B (that is, the data random access memory 122). The product of the received narrow data text 1021) and the narrow text 1023 of column 2 of the weight random access unit 124. As shown in the twenty-first figure, this operation will continue in the subsequent 1021 time-frequency cycles until the time-frequency cycle 1024 described below.

在時頻週期1024，這1024個窄神經處理單元之每一個都會執行第二十圖中位於位址2之乘法累加旋轉指令之第1023次迭代。如圖中所示，窄神經處理單元0-A將累加器202A數值217A加上由窄神經處理單元511-B之多工暫存器208B輸出209B所接收之旋轉後窄資料文字1811A(也就是由資料隨機存取記憶體122所接收之窄資料文字1)與權重隨機存取單元124之列1023窄文字0之乘積；窄神經處理單元0-B將累加器202B數值217B加上由窄神經處理單元0-A之多工暫存器208A輸出209A所接收之旋轉後窄資料文字1811B(也就是由資料隨機存取記憶體122所接收之窄資料文字2)與權重隨機存取單元124之列1023窄文字1之乘積；依此類推，直到窄神經處理單元511-B將累加器202B數值217B加上由窄神經處理單元511-A之多工暫存器208A輸出209A所接收之旋轉後窄資料文字1811B(也就是由資料隨機存取記憶體122所接收之窄資料文字0)與權重隨機存取單元124之列1023窄文字1023之乘積。 In a time-frequency cycle of 1024, each of the 1024 narrow neural processing units will execute the 1023th iteration of the multiply-accumulate rotation instruction at address 2 in the twentieth figure. As shown in the figure, the narrow neural processing unit 0-A adds the value 217A of the accumulator 202A plus the rotated narrow data text 1811A received by the multiplexer register 208B of the narrow neural processing unit 511-B to output 209B (that is, Product of narrow data text received by data random access memory 122 1) and weighted random access unit 124 1023 narrow text 0; product of narrow neural processing unit 0-B accumulator 202B value 217B plus narrow nerve The multiplexing register 208A of the processing unit 0-A outputs the narrow data text 1811B (that is, the narrow data text 2 received by the data random access memory 122) received by the 209A and the weight random access unit 124. Column 1023 the product of the narrow text 1; and so on, until the narrow neural processing unit 511-B adds the accumulator 202B value 217B plus the rotation received by the multiplexer register 208A of the narrow neural processing unit 511-A to output 209A The product of the narrow data text 1811B (that is, the narrow data text 0 received by the data random access memory 122) and the narrow text 1023 of the row 1023 of the weight random access unit 124.

在時頻週期1025，這1024個窄神經處理單元中之每一個之啟動函數單元212A/212B會執行第二十圖中位於位址3之啟動函數指令。最後，在時頻週期1026，這1024個窄神經處理單元中之每一個會將其窄結果133A/133B寫回資料隨機存取記憶體122之列16中之相對應窄文字，以執行第二十圖中位於位址4之寫入啟動函數單元指令。亦即，神經處理單元0-A之窄結果133A會被寫入資料隨機存取記憶體122之窄文字0，神經處理單元0-B之窄結果133B會被寫入資料隨機存取記憶體122之窄文字1，依此類推，直到神經處理單元511-B之窄結果133B會被寫入資料隨機存取記憶體122之窄文字1023。第二十二圖係以方塊圖顯示前述對應於第二十一圖之運算。 In the time-frequency cycle 1025, the activation function unit 212A / 212B of each of the 1024 narrow neural processing units will execute the twentieth The start function instruction at address 3 in the figure. Finally, in the time-frequency cycle 1026, each of the 1024 narrow neural processing units will write its narrow result 133A / 133B back to the corresponding narrow text in column 16 of the data random access memory 122 to execute the second Figure 10 shows the write start function unit instruction at address 4. That is, the narrow result 133A of the neural processing unit 0-A will be written into the narrow text 0 of the data random access memory 122, and the narrow result 133B of the neural processing unit 0-B will be written into the data random access memory 122 Narrow text 1, and so on, until the narrow result 133B of the neural processing unit 511-B is written into the narrow text 1023 of the data random access memory 122. The twenty-second figure is a block diagram showing the aforementioned operations corresponding to the twenty-first figure.

第二十二圖係顯示第一圖之神經網路單元121之方塊示意圖，此神經網路單元121具有如第十八圖所示之神經處理單元126以執行第二十圖之程式。此神經網路單元121包括512個神經處理單元126，即1024個窄神經處理單元，資料隨機存取記憶體122，以及權重隨機存取記憶體124，資料隨機存取記憶體122係接收其位址輸入123，權重隨機存取記憶體124係接收其位址輸入125。雖然圖中並未顯示，不過，在時頻週期0，這1024個窄神經處理單元都會執行第二十圖之初始化指令。如圖中所示，在時頻週期1，列17之1024個8位元資料文字會從資料隨機存取記憶體122讀出並提供至這1024個窄神經處理單元。在時頻週期1至1024，列0至1023之1024個8位元權重文字會分別從權重隨機存取記憶體124讀出並提供至這1024個窄神經處理單元。雖然圖中並未顯示，不過，在時頻週期1，這1024個窄神經處理單元會對載入之資料文字與權重文字執行其相對應之乘法累加運算。在時頻週期2至1024，這1024個窄神經處理單元之多工暫存器208A/208B之運作係如同一個1024個8位元文字之旋轉器，會將先前載入資料隨機存取記憶體122之列17之資料文字旋轉至鄰近之窄神經處理單元，而這些窄神經處理單元會對相對應之旋轉後資料文字以及由權重隨機存取記憶體124載入之相對應窄權重文字執行乘法累加運算。雖然圖中並未顯示，在時頻週期1025，這1024個窄啟動函數單元212A/212B會執行啟動指令。在時頻週期1026，這1024個窄神經處理單元會將其相對應之1024個8位元結果133A/133B寫回資料隨機存取記憶體122之列16。 The twenty-second figure is a block diagram showing the neural network unit 121 of the first figure. The neural network unit 121 has a neural processing unit 126 as shown in the eighteenth figure to execute the program of the twentieth figure. The neural network unit 121 includes 512 neural processing units 126, that is, 1024 narrow neural processing units, a data random access memory 122, and a weight random access memory 124. The data random access memory 122 receives its bits. The address input 123 and the weight random access memory 124 receive its address input 125. Although it is not shown in the figure, at a time-frequency period of 0, the 1024 narrow neural processing units will execute the initialization instructions of the twentieth figure. As shown in the figure, in time-frequency period 1, 1024 8-bit data characters of row 17 are read from the data random access memory 122 and provided to the 1024 narrow neural processing units. In the time-frequency period of 1 to 1024, 1024 8-bit weight characters of columns 0 to 1023 are read from the weight random access memory 124 and provided to the 1024 narrow neural processing units. Although not shown in the figure However, at time-frequency period 1, the 1024 narrow neural processing units will perform corresponding multiply-accumulate operations on the loaded data text and weight text. In the time-frequency period of 2 to 1024, the multiplex register 208A / 208B of the 1024 narrow neural processing units operates like a 1024 8-bit text rotator, which will randomly load previously loaded data into random access memory. The data text in column 17 of 122 is rotated to the adjacent narrow neural processing units, and these narrow neural processing units perform multiplication on the corresponding rotated data text and the corresponding narrow weight text loaded by the weight random access memory 124. Accumulation. Although not shown in the figure, the 1024 narrow start function units 212A / 212B will execute the start instruction in the time-frequency cycle 1025. In the time-frequency cycle 1026, the 1024 narrow neural processing units write their corresponding 1024 8-bit results 133A / 133B back to the 16th row of the data random access memory 122.

由此可以發現，相較於第二圖之實施例，第十八圖之實施例讓程式設計者具有彈性可以選擇使用寬資料與權重文字(如16位元)以及窄資料與權重文字(如8位元)執行計算，以因應特定應用下對於準確度的需求。從一個面向來看，對於窄資料之應用而言，第十八圖之實施例相較於第二圖之實施例可提供兩倍的效能，但必須增加額外的窄元件(例如多工暫存器208B、暫存器205B、窄算術邏輯單元204B、窄累加器202B、窄啟動函數單元212B)作為代價，這些額外的窄元件會使神經處理單元126增加約50%之面積。 It can be found that, compared to the embodiment of the second figure, the embodiment of the eighteenth figure gives the programmer flexibility to use wide data and weighted text (such as 16 bits) and narrow data and weighted text (such as 8-bit) to perform calculations in response to the need for accuracy in a particular application. From an aspect, for the application of narrow data, the embodiment of Fig. 18 can provide twice the performance compared to the embodiment of Fig. 2. However, additional narrow components (such as multiplexed temporary storage) must be added. 208B, register 205B, narrow arithmetic logic unit 204B, narrow accumulator 202B, and narrow start function unit 212B). At the cost of these additional narrow elements, the area of the neural processing unit 126 is increased by about 50%.

Trimodal nerve processing unit

第二十三圖係顯示第一圖之一可動態配置之神經處理單元126之另一實施例之方塊示意圖。第二十三圖之神經處理單元126不但可用於寬配置與窄配置，還可用以一第三種配置，在此稱為“漏斗(funnel)”配置。第二十三圖之神經處理單元126係類似於第十八圖之神經處理單元126。不過，第十八圖中之寬加法器244A在第二十三圖之神經處理單元126中係由一個三輸入寬加法器2344A所取代，此三輸入寬加法器2344A接收一第三加數2399，其為窄多工器1896B之輸出之一延伸版本。具有第二十三圖之神經處理單元之神經網路單元所執行之程式係類似於第二十圖之程式。不過，其中位於位址0之初始化神經處理單元指令會將這些神經處理單元126初始化為漏斗配置，而非窄配置。此外，位於位址2之乘法累加旋轉指令之計數值為511而非1023。 The twenty-third figure is a schematic block diagram showing another embodiment of the dynamically configurable neural processing unit 126, one of the first figures. The neural processing unit 126 of the twenty-third figure can be used not only in a wide configuration and a narrow configuration, but also in a third configuration, which is referred to herein as a "funnel" configuration. The neural processing unit 126 of the twenty-third figure is similar to the neural processing unit 126 of the eighteenth figure. However, the wide adder 244A in the eighteenth figure is replaced by a three-input wide adder 2344A in the neural processing unit 126 in the twenty-third figure. The three-input wide adder 2344A receives a third adder 2399. It is an extended version of the output of the narrow multiplexer 1896B. The program executed by the neural network unit having the neural processing unit of the twenty-third diagram is similar to that of the twentieth diagram. However, the initialization neural processing unit instruction at address 0 initializes these neural processing units 126 to a funnel configuration instead of a narrow configuration. In addition, the count value of the multiply accumulate rotation instruction at address 2 is 511 instead of 1023.

處於漏斗配置時，神經處理單元126之運作係類似於處於窄配置，當執行如第二十圖中位址1之乘法累加指令時，神經處理單元126會接收兩個窄資料文字207A/207B與兩個窄權重文字206A/206B；寬乘法器242A會將資料文字209A與權重文字203A相乘以產生寬多工器1896A選擇之乘積246A；窄乘法器242B會將資料文字209B與權重文字203B相乘以產生窄多工器1896B選擇之乘積246B。不過，寬加法器2344A會將乘積246A(由寬多工器1896A選擇)以及乘積246B/2399(由寬多工器1896B選擇)都與寬累加器202A輸出217A相加，而窄加法器244B與窄累加器202B則是不啟動。此外，處於漏斗配置而執行如第二十圖中位址2之乘法累加旋轉指令時，控制輸入213會使多工暫存器208A/208B旋轉兩個窄文字(如16位元)，也就是說，多工暫存器208A/208B會選擇其相對應輸入211A/211B，就如同處於寬配置一樣。不過，寬乘法器242A會將資料文字209A與權重文字203A相乘以產生寬多工器1896A選擇之乘積246A；窄乘法器242B會將資料文字209B與權重文字203B相乘以產生窄多工器1896B選擇之乘積246B；並且，寬加法器2344A會將乘積246A(由寬多工器1896A選擇)以及乘積246B/2399(由寬多工器1896B選擇)都與寬累加器202A輸出217A相加，而窄加法器244B與窄累加器202B如前述則是不啟動。最後，處於漏斗配置而執行如第二十圖中位址3之啟動函數指令時，寬啟動函數單元212A會對結果總數215A執行啟動函數以產生一窄結果133A，而窄啟動函數單元212B則是不啟動。如此，只有標示為A之窄神經處理單元會產生窄結果133A，標示為B之窄神經處理單元所產生之窄結果133B則是無效。因此，寫回結果之列(如第二十圖中位址4之指令所指示之列16)會包含空洞，這是因為只有窄結果133A有效，窄結果133B則是無效。因此，在概念上，每個時頻週期內，每個神經元(第二十三圖之神經處理單元)會執行兩個連結資料輸入，即將兩個窄資料文字乘上其相對應之權重並將這兩個乘積相加，相較之下，第二圖與第十八圖之實施例在每個時頻週期內只執行一個連結資料輸入。 In the funnel configuration, the operation of the neural processing unit 126 is similar to that in the narrow configuration. When the multiply-accumulate instruction at address 1 in the twentieth figure is executed, the neural processing unit 126 will receive two narrow data characters 207A / 207B and Two narrow weight texts 206A / 206B; wide multiplier 242A will multiply data text 209A and weight text 203A to produce a product 246A selected by wide multiplexer 1896A; narrow multiplier 242B will phase data text 209B and weight text 203B Multiply by 246B to produce the selection of the narrow multiplexer 1896B. However, the wide adder 2344A adds the product 246A (selected by the wide multiplexer 1896A) and the product 246B / 2399 (selected by the wide multiplexer 1896B) to the wide accumulator 202A output 217A, and the narrow adder 244B and The narrow accumulator 202B is not activated. Also, in the funnel When configured to execute a multiply accumulate rotation instruction such as address 2 in the twentieth figure, the control input 213 causes the multiplexer register 208A / 208B to rotate two narrow characters (such as 16 bits). The registers 208A / 208B will select their corresponding inputs 211A / 211B, as if in wide configuration. However, the wide multiplier 242A will multiply the data text 209A and the weight text 203A to produce the product 246A selected by the wide multiplexer 1896A; the narrow multiplier 242B will multiply the data text 209B and the weight text 203B to produce a narrow multiplexer. The product 246B selected by 1896B; and the wide adder 2344A adds the product 246A (selected by the wide multiplexer 1896A) and the product 246B / 2399 (selected by the wide multiplexer 1896B) to the wide accumulator 202A output 217A. The narrow adder 244B and the narrow accumulator 202B are not activated as described above. Finally, when the start function instruction at address 3 is executed in the funnel configuration, the wide start function unit 212A executes the start function on the total number of results 215A to produce a narrow result 133A, and the narrow start function unit 212B is Does not start. In this way, only the narrow neural processing unit labeled A will produce a narrow result 133A, and the narrow neural processing unit labeled B will produce a narrow result 133A. Therefore, the column of the write-back result (such as column 16 indicated by the instruction of address 4 in the twentieth figure) will contain holes, because only the narrow result 133A is valid, and the narrow result 133B is invalid. Therefore, conceptually, in each time-frequency period, each neuron (the neural processing unit of the twenty-third figure) will perform two input of connected data, that is, multiply two narrow data characters by their corresponding weights and By adding these two products, in comparison, the embodiments of the second graph and the eighteenth graph perform only one connection data input in each time-frequency period.

在第二十三圖之實施例中可以發現，產生並寫回資料隨機存取記憶體122或權重隨機存取記憶體124之結果文字(神經元輸出)之數量是所接收資料輸入(連結)數量之平方根的一半，而結果之寫回列具有空洞，即每隔一個窄文字結果就是無效，更精確來說，標示為B之窄神經處理單元結果不具意義。因此，第二十三圖之實施例對於具有連續兩層之神經網路特別有效率，舉例來說，第一層具有之神經元數量為第二層之兩倍(例如第一層具有1024個神經元充分連接至第二層之512個神經元)。此外，其他的執行單元122(例如媒體單元，如x86高級向量擴展單元)在必要時，可對一分散結果列(即具有空洞)執行合併運算(pack operation)以使其緊密(即不具空洞)。後續當神經處理單元121在執行其他關聯於資料隨機存取記憶體122與/或權重隨機存取記憶體124之其他列之計算時，即可將此處理後之資料列用於計算。 It can be found in the embodiment of the twenty-third figure that And write back the data random access memory 122 or weight random access memory 124 with the result text (neuronal output) half the square root of the number of data inputs (links) received, and the result write-back row has holes That is, the result of every other narrow text is invalid. More precisely, the result of the narrow neural processing unit labeled B is meaningless. Therefore, the embodiment of the twenty-third figure is particularly efficient for a neural network with two consecutive layers. For example, the number of neurons in the first layer is twice that of the second layer (for example, the first layer has 1024 The neurons are fully connected to the 512 neurons in the second layer). In addition, other execution units 122 (for example, media units, such as x86 advanced vector expansion unit) may perform a pack operation on a scattered result column (that is, have a hole) to make it compact (that is, have no hole) when necessary. . Subsequently, when the neural processing unit 121 performs other calculations related to the other rows of the data random access memory 122 and / or the weight random access memory 124, the processed data row may be used for calculation.

Hybrid Neural Network Unit Operations: Convolution and Common Source Computing Capabilities

本發明實施例所述之神經網路單元121的優點在於，此神經網路單元121能夠同時以類似於一個協處理器執行自己內部程式之方式運作以及以類似於一個處理器之處理單元執行所發佈之架構指令(或是由架構指令轉譯出之微指令)。架構指令是包含在具有神經網路單元121之處理器所執行之架構程式內。如此，神經網路單元121即可以混合方式運作，而能維持神經處理單元121之高利用率。舉例來說，第二十四至二十六圖係顯示神經網路單元121執行卷積運算之運作，其中，神經網路單元係被充分利用，第二十七至二十八圖係顯示神經網路單元121執行共源運算之運作。卷積層、共源層以及其他數位資料計算之應用，例如影像處理(如邊緣偵測、銳利化、模糊化、辨識/分類)需要使用到這些運算。不過，神經處理單元121之混合運算並不限於執行卷積或共源運算，此混合特徵亦可用於執行其他運算，例如第四至十三圖所述之傳統神經網路乘法累加運算與啟動函數運算。也就是說，處理器100(更精確地說，保留站108)會發佈MTNN指令1400與MFNN指令1500至神經網路單元121，因應此發佈之指令，神經網路單元121會將資料寫入記憶體122/124/129並將結果從被神經網路單元121寫入之記憶體122/124中讀出，在此同時，為了執行處理器100(透過MTNN1400指令)寫入程式記憶體129之程式，神經網路單元121會讀取並寫入記憶體122/124/129。 The advantage of the neural network unit 121 according to the embodiment of the present invention is that the neural network unit 121 can operate in a manner similar to that of a coprocessor executing its own internal program at the same time and executes all operations in a processing unit similar to a processor. Issued architectural instructions (or micro-instructions translated from architectural instructions). The architectural instructions are contained in an architectural program executed by a processor having a neural network unit 121. In this way, the neural network unit 121 can operate in a mixed manner, and the high utilization rate of the neural processing unit 121 can be maintained. For example, the twenty-fourth to twenty-six illustrations show The neural network unit 121 performs the operation of the convolution operation. Among them, the neural network unit is fully utilized. The twenty-seventh through twenty-eighth figures show the operation of the neural network unit 121 performing the common source operation. Applications such as convolutional layers, common source layers, and other digital data calculations, such as image processing (such as edge detection, sharpening, blurring, recognition / classification), require these operations. However, the hybrid operation of the neural processing unit 121 is not limited to performing convolution or common source operations. This hybrid feature can also be used to perform other operations, such as the traditional neural network multiply-accumulate operations and activation functions described in Figures 4-13. Operation. That is, the processor 100 (more precisely, the reserved station 108) will issue MTNN instructions 1400 and MFNN instructions 1500 to the neural network unit 121. In response to the issued instructions, the neural network unit 121 will write data into the memory The body 122/124/129 reads the result from the memory 122/124 written by the neural network unit 121. At the same time, in order to execute the program written by the processor 100 (via the MTNN1400 instruction) into the program memory 129 The neural network unit 121 will read and write the memory 122/124/129.

第二十四圖係一方塊示意圖，顯示由第一圖之神經網路單元121使用以執行一卷積運算之資料結構之一範例。此方塊圖包括一卷積核2402、一資料陣列2404、以及第一圖之資料隨機存取記憶體122與權重隨機存取記憶體124。就一較佳實施例而言，資料陣列2404(例如對應於影像畫素)係裝載於連接至處理器100之系統記憶體(未圖示)並由處理器100透過執行MTNN指令1400載入神經網路單元121之權重隨機存取記憶體124。卷積運算係將一第一陣列與一第二陣列進行卷積，此第二陣列即為本文所述之卷積核。如本文所述，卷積核係一係數矩陣，這些係數亦可稱為權重、參數、元素或數值。就一較佳實施例而言，此卷積核2042係處理器100所執行之架構程式之靜態資料。 The twenty-fourth figure is a block diagram showing an example of a data structure used by the neural network unit 121 of the first figure to perform a convolution operation. The block diagram includes a convolution kernel 2402, a data array 2404, and the data random access memory 122 and the weight random access memory 124 of the first figure. In a preferred embodiment, the data array 2404 (for example, corresponding to an image pixel) is loaded in a system memory (not shown) connected to the processor 100 and is loaded into the nerve by the processor 100 by executing the MTNN instruction 1400 The weight of the network unit 121 is a random access memory 124. The convolution operation is to convolve a first array and a second array, and this second array is the convolution kernel described herein. As described herein, the convolution kernel A matrix of coefficients. These coefficients can also be called weights, parameters, elements or values. According to a preferred embodiment, the convolution kernel 2042 is static data of a framework program executed by the processor 100.

此資料陣列2404係一個資料值之二維陣列，而每個資料值(例如影像畫素值)的大小是資料隨機存取記憶體122或權重隨機存取記憶體124之文字的尺寸(例如16位元或8位元)。在此範例中，資料值為16位元文字，神經網路單元121係配置有512個寬配置之神經處理單元126。此外，在此實施例中，神經處理單元126包括多工暫存器以接收來自權重隨機存取記憶體124之權重文字206，例如第七圖之多工暫存器705，藉以對由權重隨機存取記憶體124接收之一列資料值執行集體旋轉器運算，這部分在後續章節會有更詳細的說明。在此範例中，資料陣列2404係一個2560行X1600列之畫素陣列。如圖中所示，當架構程式將資料陣列2404與卷積核2402進行卷積計算時，資料陣列2402會被分為20個資料塊，而每個資料塊分別是512x400之資料陣列2406。 The data array 2404 is a two-dimensional array of data values, and the size of each data value (such as an image pixel value) is the size of the text in the data random access memory 122 or the weight random access memory 124 (for example, 16 Bit or 8 bit). In this example, the data value is a 16-bit text, and the neural network unit 121 is configured with 512 wide-configured neural processing units 126. In addition, in this embodiment, the neural processing unit 126 includes a multiplexing register to receive the weight text 206 from the weight random access memory 124, such as the multiplexing register 705 of the seventh figure, so as to randomize the weights. The access memory 124 receives a row of data values and performs a collective rotator operation, which will be described in more detail in subsequent sections. In this example, the data array 2404 is a pixel array of 2560 rows x 1600 columns. As shown in the figure, when the architecture program performs a convolution calculation on the data array 2404 and the convolution kernel 2402, the data array 2402 is divided into 20 data blocks, and each data block is a 512x400 data array 2406.

在此範例中，卷積核2402係一個由係數、權重、參數、或元素，構成之3x3陣列。這些係數的第一列標示為C0,0；C0,1；與C0,2；這些係數的第二列標示為C1,0；C1,1；與C1,2；這些係數的第三列標示為C2,0；C2,1；與C2,2。舉例來說，具有以下係數之卷積核可用於執行邊緣偵測：0,1,0,1,-4,1,0,1,0。在另一實施例中，具有以下係數之卷積核可用於執行高斯模糊運算：1,2,1,2,4,2,1,2,1。在此範例中，通常會對最終累加後之數值再執行一個除法，其中，除數係卷積核2042之各元素之絕對值的加總，在此範例中即為16。在另一範例中，除數可以是卷積核2042之元素數量。在又一個範例中，除數可以是將卷積運算壓縮至一目標數值範圍所使用之數值，此除數係由卷積核2042之元素數值、目標範圍以及執行卷積運算之輸入值陣列的範圍所決定。 In this example, the convolution kernel 2402 is a 3x3 array of coefficients, weights, parameters, or elements. The first column of these coefficients is labeled C0,0; C0,1; and C0,2; the second column of these coefficients is labeled C1,0; C1,1; and C1,2; the third column of these coefficients is labeled C2,0; C2,1; and C2,2. For example, a convolution kernel with the following coefficients can be used to perform edge detection: 0,1,0,1, -4,1,0,1,0. In another embodiment, a convolution kernel with the following coefficients can be used to perform a Gaussian fuzzy operation: 1,2,1,2,4,2,1,2,1. In this example, it is often tiring A division is performed on the added value, where the divisor is the sum of the absolute values of the elements of the convolution kernel 2042, which is 16 in this example. In another example, the divisor may be the number of elements of the convolution kernel 2042. In yet another example, the divisor may be a value used to compress the convolution operation to a target value range. The divisor is determined by the element value of the convolution kernel 2042, the target range, and the input value array that performs the convolution operation. Decided by the scope.

請參照第二十四圖以及詳述其中細節之第二十五圖，架構程式將卷積核2042之係數寫入資料隨機存取記憶體122。就一較佳實施例而言，資料隨機存取記憶體122之連續九個列(卷積核2402內之元素數量)之每個列上的所有文字，會利用卷積核2402之不同元素以列為其主要順序加以寫入。也就是說，如圖中所示，在同一列之每個文字係以第一係數C0,0寫入；下一列則是以第二係數C0,1寫入；下一列則是以第三係數C0,2寫入；再下一列則是以第四係數C1,0寫入；依此類推，直到第九列之每個文字都以第九係數C2,2寫入。為了對資料陣列2404分割出之資料塊之資料矩陣2406進行卷積運算，神經處理單元126會依據順序重複讀取資料隨機存取記憶體122中裝載卷積核2042係數之九個列，這部分在後續章節，特別是對應於第二十六A圖的部分，會有更詳細的說明。 Please refer to the twenty-fourth figure and the twenty-fifth figure in which the details are detailed. The architecture program writes the coefficients of the convolution kernel 2042 into the data random access memory 122. In a preferred embodiment, all text on each row of the nine consecutive rows of data random access memory 122 (the number of elements in the convolution kernel 2402) will use different elements of the convolution kernel 2402 to The columns are written in their main order. That is, as shown in the figure, each character in the same column is written with a first coefficient C0,0; the next column is written with a second coefficient C0,1; the next column is written with a third coefficient C0,2 is written; the next column is written with a fourth coefficient C1,0; and so on until each character in the ninth column is written with a ninth coefficient C2,2. In order to perform a convolution operation on the data matrix 2406 of the data block segmented by the data array 2404, the neural processing unit 126 repeatedly reads the nine rows of the coefficients of the convolution kernel 2042 loaded in the random access memory 122 according to the sequence. This part In the subsequent chapters, especially the part corresponding to Figure 26A, there will be more detailed explanation.

請參照第二十四圖以及詳述其中細節之第二十五圖，架構程式係將資料矩陣2406之數值寫入權重隨機存取記憶體124。神經網路單元程式執行卷積運算時，會將結果陣列寫回權重隨機存取記憶體124。就一較佳實施例而言，架構程式會將一第一資料矩陣2406寫入權重隨機存取記憶體124並使神經網路單元121開始運作，當神經網路單元121在對第一資料矩陣2406與卷積核2402執行卷積運算時，架構程式會將一第二資料矩陣2406寫入權重隨機存取記憶體124，如此，神經網路單元121完成第一資料矩陣2406之卷積運算後，即可開始執行第二資料矩陣2406之卷積運算，這部分在後續對應於第二十五圖處有更詳細的說明。以此方式，架構程式會往返於權重隨機存取記憶體124之兩個區域，以確保神經網路單元121被充分使用。因此，第二十四圖之範例顯示有一第一資料矩陣2406A與一第二資料矩陣2406B，第一資料矩陣2406A係對應於佔據權重隨機存取記憶體124中列0至399之第一資料塊，而第二資料矩陣2406B係對應於佔據權重隨機存取記憶體124中列500至899之第二資料塊。此外，如圖中所示，神經網路單元121會將卷積運算之結果寫回權重隨機存取記憶體124之列900-1299以及列1300-1699，隨後架構程式會從權重隨機存取記憶體124讀取這些結果。裝載於權重隨機存取記憶體124之資料矩陣2406之資料值係標示為“Dx,y”，其中“x”是權重隨機存取記憶體124列數，“y”是權重隨機存取記憶體之文字、或稱行數。舉例來說，位於列399之資料文字511在第二十四圖中係標示為D399,511，此資料文字係由神經處理單元511之多工暫存器705接收。 Please refer to the twenty-fourth figure and the twenty-fifth figure in which the details are detailed. The architecture program writes the values of the data matrix 2406 into the weight random access memory 124. When the neural network unit program performs a convolution operation, it writes the resulting array back to the weight random access memory 124. Just compare In the preferred embodiment, the architecture program writes a first data matrix 2406 into the weight random access memory 124 and starts the neural network unit 121. When the neural network unit 121 is When the kernel 2402 performs a convolution operation, the architecture program writes a second data matrix 2406 into the weight random access memory 124. In this way, after the neural network unit 121 completes the convolution operation of the first data matrix 2406, it can The convolution operation of the second data matrix 2406 is started. This part is explained in more detail in the subsequent part corresponding to the twenty-fifth figure. In this way, the framework program will go back and forth between the two regions of the weight random access memory 124 to ensure that the neural network unit 121 is fully used. Therefore, the example in the twenty-fourth figure shows a first data matrix 2406A and a second data matrix 2406B. The first data matrix 2406A corresponds to the first data block in the row 0 to 399 in the occupied random access memory 124. The second data matrix 2406B corresponds to the second data blocks of columns 500 to 899 in the occupied random access memory 124. In addition, as shown in the figure, the neural network unit 121 writes the result of the convolution operation back to columns 900-1299 and 1300-1699 of the weight random access memory 124, and then the architecture program will randomly access the memory from the weights. The body 124 reads these results. The data values of the data matrix 2406 loaded in the weight random access memory 124 are marked as "Dx, y", where "x" is the number of rows of the weight random access memory 124, and "y" is the weight random access memory Text, or number of lines. For example, the data text 511 in column 399 is labeled D399,511 in the twenty-fourth figure, and this data text is received by the multiplexer register 705 of the neural processing unit 511.

第二十五圖係一流程圖，顯示第一圖之處理器100執行一架構程式以利用神經網路單元121對第二十四圖之資料陣列2404執行卷積核2042之卷積運算。此流程始於步驟2502。 The twenty-fifth figure is a flowchart showing that the processor 100 of the first figure executes a framework program to use the neural network unit 121 to The data array 2404 of the fourteenth figure performs the convolution operation of the convolution kernel 2042. This process starts at step 2502.

在步驟2502中，處理器100，即執行有架構程式之處理器100，會將第二十四圖之卷積核2402以第二十四圖所顯示描述之方式寫入資料隨機存取記憶體122。此外，架構程式會將一變數N初始化為數值1。變數N係標示資料陣列2404中神經網路單元121正在處理之資料塊。此外，架構程式會將一變數NUM_CHUNKS初始化為數值20。接下來流程前進至步驟2504。 In step 2502, the processor 100, that is, the processor 100 executing the architecture program, writes the convolution kernel 2402 of FIG. 24 into the data random access memory in the manner described in the display of FIG. 24. 122. In addition, the framework program initializes a variable N to a value of 1. The variable N indicates a data block being processed by the neural network unit 121 in the data array 2404. In addition, the framework program initializes a variable NUM_CHUNKS to the value 20. The flow then proceeds to step 2504.

在步驟2504中，如第二十四圖所示，處理器100會將資料塊1之資料矩陣2406寫入權重隨機存取記憶體124(如資料塊1之資料矩陣2406A)。接下來流程前進至步驟2506。 In step 2504, as shown in FIG. 24, the processor 100 writes the data matrix 2406 of the data block 1 into the weight random access memory 124 (such as the data matrix 2406A of the data block 1). The flow then proceeds to step 2506.

在步驟2506中，處理器100會使用一個指定一函數1432以寫入程式記憶體129之MTNN指令1400，將一卷積程式寫入神經網路單元121程式記憶體129。處理器100隨後會使用一個指定一函數1432以開始執行程式之MTNN指令1400，以啟動神經網路單元卷積程式。神經網路單元卷積程式之一範例在對應於第二十六A圖處會有更詳細的說明。接下來流程前進至步驟2508。 In step 2506, the processor 100 uses a MTNN instruction 1400 that specifies a function 1432 to write the program memory 129, and writes a convolution program to the program memory 129 of the neural network unit 121. The processor 100 then uses a MTNN instruction 1400 that specifies a function 1432 to start executing the program to start the neural network unit convolution program. An example of a neural network unit convolution program is explained in more detail at the position corresponding to the twenty-sixth A figure. The flow then proceeds to step 2508.

在決策步驟2508，架構程式確認變數N之數值是否小於NUM_CHUNKS。若是，流程會前進至步驟2512；否則就前進至步驟2514。 In decision step 2508, the architecture program determines whether the value of the variable N is less than NUM_CHUNKS. If yes, the flow advances to step 2512; otherwise, it advances to step 2514.

在步驟2512，如第二十四圖所示，處理器 100將資料塊N+1之資料矩陣2406寫入權重隨機存取記憶體124(如資料塊2之資料矩陣2406B)。因此，當神經網路單元121正在對當前資料塊執行卷積運算的時候，架構程式可將下一個資料塊之資料矩陣2406寫入權重隨機存取記憶體124，如此，在完成當前資料塊之卷積運算後，即寫入權重隨機存取記憶體124後，神經網路單元121可以立即開始對下一個資料塊執行卷積運算。 At step 2512, as shown in Figure 24, the processor 100 writes the data matrix 2406 of data block N + 1 into the weight random access memory 124 (such as the data matrix 2406B of data block 2). Therefore, when the neural network unit 121 is performing a convolution operation on the current data block, the architecture program can write the data matrix 2406 of the next data block into the weight random access memory 124. Thus, after completing the current data block, After the convolution operation, that is, after the weight random access memory 124 is written, the neural network unit 121 can immediately start performing a convolution operation on the next data block.

在步驟2514，處理器100確認正在執行之神經網路單元程式(對於資料塊1而是從步驟2506開始執行，對於資料塊2-20而言則是從步驟2518開始執行)是否已經完成執行。就一較佳實施例而言，處理器100係透過執行一MFNN指令1500讀取神經網路單元121狀態暫存器127以確認是否已經完成執行。在另一實施例中，神經網路單元121會產生一中斷，表示已經完成卷積程式。接下來流程前進至決策步驟2516。 At step 2514, the processor 100 confirms whether the executing neural network unit program (for data block 1 but from step 2506 and for data block 2-20 from step 2518) has completed execution. In a preferred embodiment, the processor 100 reads the state register 127 of the neural network unit 121 by executing an MFNN instruction 1500 to confirm whether the execution has been completed. In another embodiment, the neural network unit 121 generates an interrupt, indicating that the convolution routine has been completed. The flow then proceeds to decision step 2516.

在決策步驟2516中，架構程式確認變數N之數值是否小於NUM_CHUNKS。若是，流程前進至步驟2518；否則就前進至步驟2522。 In decision step 2516, the architecture program determines whether the value of the variable N is less than NUM_CHUNKS. If yes, the flow proceeds to step 2518; otherwise, it proceeds to step 2522.

在步驟2518中，處理器100會更新卷積程式以便執行於資料塊N+1。更精確地說，處理器100會將權重隨機存取記憶體124中對應於位址0之初始化神經處理單元指令之列值更新為資料矩陣2406之第一列(例如，更新為資料矩陣2406A之列0或是資料矩陣2406B之列500)，並且會更新輸出列(例如更新為列900或1300)。隨後處理器100會開始執行此更新後之神經網路單元卷積程式。接下來流程前進至步驟2522。 In step 2518, the processor 100 updates the convolution program to execute on the data block N + 1. More precisely, the processor 100 updates the column value of the initialization neural processing unit instruction corresponding to the address 0 in the weight random access memory 124 to the first column of the data matrix 2406 (for example, to the data matrix 2406A). Column 0 or column 500 of data matrix 2406B), and the output column is updated (for example, to column 900 or 1300). The processor 100 will then execute the updated neural network unit volume. Product formula. The flow then proceeds to step 2522.

在步驟2522中，處理器100從權重隨機存取記憶體124讀取資料塊N之神經網路單元卷積程式之執行結果。接下來流程前進至決策步驟2524。 In step 2522, the processor 100 reads the execution result of the neural network unit convolution program of the data block N from the weighted random access memory 124. The flow then proceeds to decision step 2524.

在決策步驟2524中，架構程式確認變數N之數值是否小於NUM_CHUNKS。若是，流程前進至步驟2526；否則就終止。 In decision step 2524, the architecture program determines whether the value of the variable N is less than NUM_CHUNKS. If yes, the flow advances to step 2526; otherwise, it ends.

在步驟2526中，架構程式會將N的數值增加一。接下來流程回到決策步驟2508。 In step 2526, the framework program increases the value of N by one. The flow then returns to decision step 2508.

第二十六A圖係一神經網路單元程式之一程式列表，此神經網路單元程式係利用第二十四圖之卷積核2402執行一資料矩陣2406之卷積運算並將其寫回權重隨機存取記憶體124。此程式係將位址1至9之指令所構成之指令迴圈循環一定次數。位於位址0之初始化神經處理單元指令指定每個神經處理單元126執行此指令迴圈之次數，在第二十六A圖之範例所具有之迴圈計數值為400，對應於第二十四圖之資料矩陣2406內之列數，而位於迴圈終端之迴圈指令(位於位址10)會使當前迴圈計數值遞減，若是結果為非零值，就使其回到指令迴圈之頂端(即回到位址1之指令)。初始化神經處理單元指令也會將累加器202清除為零。就一較佳實施例而言，位於位址10之迴圈指令也會將累加器202清除為零。另外，如前述位於位址1之乘法累加指令也可將累加器202清除為零。 Figure 26A is a program list of a neural network unit program. This neural network unit program uses the convolution kernel 2402 of figure 24 to perform a convolution operation of a data matrix 2406 and writes it back. Weight random access memory 124. This program loops a certain number of instruction loops consisting of the instructions at addresses 1 to 9. The initialization neural processing unit instruction at address 0 specifies the number of times each neural processing unit 126 executes this instruction. The example in Figure 26A has a loop count value of 400, corresponding to the twenty-fourth. The number of columns in the data matrix 2406 of the figure, and the loop instruction (located at address 10) located at the loop terminal will decrement the current loop count value. If the result is non-zero, it will return to the command loop. Top (ie the instruction to return to address 1). The instruction to initialize the neural processing unit also clears the accumulator 202 to zero. For a preferred embodiment, the loop instruction at address 10 also clears the accumulator 202 to zero. In addition, the multiply-accumulate instruction at address 1 can also clear the accumulator 202 to zero.

對於程式內指令迴圈之每一次執行，這 512個神經處理單元126會同時執行512個3x3卷積核以及資料矩陣2406之512個相對應之3x3子矩陣之卷積運算。卷積運算是由卷積核2042之元素與相對應子矩陣內之相對應元素計算出來之九個乘積的加總。在第二十六A圖之實施例中，這512個相對應3x3子矩陣之每一個的原點(中央元素)是第二十四圖中的資料文字Dx+1,y+1，其中y(行編號)是神經處理單元126編號，而x(列編號)是當前權重隨機存取記憶體124中由第二十六A圖之程式中位址1之乘法累加指令所讀取之列編號(此列編號也會由位址0之初始化神經處理單元指令進行初始化處理，也會在執行位於位址3與5之乘法累加指令時遞增，也會被位於位址9之遞減指令更新)。如此，在此程式之每一個循環中，這512個神經處理單元126會計算512個卷積運算並將這512個卷積運算之結果寫回權重隨機存取記憶體124之指令列。在本文中係省略邊緣處理(edge handling)以簡化說明，不過需要注意的是，利用這些神經處理單元126之集體旋轉特徵會造成資料矩陣2406(對於影像處理器而言即影像之資料矩陣)之多行資料中有兩行從其一側之垂直邊緣到另一個垂直邊緣間(例如從左側邊緣到右側邊緣，反之亦然)產生環繞(wrapping)。現在針對指令迴圈進行說明。 For each execution of the in-program command loop, this The 512 neural processing units 126 will simultaneously perform the convolution operations of 512 3x3 convolution kernels and 512 corresponding 3x3 submatrices of the data matrix 2406. The convolution operation is the sum of the nine products calculated from the elements of the convolution kernel 2042 and the corresponding elements in the corresponding sub-matrix. In the embodiment of Figure 26A, the origin (central element) of each of the 512 corresponding 3x3 sub-matrixes is the data text Dx + 1, y + 1 in Figure 24, where y (Row number) is the number of the neural processing unit 126, and x (row number) is the row number read by the current weight random access memory 124 by the multiply-accumulate instruction of address 1 in the program of Figure 26A (This column number will also be initialized by the initialization neural processing unit instruction at address 0. It will also be incremented when the multiply accumulate instruction at addresses 3 and 5 is executed, and it will also be updated by the decrement instruction at address 9.) Thus, in each loop of the program, the 512 neural processing units 126 calculate 512 convolution operations and write the results of the 512 convolution operations back to the instruction line of the weight random access memory 124. In this article, edge handling is omitted to simplify the description, but it should be noted that using the collective rotation features of these neural processing units 126 will cause the data matrix 2406 (for the image processor, the data matrix of the image). Wrapping occurs in two rows of the multi-line data from the vertical edge on one side to the other vertical edge (for example, from the left edge to the right edge and vice versa). The instruction loop will now be described.

位址1是乘法累加指令，此指令會指定資料隨機存取記憶體122之列0並暗中利用當前權重隨機存取記憶體124之列，這個列最好是裝載在定序器128內(並由位於位址0之指令將其初始化為零以執行第一次指令迴圈傳遞之運算)。也就是說，位於位址1的指令會使每個神經處理單元126從資料隨機存記憶體122之列0讀取其相對應文字，從當前權重隨機存取記憶體124列讀取其相對應文字，並對此二個文字執行一乘法累加運算。如此，舉例來說，神經處理單元5將C0,0與Dx,5相乘(其中“x”是當前權重隨機存取記憶體124列)，將結果加上累加器202數值217，並將總數寫回累加器202。 Address 1 is a multiply accumulate instruction. This instruction will specify row 0 of data random access memory 122 and secretly use the current weight of random access memory 124. This row is preferably loaded in sequencer 128 (and Initialized to zero by the instruction at address 0 to execute the first instruction Operation of loop pass). That is, the instruction at address 1 will cause each neural processing unit 126 to read its corresponding text from column 0 of the data random storage memory 122 and read its corresponding from the current weight random access memory 124 column. Text, and perform a multiply-accumulate operation on the two texts. So, for example, the neural processing unit 5 multiplies C0,0 by Dx, 5 (where "x" is the current weight random access memory 124 columns), adds the result to the accumulator 202 value 217, and adds the total Write back to accumulator 202.

位址2是一個乘法累加指令，此指令會指定資料隨機存取記憶體122之列遞增(即增加至1)，隨後再從資料隨機存取記憶體122之遞增後位址讀取這個列。此指令並會指定將每個神經處理單元126之多工暫存器705內的數值旋轉至鄰近的神經處理單元126，在此範例中即為因應位址1之指令而從權重隨機存取記憶體124讀取之資料矩陣2406值之列。在第二十四至二十六圖之實施例中，這些神經處理單元126係用以將多工暫存器705之數值向左旋轉，亦即從神經處理單元J旋轉至神經處理單元J-1，而非如前述第三、七與十九圖從神經處理單元J旋轉至神經處理單元J+1。值得注意的是，神經處理單元126向右旋轉之實施例中，架構程式會將卷積核2042係數值以不同順序寫入資料隨機存取記憶體122(例如繞著其中心行旋轉)以達到相似卷積結果之目的。此外，在需要時，架構程式可執行額外的卷積核預處理(例如移動(transposition))。此外，指令指定之計數值為2。因此，位於位址2之指令會使每個神經處理單元126從資料隨機存取記憶體122之列1讀取其相對應文字，將旋轉後文字接收至多工暫存器705，並對這兩個文字執行一乘法累加運算。因為計數值為2，此指令也會使每個神經處理單元126重複前述運作。也就是說，定序器128會使資料隨機存取記憶體122列位址123遞增(即增加至2)，而每個神經處理單元126會從資料隨機存取記憶體122之列2讀取其相對應文字以及將旋轉後文字接收至多工暫存器705，並且對這兩個文字執行一乘法累加運算。如此，舉例來說，假定當前權重隨機存取記憶體124列為27，在執行位址2之指令後，神經處理單元5會將C0,1與D27,6之乘積與C0,2與D27,7之乘積累加至其累加器202。如此，完成位址1與位址2之指令後，C0,0與D27,5之乘積、C0,1與D27,6之乘積與C0,2與D27,7就會累加至累加器202，加入其他所有來自先前傳遞之指令迴圈的累加值。 Address 2 is a multiply accumulate instruction. This instruction specifies that the row of data random access memory 122 is incremented (ie, increased to 1), and then the row is read from the address of the data random access memory 122 after incrementing. This instruction also specifies that the value in the multiplex register 705 of each neural processing unit 126 is rotated to the neighboring neural processing unit 126. In this example, it is the random access memory from the weight corresponding to the instruction at address 1. The row of data matrix 2406 values read by volume 124. In the embodiments of the twenty-fourth to twenty-sixth figures, these neural processing units 126 are used to rotate the values of the multiplex register 705 to the left, that is, from the neural processing unit J to the neural processing unit J- 1, instead of rotating from the neural processing unit J to the neural processing unit J + 1 as described in the third, seventh, and nineteenth figures. It is worth noting that in the embodiment where the neural processing unit 126 rotates to the right, the architecture program will write the convolution kernel 2042 coefficient values into the data random access memory 122 (for example, rotate around its center row) in different orders to achieve The purpose of similar convolution results. In addition, the architecture program can perform additional pre-processing of the convolution kernel (such as transposition) when needed. In addition, the instruction specifies a count value of two. Therefore, the instruction at address 2 will cause each neural processing unit 126 to read its corresponding text from row 1 of the data random access memory 122 and rotate it. The post characters are received in the multiplex register 705, and a multiply-accumulate operation is performed on the two characters. Because the count value is 2, this instruction also causes each neural processing unit 126 to repeat the aforementioned operation. In other words, the sequencer 128 increments the data random access memory 122 row address 123 (ie, increases to 2), and each neural processing unit 126 reads from the data random access memory row 122 The corresponding characters and the rotated characters are received into the multiplexing register 705, and a multiply-accumulate operation is performed on the two characters. So, for example, assuming that the current weight random access memory 124 is column 27, after executing the instruction of address 2, the neural processing unit 5 will multiply the product of C0,1 and D27,6 by C0,2 and D27, The multiplication of 7 is added to its accumulator 202. In this way, after completing the instructions of address 1 and address 2, the product of C0,0 and D27,5, the product of C0,1 and D27,6 and C0,2 and D27,7 will be added to accumulator 202, add All other accumulated values from the previously passed instruction loop.

位址3與4之指令所執行之運算係類似於位址1與2之指令，利用權重隨機存取記憶體124列遞增指標之功效，這些指令會對權重隨機存取記憶體124之下一列進行運算，並且這些指令會對資料隨機存取記憶體122之後續三列，即列3至5，進行運算。也就是說，以神經處理單元5為例，完成位址1至4之指令後，C0,0與D27,5之乘積、C0,1與D27,6之乘積、C0,2與D27,7之乘積、C1,0與D28,5之乘積、C1,1與D28,6之乘積、以及C1,2與D28,7之乘積會累加至累加器202，加入其他所有來自先前傳遞之指令迴圈的累加值。 The operations performed by the instructions at addresses 3 and 4 are similar to the instructions at addresses 1 and 2, using the effect of increasing the index of the 124 rows of random access memory. These instructions will Operations are performed, and these instructions perform operations on the next three rows of data random access memory 122, that is, rows 3 to 5. That is, taking the neural processing unit 5 as an example, after completing the instructions of addresses 1 to 4, the product of C0,0 and D27,5, the product of C0,1 and D27,6, and the product of C0,2 and D27,7 The product, the product of C1,0 and D28,5, the product of C1,1 and D28,6, and the product of C1,2 and D28,7 are accumulated to accumulator 202, and all other loops from the previously passed instruction loop are added. Cumulative value.

位址5與6之指令所執行之運算係類似於位址3與4之指令，這些指令會對權重隨機存取記憶體124 之下一列，以及資料隨機存取記憶體122之後續三列，即列6至8，進行運算。也就是說，以神經處理單元5為例，完成位址1至6之指令後，C0,0與D27,5之乘積、C0,1與D27,6之乘積、C0,2與D27,7之乘積、C1,0與D28,5之乘積、C1,1與D28,6之乘積、C1,2與D28,7、C2,0與D29,5之乘積、C2,1與D29,6之乘積、以及C2,2與D29,7之乘積會累加至累加器202，加入其他所有來自先前傳遞之指令迴圈的累加值。也就是說，完成位址1至6之指令後，假定指令迴圈開始時，權重隨機存取記憶體124列為27，以神經處理單元5為例，將會利用卷積核2042對以下3x3子矩陣進行卷積運算：D27,5 D27,6 D27,7 D28,5 D28,6 D28,7 D29,5 D29,6 D29,7一般而言，完成位址1到6的指令後，這512個神經處理單元126都已經使用卷積核2042對下列3x3子矩陣進行卷積運算：Dr,n Dr,n+1 Dr,n+2 Dr+1,n Dr+1,n+1 Dr+1,n+2 Dr+2,n Dr+2,n+1 Dr+2,n+2其中r是指令迴圈開始時，權重隨機存取記憶體124之列位址值，而n是神經處理單元126之編號。 The instructions at addresses 5 and 6 perform operations similar to the instructions at addresses 3 and 4. These instructions will randomly access the memory 124. The next row, and the subsequent three rows of the data random access memory 122, rows 6 to 8, perform operations. That is, taking the neural processing unit 5 as an example, after completing the instructions of addresses 1 to 6, the product of C0,0 and D27,5, the product of C0,1 and D27,6, and the product of C0,2 and D27,7 Product, product of C1,0 and D28,5, product of C1,1 and D28,6, product of C1,2 and D28,7, product of C2,0 and D29,5, product of C2,1 and D29,6, And the product of C2,2 and D29,7 will be accumulated to accumulator 202, adding all other accumulated values from the previous command loop. That is, after completing the instructions at addresses 1 to 6, it is assumed that at the beginning of the instruction loop, the weight random access memory 124 is listed as 27. Taking the neural processing unit 5 as an example, the convolution kernel 2042 will be used for the following 3x3 The submatrix performs the convolution operation: D27,5 D27,6 D27,7 D28,5 D28,6 D28,7 D29,5 D29,6 D29,7 In general, after completing the instructions of addresses 1 to 6, these 512 Each of the neural processing units 126 has used the convolution kernel 2042 to perform convolution operations on the following 3x3 submatrices: Dr, n Dr, n + 1 Dr, n + 2 Dr + 1, n Dr + 1, n + 1 Dr + 1 , n + 2 Dr + 2, n Dr + 2, n + 1 Dr + 2, n + 2 where r is the value of the column address of the random access memory 124 at the beginning of the instruction loop, and n is the neural processing Number of unit 126.

位址7之指令會透過啟動函數單元121傳遞累加器202數值217。此傳遞功能會傳遞一個文字，其尺寸大小(以位元計)係等同於由資料隨機存取記憶體 122與權重隨機存取記憶體124讀取之文字(在此範例中即16位元)。就一較佳實施例而言，使用者可指定輸出格式，例如輸出位元中有多少位元是小數(fractional)位元，這部分在後續章節會有更詳細的說明。另外，此指定可指定一個除法啟動函數，而非指定一個傳遞啟動函數，此除法啟動函數會將累加器202數值217除以一個除數，如本文對應於第二十九A與三十圖所述，例如利用第三十圖之“除法器”3014/3016之其中之一。舉例來說，就一個具有係數之卷積核2042而言，如前述具有十六分之一之係數之高斯模糊核，位址7之指令會指定一除法啟動函數(例如除以16)，而非指定一傳遞函數。另外，架構程式可以在將卷積核係數寫入資料隨機存取記憶體122前，對卷積核2042係數執行此除以16之運算，並據以調整卷積核2042數值之二進位小數點的位置，例如使用如下所述第二十九圖之資料二進位小數點2922。 The instruction at address 7 passes the value 217 of the accumulator 202 through the activation function unit 121. This pass-through function passes a text whose size (in bits) is equivalent to data random access memory Text read by 122 and weighted random access memory 124 (in this example, 16 bits). In a preferred embodiment, the user can specify an output format, such as how many bits in the output bits are fractional bits. This section will be described in more detail in subsequent sections. In addition, this designation can specify a division start function instead of specifying a pass-through start function. This division start function divides the value 217 of the accumulator 202 by a divisor, as shown in the figure corresponding to the figures 29A and 30 For example, one of the "dividers" 3014/3016 of the thirty figure is used. For example, for a convolution kernel 2042 with coefficients, such as the aforementioned Gaussian fuzzy kernel with a coefficient of one-sixteenth, the instruction at address 7 will specify a division start function (for example, divide by 16), and Unspecified a transfer function. In addition, the structure program can perform the operation of dividing the convolution kernel 2042 coefficient by 16 before writing the convolution kernel coefficient into the data random access memory 122, and adjust the decimal point of the convolution kernel 2042 value accordingly. For example, use the data decimal point 2922 of the 29th figure as shown below.

位址8之指令會將啟動函數單元212之輸出寫入權重隨機存取記憶體124中由輸出列暫存器之當前值所指定之列。此當前值會被位址0之指令初始化，並且由指令內之遞增指標在每傳遞經過一次迴圈就遞增此數值。 The instruction at address 8 will write the output of the activation function unit 212 into the row specified by the current value of the output register in the weight random access memory 124. This current value is initialized by the instruction at address 0, and the value is incremented by the increment indicator within the instruction every time a loop is passed.

如第二十四至二十六圖具有一3x3卷積核2402之範例所述，神經處理單元126大約每三個時頻週期會讀取權重隨機存取記憶體124以讀取資料矩陣2406之一個列，並且大約每十二個時頻週期會將卷積核結果矩陣寫入權重隨機存取記憶體124。此外，假定在一實施例中，具有如第十七圖之緩衝器1704之一寫入與讀取緩衝器，在神經處理單元126進行讀取與寫入之同時，處理器100可以對權重隨機存取記憶體124進行讀取與寫入，緩衝器1704大約每十六個時頻週期會對權重隨機存取記憶體執行一次讀取與寫入動作，以分別讀取資料矩陣以及寫入卷積核結果矩陣。因此，權重隨機存取記憶體124之大約一半的頻寬會由神經網路單元121以混合方式執行之卷積核運算所消耗。本範例係包含一個3x3卷積核2042，不過，本發明並不限於此，其他大小的卷積核，如2x2、4x4、5x5、6x6、7x7、8x8等，亦可適用於不同的神經網路單元程式。在使用較大卷積核之情況下，因為乘法累加指令之旋轉版本(如第二十六A圖之位址2、4與6之指令，較大之卷積核會需要使用這些指令)具有較大之計數值，神經處理單元126讀取權重隨機存取記憶體124之時間占比會降低，因此，權重隨機存取記憶體124之頻寬使用比也會降低。 As shown in the examples of the twenty-fourth to twenty-sixth figures with a 3x3 convolution kernel 2402, the neural processing unit 126 reads the weight random access memory 124 to read the data matrix 2406 approximately every three time-frequency cycles. One column, and the convolution kernel result matrix is written into the weight random access memory 124 approximately every twelve time-frequency periods. In addition, it is assumed that an embodiment There is a write and read buffer as one of the buffers 1704 in the seventeenth figure. While the neural processing unit 126 performs reading and writing, the processor 100 can read the weight random access memory 124 Fetch and write. The buffer 1704 performs read and write operations on the weighted random access memory approximately every sixteen time-frequency cycles to read the data matrix and write the convolution kernel result matrix, respectively. Therefore, about half of the bandwidth of the weighted random access memory 124 is consumed by the convolution kernel operation performed by the neural network unit 121 in a hybrid manner. This example contains a 3x3 convolution kernel 2042, but the invention is not limited to this. Convolution kernels of other sizes, such as 2x2, 4x4, 5x5, 6x6, 7x7, 8x8, etc., can also be applied to different neural networks. Unit program. In the case of using a larger convolution kernel, because the rotated version of the multiply accumulate instruction (such as the instructions at address 2, 4, and 6 of Figure 26A, larger convolution kernels will need to use these instructions). With a larger count value, the proportion of time for the neural processing unit 126 to read the weighted random access memory 124 will decrease, so the bandwidth usage ratio of the weighted random access memory 124 will also decrease.

另外，架構程式可使神經網路單元程式對輸入資料矩陣2406中不再需要使用之列進行覆寫，而非將卷積運算結果寫回權重隨機存取記憶體124之不同列(如列900-1299與1300-1699)。舉例來說，就一個3x3之卷積核而言，架構程式可以將資料矩陣2406寫入權重隨機存取記憶體124之列2-401，而非寫入列0-399，而神經處理單元程式則會從權重隨機存取記憶體124之列0開始將卷積運算結果寫入，而每傳遞經過一次指令迴圈就遞增列數。如此，神經網路單元程式只會將不再需要使用之列進行覆寫。舉例來說，在第一次傳遞經過指令迴圈之後(或更精確地說，在執行位址1之指令之後其載入權重隨機存取記憶體124之列0)，列0之資料可以被覆寫，不過，列1-3的資料需要留給第二次傳遞經過指令迴圈之運算而不能被覆寫；同樣地，在第二次傳遞經過指令迴圈之後，列1之資料可以被覆寫，不過，列2-4的資料需要留給第三次傳遞經過指令迴圈之運算而不能被覆寫；依此類推。在此實施例中，可以增大各個資料矩陣2406(資料塊)之高度(如800列)，因而可以使用較少之資料塊。 In addition, the structural program can make the neural network unit program overwrite the rows that are no longer needed in the input data matrix 2406 instead of writing the convolution operation results back to different rows of the weight random access memory 124 (such as row 900 -1299 and 1300-1699). For example, for a 3x3 convolution kernel, the architecture program can write the data matrix 2406 into the weighted random access memory 124 rows 2-401 instead of the rows 0-399, and the neural processing unit program The result of the convolution operation is written from column 0 of the weight random access memory 124, and the number of columns is incremented after each instruction loop is passed. In this way, the neural network unit program will only Use the columns for overwriting. For example, after the first pass through the instruction loop (or more precisely, after the instruction at address 1 is executed, its load weight random access memory 124 row 0), the data of row 0 can be overwritten Write, however, the data in columns 1-3 need to be left for the second pass through the instruction loop operation and cannot be overwritten; similarly, after the second pass through the instruction loop, the data in column 1 can be overwritten However, the data in columns 2-4 need to be left for the third pass through the instruction loop operation and cannot be overwritten; and so on. In this embodiment, the height (for example, 800 rows) of each data matrix 2406 (data block) can be increased, so fewer data blocks can be used.

另外，架構程式可以使神經網路單元程式將卷積運算之結果寫回卷積核2402上方之資料隨機存取記憶體122列(例如在列8上方)，而非將卷積運算結果寫回權重隨機存取記憶體124，當神經網路單元121寫入結果時，架構程式可以從資料隨機存取記憶體122讀取結果(例如使用第二十六圖中資料隨機存取記憶體122之最近寫入列2606位址)。此配置適用於具有單埠權重隨機存取記憶體124與雙埠資料隨機存取記憶體之實施例。 In addition, the structure program can make the neural network unit program write the results of the convolution operation back to the data random access memory 122 rows above the convolution kernel 2402 (for example, above the row 8), instead of writing the convolution operation results back. The weight random access memory 124, when the neural network unit 121 writes the result, the framework program can read the result from the data random access memory 122 (for example, using the data random access memory 122 in the twenty-sixth figure). Address 2606 was recently written). This configuration is suitable for the embodiment with a port weight random access memory 124 and a dual port data random access memory.

依據第二十四至二十六A圖之實施例中神經網路單元121之運算可以發現，第二十六A圖之程式之每次執行會需要大約5000個時頻週期，如此，第二十四圖中整個2560x1600之資料陣列2404之卷積運算需要大約100,000個時頻週期，明顯少於以傳統方式執行相同任務所需要的時頻週期數。 According to the calculations of the neural network unit 121 in the embodiments of the twenty-fourth to twenty-sixth figures A, it can be found that each execution of the program of the twenty-sixth figure A requires about 5000 time-frequency cycles. Thus, the second The convolution operation of the entire 2560x1600 data array 2404 in the fourteenth figure requires about 100,000 time-frequency cycles, which is significantly less than the number of time-frequency cycles required to perform the same task in a traditional manner.

第二十六B圖係顯示第一圖之神經網路單元121之控制暫存器127之某些欄位之一實施例之方塊示意圖。此狀態暫存器127包括一個欄位2602，指出權重隨機存取記憶體124中最近被神經處理單元126寫入之列的位址；一個欄位2606，指出資料隨機存取記憶體122中最近被神經處理單元126寫入之列的位址；一個欄位2604，指出權重隨機存取記憶體124中最近被神經處理單元126讀取之列的位址；以及一個欄位2608，指出資料隨機存取記憶體122中最近被神經處理單元126讀取之列的位址。如此，執行於處理器100之架構程式就可以確認神經網路單元121之處理進度，當對資料隨機存取記憶體122與/或權重隨機存取記憶體124進行資料之讀取與/或寫入時。利用此能力，加上如前述選擇對輸入資料矩陣進行覆寫(或是如前述將結果寫入資料隨機存取記憶體122)，如以下之範例所述，第二十四圖之資料陣列2404就可以視為5個512x1600之資料塊來執行，而非20個512x400之資料塊。處理器100從權重隨機存取記憶體124之列2開始寫入第一個512x1600之資料塊，並使神經網路單元程式啟動(此程式具有一數值為1600之迴圈計數，並且將權重隨機存取記憶體124輸出列初始化為0)。當神經網路單元121執行神經網路單元程式時，處理器100會監測權重隨機存取記憶體124之輸出位置/位址，藉以(1)(使用MFNN指令1500)讀取權重隨機存取記憶體124中具有由神經網路單元121(由列0開始)寫入之有效卷積運算結果之列；以及(2)將第二個512x1600資料矩陣2406(始於列2)覆寫於已經被讀取過之有效卷積運算結果，如此當神經網路單元121對於第一個512x1600資料塊完成神經網路單元程式，處理器100在必要時可以立即更新神經網路單元程式並再次啟動神經網路單元程式以執行於第二個512x1600資料塊。此程序會再重複三次執行剩下三個512x1600資料塊，以使神經網路單元121可以被充分使用。 Figure 26B shows the neural network sheet of the first picture Block diagram of one embodiment of some fields of the control register 127 of the element 121. The status register 127 includes a field 2602 indicating the address of the row in the weight random access memory 124 that was recently written by the neural processing unit 126; a field 2606 indicating the latest address in the data random access memory 122. The address of the column written by the neural processing unit 126; a field 2604 indicating the address of the row in the weight random access memory 124 that was recently read by the neural processing unit 126; and a field 2608 indicating that the data is random The address in the memory 122 that was recently read by the neural processing unit 126. In this way, the architecture program running on the processor 100 can confirm the processing progress of the neural network unit 121, and read and / or write data from the data random access memory 122 and / or the weight random access memory 124. Time. Utilize this ability, plus the above option to overwrite the input data matrix (or write the result to the data random access memory 122 as described above), as shown in the following example, the data array 2404 of Figure 24 It can be implemented as 5 512x1600 data blocks instead of 20 512x400 data blocks. The processor 100 writes the first 512x1600 data block starting from column 2 of the weight random access memory 124 and starts the neural network unit program (this program has a cycle count of 1600 and randomly weights The output column of the access memory 124 is initialized to 0). When the neural network unit 121 executes the neural network unit program, the processor 100 monitors the output position / address of the weighted random access memory 124 to read the weighted random access memory by (1) (using the MFNN instruction 1500) The volume 124 has a column of valid convolution operation results written by the neural network unit 121 (starting from column 0); and (2) overwriting the second 512x1600 data matrix 2406 (starting from column 2) to the Read the results of valid convolution operations, In this way, when the neural network unit 121 completes the neural network unit program for the first 512x1600 data block, the processor 100 can immediately update the neural network unit program and start the neural network unit program again to run on the second 512x1600 when necessary. Data block. This procedure will be repeated three times to execute the remaining three 512x1600 data blocks so that the neural network unit 121 can be fully used.

在一實施例中，啟動函數單元212具有能夠對累加器202數值217有效執行一有效除法運算之能力，這部分在後續章節尤其是對應於第二十九A、二十九B與三十圖處會有更詳細的說明。舉例來說，對累加器202數值進行除以16之除法運算之啟動函數神經網路單元指令可用於以下所述之高斯模糊矩陣。 In one embodiment, the starting function unit 212 has the ability to effectively perform an effective division operation on the value 217 of the accumulator 202, which is particularly corresponding to the figures in the subsequent chapters, 29A, 29B, and 30. There will be more detailed instructions. For example, a start function neural network unit instruction that divides the value of the accumulator 202 by a division of 16 can be used for a Gaussian fuzzy matrix described below.

第二十四圖之範例中所使用之卷積核2402為一個應用於整個資料矩陣2404之小型靜態卷積核，不過，本發明並不限於此，此卷積核亦可為一大型矩陣，具有特定之權重對應於資料陣列2404之不同資料值，例如常見於卷積神經網路之卷積核。當神經網路單元121以此方式被使用時，架構程式會將資料矩陣與卷積核之位置互換，亦即將資料矩陣放置於資料隨機存取記憶體122內而將卷積核放置於權重隨機存取記憶體124內，而執行神經網路單元程式所需處理之列數也會相對較少。 The convolution kernel 2402 used in the example in Figure 24 is a small static convolution kernel applied to the entire data matrix 2404. However, the present invention is not limited to this. The convolution kernel can also be a large matrix. Having specific weights correspond to different data values of the data array 2404, such as convolution kernels commonly used in convolutional neural networks. When the neural network unit 121 is used in this way, the architecture program will interchange the positions of the data matrix and the convolution kernel, that is, place the data matrix in the data random access memory 122 and place the convolution kernel in random weights. The memory 124 is accessed, and the number of rows required to execute the neural network unit program is relatively small.

第二十七圖係一方塊示意圖，顯示第一圖中填入輸入資料之權重隨機存取記憶體124之一範例，此輸入資料係由第一圖之神經網路單元121執行共源運算 (pooling operation)。共源運算是由人工神經網路之一共源層執行，透過取得輸入矩陣之子區域或子矩陣並計算子矩陣之最大值或平均值以作為一結果矩陣即共源矩陣，以縮減輸入資料矩陣(如一影像或是卷積後影像)之大小(dimension)。在第二十七與二十八圖之範例中，共源運算計算各個子矩陣之最大值。共源運算對於如執行物件分類或偵測之人工神經網路特別有用。一般而言，共源運算實際上可以使輸入矩陣縮減之因數為所檢測之子矩陣的元素數，特別是可以將輸入矩陣之各個維度方向都縮減子矩陣之相對應維度方向之元素數。在第二十七圖之範例中，輸入資料是一個寬文字(如16位元)之512x1600矩陣，儲存於權重隨機存取記憶體124之列0至1599。在第二十七圖中，這些文字係以其所在列行位置標示，如，位於列0行0之文字係標示為D0,0；位於列0行1之文字係標示為D0,1；位於列0行2之文字係標示為D0,2；依此類推，位於列0行511之文字係標示為D0,511。相同地，位於列1行0之文字係標示為D1,0；位於列1行1之文字係標示為D1,1；位於列1行2文字係標示為D1,2；依此類推，位於列1行511之文字係標示為D1,511；如此依此類推，位於列1599行0之文字係標示為D1599,0；位於列1599行1之文字係標示為D1599,1位於列1599行2之文字係標示為D1599,2；依此類推，位於列1599行511之文字係標示為D1599,511。 The twenty-seventh figure is a block diagram showing an example of the weighted random access memory 124 filled with input data in the first figure, and the input data is performed by the neural network unit 121 in the first figure (pooling operation). The common source operation is performed by a common source layer of an artificial neural network. By obtaining a sub-region or sub-matrix of the input matrix and calculating the maximum or average value of the sub-matrix as a result matrix, that is, the common source matrix, the input data matrix is reduced ( (Such as an image or a convolution image). In the examples of Figures 27 and 28, the common source operation calculates the maximum value of each sub-matrix. Common source computing is particularly useful for artificial neural networks such as performing object classification or detection. In general, the common source operation can actually reduce the factor of the input matrix to the number of elements of the detected sub-matrix, in particular, reduce the dimensions of the input matrix to the number of elements in the corresponding dimension of the sub-matrix. In the example in the twenty-seventh figure, the input data is a 512x1600 matrix of wide text (such as 16 bits), which is stored in rows 0 to 1599 of the weight random access memory 124. In the twenty-seventh figure, these characters are indicated by their column positions. For example, the characters at column 0 and row 0 are labeled D0,0; the characters at column 0 and row 1 are labeled D0,1; The text in column 0, row 2 is labeled D0,2; and so on, the text in column 0, row 511 is labeled D0,511. Similarly, the text on column 1 row 0 is labeled D1,0; the text on column 1 row 1 is labeled D1,1; the text on column 1 row 2 is labeled D1,2; and so on, the column The text on line 511 is marked as D1,511; and so on, the text on line 1599 is marked as D1599,0 on line 0; the text on line 1599 is marked on D1599,1 is located on line 1599 on line 2 The text is labeled D1599,2; and so on, the text at column 1599, line 511 is labeled D1599,511.

第二十八圖係一神經網路單元程式之一程式列表，此神經網路單元程式係執行第二十七圖之輸入資料矩陣之共源運作並將其寫回權重隨機存取記憶體124。在第二十八圖之範例中，共源運算會計算輸入資料矩陣中各個4x4子矩陣之最大值。此程式會多次執行由指令1至10構成的指令迴圈。位於位址0之初始化神經處理單元指令會指定每個神經處理單元126執行指令迴圈之次數，在第二十八圖之範例中之迴圈計數值為400，而在迴圈末端(在位址11)之迴圈指令會使當前迴圈計數值遞減，而若是所產生之結果是一非零值，就使其回到指令迴圈之頂端(即回到位址1之指令)。權重隨機存取記憶體124內之輸入資料矩陣實質上會被神經網路單元程式視為400個由四個相鄰列構成之互斥群組，即列0-3、列4-7、列8-11、依此類推，直到列1596-1599。每一個由四個相鄰列構成之群組包括128個4x4子矩陣，這些子矩陣係由此群組之四個列與四個相鄰行之交叉處元素所形成之4x4子矩陣，這些相鄰行即行0-3、行4-7、行8-11、依此類推直到行508-511。這512個神經處理單元126中，每四個為一組計算之第四個神經處理單元126(一共即128個)會對一相對應4x4子矩陣執行一共源運算，而其他三個神經處理單元126則不被使用。更精確地說，神經處理單元0、4、8、依此類推直到神經處理單元508，會對其相對應之4x4子矩陣執行一共源運算，而此4x4子矩陣之最左側行編號係對應於神經處理單元編號，而下方列係對應於當前權重隨機存取記憶體124之列值，此數值會被位址0之初始化指令初始化為零並且在重複每次指令迴圈後會增加4，這部分在後續章節會有更詳細的說明。這400次指令迴圈之重複動作係對應至第二十七圖之輸入資料矩陣中之4x4子矩陣群組數(即輸入資料矩陣具有之1600個列除以4)。初始化神經處理單元指令也會清除累加器202使其歸零。就一較佳實施例而言，位址11之迴圈指令也會清除累加器202使其歸零。另外，位址1之maxwacc指令會指定清除累加器202使其歸零。 The twenty-eighth figure is a program list of a neural network unit program. The neural network unit program executes the output of the twenty-seventh figure. The common source operation of the data matrix is entered and written back to the weight random access memory 124. In the example in Figure 28, the common source operation calculates the maximum value of each 4x4 sub-matrix in the input data matrix. This program will execute the instruction loop consisting of instructions 1 to 10 multiple times. The initialized neural processing unit instruction at address 0 specifies the number of times each instruction is executed by the neural processing unit 126. In the example in Figure 28, the loop count value is 400, and at the end of the loop (in place) The loop instruction at address 11) will decrement the current loop count value, and if the result is a non-zero value, return it to the top of the instruction loop (ie, return to the instruction at address 1). The input data matrix in the weight random access memory 124 will be substantially regarded by the neural network unit program as 400 mutually exclusive groups consisting of four adjacent rows, that is, rows 0-3, 4-7, and 4 8-11, and so on, until column 1596-1599. Each group consisting of four adjacent columns includes 128 4x4 sub-matrixes. These sub-matrixes are 4x4 sub-matrixes formed by the elements at the intersection of the four columns of the group and four adjacent rows. Adjacent lines are lines 0-3, 4-7, lines 8-11, and so on until lines 508-511. Of the 512 neural processing units 126, each four is a group of calculations. The fourth neural processing unit 126 (a total of 128) performs a common source operation on a corresponding 4x4 submatrix, while the other three neural processing units 126 is not used. More precisely, the neural processing unit 0, 4, 8, and so on until the neural processing unit 508 will perform a common source operation on its corresponding 4x4 submatrix, and the leftmost row number of this 4x4 submatrix corresponds to The number of the neural processing unit, and the lower row corresponds to the value of the current weight random access memory 124. This value will be initialized to zero by the initialization instruction at address 0 and will increase by 4 after each instruction cycle. Some will be explained in more detail in subsequent chapters Bright. The repeated operations of the 400 instruction loops correspond to the number of 4x4 sub-matrix groups in the input data matrix of the twenty-seventh figure (that is, the 1,600 columns of the input data matrix divided by 4). Initializing the neural processing unit instructions also clears the accumulator 202 to zero. In a preferred embodiment, the loop instruction at address 11 also clears the accumulator 202 to zero. In addition, the maxwacc instruction at address 1 specifies to clear accumulator 202 to zero.

每次在執行程式之指令迴圈時，這128個被使用之神經處理單元126會對輸入資料矩陣之當前四列群組中之128個個別之4x4子矩陣，同時執行128個共源運算。進一步來說，此共源運算會確認這4x4子矩陣之16個元素中之最大值元素。在第二十八圖之實施例中，對於這128個被使用之神經處理單元126中之每個神經處理單元y而言，4x4子矩陣之下方左側元素為第二十七圖內之元素Dx,y，其中x是指令迴圈開始時當前權重隨機存取記憶體124之列數，而此列資料係由第二十八圖之程式中位址1之maxwacc指令讀取(此列數也會由位址0之初始化神經處理單元指令加以初始化，並在每次執行位址3、5與7之maxwacc指令時遞增)。因此，對於此程式之每一個迴圈而言，這128個被使用之神經處理單元126會將當前列群組之相對應128個4x4子矩陣之最大值元素，寫回權重隨機存取記憶124之指定列。以下係針對此指令迴圈進行描述。 Each time the instruction loop of the program is executed, the 128 used neural processing units 126 perform 128 common source operations on 128 individual 4x4 sub-matrices in the current four-row group of the input data matrix. Further, the common source operation will confirm the maximum element among the 16 elements of the 4x4 sub-matrix. In the embodiment of the twenty-eighth figure, for each of the 128 neural processing units 126 used, the left element below the 4x4 sub-matrix is the element Dx in the twenty-seventh figure , y, where x is the number of rows of the current weight random access memory 124 at the beginning of the instruction loop, and this row of data is read by the maxwacc instruction at address 1 in the program in Figure 28 (the number of rows is also It will be initialized by the initialization neural processing unit instruction at address 0 and incremented each time the maxwacc instruction at addresses 3, 5 and 7 is executed). Therefore, for each loop of this program, the 128 used neural processing units 126 write back the weighted random access memory 124 of the maximum element of the corresponding 128 4x4 sub-matrix of the current column group. Of the specified column. The following is a description of this instruction loop.

位址1之maxwacc指令會暗中使用當前權重隨機存取記憶體124列，這個列最好是裝載在定序器128內(並由位於位址0之指令將其初始化為零以執行第一次傳遞經過指令迴圈之運算)。位址1之指令會使每個神經處理單元126從權重隨機存取記憶體124之當前列讀取其相對應文字，將此文字與累加器202數值217相比，並將這兩個數值之最大者儲存於累加器202。因此，舉例來說，神經處理單元8會確認累加器202數值217與資料文字Dx,8(其中“x”是當前權重隨機存取記憶體124列)中之最大值並將其寫回累加器202。 The maxwacc instruction at address 1 implicitly uses the current weighted random access memory 124 rows. This row is preferably loaded in the sequencer 128 (and initialized to zero by the instruction at address 0 to execute One pass passes through the instruction loop operation). The instruction at address 1 will cause each neural processing unit 126 to read its corresponding text from the current row of the weight random access memory 124, compare this text with the value 217 of the accumulator 202, and add the two values to The largest is stored in the accumulator 202. Therefore, for example, the neural processing unit 8 confirms the maximum value in the accumulator 202 value 217 and the data text Dx, 8 (where "x" is the current weight random access memory 124 rows) and writes it back to the accumulator 202.

位址2是一個maxwacc指令，此指令會指定將每個神經處理單元126之多工暫存器705內之數值旋轉至鄰近至神經處理單元126，在此即為因應位址1之指令剛從權重隨機存取記憶體124讀取之一列輸入資料陣列值。在第二十七至二十八圖之實施例中，神經處理單元126係用以將多工器705數值向左旋轉，亦即從神經處理單元J旋轉至神經處理單元J-1，如前文對應於第二十四至二十六圖之章節所述。此外，此指令會指定一計數值為3。如此，位址2之指令會使每個神經處理單元126將旋轉後文字接收至多工暫存器705並確認此旋轉後文字與累加器202數值中之最大值，然後將這個運算再重複兩次。也就是說，每個神經處理單元126會執行三次將旋轉後文字接收至多工暫存器705並確認旋轉後文字與累加器202數值中最大值之運算。如此，舉例來說，假定開始此指令迴圈時，當前權重隨機存取記憶體124列為36，以神經處理單元8為例，在執行位址1與2之指令後，神經處理單元8將會在其累加器202中儲存迴圈開始時累加器202以及四個權重隨機存取記憶體124文字D36,8、 D36,9、D36,10與D36,11中之最大值。 Address 2 is a maxwacc instruction. This instruction specifies that the value in the multiplexer register 705 of each neural processing unit 126 is rotated to the nearest to the neural processing unit 126. This is the address corresponding to the instruction of address 1. The weight random access memory 124 reads a row of input data array values. In the embodiments of Figures 27 to 28, the neural processing unit 126 is used to rotate the value of the multiplexer 705 to the left, that is, from the neural processing unit J to the neural processing unit J-1, as described above. Corresponding to the chapters in figures 24 to 26. In addition, this instruction specifies a count value of 3. In this way, the instruction at address 2 will cause each neural processing unit 126 to receive the rotated text to the multiplexer register 705 and confirm the maximum value between the rotated text and the accumulator 202 value, and then repeat this operation twice more. . In other words, each neural processing unit 126 performs three operations of receiving the rotated text to the multiplexer register 705 and confirming the maximum value between the rotated text and the value of the accumulator 202. So, for example, suppose that at the start of this instruction loop, the current weight random access memory 124 is 36, and the neural processing unit 8 is taken as an example. After executing the instructions of addresses 1 and 2, the neural processing unit 8 will It will store the accumulator 202 and its four weighted random access memories 124 at the start of the loop in its accumulator 202. The text D36,8, The maximum of D36,9, D36,10 and D36,11.

位址3與4之maxwacc指令所執行之運算類似於位址1之指令，利用權重隨機存取記憶體124列遞增指標具有之功效，位址3與4之指令會對權重隨機存取記憶體124之下一列執行。也就是說，假定指令迴圈開始時當前權重隨機存取記憶體124列是36，以神經處理單元8為例，在完成位址1到4之指令後，神經處理單元8將會在其累加器202中儲存迴圈開始時累加器202以及八個權重隨機存取記憶體124文字D36,8、D36,9、D36,10、D36,11、D37,8、D37,9、D37,10與D37,11中之最大值。 The operations performed by the maxwacc instructions at addresses 3 and 4 are similar to the instructions at address 1, using the power of the 124-row incrementing index of the weight random access memory. The instructions at addresses 3 and 4 will perform random access to the weight. Executed under 124. In other words, assuming that the current weight random access memory 124 column is 36 at the beginning of the instruction loop, the neural processing unit 8 is taken as an example. After completing the instructions at addresses 1 to 4, the neural processing unit 8 will accumulate the instructions. The accumulator 202 and the eight weighted random access memories 124 at the beginning of the loop are stored in the memory 202. The maximum value of D37,11.

位址5至8之maxwacc指令所執行之運算類似於位址1至4之指令，位址5至8之指令會對權重隨機存取記憶體124之下兩列執行。也就是說，假定指令迴圈開始時當前權重隨機存取記憶體124列是36，以神經處理單元8為例，在完成位址1到8之指令後，神經處理單元8將會在其累加器202中儲存迴圈開始時累加器202以及十六個權重隨機存取記憶體124文字D36,8、D36,9、D36,10、D36,11、D37,8、D37,9、D37,10、D37,11、D38,8、D38,9、D38,10、D38,11、D39,8、D39,9、D39,10與D39,11中之最大值。也就是說，假定指令迴圈開始時當前權重隨機存取記憶體124列是36，以神經處理單元8為例，在完成位址1到8之指令後，神經處理單元8將會完成確認下列4x4子矩陣之最大值：D36,8 D36,9 D36,10 D36,11 D37,8 D37,9 D37,10 D37,11 D38,8 D38,9 D38,10 D38,11 D39,8 D39,9 D39,10 D39,11基本上，在完成位址1至8之指令後，這128個被使用之神經處理單元126中的每一個神經處理單元126就會完成確認下列4x4子矩陣之最大值：Dr,n Dr,n+1 Dr,n+2 Dr,n+3 Dr+1,n Dr+1,n+1 Dr+1,n+2 Dr+1,n+3 Dr+2,n Dr+2,n+1 Dr+2,n+2 Dr+2,n+3 Dr+3,n Dr+3,n+1 Dr+3,n+2 Dr+3,n+3其中r是指令迴圈開始時當前權重隨機存取記憶體124之列位址值，n是神經處理單元126編號。 The operations performed by the maxwacc instructions at addresses 5 to 8 are similar to the instructions at addresses 1 to 4. The instructions at addresses 5 to 8 are executed on the two rows below the weight random access memory 124. That is, assuming that the current weight random access memory 124 column is 36 at the beginning of the instruction loop, the neural processing unit 8 is taken as an example. After completing the instructions at the addresses 1 to 8, the neural processing unit 8 will accumulate the instructions. The accumulator 202 and sixteen weighted random access memories 124 at the beginning of the loop are stored in the memory 202. The text D36, 8, D36, 9, D36, 10, D36, 11, D37, 8, D37, 9, D37, 10 , D37,11, D38,8, D38,9, D38,10, D38,11, D39,8, D39,9, D39,10 and D39,11. That is, assuming that the current weight random access memory 124 column at the beginning of the instruction loop is 36, taking the neural processing unit 8 as an example, after completing the instructions at addresses 1 to 8, the neural processing unit 8 will complete the confirmation of the following Maximum of 4x4 sub-matrix: D36,8 D36,9 D36,10 D36,11 D37,8 D37,9 D37,10 D37,11 D38,8 D38,9 D38,10 D38,11 D39,8 D39,9 D39,10 D39,11 Basically, after completing the instructions of addresses 1 to 8, the 128 of the 128 neural processing units used Each neural processing unit 126 will confirm the maximum of the following 4x4 sub-matrix: Dr, n Dr, n + 1 Dr, n + 2 Dr, n + 3 Dr + 1, n Dr + 1, n + 1 Dr + 1, n + 2 Dr + 1, n + 3 Dr + 2, n Dr + 2, n + 1 Dr + 2, n + 2 Dr + 2, n + 3 Dr + 3, n Dr + 3, n + 1 Dr + 3, n + 2 Dr + 3, n + 3 where r is the column address value of the current weight random access memory 124 at the beginning of the instruction loop, and n is the number of the neural processing unit 126.

位址9之指令會透過啟動函數單元212傳遞累加器202數值217。此傳遞功能會傳遞一個文字，其尺寸大小(以位元計)係等同於由權重隨機存取記憶體124讀取之文字(在此範例中即16位元)。就一較佳實施例而言，使用者可指定輸出格式，例如輸出位元中有多少位元是小數(fractional)位元，這部分在後續章節會有更詳細的說明。 The instruction at address 9 passes the value 217 of the accumulator 202 through the enable function unit 212. This transfer function will pass a text whose size (in bits) is equivalent to the text read by the weight random access memory 124 (in this example, 16 bits). In a preferred embodiment, the user can specify an output format, such as how many bits in the output bits are fractional bits. This section will be described in more detail in subsequent sections.

位址10之指令會將累加器202數值217寫入權重隨機存取記憶體124中由輸出列暫存器之當前值所指定之列，此當前值會被位址0之指令予以初始化，並利用指令內之遞增指標在每次傳遞經過迴圈後將此數值遞增。進一步來說，位址10之指令會將累加器202之一寬文字(如16位元)寫入權重隨機存取記憶體124。就一較佳實施例而言，此指令會將這16個位元依照輸出二進位小數點2916來進行寫入，這部分在下列對應於第二十九A與二十九B圖處會有更詳細的說明。 The instruction at address 10 will write the value 217 of the accumulator 202 into the weighted random access memory 124 in the row specified by the current value of the output register. This current value will be initialized by the instruction at address 0, and Use the increment indicator in the instruction to increment this value after each pass through the loop. Further, the instruction at the address 10 writes a wide text (such as 16 bits) of the accumulator 202 into the weight random access memory 124. In a preferred embodiment, this instruction will output the 16 bits according to the output binary. The decimal point is 2916 for writing. This part will be explained in more detail in the following figures corresponding to the 29th A and 29B figures.

如前述，迭代一次指令迴圈寫入權重隨機存取記憶體124之列會包含具有無效值之空洞。也就是說，結果133之寬文字1至3、5至7、9至11、依此類推，直到寬文字509至511都是無效或未使用的。在一實施例中，啟動函數單元212包括一多工器使能將結果合併至列緩衝器之相鄰文字，例如第十一圖之列緩衝器1104，以寫回輸出權重隨機存取記憶體124列。就一較佳實施例而言，啟動函數指令會指定每個空洞中的文字數，而此空洞內之文字數控制多工器合併結果。在一實施例中，空洞數可指定為數值2至6，以合併共源之3x3、4x4、5x5、6x6或7x7子矩陣之輸出。另外，執行於處理器100之架構程式會從權重隨機存取記憶體124讀取所產生之稀疏(即具有空洞)結果列，並利用其他執行單元112，例如使用架構合併指令之媒體單元，如x86單指令多資料流程擴展(SSE)指令，執行合併功能。以類似於前述同時進行之方式並利用神經網路單元121之混合本質，執行於處理器100之架構程式可以讀取狀態暫存器127以監測權重隨機存取記憶體124之最近寫入列(例如第二十六B圖之欄位2602)以讀取所產生之一稀疏結果列，將其合併並寫回權重隨機存取記憶體124之同一列，如此就完成準備而能作為一輸入資料矩陣，提供給神經網路之下一層使用，例如一卷積層或是一傳統神經網路層(亦即乘法累加層)。此外，本文所述之實施例係以4x4子矩陣執行共源運算，不過本發明並不限於此，第二十八圖之神經網路單元程式可經調整，而以其他尺寸之子矩陣，如3x3、5x5、6x6或7x7，執行共源運算。 As mentioned above, the iterative instruction loop write weight random access memory 124 row contains holes with invalid values. That is, the wide characters 1 to 3, 5 to 7, 9 to 11, and so on of the result 133 are invalid or unused until the wide characters 509 to 511. In one embodiment, the startup function unit 212 includes a multiplexer that enables the results to be merged into the adjacent text of the column buffer, such as the column buffer 1104 of FIG. 11 to write back the output weight random access memory. 124 columns. In a preferred embodiment, the start function instruction specifies the number of characters in each hole, and the number of characters in the hole controls the multiplexer merge result. In one embodiment, the number of holes can be specified as a value from 2 to 6 to combine the outputs of the 3x3, 4x4, 5x5, 6x6, or 7x7 sub-matrix of the common source. In addition, the architecture program running on the processor 100 reads the generated sparse (ie, has a hole) result row from the weighted random access memory 124 and uses other execution units 112, such as the media unit using the architecture merge instruction, such as The x86 single instruction multiple data flow extension (SSE) instruction performs a merge function. In a manner similar to the foregoing, and utilizing the hybrid nature of the neural network unit 121, the architecture program executing on the processor 100 can read the status register 127 to monitor the most recently written row of the random access memory 124 ( (For example, column 26602 in Figure 26B) to read one of the sparse result rows, merge them and write them back to the same row of the weight random access memory 124, so that the preparation is completed and it can be used as an input data. The matrix is provided for use by the next layer of the neural network, such as a convolutional layer or a traditional neural network layer (ie, a multiply-accumulate layer). In addition, the embodiments described herein perform co-sourcing with a 4x4 sub-matrix Operation, but the present invention is not limited to this. The neural network unit program of the twenty-eighth figure can be adjusted, and common source operations are performed with sub-matrices of other sizes, such as 3x3, 5x5, 6x6, or 7x7.

如前述可以發現，寫入權重隨機存取記憶體124之結果列的數量是輸入資料矩陣之列數的四分之一。最後，在此範例中並未使用資料隨機存取記憶體122。不過，也可利用資料隨機存取記憶體122，而非權重隨機存取記憶體124，來執行共源運算。 As can be found from the foregoing, the number of result rows written into the weighted random access memory 124 is a quarter of the number of rows of the input data matrix. Finally, the data random access memory 122 is not used in this example. However, the data random access memory 122 may be used instead of the weight random access memory 124 to perform common source operations.

在第二十七與二十八圖之實施例中，共源運算會計算子區域之最大值。不過，第二十八圖之程式可經調整以計算子區域之平均值，利入透過將maxwacc指令以sumwacc指令取代(將權重文字與累加器202數值217加總)並將位址9之啟動函數指令修改為將累加結果除以各個子區域之元素數(較佳者係透過如下所述之倒數乘法運算)，在此範例中為十六。 In the embodiments of the twenty-seventh and twenty-eighth figures, the common source operation calculates the maximum value of the sub-region. However, the program in the twenty-eighth figure can be adjusted to calculate the average value of the sub-region. It is beneficial to replace the maxwacc instruction with the sumwacc instruction (summing the weight text with the accumulator 202 value 217) and starting the address 9 The function instruction is modified to divide the accumulation result by the number of elements in each subregion (preferably through the inverse multiplication operation described below), which is sixteen in this example.

由神經網路單元121依據第二十七與二十八圖之運算中可以發現，每一次執行第二十八圖之程式需要使用大約6000個時頻週期來對第二十七圖所示之整個512x1600資料矩陣執行一次共源運算，此運算所使用之時頻週期數明顯少於傳統方式執行相類似任務所需之時頻週期數。 It can be found from the calculation of the twenty-seventh and twenty-eighth graphs by the neural network unit 121 that each execution of the program of the twenty-eighth graph requires about 6000 time-frequency cycles to perform the calculations shown in the twenty-seventh graph. The entire 512x1600 data matrix performs a common source operation. The number of time-frequency cycles used in this operation is significantly less than the number of time-frequency cycles required to perform similar tasks in a traditional manner.

另外，架構程式可使神經網路單元程式將共源運算之結果寫回資料隨機存取記憶體122列，而非將結果寫回權重隨機存取記憶體124，當神經網路單元121將結果寫入資料隨機存取記憶體122時(例如使用第二十六B圖之資料隨機存取記憶體122最近寫入列2606之位址)，架構程式會從資料隨機存取記憶體122讀取結果。此配置適用具有單埠權重隨機存取記憶體124與雙埠資料隨機存取記憶體122之實施例。 In addition, the framework program can make the neural network unit program write the results of the common source operation back to the 122 rows of data random access memory, instead of writing the results back to the weight random access memory 124. When the neural network unit 121 writes the results When writing data to random access memory 122 (e.g. using twentieth The data random access memory 122 in Figure 6B was recently written to the address of row 2606), and the architecture program will read the results from the data random access memory 122. This configuration is applicable to the embodiment having the port weight random access memory 124 and the dual port data random access memory 122.

Fixed-point arithmetic operations, with users providing binary decimal points, full-precision fixed-point accumulation, user-specified reciprocal values, random rounding of accumulator values, and optional start / output functions

一般而言，在數位計算裝置內執行算術運算之硬體單元依據其執行算術運算之對象為整數或浮點數，通常可分為“整數”單元與“浮點”單元。浮點數具有一數值(magnitude)(或尾數)與一指數，通常還有一符號。指數是基數(radix)點(通常為二進位小數點)相對於數值之位置之指標。相較之下，整數不具有指數，而只具有一數值，通常還有一符號。浮點單元可以讓程式設計者可以從一個非常大範圍之不同數值中取得其工作所要使用之數字，而硬體則是在需要時負責調整此數字之指數值，而不需程式設計者處理。舉例來說，假定兩個浮點數0.111 x 10²⁹與0.81 x 10³¹相乘。(雖然浮點單元通常工作於2為基礎之浮點數，此範例中所使用的是十進位小數，或以10為基礎之浮點數。)浮點單元會自動負責尾數相乘，指數相加，隨後再將結果標準化至數值.8911 x 10⁵⁹。在另一個範例中，假定同樣的兩個浮點數相加。浮點單元會在相加前自動負責將尾數之二進位小數點對齊以產生數值為.81111 x 10³¹之總數。 Generally speaking, a hardware unit that performs arithmetic operations in a digital computing device is an integer or a floating-point number according to its object of performing arithmetic operations, and can generally be divided into "integer" units and "floating point" units. Floating point numbers have a magnitude (or mantissa) and an exponent, and usually have a sign. The index is an indicator of the position of the radix point (usually a binary decimal point) relative to the value. In contrast, integers have no exponent, but only have a value, and usually have a sign. The floating-point unit allows programmers to obtain the number they want to use for their work from a very large range of different values, while the hardware is responsible for adjusting the index value of this number when needed without the programmer's handling. For example, suppose two floating-point numbers 0.111 x 10 ^{29 are} multiplied by 0.81 x 10 ³¹ . (Although floating-point units usually work with 2-based floating-point numbers, decimal decimals or 10-based floating-point numbers are used in this example.) Floating-point units are automatically responsible for multiplying the mantissa and exponential And then normalize the result to the value .8911 x 10 ⁵⁹ . In another example, suppose the same two floating-point numbers are added. Floating-point units are automatically responsible for aligning the mantissa's decimal points before adding to produce a total of .81111 x 10 ³¹ .

不過，眾所周知，這樣複雜的運算而會導致浮點單元之尺寸增加，耗能增加、每指令所需時頻週期數增加、以及/或週期時間拉長。因為這個原因，許多裝置(如嵌入式處理器、微控制器與相對低成本與/或低功率之微處理器)並不具有浮點單元。由前述範例可以發現，浮點單元之複雜結構包含執行關聯於浮點加法與乘法/除法之指數計算之邏輯(即對運算元之指數執行加/減運算以產生浮點乘法/除法之指數數值之加法器，將運算元指數相減以確認浮點加法之二進位小數點對準偏移量之減法器)，包含為了達成浮點加法中尾數之二進位小數點對準之偏移器，包含對浮點結果進行標準化處理之偏移器。此外，流程之進行通常還需要執行浮點結果之捨入運算之邏輯、執行整數格式與浮點格式間以及不同浮點格式(例如擴增精度、雙精度、單精度、半精度)間之轉換的邏輯、前導零與前導一之偵測器、以及處理特殊浮點數之邏輯，例如反常值、非數值與無窮值。 However, it is well known that such a complicated operation will result in an increase in the size of the floating-point unit, an increase in power consumption, an increase in the number of time-frequency cycles required per instruction, and / or an increase in cycle time. For this reason, many devices (such as embedded processors, microcontrollers, and relatively low cost and / or low power microprocessors) do not have floating point units. As can be seen from the foregoing example, the complex structure of a floating-point unit contains logic for performing exponential calculations related to floating-point addition and multiplication / divide The adder subtracts the operand exponent to confirm that the decimal point of the floating-point addition is aligned with the offset of the offset). It includes an offsetr for achieving the decimal point alignment of the mantissa in floating-point addition. Contains an offsetter that normalizes floating-point results. In addition, the flow usually requires the logic of performing rounding operations on floating-point results, performing conversions between integer and floating-point formats, and conversions between different floating-point formats (such as augmented precision, double precision, single precision, and half precision). Logic for leading, zero and leading one detectors, and logic for handling special floating-point numbers, such as outliers, non-numeric values, and infinite values.

此外，關於浮點單元之正確度驗證會因為設計上需要被驗證之數值空間增加而大幅增加其複雜度，而會延長產品開發週期與上市時間。此外，如前述，浮點算術運算需要對用於計算之每個浮點數的尾數欄位與指數欄位分別儲存與使用，而會增加所需之儲存空間與/或在給定儲存空間以儲存整數之情況下降低精確度。其中許多缺點都可以透過整數單元執行算術運算來避免。 In addition, the verification of the accuracy of floating-point units will greatly increase the complexity of the design due to the increase in the numerical space that needs to be verified, which will extend the product development cycle and time to market. In addition, as mentioned above, floating-point arithmetic operations need to store and use the mantissa and exponent fields of each floating-point number used for calculation, respectively, and will increase the required storage space and / or in a given storage space to Reduced accuracy when storing integers. Many of these disadvantages can be avoided by performing arithmetic operations on integer units.

程式設計者通常需要撰寫處理小數之程式，小數即為非完整數之數值。這種程式可能需要在不具有浮點單元之處理器上執行，或是處理器雖然具有浮點單元，不過由處理器之整數單元執行整數指令會比較快。為了利用整數處理器在效能上的優勢，程式設計者會對定點數值(fixed-point numbers)使用習知之定點算術運算。這樣的程式會包括執行於整數單元以處理整數或整數資料之指令。軟體知道資料是小數，這個軟體並包含指令對整數資料執行運算而處理這個資料實際上是小數的問題，例如對準偏移器。基本上，定點軟體可手動執行某些或全部浮點單元所能執行之功能。 Programmers often need to write procedures for handling decimals Formula, the decimal is the value of the incomplete number. This program may need to be executed on a processor without a floating-point unit, or the processor may have a floating-point unit, but it is faster to execute integer instructions from the processor's integer unit. In order to take advantage of the performance advantages of integer processors, programmers use conventional fixed-point arithmetic operations on fixed-point numbers. Such a program would include instructions that execute on integer units to process integers or integer data. Software knows that data is decimal. This software does not include instructions to perform operations on integer data and processing this data is actually a decimal problem, such as aligning the offset. Basically, fixed-point software can manually perform functions that some or all floating-point units can perform.

在本文中，一個“定點”數(或值或運算元或輸入或輸出)是一個數字，其儲存位元被理解為包含位元以表示此定點數之一小數部分，此位元在此稱為“小數位元”。定點數之儲存位元係包含於記憶體或暫存器內，例如記憶體或暫存器內之一個8位元或16位元文字。此外，定點數之儲存位元全部都用來表達一個數值，而在某些情況下，其中一個位元會用來表達符號，不過，沒有一個定點數的儲存位元會用來表達這個數的指數。此外，此定點數之小數位元數量或稱二進位小數點位置係指定於一個不同於定點數儲存位元之儲存空間內，並且是以共享或通用之方式指出小數位元的數量或稱二進位小數點位置，分享給一個包含此定點數之定點數集合，例如輸入運算元、累加數值或是處理單元陣列之輸出結果之集合。 In this article, a "fixed-point" number (or value or operand or input or output) is a number whose storage bits are understood to include bits to represent a fractional part of this fixed-point number. This bit is called here Is "decimal places". The storage point of a fixed-point number is contained in a memory or a register, such as an 8-bit or 16-bit text in the memory or the register. In addition, all the storage bits of a fixed-point number are used to express a value, and in some cases, one of the bits is used to express a symbol. However, there is no fixed-point number to store the bit. index. In addition, the fixed-point number or binary decimal point position is specified in a storage space different from the fixed-point storage bit, and the number or decimal of the decimal point is indicated in a shared or universal manner. The position of the decimal point is shared with a fixed-point number set containing the fixed-point number, such as a set of input operands, accumulated values, or output results of an array of processing units.

在此描述之實施例中，算術邏輯單元是整數單元，不過，啟動函數單元則是包含浮點算術硬體輔助或加速。如此可以使算術邏輯單元部分變得更小且更為快速，以利於在給定的晶片空間上使用更多的算術邏輯單元。這也表示在單位晶片空間上可以設置更多的神經元，而特別有利於神經網路單元。 In the embodiment described here, the arithmetic logic unit is an integer Number unit, however, the start function unit contains hardware assistance or acceleration for floating point arithmetic. This can make the part of the arithmetic logic unit smaller and faster, which facilitates the use of more arithmetic logic units on a given chip space. This also means that more neurons can be set in the unit chip space, which is particularly beneficial to the neural network unit.

此外，相較於每個浮點數都需要指數儲存位元，本文所述之實施例中的定點數係以一個指標表達全部的數字集合中屬於小數位元之儲存位元的數量，不過，此指標係位於一個單一、共享之儲存空間而廣泛地指出整個集合之所有數字，例如一系列運算之輸入集合、一系列運算之累加數之集合、輸出之集合，其中小數位元之數量。就一較佳實施例而言，神經網路單元之使用者可對此數字集合指定小數儲存位元之數量。因此，可以理解的是，雖然在許多情況下(如一般數學)，“整數”之用語是指一個帶符號完整數，也就是一個不具有小數部分之數字，不過，在本文的脈絡中，“整數”之用語可表示具有小數部分之數字。此外，在本文的脈絡中，“整數”之用語是為了與浮點數進行區分，對於浮點數而言，其各自儲存空間內之部分位元會用來表達浮點數之指數。類似地，整數算術運算，如整數單元執行之整數乘法或加法或比較運算，係假設運算元中不具有指數，因此，整數單元之整數元件，如整數乘法器、整數加法器、整數比較器，就不需要包含邏輯來處理指數，例如不需要為了加法或比較運算而移動尾數來對準二進位小數點，不需要為了乘法運算而將指數相加。 In addition, compared to each floating-point number requiring exponential storage bits, the fixed-point numbers in the embodiments described herein use one index to express the number of storage bits that belong to the decimal place in the entire set of numbers, however, This indicator is located in a single, shared storage space and broadly indicates all numbers of the entire set, such as the input set of a series of operations, the set of cumulative numbers of a series of operations, and the set of outputs, among which the number of decimal places. In a preferred embodiment, the user of the neural network unit can specify the number of decimal storage bits for this number set. Therefore, it can be understood that although in many cases (such as general mathematics), the term "integer" refers to a signed complete number, that is, a number without a decimal part, but in the context of this article, " The term "integer" can mean a number with a decimal part. In addition, in the context of this article, the term "integer" is used to distinguish from floating-point numbers. For floating-point numbers, some bits in their respective storage space are used to express the exponent of floating-point numbers. Similarly, integer arithmetic operations, such as integer multiplication or addition or comparison operations performed by integer units, assume that there is no exponent in the operands. Therefore, integer components of integer units, such as integer multipliers, integer adders, integer comparators, There is no need to include logic to handle the exponents, for example, it is not necessary to move the mantissa to align the decimal points for addition or comparison operations, and it is not necessary to add the exponents for multiplication operations.

此外，本文所述之實施例包括一個大型的硬體整數累加器以對一個大型系列之整數運算進行累加(如1000個乘法累加運算)而不會喪失精確度。如此可避免神經網路單元處理浮點數，同時又能使累加數維持全精度，而不會使其飽和或因為溢位而產生不準確的結果。一旦這系列整數運算加總出一結果輸入此全精度累加器，此定點硬體輔助會執行必要的縮放與飽和運算，藉以利用使用者指定之累加值小數位元數量指標以及輸出值所需要之小數位元數量，將此全精度累加值轉換為一輸出值，這部分在後續章節會有更詳細的說明。 In addition, the embodiments described herein include a large hardware integer accumulator to accumulate a large series of integer operations (such as 1000 multiply-accumulate operations) without losing accuracy. This can prevent the neural network unit from processing floating-point numbers, while maintaining the accumulative number with full precision without saturating it or producing inaccurate results due to overflow. Once the result of this series of integer operations is added to the full-accuracy accumulator, the fixed-point hardware assist will perform the necessary scaling and saturation operations to use the user-specified number of decimal places for the accumulated value and the output value required. The number of decimal places, this full-precision accumulative value is converted into an output value, this part will be explained in more detail in subsequent chapters.

當需要將累加值從全精度形式進行壓縮以便用於啟動函數之一輸入或是用於傳遞，就一較佳實施例而言，啟動函數單元可以選擇性地對累加值執行隨機捨入運算，這部分在後續章節會有更詳細的說明。最後，依據神經網路之一給定層之不同需求，神經處理單元可以選擇性地接受指示以使用不同的啟動函數以及/或輸出許多不同形式之累加值。 When the accumulated value needs to be compressed from the full-precision form for input to one of the startup functions or used for transfer, in a preferred embodiment, the startup function unit can selectively perform a random rounding operation on the accumulated value. This section will be explained in more detail in subsequent chapters. Finally, depending on the different needs of a given layer of one of the neural networks, the neural processing unit can selectively accept instructions to use different activation functions and / or output many different forms of accumulated values.

第二十九A圖係顯示第一圖之控制暫存器127之一實施例之方塊示意圖。此控制暫存器127可包括複數個控制暫存器127。如圖中所示，此控制暫存器127包括下列欄位：配置2902、帶符號資料2912、帶符號權重2914、資料二進位小數點2922、權重二進位小數點2924、算術邏輯單元函數2926、捨入控制2932、啟動函數2934、倒數2942、偏移量2944、輸出隨機存取記憶體2952、輸出二進位小數點2954、以及輸出命令2956。控制暫存器127值可以利用MTNN指令1400與NNU程式之指令，如啟動指令，進行寫入動作。 Figure 29A is a block diagram showing an embodiment of the control register 127 of the first figure. The control register 127 may include a plurality of control registers 127. As shown in the figure, the control register 127 includes the following fields: configuration 2902, signed data 2912, signed weight 2914, data binary decimal point 2922, weighted binary decimal point 2924, arithmetic logic unit function 2926, Round control 2932, start function 2934, countdown 2942, offset 2944, output random access memory 2952, output binary decimal point 2954, and output command 2956. control The value of the register 127 can be written using the MTNN instruction 1400 and the NNU program instructions, such as the start instruction.

配置2902值係指定神經網路單元121是屬於窄配置、寬配置或是漏斗配置，如前所述。配置2902也設定了由資料隨機存取記憶體122與權重隨機存取記憶體124接收之輸入文字的大小。在窄配置與漏斗配置中，輸入文字的大小是窄的(例如8位元或9位元)，不過，在寬配置中，輸入文字的大小則是寬的(例如12位元或16位元)。此外，配置2902也設定了與輸入文字大小相同之輸出結果133的大小。 The configuration 2902 value specifies whether the neural network unit 121 belongs to a narrow configuration, a wide configuration, or a funnel configuration, as described above. The configuration 2902 also sets the size of the input text received by the data random access memory 122 and the weight random access memory 124. In the narrow and funnel configurations, the size of the input text is narrow (for example, 8-bit or 9-bit), but in the wide configuration, the size of the input text is wide (for example, 12-bit or 16-bit) ). In addition, the configuration 2902 also sets the size of the output result 133 the same as the size of the input text.

帶符號資料值2912為真的時候，即表示由資料隨機存取記憶體122接收之資料文字為帶符號值，若為假，則表示這些資料文字為不帶符號值。帶符號權重值2914為真的時候，即表示由權重隨機存取記憶體122接收之權重文字為帶符號值，若為假，則表示這些權重文字為不帶符號值。 When the signed data value 2912 is true, it means that the data text received by the data random access memory 122 is a signed value. If it is false, it means that these data texts are unsigned values. When the signed weight value 2914 is true, it means that the weight text received by the weight random access memory 122 is a signed value. If it is false, it means that these weight text are unsigned values.

資料二進位小數點2922值表示由資料隨機存取記憶體122接收之資料文字之二進位小數點位置。就一較佳實施例而言，對於二進位小數點之位置而言，資料二進位小數點2922值即表示二進位小數點從右側計算之位元位置數量。換言之，資料二進位小數點2922表示資料文字之最低有效位元中屬於小數位元之數量，即位於二進位小數點右側之位元數。類似地，權重二進位小數點2924值表示由權重隨機存取記憶體124接收之權重文字之二進位小數點位置。就一較佳實施例而言，當算術邏輯單元函數2926是一個乘法與累加或輸出累加，神經處理單元126將裝載於累加器202之數值之二進位小數點右側之位元數確定為資料二進位小數點2922與權重二進位小數點2924之加總。因此，舉例來說，若是資料二進位小數點2922之值為5而權重二進位小數點2924之值為3，累加器202內之值就會在二進位小數點右側有8個位元。當算術邏輯單元函數2926是一個總數/最大值累加器與資料/權重文字或是傳遞資料/權重文字，神經處理單元126會將裝載於累加器202之數值之二進位小數點右側之位元數分別確定為資料/權重二進位小數點2922/2924。在另一實施例中，則是指定單一個累加器二進位小數點2923，而不去指定個別的資料二進位小數點2922與權重二進位小數點2924。這部分在後續對應於第二十九B圖處會有更詳細的說明。 The data binary decimal point 2922 value indicates the binary decimal point position of the data text received by the data random access memory 122. For a preferred embodiment, for the position of the binary decimal point, the value of the data binary decimal point 2922 represents the number of bit positions calculated from the right side of the binary decimal point. In other words, the data decimal point 2922 represents the number of decimal places in the least significant digits of the data text, that is, the number of digits to the right of the binary decimal point. Similarly, the weight binary decimal point 2924 value indicates the position of the binary decimal point of the weight text received by the weight random access memory 124. For a preferred embodiment, When the arithmetic logic unit function 2926 is a multiplication and accumulation or output accumulation, the neural processing unit 126 determines the number of digits to the right of the binary decimal point of the value loaded in the accumulator 202 as the data decimal point 2922 and the weight decimal point. The sum of points 2924. Therefore, for example, if the value of the data binary decimal point 2922 is 5 and the weighted binary decimal point 2924 is 3, the value in the accumulator 202 will have 8 bits to the right of the binary decimal point. When the arithmetic logic unit function 2926 is a total / maximum accumulator and data / weight text or passed data / weight text, the neural processing unit 126 will place the number of digits to the right of the decimal point of the binary value of the accumulator 202 Decided as data / weight binary decimal point 2922/2924 respectively. In another embodiment, a single accumulator binary decimal point 2923 is specified instead of specifying an individual data decimal point 2922 and a weighted decimal point 2924. This section will be explained in more detail in the subsequent section corresponding to Figure 29B.

算術邏輯單元函數2926指定由神經處理單元126之算術邏輯單元204執行之函數。如前述，算術邏輯單元函數2926可包括以下運算但不限於：將資料文字209與權重文字203相乘並將此乘積與累加器202相加；將累加器202與權重文字203相加；將累加器202與資料文字209相加；累加器202與資料文字209中之最大值；累加器202與權重文字209中之最大值；輸出累加器202；傳遞資料文字209；傳遞權重文字209；輸出零值。在一實施例中，此算術邏輯單元函數2926係由神經網路單元初始化指令予以指定，並且由算術邏輯單元204使用以因應一執行指令(未圖示)。在一實施例中，此算術邏輯單元函數2926係由個別的神經網路單元指令予以指定，如前述乘法累加以及maxwacc指令。 The arithmetic logic unit function 2926 specifies a function to be executed by the arithmetic logic unit 204 of the neural processing unit 126. As mentioned above, the arithmetic logic unit function 2926 may include the following operations but is not limited to: multiplying the data text 209 by the weight text 203 and adding this product to the accumulator 202; adding the accumulator 202 and the weight text 203; adding up Adder 202 and data text 209; maximum value of accumulator 202 and data text 209; maximum value of accumulator 202 and weight text 209; output accumulator 202; pass data text 209; pass weight text 209; output zero value. In one embodiment, the arithmetic logic unit function 2926 is specified by a neural network unit initialization instruction and used by the arithmetic logic unit 204 to respond to an execution instruction (not shown). In one embodiment, the arithmetic logic sheet The metafunction 2926 is specified by individual neural network unit instructions, such as the aforementioned multiply accumulate and maxwacc instructions.

捨入控制2932指定(第三十圖中)捨入器3004所使用之捨入運算的形式。在一實施例中，可指定之捨入模式包括但不限於：不捨入、捨入至最近值、以及隨機捨入。就一較佳實施例而言，處理器100包括一隨機位元來源3003(請參照第三十圖)以產生隨機位元3005，這些隨機位元3005係經取樣用以執行隨機捨入以降低產生捨入偏差的可能性。在一實施例中，當捨入位元3005為一而黏(sticky)位元為零，若是取樣之隨機位元3005為真，神經處理單元126就會向上捨入，若是取樣之隨機位元3005為假，神經處理單元126就不會向上捨入。在一實施例中，隨機位元來源3003係基於處理器100具有之隨機電子特性進行取樣以產生隨機位元3005，這些隨機電子特性如半導體二極體或電阻之熱雜訊，不過本發明並不限於此。 The rounding control 2932 specifies (Figure 30) the form of the rounding operation used by the rounder 3004. In an embodiment, the rounding modes that can be specified include, but are not limited to: no rounding, rounding to the nearest value, and random rounding. For a preferred embodiment, the processor 100 includes a random bit source 3003 (refer to Figure 30) to generate random bits 3005. These random bits 3005 are sampled to perform random rounding to reduce The possibility of rounding bias. In one embodiment, when the rounded bit 3005 is one and the sticky bit is zero, if the sampled random bit 3005 is true, the neural processing unit 126 will round up. If it is the sampled random bit 3005 is false, the neural processing unit 126 will not round up. In one embodiment, the random bit source 3003 is sampled based on the random electronic characteristics of the processor 100 to generate random bits 3005. These random electronic characteristics are thermal noise of semiconductor diodes or resistors. Not limited to this.

啟動函數2934指定用於累加器202數值217之函數以產生神經處理單元126之輸出133。如本文所述，啟動函數2934包括但不限於：S型函數；雙曲正切函數；軟加函數；校正函數；除以二的指定冪次方；乘上一個使用者指定之倒數值以達成等效除法；傳遞整個累加器；以及將累加器以標準尺寸傳遞，這部分在以下章節會有更詳細的說明。在一實施例中，啟動函數係由神經網路單元啟動函數指令所指定。另外，啟動函數也可由初始化指令所指定，並因應一輸出指令而使用，例如第四圖中位址4之啟動函數單元輸出指令，在此實施例中，位於第四圖中位址3之啟動函數指令會包含於輸出指令內。 The activation function 2934 specifies a function for the value 217 of the accumulator 202 to generate the output 133 of the neural processing unit 126. As described herein, the start function 2934 includes, but is not limited to: S-type function; hyperbolic tangent function; soft addition function; correction function; division by a specified power of two; multiplication by a user-specified inverse value to achieve, etc. Divide by effect; pass the entire accumulator; and pass the accumulator in standard size, which will be explained in more detail in the following sections. In one embodiment, the activation function is specified by a neural network unit activation function instruction. In addition, the startup function can also be specified by the initialization instruction and used in response to an output instruction. For example, the start function unit at address 4 in the fourth figure outputs an instruction. In this embodiment, the start function instruction at address 3 in the fourth figure is included in the output instruction.

倒數2942值指定一個與累加器202數值217相乘以達成對累加器202數值217進行除法運算之數值。也就是說，使用者所指定之倒數2942值會是實際上想要執行之除數的倒數。這有利於搭配如本文所述之卷積或共源運算。就一較佳實施例而言，使用者會將倒數2942值指定為兩個部分，這在後續對應於第二十九C圖處會有更詳細的說明。在一實施例中，控制暫存器127包括一欄位(未圖示)讓使用者可以在多個內建除數值中指定一個進行除法，這些內建除數值的大小相當於常用之卷積核的大小，如9、25、36或49。在此實施例中，啟動函數單元212會儲存這些內建除數的倒數，用以與累加器202數值217相乘。 The reciprocal 2942 value specifies a value that is multiplied by the value 217 of the accumulator 202 to achieve a division of the value 217 of the accumulator 202. In other words, the user-specified inverse 2942 value will be the inverse of the divisor actually desired to be performed. This facilitates the use of convolution or common source operations as described herein. In a preferred embodiment, the user designates the penultimate 2942 value as two parts, which will be described in more detail at a subsequent point corresponding to the twenty-ninth figure C. In one embodiment, the control register 127 includes a field (not shown) so that the user can specify one of a plurality of built-in division values to perform division. The size of these built-in division values is equivalent to a common convolution. The size of the core, such as 9, 25, 36, or 49. In this embodiment, the activation function unit 212 stores the inverses of these built-in divisors for multiplication with the value 217 of the accumulator 202.

偏移量2944係指定啟動函數單元212之一移位器會將累加器202數值217右移之位元數，以達成將其除以二的冪次方之運算。這有利於搭配尺寸為二的冪次方之卷積核進行運算。 The offset 2944 specifies the number of bits to shift the value 217 of the accumulator 202 to the right by one of the shifters of the activation function unit 212 to achieve the operation of dividing it by a power of two. This is conducive to the operation with a convolution kernel with a power of two.

輸出隨機存取記憶體2952值會在資料隨機存取記憶體122與權重隨機存取記憶體124中指定一個來接收輸出結果133。 The output random access memory 2952 value will designate one of the data random access memory 122 and the weight random access memory 124 to receive the output result 133.

輸出二進位小數點2954值表示輸出結果133之二進位小數點的位置。就一較佳實施例而言，對於輸出結果133之二進位小數點的位置而言，輸出二進位小數點2954值即表示從右側計算之位元位置數量。換言之，輸出二進位小數點2954表示輸出結果133之最低有效位元中屬於小數位元之數量，即位於二進位小數點右側之位元數。啟動函數單元212會基於輸出二進位小數點2954之數值(在大部分之情況下，也會基於資料二進位小數點2922、權重二進位小數點2924、啟動函數2934與/或配置2902之數值)執行捨入、壓縮、飽和與尺寸轉換之運算。 The output binary point 2954 value indicates the position of the binary point of the output result 133 bis. For a preferred embodiment, for the position of the decimal point of the output 133 bis, the output binary is small. A 2954 count indicates the number of bit positions calculated from the right. In other words, the output decimal point 2954 represents the number of decimal places in the least significant bit of the output result 133, that is, the number of bits to the right of the binary decimal point. The startup function unit 212 will be based on the output binary decimal point 2954 (in most cases, it will also be based on the data binary decimal point 2922, the weighted binary decimal point 2924, the startup function 2934 and / or the value of the configuration 2902) Perform rounding, compression, saturation, and size conversion operations.

輸出命令2956會從許多面向控制輸出結果133。在一實施例中，啟動函數單元121會利用標準尺寸的概念，標準尺寸為配置2902指定之寬度大小(以位元計)的兩倍。如此，舉例來說，若是配置2902設定由資料隨機存取記憶體122與權重隨機存取記憶體124接收之輸入文字的大小為8位元，標準尺寸就會是16位元；在另一個範例中，若是配置2902設定由資料隨機存取記憶體122與權重隨機存取記憶體124接收之輸入文字的大小為16位元，標準尺寸就會是32位元。如本文所述，累加器202之尺寸較大(舉例來說，窄的累加器202B為28位元，而寬的累加器202A則是41位元)以維持中間計算，如1024與512個神經網路單元乘法累加指令，之全精度。如此，累加器202數值217就會大於(以位元計)標準尺寸，而對於啟動函數2934之大部分數值(除了傳遞整個累加器)，啟動函數單元212(例如以下對應於第三十圖之段落所述之標準尺寸壓縮器3008)就會將累加器202數值217壓縮至標準尺寸之大小。輸出命令2956之第一預設值會指示啟動函數單元212執行指定的啟動函數2934以產生一內部結果並將此內部結果作為輸出結果133輸出，此內部結果之大小等於原始輸入文字之大小，即標準尺寸的一半。輸出命令2956之第二預設值會指示啟動函數單元212執行指定的啟動函數2934以產生一內部結果並將此內部結果之下半部作為輸出結果133輸出，此內部結果之大小等於原始輸入文字之大小的兩倍，即標準尺寸；而輸出命令2956之第三預設值會指示啟動函數單元212將標準尺寸之內部結果的上半部作為輸出結果133輸出。輸出命令2956之第四預設值會指示啟動函數單元212將累加器202之未經處理的最低有效文字作為輸出結果133輸出；而輸出命令2956之第五預設值會指示啟動函數單元212將累加器202之未經處理的中間有效文字作為輸出結果133輸出；輸出命令2956之第六預設值會指示啟動函數單元212將累加器202之未經處理的最高有效文字(其寬度係由配置2902所指定)作為輸出結果133輸出，這在前文對應於第八至十圖之章節有更詳細的說明。如前述，輸出整個累加器202尺寸或是標準尺寸之內部結果有助於讓處理器100之其他執行單元112可以執行啟動函數，如軟極大啟動函數。 The output command 2956 outputs the result 133 from a number of control-oriented. In one embodiment, the activation function unit 121 uses the concept of a standard size, which is twice the width (in bits) specified by the configuration 2902. So, for example, if the configuration 2902 sets the size of the input text received by the data random access memory 122 and the weight random access memory 124 to 8 bits, the standard size will be 16 bits; in another example If the configuration 2902 sets the size of the input text received by the data random access memory 122 and the weight random access memory 124 to 16 bits, the standard size will be 32 bits. As described in this article, the size of the accumulator 202 is large (for example, the narrow accumulator 202B is 28 bits, and the wide accumulator 202A is 41 bits) to maintain intermediate calculations, such as 1024 and 512 nerves. Network unit multiply accumulate instructions with full precision. In this way, the value 217 of the accumulator 202 will be greater than (in bits) the standard size, and for most of the values of the start function 2934 (except for passing the entire accumulator), the start function unit 212 (for example, the following corresponds to the one in Figure 30) The standard size compressor 3008) described in the paragraph will compress the value 217 of the accumulator 202 to the size of the standard size. Output the first preview of command 2956 The setting value instructs the startup function unit 212 to execute the specified startup function 2934 to generate an internal result and output the internal result as an output result 133. The size of the internal result is equal to the size of the original input text, that is, half of the standard size. The second preset value of the output command 2956 will instruct the startup function unit 212 to execute the specified startup function 2934 to generate an internal result and output the lower half of the internal result as the output result 133. The size of the internal result is equal to the original input text. Twice the size, which is the standard size; and the third preset value of the output command 2956 instructs the activation function unit 212 to output the upper half of the internal result of the standard size as the output result 133. The fourth preset value of output command 2956 instructs startup function unit 212 to output the unprocessed least significant text of accumulator 202 as output result 133; and the fifth preset value of output command 2956 instructs startup function unit 212 to The unprocessed intermediate valid text of the accumulator 202 is output as the output result 133; the sixth preset value of the output command 2956 will instruct the activation function unit 212 to process the unprocessed highest valid text of the accumulator 202 (the width of which is determined by the configuration (Specified by 2902) is output as the output result 133, which is explained in more detail in the previous section corresponding to the eighth to tenth figures. As described above, outputting the internal result of the size of the accumulator 202 or the standard size is helpful to enable other execution units 112 of the processor 100 to execute startup functions, such as soft maximum startup functions.

第二十九A圖(以及第二十九B與二十九C圖)所描述之欄位係位於控制暫存器127內部，不過，本發明並不限於此，其中一個或多個欄位亦可位於神經網路單元121之其他部分。就一較佳實施例而言，其中許多欄位可以包含在神經網路單元指令內部，並由定序器128 予以解碼以產生一微指令3416(請參照第三十四圖)控制算術邏輯單元204以及/或啟動函數單元212。此外，這些欄位也可以包含在儲存於媒體暫存器118之微運算3414內(請參照第三十四圖)，以控制算術邏輯單元204以及/或啟動函數單元212。此實施例可以降低初始化神經網路單元指令之使用，而在其他實施例中則可去除此初始化神經網路單元指令。 The fields described in Figure 29A (and Figures 29B and 29C) are located inside the control register 127. However, the present invention is not limited to this. One or more of the fields It may also be located in other parts of the neural network unit 121. In a preferred embodiment, many of the fields may be contained within a neural network unit instruction and provided by the sequencer 128. It is decoded to generate a micro-instruction 3416 (refer to FIG. 34) to control the arithmetic logic unit 204 and / or start the function unit 212. In addition, these fields can also be included in the micro-operation 3414 (see FIG. 34) stored in the media register 118 to control the arithmetic logic unit 204 and / or activate the function unit 212. This embodiment can reduce the use of the initialization neural network unit instruction, and in other embodiments, the initialization neural network unit instruction can be removed.

如前述，神經網路單元指令可以指定對記憶體運算元(如來自資料隨機存取記憶體122與/或權重隨機存取記憶體123之文字)或一個旋轉後運算元(如來自多工暫存器208/705)執行算術邏輯指令運算。在一實施例中，神經網路單元指令還可以將一個運算元指定為一啟動函數之暫存器輸出(如第三十圖之暫存器3038之輸出)。此外，如前述，神經網路單元指令可以指定來使資料隨機存取記憶體122或權重隨機存取記憶體124之一當前列位址遞增。在一實施例中，神經網路單元指令可指定一立即帶符號整數差值加入當前列以達成遞增或遞減一以外數值之目的。 As mentioned above, the neural network unit instruction can specify a memory operand (such as text from data random access memory 122 and / or weight random access memory 123) or a rotated operand (such as Registers 208/705) perform arithmetic logic instruction operations. In an embodiment, the neural network unit instruction may further designate an operand as a register output of an activation function (such as the output of the register 3038 in FIG. 30). In addition, as described above, the neural network unit instruction may be designated to increment the current row address of one of the data random access memory 122 or the weight random access memory 124. In one embodiment, the neural network unit instruction may specify an immediate signed integer difference value to be added to the current column to achieve the purpose of increasing or decreasing by a value other than one.

第二十九B圖係顯示第一圖之控制暫存器127之另一實施例之方塊示意圖。第二十九B圖之控制暫存器127類似於第二十九A圖之控制暫存器127，不過，第二十九B圖之控制暫存器127包括一個累加器二進位小數點2923。累加器二進位小數點2923係表示累加器202之二進位小數點位置。就一較佳實施例而言，累加器二進位小數點2923值表示此二進位小數點位置從右側的位元位置數量。換言之，累加器二進位小數點2923表示累加器202之最低有效位元中屬於小數位元之數量，即位於二進位小數點右側之位元。在此實施例中，累加器二進位小數點2923係明確指示，而非如第二十九A圖之實施例是暗中確認。 Figure 29B is a block diagram showing another embodiment of the control register 127 of the first figure. The control register 127 of Fig. 29B is similar to the control register 127 of Fig. 29A, but the control register 127 of Fig. 29B includes an accumulator binary decimal point 2923 . The accumulator binary decimal point 2923 indicates the position of the decimal point of the accumulator 202 binary. For a preferred embodiment, the value of the accumulator binary decimal point 2923 indicates that the position of the binary decimal point from the right Number of meta positions. In other words, the accumulator binary decimal point 2923 represents the number of decimal places in the least significant bit of the accumulator 202, that is, the bit located to the right of the binary decimal point. In this embodiment, the binary decimal point 2923 of the accumulator is clearly indicated, instead of being secretly confirmed as in the embodiment of Figure 29A.

第二十九C圖係顯示以兩個部分儲存第二十九A圖之倒數2942之一實施例之方塊示意圖。第一個部分2962是一個偏移值，表示使用者想要乘上累加器202數值217之真實倒數值中被抑制之前導零的數量2962。前導零的數量是緊接在二進位小數點右側連續排列之零的數量。第二部分2694是前導零抑制倒數值，也就是將所有前導零移除後之真實倒數值。在一實施例中，被抑制前導零數量2962係以4位元儲存，而前導零抑制倒數值2964則是以8位元不帶符號值儲存。 Figure 29C is a block diagram showing an example of storing the penultimate 2942 of Figure 29A in two parts. The first part 2962 is an offset value indicating that the user wants to multiply the number 2962 of the true inverse value of the value 217 of the accumulator 202 by the leading zero before being suppressed. The number of leading zeros is the number of zeros arranged consecutively to the right of the decimal point. The second part 2694 is the leading zero suppression inverse value, that is, the true inverse value after removing all leading zeros. In one embodiment, the suppressed leading zero number 2962 is stored in 4 bits, and the leading zero suppression reciprocal value 2964 is stored in an 8-bit unsigned value.

舉例來說，假設使用者想要將累加器202數值217乘上數值49的倒數值。數值49的倒數值以二維呈現並設定13個小數位元就會是0.0000010100111，其中有五個前導零。如此，使用者會將被抑制前導零數量2962填入數值5，將前導零抑制倒數值2964填入數值10100111。在倒數乘法器“除法器A”3014(請參照第三十圖)將累加器202數值217與前導零抑制倒數值2964相乘後，所產生之乘積會依據被抑制前導零數量2962右移。這樣的實施例有助於利用相對較少之位元來表達倒數2942值達成高精確度的要求。 For example, suppose a user wants to multiply the value 217 of the accumulator 202 by the inverse value of the value 49. The reciprocal value of the value 49 is displayed in two dimensions and the 13 decimal places are set to 0.0000010100111, of which there are five leading zeros. In this way, the user will fill the number of suppressed leading zeros 2962 with a value of 5 and the number of leading zeros with a suppression of 2964 into a value of 10100111. After the reciprocal multiplier "divider A" 3014 (refer to Figure 30) multiplies the value 217 of the accumulator 202 with the leading zero suppression reciprocal value 2964, the resulting product is shifted to the right by the suppressed leading zero number 2962. Such an embodiment helps to use a relatively small number of bits to express the inverse 2942 value to achieve the requirement of high accuracy.

第三十圖係顯示第二圖之啟動函數單元 212之一實施例之方塊示意圖。此啟動函數單元212包含第一圖之控制邏輯127、一個正類型轉換器(PFC)與輸出二進位小數點對準器(OBPA)3002以接收累加器202數值217、一個捨入器3004以接收累加器202數值217與輸出二進位小數點對準器3002移出之位元數量的指標、一個如前述之隨機位元來源3003以產生隨機位元3005、一個第一多工器3006以接收正類型轉換器與輸出二進位小數點對準器3002之輸出以及捨入器3004之輸出、一個標準尺寸壓縮器(CCS)與飽和器3008以接收第一多工器3006之輸出、一個位元選擇器與飽和器3012以接收標準尺寸壓縮器與飽和器3008之輸出、一個校正器3018以接收標準尺寸壓縮器與飽和器3008之輸出、一個倒數乘法器3014以接收標準尺寸壓縮器與飽和器3008之輸出、一個向右移位器3016以接收標準尺寸壓縮器與飽和器3008之輸出、一個雙取正切(tanh)模組3022以接收位元選擇器與飽和器3012之輸出、一個S型模組3024以接收位元選擇器與飽和器3012之輸出、一個軟加模組3026以接收位元選擇器與飽和器3012之輸出、一個第二多工器3032以接收雙取正切模組3022、S型模組3024、軟加模組3026、校正器3018、倒數乘法器3014與向右移位器3016之輸出以及標準尺寸壓縮器與飽和器3008所傳遞之標準尺寸輸出3028、一個符號恢復器3034以接收第二多工器3032之輸出、一個尺寸轉換器與飽和器3036以接收符號恢復器3034之輸出、一第三多工器3037以接收尺寸轉換器與飽和器3036之輸出與累加器輸出217、以及一個輸出暫存器 3038以接收多工器3037之輸出，而其輸出即為第一圖中的結果133。 Figure 30 shows the starting function unit of Figure 2. A block diagram of an embodiment of 212. The start function unit 212 includes the control logic 127 of the first figure, a positive type converter (PFC) and an output binary decimal point aligner (OBPA) 3002 to receive the accumulator 202 value 217, and a rounder 3004 to receive Index of accumulator 202 value 217 and number of bits shifted by output binary point aligner 3002, a random bit source 3003 as previously described to generate random bits 3005, and a first multiplexer 3006 to receive positive types The output of the converter and output binary decimal aligner 3002 and the output of the rounder 3004, a standard size compressor (CCS) and saturator 3008 to receive the output of the first multiplexer 3006, a bit selector And saturator 3012 to receive the output of the standard size compressor and saturator 3008, a corrector 3018 to receive the output of the standard size compressor and saturator 3008, and a reciprocal multiplier 3014 to receive the standard size compressor and saturator 3008 Output, a right shifter 3016 to receive the output of the standard size compressor and saturator 3008, a double tangent (tanh) module 3022 to receive the output of the bit selector and saturator 3012 An S-type module 3024 receives the output of the bit selector and saturator 3012, a soft addition module 3026 receives the output of the bit selector and saturator 3012, and a second multiplexer 3032 receives the double tangent The output of module 3022, S-type module 3024, soft addition module 3026, corrector 3018, reciprocal multiplier 3014 and right shifter 3016, and standard size output 3028 transmitted by standard size compressor and saturator 3008, A symbol restorer 3034 receives the output of the second multiplexer 3032, a size converter and saturator 3036 receives the output of the symbol restorer 3034, and a third multiplexer 3037 receives the output of size converter and saturator 3036. Output and accumulator output 217, and an output register 3038 receives the output of the multiplexer 3037, and the output is the result 133 in the first figure.

正類型轉換器與輸出二進位小數點對準器3002接收累加器202值217。就一較佳實施例而言，如前述，累加器202值217是一個全精度值。也就是說，累加器202具有足夠的儲存位元數以裝載一累加數，此累加數是由整數加法器244將一系列由整數乘法器242產生之乘積相加所產生之總數，而此運算不捨棄乘法器242之個別乘積或加法器之各個總數中之任何一個位元以維持精確度。就一較佳實施例而言，累加器202至少具有足夠的位元數來裝載一神經網路單元121可被程式化執行產生之乘積累加的最大數量。舉例來說，請參照第四圖之程式，在寬配置下，神經網路單元121可被程式化執行產生之乘積累加的最大數量為512，而累加數202位元寬度為41。在另一範例中，請參照第二十圖之程式，在窄配置下，神經網路單元121可被程式化執行產生之乘積累加的最大數量為1024，而累加數202位元寬度為28。基本上，全精度累加器202具有至少Q個位元，其中Q是M與log₂P之加總，其中M是乘法器242之整數乘積之位元寬度(舉例來說，對於窄乘法器242而言是16位元，對於寬乘法器242而言是32位元)，而P是累加器202所能累加之乘積的最大容許數量。就一較佳實施例而言，乘積累加之最大數量是依據神經網路單元121之程式設計者之程式規格所指定。在一實施例中，假定一個先前乘法累加指令用以從資料/權重隨機存取記憶體122/124載入資料/權重文字206/207列(如第四圖中位址1之指令)之基礎上，定序器128會執行乘法累加神經網路單元指令(如第四圖中位址2之指令)之計數的最大值是例如511。 The positive type converter and output binary point aligner 3002 receives the accumulator 202 value 217. For a preferred embodiment, as mentioned above, the accumulator 202 value 217 is a full precision value. That is, the accumulator 202 has a sufficient number of storage bits to load an accumulative number. The accumulative number is the total number generated by the integer adder 244 adding a series of products generated by the integer multiplier 242, and this operation None of the individual products of the multiplier 242 or the individual totals of the adder are discarded to maintain accuracy. For a preferred embodiment, the accumulator 202 has at least a sufficient number of bits to load the maximum number of multiply-accumulate additions that can be generated by a neural network unit 121 that is programmed. For example, please refer to the program in the fourth figure. In a wide configuration, the maximum number of multiply-accumulate sums that can be generated by the neural network unit 121 by programmatic execution is 512, and the accumulative number is 202 bits wide. In another example, please refer to the program in the twentieth chart. In a narrow configuration, the maximum number of multiply-accumulate additions that can be generated by the neural network unit 121 by programmatic execution is 1024, and the accumulative number 202-bit width is 28. Basically, the full precision accumulator 202 has at least Q bits, where Q is the sum of M and log ₂ P, where M is the bit width of the integer product of multiplier 242 (for example, for narrow multiplier 242 16 bits, 32 bits for wide multiplier 242), and P is the maximum allowable number of products that accumulator 202 can accumulate. In a preferred embodiment, the maximum number of multiplications is specified according to the program specifications of the programmer of the neural network unit 121. In one embodiment, it is assumed that a previous multiply accumulate instruction is used to load data / weight random access memory 122/124 from the data / weight text 206/207 rows (such as the instruction at address 1 in the fourth figure). In the above, the sequencer 128 will execute a multiply accumulate neural network unit instruction (such as the instruction at address 2 in the fourth figure). The maximum count is, for example, 511.

利用一個具有足夠位元寬度而能對所容許累加之最大數量之一全精度值執行累加運算之一累加器202，即可簡化神經處理單元126之算術邏輯單元204之設計。特別是，這樣處理可以緩和需要使用邏輯來對整數加法器244產生之總數執行飽和運算之需求，因為整數加法器244會使一個小型累加器產生溢位，而需要持續追蹤累加器之二進位小數點位置以確認是否產生溢位以確認是否需要執行飽和運算。舉例來說，對於具有一非全精度累加器但具有飽和邏輯以處理非全精度累加器之溢位之設計而言，假定存在以下情況。 The design of the arithmetic logic unit 204 of the neural processing unit 126 can be simplified by using an accumulator 202 having a sufficient bit width and capable of performing an accumulating operation on one of the maximum number of accumulative full-precision values that can be accumulated. In particular, this process can alleviate the need to use logic to perform saturation operations on the totals produced by the integer adder 244, because the integer adder 244 can cause a small accumulator to overflow, and it needs to keep track of the decimal places of the accumulator Click on the position to see if an overflow occurs to see if a saturation operation needs to be performed. For example, for a design with a non-full-precision accumulator but with saturation logic to handle the overflow of the non-full-precision accumulator, assume the following.

(1)資料文字值的範圍是介於0與1之間而所有儲存位元都用以儲存小數位元。權重文字值的範圍是介於-8與+8之間而除了三個以外之所有儲存位元都用以儲存小數位元。做為一個雙曲正切啟動函數之輸入之累加值的範圍是介於-8與8之間，而除了三個以外之所有儲存位元都用以儲存小數位元。 (1) The range of data text values is between 0 and 1 and all storage bits are used to store decimal places. The weight text value ranges from -8 to +8 and all storage bits except three are used to store decimal places. The accumulative value of the input as a hyperbolic tangent activation function is between -8 and 8, and all storage bits except three are used to store decimal places.

(2)累加器之位元寬度為非全精度(如只有乘積之位元寬度)。 (2) The bit width of the accumulator is not full precision (such as only the bit width of the product).

(3)假定累加器為全精度，最終累加值也大約會介於-8與8之間(如+4.2)；不過，在此序列中“點A”前的乘積會較頻繁地產生正值，而在點A後的乘積則會較頻繁地產生負值。在此情況下，就可能取得不正確的結果(如+4.2以外之結果)。這是因為在點A前方之某些點，當需要使累加器達到一個超過其飽和最大值+8之數值，如+8.2，就會損失多出的0.2。累加器甚至會使剩下的乘積累加結果維持在飽和值，而會損失更多正值。因此，累加器之最終值可能會小於使用具有全精度位元寬度之累加器所計算之數值(即小於+4.2)。 (3) Assuming that the accumulator is full precision, the final accumulative value will also be between -8 and 8 (such as +4.2); however, the product before "point A" in this sequence will produce a positive value more frequently , And the product after point A will produce negative values more frequently. In this case, incorrect results (such as results other than +4.2) may be obtained. This is because at some points in front of point A, when the accumulator needs to reach a value that exceeds its saturation maximum +8, such as +8.2, an extra 0.2 will be lost. The accumulator will even maintain the remaining multiply-accumulate results at a saturated value, and lose more positive values. Therefore, the final value of the accumulator may be smaller than the value calculated using the accumulator with full precision bit width (ie, less than +4.2).

正類型轉換器3004會在累加器202數值217為負時，將其轉換為正類型，並產生一額外位元指出原本數值之正負，這個位元會隨同此數值向下傳遞至啟動函數單元212管線。將負數轉換為正類型可以簡化後續啟動函數單元121之運算。舉例來說，經此處理後，只有正值會輸入雙曲正切模組3022與S型模組3024，因而可以簡化這些模組的設計。此外，也可以簡化捨入器3004與飽和器3008。 The positive type converter 3004 converts the accumulator 202 to a positive type when the value 217 of the accumulator 202 is negative, and generates an extra bit to indicate the positive or negative value of the original value. This bit is passed down to the start function unit 212 along with this value. Pipeline. Converting a negative number to a positive type can simplify the operation of the subsequent activation function unit 121. For example, after this processing, only positive values will be input to the hyperbolic tangent module 3022 and the S-type module 3024, so the design of these modules can be simplified. In addition, the rounder 3004 and the saturator 3008 can be simplified.

輸出二進位小數點對準器3002會向右移動或縮放此正類型值，使其對準於控制暫存器127內指定之輸出二進位小數點2954。就一較佳實施例而言，輸出二進位小數點對準器3002會計算累加器202數值217之小數位元數(例如由累加器二進位小數點2923所指定或是資料二進位小數點2922與權重二進位小數點2924之加總)減去輸出之小數位元數(例如由輸出二進位小數點2954所指定)之差值作為偏移量。如此，舉例來說，若是累加器202二進位小數點2923為8(即上述實施例)而輸出二進位小數點2954為3，輸出二進位小數點對準器 3002就會將此正類型數值右移5個位元以產生提供至多工器3006與捨入器3004之結果。 The output binary point aligner 3002 shifts or scales this positive type value to the right to align it with the output binary point 2954 specified in the control register 127. For a preferred embodiment, the output decimal point aligner 3002 calculates the number of decimal places of the value 217 of the accumulator 202 (for example, specified by the accumulator binary decimal point 2923 or the data binary decimal point 2922). And the weighted binary decimal point (total of 2924) minus the output decimal place number (eg, specified by the output binary decimal point 2954) as the offset. So, for example, if the accumulator 202 has a binary decimal point 2923 of 8 (that is, the above embodiment) and the output binary decimal point 2954 is 3, the binary decimal point aligner is output. 3002 will shift this positive type value 5 bits to the right to produce the result provided to multiplexer 3006 and rounder 3004.

捨入器3004會對累加器202數值217執行捨入運算。就一較佳實施例而言，捨入器3004會對正類型轉換器與輸出二進位小數點對準器3002產生之一正類型數值產生一個捨入後版本，並將此捨入後版本提供至多工器3006。捨入器3004會依據前述捨入控制2932執行捨入運算，如本文所述，前述捨入控制會包括使用隨機位元3005之隨機捨入。多工器3006會依據捨入控制2932(如本文所述，可包含隨機捨入)，在其多個輸入中選擇其一，也就是來自正類型轉換器與輸出二進位小數點對準器3002之正類型數值或是來自捨入器3004之捨入後版本，並且將選擇後的數值提供給標準尺寸壓縮器與飽和器3008。就一較佳實施例而言，若是捨入控制指定不進行捨入，多工器3006就會選擇正類型轉換器與輸出二進位小數點對準器3002之輸出，否則就會選擇捨入器3004之輸出。在其他實施例中，亦可由啟動函數單元212執行額外的捨入運算。舉例來說，在一實施例中，當位元選擇器3012對標準尺寸壓縮器與飽和器3008之輸出(如後述)位元進行壓縮時，位元選擇器3012會基於遺失的低順位位元進行捨入運算。在另一個範例中，倒數乘法器3014(如後述)之乘積會被施以捨入運算。在又一個範例中，尺寸轉換器3036需要轉換出適當之輸出尺寸(如後述)，此轉換可能涉及丟去某些用於決定捨入之低順位位元，就會執行捨入運算。 The rounder 3004 performs a rounding operation on the accumulator 202 value 217. For a preferred embodiment, the rounder 3004 generates a rounded version of one of the positive type values generated by the positive type converter and the output binary point aligner 3002, and provides this rounded version Up to multiplexer 3006. The rounder 3004 performs a rounding operation according to the aforementioned rounding control 2932. As described herein, the aforementioned rounding control includes random rounding using a random bit 3005. The multiplexer 3006 will control 2932 according to the rounding (as described in this article, which may include random rounding), and select one of its multiple inputs, that is, from the positive type converter and the output decimal point aligner 3002 The positive type value is also a rounded version from the rounder 3004, and the selected value is provided to the standard size compressor and saturator 3008. In a preferred embodiment, if the rounding control specifies no rounding, the multiplexer 3006 will select the output of the positive type converter and the output binary decimal aligner 3002, otherwise the rounder will be selected. 3004 output. In other embodiments, an additional rounding operation may be performed by the startup function unit 212. For example, in one embodiment, when the bit selector 3012 compresses the output (as described below) of the standard size compressor and the saturator 3008, the bit selector 3012 will base on the missing low order bits Performs rounding. In another example, the product of the inverse multiplier 3014 (described below) is rounded. In another example, the size converter 3036 needs to convert an appropriate output size (as described later). This conversion may involve dropping some low order bits used to determine rounding, and then perform a rounding operation.

標準尺寸壓縮器3008會將多工器3006輸出值壓縮至標準尺寸。因此，舉例來說，若是神經處理單元126是處於窄配置或漏斗配置2902，標準尺寸壓縮器3008可將28位元之多工器3006輸出值壓縮至16位元；而若是神經處理單元126是處於寬配置2902，標準尺寸壓縮器3008可將41位元之多工器3006輸出值壓縮至32位元。不過，在壓縮至標準尺寸前，若是壓縮前值大於標準型式所能表達之最大值，飽和器3008就會使此壓縮前值填滿至標準型式所能表達之最大值。舉例來說，若是壓縮前值中位於最高有效壓縮前值位元左側之任何位元都是數值1，飽和器3008就會填滿至最大值(如填滿為全部1)。 The standard size compressor 3008 compresses the output value of the multiplexer 3006 to a standard size. Therefore, for example, if the neural processing unit 126 is in a narrow configuration or a funnel configuration 2902, the standard size compressor 3008 can compress the output value of the 28-bit multiplexer 3006 to 16 bits; and if the neural processing unit 126 is In a wide configuration 2902, the standard size compressor 3008 can compress the output value of a 41-bit multiplexer 3006 to 32-bit. However, before compression to the standard size, if the value before compression is greater than the maximum value that can be expressed by the standard pattern, the saturator 3008 will fill this value before compression to the maximum value that can be expressed by the standard pattern. For example, if any bit in the pre-compression value that is to the left of the most significant pre-compression value bit is the value 1, the saturator 3008 will fill up to the maximum value (for example, if it fills all 1).

就一較佳實施例而言，雙曲正切模組3022、S型模組3024、以及軟加模組3026都包含查找表，如可程式化邏輯陣列(PLA)、唯讀記憶體(ROM)、組合邏輯閘等等。在一實施例中，為了簡化並縮小這些模組3022/3024/3026的尺寸，提供至這些模組之輸入值係具有3.4之型式，即三個整數位元與四個小數位元，亦即輸入值具有四個位元位於二進位小數點右側並且具有三個位元位於二進位小數點左側。因為在3.4型式之輸入值範圍(-8,+8)之極端處，輸出值會漸近地靠近其最小/最大值，因此選擇這些數值。不過，本發明並不限於此，本發明亦可應用於其它將二進位小數點放置在不同位置之實施例，如以4.3型式或2.5型式。位元選擇器3012會在標準尺寸壓縮器與飽和器3008輸出之位元中選擇選擇滿足3.4型式規範之位元，此涉及壓縮處理，也就是會喪失某些位元，因為標準型式則具有較多之位元數。不過，在選擇/壓縮標準尺寸壓縮器與飽和器3008輸出值之前，若是壓縮前值大於3.4型式所能表達之最大值，飽和器3012就會使壓縮前值填滿至3.4型式所能表達之最大值。舉例來說，若是壓縮前值中位於最高有效3.4型式位元左側之任何位元都是數值1，飽和器3012就會填滿至最大值(如填滿至全部1)。 In a preferred embodiment, the hyperbolic tangent module 3022, the S-shaped module 3024, and the soft-plus module 3026 all include look-up tables, such as a programmable logic array (PLA), read-only memory (ROM) , Combined logic gates, and so on. In an embodiment, in order to simplify and reduce the size of these modules 3022/3024/3026, the input value provided to these modules has a type of 3.4, that is, three integer bits and four decimal places, that is, The input value has four bits to the right of the decimal point and three bits to the left of the decimal point. These values are chosen because the output value will approach the minimum / maximum value asymptotically at the extremes of the input value range (-8, +8) of type 3.4. However, the present invention is not limited to this, and the present invention can also be applied to other embodiments in which a binary decimal point is placed at a different position, such as a 4.3 type or a 2.5 type. Bit selector 3012 selects between the bits of the standard size compressor and saturator 3008 output. Bits that satisfy the 3.4 type specification involve compression processing, that is, some bits are lost, because the standard type has a larger number of bits. However, before selecting / compressing the output values of the standard size compressor and saturator 3008, if the value before compression is greater than the maximum value that can be expressed by 3.4 type, the saturator 3012 will fill the value before compression to the value that can be expressed by 3.4 type. The maximum value. For example, if any bit on the left side of the most significant 3.4 type bit in the pre-compression value is the value 1, the saturator 3012 will be filled to the maximum value (for example, filled to all 1).

雙曲正切模組3022、S型模組3024與軟加模組3026會對標準尺寸壓縮器與飽和器3008輸出之3.4型式數值執行相對應之啟動函數(如前述)以產生一結果。就一較佳實施例而言，雙曲正切模組3022與S型模組3024所產生的是一個0.7型式之7位元結果，即零個整數位元與七個小數位元，亦即輸入值具有七個位元位於二進位小數點右側。就一較佳實施例而言，軟加模組3026產生的是一個3.4型式之7位元結果，即其型式與此模組3026之輸入型式相同。就一較佳實施例而言，雙曲正切模組3022、S型模組3024與軟加模組3026之輸出會被延展至標準型式(例如在必要時加上前導零)並對準而使二進位小數點由輸出二進位小數點2954數值所指定。 The hyperbolic tangent module 3022, the S-type module 3024 and the soft-add module 3026 perform a corresponding activation function (as described above) on the 3.4-type values output by the standard size compressor and the saturator 3008 to produce a result. In a preferred embodiment, the hyperbolic tangent module 3022 and the S-shaped module 3024 produce a 0.7-bit 7-bit result, that is, zero integer bits and seven decimal places, that is, input The value has seven digits to the right of the decimal point. In a preferred embodiment, the soft add module 3026 produces a 7-bit result of the 3.4 type, that is, the type is the same as the input type of the module 3026. For a preferred embodiment, the output of the hyperbolic tangent module 3022, the S-shaped module 3024 and the soft plus module 3026 will be extended to the standard type (for example, leading zeros are added if necessary) and aligned so that The binary decimal point is specified by the output binary decimal point 2954 value.

校正器3018會產生標準尺寸壓縮器與飽和器3008之輸出值之一校正後版本。也就是說，若是標準尺寸壓縮器與飽和器3008之輸出值(如前述其符號係以管線下移)為負，校正器3018會輸出零值；否則，校正器3018就會將其輸入值輸出。就一較佳實施例而言，校正器3018之輸出為標準型式並具有由輸出二進位小數點2954數值所指定之二進位小數點。 The corrector 3018 generates a corrected version of one of the output values of the standard size compressor and the saturator 3008. That is to say, if the output values of the standard size compressor and saturator 3008 (the symbols are shifted down the pipeline as described above) are negative, the corrector 3018 will output a zero value; otherwise, the corrector 3018 will output its input value . For a preferred embodiment, The output of the corrector 3018 is a standard type and has a decimal point specified by the output binary point 2954 value.

倒數乘法器3014會將標準尺寸壓縮器與飽和器3008之輸出與指定於倒數值2942之使用者指定倒數值相乘，以產生標準尺寸之乘積，此乘積實際上即為標準尺寸壓縮器與飽和器3008之輸出值，以倒數值2942之倒數作為除數計算出來的商數。就一較佳實施例而言，倒數乘法器3014之輸出為標準型式並具有由輸出二進位小數點2954數值指定之二進位小數點。 The inverse multiplier 3014 multiplies the output of the standard size compressor and saturator 3008 by the user-specified inverse value specified in the inverse value 2942 to produce a standard size product. This product is actually the standard size compressor and saturation. The output value of the generator 3008 is a quotient calculated by taking the inverse of the inverse value 2942 as the divisor. In a preferred embodiment, the output of the reciprocal multiplier 3014 is a standard type and has a decimal point specified by the output binary point 2954 value.

向右移位器3016會將標準尺寸壓縮器與飽和器3008之輸出，以指定於偏移量值2944之使用者指定位元數進行移動，以產生標準尺寸之商數。就一較佳實施例而言，向右移位器3016之輸出為標準型式並具有由輸出二進位小數點2954數值指定之二進位小數點。 The right shifter 3016 shifts the output of the standard size compressor and the saturator 3008 by the user-specified number of bits specified at the offset value 2944 to generate a standard size quotient. According to a preferred embodiment, the output of the right shifter 3016 is a standard type and has a decimal point specified by the output binary point 2954 value.

多工器3032選擇啟動函數2934值所指定之適當輸入，並將其選擇提供至符號恢復器3034，若是原本的累加器202數值217為負值，符號恢復器3034就會將多工器3032輸出之正類型數值轉換為負類型，例如轉換為二補數類型。 The multiplexer 3032 selects the appropriate input specified by the value of the start function 2934 and provides the selection to the symbol restorer 3034. If the original accumulator 202 value 217 is negative, the symbol restorer 3034 outputs the multiplexer 3032 A positive type value is converted to a negative type, such as a two's complement type.

尺寸轉換器3036會依據如第二十九A圖所述之輸出命令2956之數值，將符號恢復器3034之輸出轉換至適當的尺寸。就一較佳實施例而言，符號恢復器3034之輸出具有一個由輸出二進位小數點2954數值指定之二進位小數點。就一較佳實施例而言，對於輸出命令之第一預設值而言，尺寸轉換器3036會捨棄符號恢復器 3034輸出之上半部位元。此外，若是符號恢復器3034之輸出為正並且超過配置2902指定之文字尺寸所能表達之最大值，或是輸出為負並且小於文字尺寸所能表達之最小值，飽和器3036就會將其輸出分別填滿至此文字尺寸之可表達最大/最小值。對於第二與第三預設值，尺寸轉換器3036會傳遞符號恢復器3034之輸出。 The size converter 3036 converts the output of the symbol restorer 3034 to an appropriate size according to the value of the output command 2956 as shown in FIG. 29A. For a preferred embodiment, the output of the sign restorer 3034 has a decimal point specified by the output binary point 2954 value. For a preferred embodiment, for the first preset value of the output command, the size converter 3036 will discard the symbol restorer. 3034 outputs the upper half element. In addition, if the output of the symbol restorer 3034 is positive and exceeds the maximum value of the text size specified by the configuration 2902, or the output is negative and less than the minimum value of the text size, the saturator 3036 will output it Fill up to the maximum / minimum value of this text size. For the second and third preset values, the size converter 3036 passes the output of the symbol restorer 3034.

多工器3037會依據輸出命令2956，在資料轉換器與飽和器3036輸出與累加器202輸出217中選擇其一以提供給輸出暫存器3038。進一步來說，對於輸出命令2956之第一與第二預設值，多工器3037會選擇尺寸轉換器與飽和器3036之輸出的下方文字(尺寸由配置2902指定)。對於第三預設值，多工器3037會選擇尺寸轉換器與飽和器3036之輸出的上方文字。對於第四預設值，多工器3037會選擇未經處理之累加器202數值217的下方文字；對於第五預設值，多工器3037會選擇未經處理之累加器202數值217的中間文字；而對於第六預設值，多工器3037會選擇未經處理之累加器202數值217的上方文字。如前述，就一較佳實施例而言，啟動函數單元212會在未經處理之累加器202數值217的上方文字加上零值上方位元。 The multiplexer 3037 selects one of the data converter and saturator 3036 output and the accumulator 202 output 217 according to the output command 2956 to provide to the output register 3038. Further, for the first and second preset values of the output command 2956, the multiplexer 3037 selects the lower text of the output of the size converter and the saturator 3036 (the size is specified by the configuration 2902). For the third preset value, the multiplexer 3037 selects the text above the output of the size converter and the saturator 3036. For the fourth preset value, the multiplexer 3037 selects the text below the value 217 of the unprocessed accumulator 202; for the fifth preset value, the multiplexer 3037 selects the middle of the value 217 of the unprocessed accumulator 202 Text; and for the sixth preset value, the multiplexer 3037 selects the text above the value 217 of the unprocessed accumulator 202. As mentioned above, in a preferred embodiment, the activation function unit 212 adds a zero value upper azimuth element to the text above the value 217 of the unprocessed accumulator 202.

第三十一圖係顯示第三十圖之啟動函數單元212之運作之一範例。如圖中所示，神經處理單元126之配置2902係設定為窄配置。此外，帶符號資料2912與帶符號權重2914值為真。此外，資料二進位小數點2922值表示對於資料隨機存取記憶體122文字而言，其二進位小數點位置右側有7個位元，神經處理單元126所接收之第一資料文字之一範例值係呈現為0.1001110。此外，權重二進位小數點2924值表示對於權重隨機存取記憶體124文字而言，其二進位小數點位置右側有3個位元，神經處理單元126所接收之第一權重文字之一範例值係呈現為00001.010。 The thirty-first figure is an example of the operation of the activation function unit 212 of the thirty-first figure. As shown in the figure, the configuration 2902 of the neural processing unit 126 is set to a narrow configuration. In addition, the signed data 2912 and the signed weight 2914 are true. In addition, the value of data binary decimal point 2922 indicates that for the data random access memory 122 text, the binary There are 7 bits to the right of the decimal point position. An example value of the first data text received by the neural processing unit 126 is 0.1001110. In addition, the weight binary decimal point 2924 value indicates that for the weight random access memory 124 text, there are 3 bits to the right of the binary decimal point position, and an example value of the first weight text received by the neural processing unit 126 The line appears as 00001.010.

第一資料與權重文字之16位元乘積(此乘積會與累加器202之初始零值相加)係呈現為000000.1100001100。因為資料二進位小數點2912是7而權重二進位小數點2914是3，對於所隱含之累加器202二進位小數點而言，其右側會有10個位元。在窄配置的情況下，如本實施例所示，累加器202具有28個位元寬。舉例來說，完成所有算術邏輯運算後(例如第二十圖全部1024個乘法累加運算)，累加器202之數值217會是000000000000000001.1101010100。 The 16-bit product of the first data and the weight text (this product will be added to the initial zero value of the accumulator 202) is presented as 000000.1100001100. Because the data decimal point 2912 is 7 and the weighted decimal point 2914 is 3, for the implied accumulator 202 binary decimal point, there will be 10 digits to the right. In the case of a narrow configuration, as shown in this embodiment, the accumulator 202 has a width of 28 bits. For example, after completing all arithmetic logic operations (such as all 1024 multiply-accumulate operations in the twentieth graph), the value 217 of the accumulator 202 will be 000000000000000001.1101010100.

輸出二進位小數點2954值表示輸出之二進位小數點右側有7個位元。因此，在傳遞輸出二進位小數點對準器3002與標準尺寸壓縮器3008之後，累加器202數值217會被縮放、捨入與壓縮至標準型式之數值，即000000001.1101011。在此範例中，輸出二進位小數點位址表示7個小數位元，而累加器202二進位小數點位置表示10個小數位元。因此，輸出二進位小數點對準器3002會計算出差值3，並透過將累加器202數值217右移3個位元以對其進行縮放。在第三十一圖中即顯示累加器202數值217會喪失3個最低有效位元(二進位數100)。此外，在此範例中，捨入控制2932值係表示使用隨機捨入，並且在此範例中係假定取樣隨機位元3005為真。如此，如前述，最低有效位元就會被向上捨入，這是因為累加器202數值217的捨入位元(這3個因為累加器202數值217之縮放運算而被移出的位元中之最高有效位元)為一，而黏位元(這3個因為累加器202數值217之縮放運算而被移出的位元中，2個最低有效位元之布林或運算結果)為零。 The output 2954 decimal point value indicates that there are 7 digits to the right of the output binary decimal point. Therefore, after passing the output binary decimal aligner 3002 and the standard size compressor 3008, the value 217 of the accumulator 202 is scaled, rounded, and compressed to the standard type value, that is, 000000001.1101011. In this example, the output binary decimal point address represents 7 decimal places, and accumulator 202 binary decimal point position represents 10 decimal places. Therefore, the output binary aligner 3002 calculates the difference 3 and scales the accumulator 202 value 217 by 3 bits to the right. In the thirty-first figure, it is shown that the value 217 of the accumulator 202 will lose the 3 least significant bits (100 binary digits). In addition, In this example, the rounding control 2932 value means that random rounding is used, and in this example, the sampling random bit 3005 is assumed to be true. In this way, as mentioned above, the least significant bit will be rounded up. This is because of the rounding bit of the value 217 of the accumulator 202 (of the 3 bits that were shifted out due to the scaling operation of the value 217 of the accumulator 202) The most significant bit is one, and the sticky bit (the three least significant bits of the 3 bits that are shifted out due to the scaling operation of the accumulator 202 value 217) is zero.

在本範例中，啟動函數2934表示所使用的是S型函數。如此，位元選擇器3012就會選擇標準型式值之位元而使S型模組3024之輸入具有三個整數位元與四個小數位元，如前述，即所示之數值001.1101。S型模組3024之輸出數值會放入標準型式中，即所示之數值000000000.1101110。 In this example, the activation function 2934 indicates that an S-type function is used. In this way, the bit selector 3012 selects the bits of the standard type value so that the input of the S-type module 3024 has three integer bits and four decimal places, as described above, that is, the value 001.1101 shown. The output value of the S-type module 3024 will be put into the standard type, that is, the value 000000000.1101110 shown.

此範例之輸出命令2956指定第一預設值，即輸出配置2902表示之文字尺寸，在此情況下即窄文字(8位元)。如此，尺寸轉換器3036會將標準S型輸出值轉換為一個8位元量，其具有一個隱含之二進位小數點，即在此二進位小數點右側有7個位元，而產生一個輸出值01101110，如圖中所示。 The output command 2956 of this example specifies the first preset value, that is, the text size indicated by the output configuration 2902, in this case, the narrow text (8 bits). In this way, the size converter 3036 converts the standard S-type output value into an 8-bit quantity, which has an implicit binary decimal point, that is, 7 bits to the right of the binary decimal point, and produces an output The value 01101110, as shown in the figure.

第三十二圖係顯示第三十圖之啟動函數單元212之運作之第二個範例。第三十二圖之範例係描述當啟動函數2934表示以標準尺寸傳遞累加器202數值217時，啟動函數單元212之運算。如圖中所示，此配置2902係設定為神經處理單元216之窄配置。 The thirty-second figure is a second example showing the operation of the activation function unit 212 of the thirty figure. The example in the thirty-second figure describes the operation of the activation function unit 212 when the activation function 2934 indicates that the value 217 of the accumulator 202 is passed in a standard size. As shown in the figure, this configuration 2902 is set as a narrow configuration of the neural processing unit 216.

在此範例中，累加器202之寬度為28個位元，累加器202二進位小數點之位置右側有10個位元(這是因為在一實施例中資料二進位小數點2912與權重二進位小數點2914之加總為10，或者在另一實施例中累加器二進位小數點2923明確被指定為具有數值10)。舉例來說，在執行所有算術邏輯運算後，第三十二圖所示之累加器202數值217為000001100000011011.1101111010。 In this example, the width of the accumulator 202 is 28 bits, and there are 10 bits to the right of the position of the decimal point of the accumulator 202 (this is because in one embodiment the data decimal point 2912 and the weighted binary bit The decimal point 2914 adds up to 10, or in another embodiment the accumulator binary decimal point 2923 is explicitly specified to have the value 10). For example, after performing all the arithmetic logic operations, the value 217 of the accumulator 202 shown in the thirty-second figure is 000001100000011011.1101111010.

在此範例中，輸出二進位小數點2954值表示對於輸出而言，二進位小數點右側有4個位元。因此，在傳遞輸出二進位小數點對準器3002與標準尺寸壓縮器3008之後，累加器202數值217會飽和並壓縮至所示之標準型式值111111111111.1111，此數值係由多工器3032所接收以作為標準尺寸傳遞值3028。 In this example, the output of the decimal point 2954 value means that for the output, there are 4 digits to the right of the decimal point. Therefore, after passing the output binary aligner 3002 and the standard size compressor 3008, the value 217 of the accumulator 202 will saturate and compress to the standard type value 111111111111.1111 shown. This value is received by the multiplexer 3032. Pass the value 3028 as a standard size.

在此範例中顯示兩個輸出命令2956。第一個輸出命令2956指定第二預設值，即輸出標準型式尺寸之下方文字。因為配置2902所指示之尺寸為窄文字(8位元)，標準尺寸就會是16位元，而尺寸轉換器3036會選擇標準尺寸傳遞值3028之下方8個位元以產生如圖中所示之8位元數值11111111。第二個輸出命令2956指定第三預設值，即輸出標準型式尺寸之上方文字。如此，尺寸轉換器3036會選擇標準尺寸傳遞值3028之上方8個位元以產生如圖中所示之8位元數值11111111。 Two output commands 2956 are shown in this example. The first output command 2956 specifies the second preset value, that is, the text below the standard type size is output. Because the size indicated by the configuration 2902 is narrow text (8 bits), the standard size will be 16 bits, and the size converter 3036 will select the 8 bits below the standard size transfer value 3028 to generate the figure The 8-bit value is 11111111. The second output command 2956 specifies the third preset value, that is, the text above the standard size is output. In this way, the size converter 3036 selects 8 bits above the standard size transfer value 3028 to generate the 8-bit value 11111111 as shown in the figure.

第三十三圖係顯示第三十圖之啟動函數單元212之運作之第三個範例。第三十三圖之範例係揭示當啟動函數2934表示要傳遞整個未經處理之累加器202 數值217時啟動函數單元212之運作。如圖中所示，此配置2902係設定為神經處理單元126之寬配置(例如16位元之輸入文字)。 The thirty-third figure is a third example showing the operation of the activation function unit 212 of the thirty-third figure. The example in Figure 33 reveals that when the activation function 2934 indicates that the entire unprocessed accumulator 202 is to be passed When the value is 217, the operation of the function unit 212 is started. As shown in the figure, this configuration 2902 is set to a wide configuration of the neural processing unit 126 (for example, a 16-bit input text).

在此範例中，累加器202之寬度為41個位元，累加器202二進位小數點位置的右側有8個位元(這是因為在一實施例中資料二進位小數點2912與權重二進位小數點2914之加總為8，或者在另一實施例中累加器二進位小數點2923明確被指定為具有數值8)。舉例來說，在執行所有算術邏輯運算後，第三十三圖所示之累加器202數值217為001000000000000000001100000011011.11011110。 In this example, the width of the accumulator 202 is 41 bits, and there are 8 bits to the right of the decimal point position of the accumulator 202 (this is because in one embodiment the data decimal point 2912 and the weighted binary bit The decimal point 2914 adds up to 8, or in another embodiment the accumulator binary decimal point 2923 is explicitly specified to have the value 8). For example, after performing all arithmetic logic operations, the value 217 of the accumulator 202 shown in Figure 33 is 001000000000000000001100000011011.11011110.

此範例中顯示三個輸出命令2956。第一個輸出命令指定第四預設值，即輸出未經處理之累加器202數值之下方文字；第二個輸出命令指定第五預設值，即輸出未經處理之累加器202數值之中間文字；而第三個輸出命令指定第六預設值，即輸出未經處理之累加器202數值之上方文字。因為配置2902所指示之尺寸為寬文字(16位元)，如第三十三圖所示，因應第一輸出命令2956，多工器3037會選擇16位元值0001101111011110；因應第二輸出命令2956，多工器3037會選擇16位元值0000000000011000；而因應第三輸出命令2956，多工器3037會選擇16位元值0000000001000000。 Three output commands 2956 are shown in this example. The first output command specifies the fourth preset value, that is, the text below the value of the unprocessed accumulator 202 is output; the second output command specifies the fifth preset value, which is the middle of the value of the unprocessed accumulator 202 Text; and the third output command specifies a sixth preset value, that is, the text above the value of the unprocessed accumulator 202 is output. Because the size indicated by configuration 2902 is wide text (16 bits), as shown in Figure 33, in response to the first output command 2956, the multiplexer 3037 will select the 16-bit value 0001101111011110; in response to the second output command 2956 The multiplexer 3037 will select the 16-bit value 0000000000011000; in response to the third output command 2956, the multiplexer 3037 will select the 16-bit value 0000000001000000.

如前述，神經網路單元121即可執行於整數資料而非浮點資料。如此，即有助於簡化個個神經處理單元126，或至少其中之算術邏輯單元204部分。舉例來說，這個算術邏輯單元204就不需要為了乘法器242而納入在浮點運算中需用來將乘數之指數相加之加法器。類似地，這個算術邏輯單元204就不需要為了加法器234而納入在浮點運算中需用來對準加數之二進位小數點之移位器。所屬技術領域具有通常知識者當能理解，浮點單元往往非常複雜；因此，本文所述之範例僅針對算術邏輯單元204進行簡化，利用所述具有硬體定點輔助而讓使用者可指定相關二進位小數點之整數實施例亦可用於對其他部分進行簡化。相較於浮點之實施例，使用整數單元作為算術邏輯單元204可以產生一個較小(且較快)之神經處理單元126，而有利於將一個大型的神經處理單元126陣列整合進神經網路單元121內。啟動函數單元212之部分可以基於使用者指定、累加數需要之小數位元數量以及輸出值需要之小數位元數量，來處理累加器202數值217之縮放與飽和運算，而較佳者係基於使用者指定。任何額外複雜度與伴隨之尺寸增加，以及啟動函數單元212之定點硬體輔助內之能源與/或時間耗損，都可以透過在算術邏輯單元204間共享啟動函數單元212之方式來進行分攤，這是因為如第十一圖之實施例所示，採用共享方式之實施例可以減少啟動函數單元1112之數量。 As mentioned above, the neural network unit 121 can be executed on integer data instead of floating-point data. In this way, it helps to simplify each of the neural processing units 126, or at least the arithmetic logic unit 204 thereof. For example In other words, the arithmetic logic unit 204 does not need to include an adder for adding the exponents of the multipliers in the floating-point operation for the multiplier 242. Similarly, the arithmetic logic unit 204 does not need to include a shifter for aligning the decimal point of the addend in the floating-point operation for the adder 234. Those skilled in the art can understand that floating-point units are often very complex; therefore, the examples described in this article are only simplified for the arithmetic logic unit 204. The use of hardware fixed-point assistance allows users to specify related two The integer embodiment of the decimal point can also be used to simplify other parts. Compared with the floating-point embodiment, using an integer unit as the arithmetic logic unit 204 can generate a smaller (and faster) neural processing unit 126, which is advantageous for integrating a large array of neural processing units 126 into a neural network. Unit 121. The part of the activation function unit 212 can process the scaling and saturation operations of the value 217 of the accumulator 202 based on the user specified, the number of decimal places required by the accumulated number and the number of decimal places required by the output value, and the better is based on the use Designator. Any additional complexity and accompanying increase in size, and energy and / or time consumption in the fixed-point hardware assistance of the activation function unit 212 can be shared by sharing the activation function unit 212 among the arithmetic logic unit 204. This is because, as shown in the embodiment in FIG. 11, the embodiment adopting the sharing method can reduce the number of activation function units 1112.

本文所述之實施例可以享有許多利用整數算數單元以降低硬體複雜度之優點(相較於使用浮點算術單元)，而同時還能用於小數之算術運算，即具有二進位小數點之數字。浮點算術之優點在於它可以提供資料算術運算給資料之個別數值落在一個非常廣的數值範圍內(實際上只受限於指數範圍的大小，因此會是一個非常大的範圍)。也就是說，每個浮點數具有其潛在獨一無二的指數值。不過，本文所述之實施例理解到並利用某些應用中具有輸入資料高度平行且落於一相對較窄之範圍內而使所有平行資料具有相同“指數”之特性。如此，這些實施例讓使用者將二進位小數點位置一次指定給所有的輸入值與/或累加值。類似地，透過理解並利用平行輸出具有類似範圍之特性，這些實施例讓使用者將二進位小數點位置一次指定給所有的輸出值。人工神經網路是此種應用之一範例，不過本發明之實施例亦可應用於執行其他應用之計算。透過將二進位小數點位置一次指定給多個輸入而非給對個別的輸入數，相較於使用浮點運算，本發明之實施例可以更有效率地利用記憶空間(如需要較少之記憶體)以及/或在使用類似數量之記憶體的情況下提升精度，這是因為用於浮點運算之指數的位元可用來提升數值精度。 The embodiments described herein can enjoy many advantages of using integer arithmetic units to reduce hardware complexity (compared to using floating-point arithmetic units), and can also be used for arithmetic operations on decimals, that is, with binary decimal digital. The advantage of floating-point arithmetic is that it can provide The individual values given to the data by the arithmetic operation fall within a very wide range of values (actually only limited by the size of the exponential range, so it will be a very large range). That is, each floating-point number has its potentially unique exponential value. However, the embodiments described herein understand and take advantage of the fact that in some applications the input data is highly parallel and falls within a relatively narrow range so that all parallel data have the same "index" characteristics. As such, these embodiments allow the user to assign binary decimal point positions to all input values and / or accumulated values at once. Similarly, by understanding and utilizing the characteristics of parallel outputs with similar ranges, these embodiments allow the user to assign the binary decimal point position to all output values at once. Artificial neural networks are one example of such applications, but embodiments of the present invention can also be applied to perform calculations for other applications. By assigning the decimal point position to multiple inputs at one time rather than to individual input numbers, the embodiment of the present invention can use the memory space more efficiently (if less memory is required) than using floating-point arithmetic Volume) and / or increase accuracy with a similar amount of memory, because the bits of the exponent used for floating-point arithmetic can be used to increase numerical accuracy.

此外，本發明之實施例理解到在對一個大型系列之整數運算(如溢位或喪失較不重要之小數位元)執行累加時可能喪失精度，因此提供一個解決方法，主要是利用一個足夠大的累加器來避免精度喪失。 In addition, the embodiments of the present invention understand that accuracy may be lost when performing accumulation on a large series of integer operations (such as overflow or loss of less significant decimal places), so a solution is provided, mainly by using a large enough Accumulator to avoid loss of accuracy.

Direct execution of micro-operations in neural network units

第三十四圖係顯示第一圖之處理器100以及神經網路單元121之部分細節之方塊示意圖。神經網路單元121包括神經處理單元126之管線級3401。各個管線級3401係以級暫存器區分，並包括組合邏輯以達成本文之神經處理單元126之運算，如布林邏輯閘、多工器、加法器、乘法器、比較器等等。管線級3401從多工器3402接收一微運算3418。微運算3418會向下流動至管線級3401並控制其組合邏輯。微運算3418是一個位元集合。就一較佳實施例而言，微運算3418包括資料隨機存取記憶體122記憶體位址123之位元、權重隨機存取記憶體124記憶體位址125之位元、程式記憶體129記憶體位址131之位元、多工暫存器208/705控制信號213/713、還有許多控制暫存器217之欄位(例如第二十九A至二十九C圖之控制暫存器)。在一實施例中，微運算3418包括大約120個位元。多工器3402從三個不同的來源接收微運算，並選擇其中一個作為提供給管線級3401之微運算3418。 The thirty-fourth figure is a block diagram showing some details of the processor 100 and the neural network unit 121 in the first figure. Neural network The unit 121 includes a pipeline stage 3401 of the neural processing unit 126. Each pipeline stage 3401 is distinguished by a stage register and includes combinational logic to achieve the operations of the neural processing unit 126 in this article, such as a Bollinger logic gate, a multiplexer, an adder, a multiplier, a comparator, and so on. The pipeline stage 3401 receives a micro-operation 3418 from the multiplexer 3402. The micro-operation 3418 flows down to the pipeline stage 3401 and controls its combination logic. Micro operation 3418 is a set of bits. In a preferred embodiment, the micro-operation 3418 includes data random access memory 122, memory address 123, weight random access memory 124, memory address 125, and program memory 129. Bit 131, multiplex register 208/705 control signal 213/713, and many fields of control register 217 (such as the control registers of Figures 29A to 29C). In one embodiment, the micro-operation 3418 includes approximately 120 bits. The multiplexer 3402 receives micro-operations from three different sources and selects one of them as the micro-operation 3418 provided to the pipeline stage 3401.

多工器3402之一個微運算來源為第一圖之定序器128。定序器128會將由程式記憶體129接收之神經網路單元指令解碼並據以產生一個微運算3416提供至多工器3402之第一輸入。 One micro-computing source of the multiplexer 3402 is the sequencer 128 of the first figure. The sequencer 128 decodes the neural network unit instructions received by the program memory 129 and generates a micro-operation 3416 to provide the first input to the multiplexer 3402.

多工器3402之第二個微運算來源為從第一圖之保留站108接收微指令105以及從通用暫存器116與媒體暫存器118接收運算元之解碼器3404。就一較佳實施例而言，如前述，微指令105係由指令轉譯器104因應MTNN指令1400與MFNN指令1500之轉譯所產生。微指令105可包括一個立即欄以指定一特定函數(由一個MTNN指令1400或一個MFNN指令1500所指定)，例如程式記憶體129內程式的開始與停止執行、直接從媒體暫存器118執行一微運算、或是如前述讀取/寫入神經網路單元之一記憶體。解碼器3404會將微指令105解碼並據以產生一個微運算3412提供至多工器之第二輸入。就一較佳實施例而言，對於MTNN指令1400/MFNN指令1500之某些函數1432/1532而言，解碼器3404不需要產生一個微運算3412向下傳送至管線3401，例如寫入控制暫存器127、開始執行程式記憶體129內之程式、暫停執行程式記憶體129內之程式、等待程式記憶體129內之程式完成執行、從狀態暫存器127讀取以及重設神經網路單元121。 The second micro-computing source of the multiplexer 3402 is a decoder 3404 that receives micro-instructions 105 from the reservation station 108 in the first figure and receives operands from the general-purpose register 116 and the media register 118. In a preferred embodiment, as mentioned above, the microinstruction 105 is generated by the instruction translator 104 in response to the translation of the MTNN instruction 1400 and the MFNN instruction 1500. Microinstruction 105 may include an immediate field to specify a specific function (specified by an MTNN instruction 1400 or an MFNN instruction 1500), such as program memory Start and stop execution of the program in the body 129, perform a micro-operation directly from the media register 118, or read / write memory of one of the neural network units as described above. The decoder 3404 decodes the micro instruction 105 and generates a micro operation 3412 to provide a second input to the multiplexer. For a preferred embodiment, for some functions 1432/1532 of the MTNN instruction 1400 / MFNN instruction 1500, the decoder 3404 does not need to generate a micro-operation 3412 and send it down to the pipeline 3401, for example, write control temporary storage Device 127, start program in program memory 129, pause program in program memory 129, wait for program in program memory 129 to finish executing, read from state register 127, and reset neural network unit 121 .

多工器3402之第三個微運算來源為媒體暫存器118本身。就一較佳實施例而言，如前文對應於第十四圖所述，MTNN指令1400可指定一函數以指示神經網路單元121直接執行一個由媒體暫存器118提供至多工器3402之第三輸入之微運算3414。直接執行由架構媒體暫存器118提供之微運算3414有利於對神經網路單元121進行測試，如內建自我測試(BIST)，或除錯之動作。 The third micro-computing source of the multiplexer 3402 is the media register 118 itself. For a preferred embodiment, as described above corresponding to the fourteenth figure, the MTNN instruction 1400 may specify a function to instruct the neural network unit 121 to directly execute a first step provided by the media register 118 to the multiplexer 3402 Three-input micro operation 3414. Directly performing the micro-operations 3414 provided by the architecture media register 118 is beneficial for testing the neural network unit 121, such as a built-in self-test (BIST), or a debugging action.

就一較佳實施例而言，解碼器3404會產生一個模式指標3422控制多工器3402之選擇。當MTNN指令1400指定一個函數開始執行一個來自程式記憶體129之程式，解碼器3404會產生一模式指標3422值使多工器3402選擇來自定序器128之微運算3416，直到發生錯誤或直到解碼器3404碰到一個MTNN指令1400指定一個函數停止執行來自程式記憶體129之程式。當MTNN指令1400指定一個函數指示神經網路單元121直接執行由媒體暫存器118提供之一微運算3414，解碼器3404會產生一個模式指標3422值使多工器3402選擇來自所指定之媒體暫存器118之微運算3414。否則，解碼器3404就會產生一個模式指標3422值使多工器3402選擇來自解碼器3404之微運算3412。 For a preferred embodiment, the decoder 3404 generates a mode indicator 3422 to control the selection of the multiplexer 3402. When MTNN instruction 1400 specifies a function to start executing a program from program memory 129, the decoder 3404 will generate a mode indicator 3422 value to cause the multiplexer 3402 to select the micro-operation 3416 from the sequencer 128 until an error occurs or until decoding The processor 3404 encounters an MTNN instruction 1400 specifying a function to stop executing a program from the program memory 129. When MTNN instruction 1400 specifies a function to instruct neural network unit 121 to directly execute The register 118 provides a micro operation 3414. The decoder 3404 will generate a mode indicator 3422 value to enable the multiplexer 3402 to select the micro operation 3414 from the specified media register 118. Otherwise, the decoder 3404 will generate a mode indicator 3422 value to cause the multiplexer 3402 to select the micro-operation 3412 from the decoder 3404.

可變率神經網路單元 Variable rate neural network unit

在許多情況下，神經網路單元121執行程式後就會進入待機狀態(idle)等待處理器100處理一些需要在執行下一個程式前處理的事情。舉例來說，假設處在一個類似於第三至六A圖所述之情況，神經網路單元121會對一乘法累加啟動函數程式(也可稱為一前授神經網路層程式(feed forward neural network layer program))連續執行兩次或更多次。相較於神經網路單元121執行程式所花費的時間，處理器100明顯需要花費較長的時間來將512KB之權重值寫入權重隨機存取記憶體124以供下一次神經網路單元程式使用。換言之，神經網路單元121會在短時間內執行程式，隨後就進入待機狀態，直到處理器100將接下來的權重值寫入權重隨機存取記憶體124供下一次程式執行使用。此情況可參照第三十六A圖，詳如後述。在此情況下，神經網路單元121可採用較低時頻率運行以延長執行程式之時間，藉以使執行程式所需之能源消耗分散至較長的時間範圍，而使神經網路單元121，乃至於整個處理器100，維特在較低溫度。此情況稱為緩和模式，可參照第三十六B圖，詳如後述。 In many cases, after executing the program, the neural network unit 121 enters an idle state and waits for the processor 100 to process something that needs to be processed before executing the next program. For example, if it is in a situation similar to that described in Figures 3 to 6A, the neural network unit 121 accumulates a multiplying activation function program (also known as a feed forward neural network layer program (feed forward neural network layer program)) is performed two or more times in a row. Compared with the time taken by the neural network unit 121 to execute the program, the processor 100 obviously takes a longer time to write the 512KB weight value into the weight random access memory 124 for the next neural network unit program. . In other words, the neural network unit 121 executes the program in a short time, and then enters the standby state until the processor 100 writes the next weight value into the weight random access memory 124 for the next program execution. In this case, please refer to Figure 36A, which will be described in detail later. In this case, the neural network unit 121 can run at a lower time frequency to extend the execution time of the program, so that the energy consumption required to execute the program is spread over a longer time range, so that the neural network unit 121 and even For the entire processor 100, Witter is at a lower temperature. This situation is called a relaxation mode, and you can refer to Figure 36B, which will be described in detail later.

第三十五圖係一方塊圖，顯示具有一可變率神經網路單元121之處理器100。此處理器100係類似於第一圖之處理器100，並且圖中具有相同標號之元件亦相類似。第三十五圖之處理器100並具有時頻產生邏輯3502耦接至處理器100之功能單元，這些功能單元即指令攫取單元101，指令快取102，指令轉譯器104，重命名單元106，保留站108，神經網路單元121，其他執行單元112，記憶體子系統114，通用暫存器116與媒體暫存器118。時頻產生邏輯3502包括一時頻產生器，例如一鎖相迴路(PLL)，以產生一個具有一主要時頻率或稱時頻頻率之時頻信號。舉例來說，此主要時頻率可以是1GHz，1.5GHz，2GHz等等。時頻率即表示每秒之週期數，如時頻信號在高低狀態間之震盪次數。較佳地，此時頻信號具有一平衡週期(duty cycle)，即此週期之一半為高狀態而另一半為低狀態；另外，此時頻信號也可具有一非平衡週期，也就是時頻信號處在高狀態之時間長於其處在低狀態之時間，反之亦然。較佳地，鎖相迴路係用以產生多個時頻率之主要時頻信號。較佳地，處理器100包括一電源管理模組，依據多種因素自動調整主要時頻率，這些因素包括處理器100之動態偵測操作溫度，利用率(utilization)，以及來自系統軟體(如作業系統，基本輸入輸出系統(BIOS))指示所需效能與/或節能指標之命令。在一實施例中，電源管理模組包括處理器100之微碼。 The thirty-fifth figure is a block diagram showing a variable Processor 100 with rate neural network unit 121. The processor 100 is similar to the processor 100 in the first figure, and the components with the same reference numerals in the figure are similar. The processor 100 in the thirty-fifth figure has time-frequency generating logic 3502, which is a functional unit coupled to the processor 100. These functional units are the instruction fetch unit 101, the instruction cache 102, the instruction translator 104, and the rename unit 106. Reservation station 108, neural network unit 121, other execution units 112, memory subsystem 114, general purpose register 116 and media register 118. The time-frequency generating logic 3502 includes a time-frequency generator, such as a phase-locked loop (PLL), to generate a time-frequency signal having a main time-frequency or time-frequency. For example, the main frequency can be 1GHz, 1.5GHz, 2GHz, and so on. The time frequency represents the number of cycles per second, such as the number of oscillations of the time frequency signal between high and low states. Preferably, the time-frequency signal has a duty cycle, that is, one half of the period is high and the other half is low. In addition, the time-frequency signal may also have an unbalanced period, which is time-frequency A signal is in a high state longer than it is in a low state, and vice versa. Preferably, the phase-locked loop is used to generate a main time-frequency signal with multiple time-frequency. Preferably, the processor 100 includes a power management module that automatically adjusts the main time frequency based on a variety of factors, including dynamic detection of operating temperature, utilization of the processor 100, and system software (such as an operating system). , Basic Input Output System (BIOS) command to indicate required performance and / or energy saving indicators. In one embodiment, the power management module includes microcode of the processor 100.

時頻產生邏輯3502並包括一時頻散佈網路，或時頻樹(clock tree)。時頻樹會將主要時頻信號散佈至處理器100之功能單元，如第三十五圖所示，此散佈動作就是將時頻信號3506-1傳送至指令攫取單元101，將時頻信號3506-2傳送至指令快取102，將時頻信號3506-10傳送至指令轉譯器104，將時頻信號3506-9傳送至重命名單元106，將時頻信號3506-8傳送至保留站108，將時頻信號3506-7傳送至神經網路單元121，將時頻信號3506-4傳送至其他執行單元112，將時頻信號3506-3傳送至記憶體子系統114，將時頻信號3506-5傳送至通用暫存器116，以及將時頻信號3506-6傳送至媒體暫存器118，這些信號集體稱為時頻信號3506。此時頻樹具有節點或線，以傳送主要時頻信號3506至其相對應之功能單元。此外，較佳地，時頻產生邏輯3502可包括時頻緩衝器，在需要提供較乾淨之時頻信號與/或需要提升主要時頻信號之電壓準位時，特別是對於較遠之節點，時頻緩衝器可重新產生主要時頻信號。此外，各個功能單元並具有其自身之子時頻樹，在需要時重新產生與/或提升所接收之相對應主要時頻信號3506的電壓準位。 The time-frequency generation logic 3502 includes a time-frequency dispersion network, or a time-frequency tree. The time-frequency tree will distribute the main time-frequency signals to the functional units of the processor 100. As shown in Figure 35, this distribution The action is to transmit the time-frequency signal 3506-1 to the instruction fetch unit 101, the time-frequency signal 3506-2 to the instruction cache 102, the time-frequency signal 3506-10 to the instruction translator 104, and the time-frequency signal 3506- 9 to the renaming unit 106, the time-frequency signal 3506-8 to the reservation station 108, the time-frequency signal 3506-7 to the neural network unit 121, and the time-frequency signal 3506-4 to the other execution units 112, The time-frequency signal 3506-3 is transmitted to the memory subsystem 114, the time-frequency signal 3506-5 is transmitted to the general purpose register 116, and the time-frequency signal 3506-6 is transmitted to the media register 118. These signals are collectively called Is time-frequency signal 3506. The time-frequency tree has nodes or lines to transmit the main time-frequency signal 3506 to its corresponding functional unit. In addition, preferably, the time-frequency generation logic 3502 may include a time-frequency buffer, when a cleaner time-frequency signal is needed and / or a voltage level of the main time-frequency signal needs to be raised, especially for a distant node, The time-frequency buffer can regenerate the main time-frequency signal. In addition, each functional unit has its own child time-frequency tree, and when necessary, reproduces and / or boosts the voltage level of the corresponding main time-frequency signal 3506 received.

神經網路單元121包括時頻降低邏輯3504，時頻降低邏輯3504接收一緩和指標3512與主要時頻信號3506-7，以產生一第二時頻信號。第二時頻信號具有一時頻率。此時頻率若非相同於主要時頻率，就是處於一緩和模式從主要時頻率降低一數值以減少熱能產生，此數值係程式化至緩和指標3512。時頻降低邏輯3504類似於時頻產生邏輯3502，其具有一時頻散佈網路，或時頻樹，以散佈第二時頻信號至神經網路單元121之多種功能方塊，此散佈動作就是將時頻信號3508-1傳送至神經處理單元陣列126，將時頻信號3508-2傳送至定序器128以即將時頻信號3508-3傳送至介面邏輯3514，這些信號集體稱為第二時頻信號3508。較佳地，這些神經處理單元126包括複數個管線級3401，如第三十四圖所示，管線級3401包括管線分級暫存器，用以從時頻降低邏輯3504接收第二時頻信號3508-1。 The neural network unit 121 includes time-frequency reduction logic 3504. The time-frequency reduction logic 3504 receives a relaxation index 3512 and a main time-frequency signal 3506-7 to generate a second time-frequency signal. The second time-frequency signal has a time-frequency. At this time, if the frequency is not the same as the main time frequency, it is in a relaxation mode. A value is reduced from the main time frequency to reduce heat generation. This value is programmed to the relaxation index 3512. The time-frequency reduction logic 3504 is similar to the time-frequency generation logic 3502. It has a time-frequency distribution network, or a time-frequency tree, to distribute the second time-frequency signal to the various functional blocks of the neural network unit 121. This distribution action is to divide the time Frequency signal 3508-1 to God Through the processing unit array 126, the time-frequency signal 3508-2 is transmitted to the sequencer 128 to transmit the time-frequency signal 3508-3 to the interface logic 3514. These signals are collectively referred to as the second time-frequency signal 3508. Preferably, the neural processing units 126 include a plurality of pipeline stages 3401. As shown in FIG. 34, the pipeline stage 3401 includes pipeline stage registers to receive the second time-frequency signal 3508 from the time-frequency reduction logic 3504. -1.

神經網路單元121並具有介面邏輯3514以接收主要時頻信號3506-7與第二時頻信號3508-3。介面邏輯3514係耦接於處理器100前端之下部分(例如保留站108，媒體暫存器118與通用暫存器116)與神經網路單元121之多種功能方塊間，這些功能方塊即時頻降低邏輯3504，資料隨機存取記憶體122，權重隨機存取記憶體124，程式記憶體129與定序器128。介面邏輯3514包括一資料隨機存取記憶體緩衝3522，一權重隨機存取記憶體緩衝3524，第三十四圖之解碼器3404，以及緩和指標3512。緩和指標3512裝載一數值，此數值係指定神經處理單元陣列126會以多慢的速度執行神經網路單元程式指令。較佳地，緩和指標3512係指定一除數值N，時頻降低邏輯3504將主要時頻信號3506-7除以此除數值以產生第二時頻信號3508，如此，第二時頻信號之時頻率就會是1/N。較佳地，N的數值可程式化為複數個不同預設值中之任何一個，這些預設值可使時頻降低邏輯3504對應產生複數個具有不同時頻率之第二時頻信號3508，這些時頻率係小於主要時頻率。 The neural network unit 121 also has interface logic 3514 to receive the main time-frequency signal 3506-7 and the second time-frequency signal 3508-3. The interface logic 3514 is coupled between the lower part of the front end of the processor 100 (such as the reservation station 108, the media register 118 and the general purpose register 116) and the various functional blocks of the neural network unit 121. These functional blocks are reduced frequently. Logic 3504, data random access memory 122, weight random access memory 124, program memory 129, and sequencer 128. The interface logic 3514 includes a data random access memory buffer 3522, a weighted random access memory buffer 3524, a decoder 3404 of the thirty-fourth figure, and a relaxation index 3512. The relaxation indicator 3512 is loaded with a value that specifies how slowly the neural processing unit array 126 will execute the neural network unit program instructions. Preferably, the relaxation index 3512 specifies a division value N, and the time-frequency reduction logic 3504 divides the main time-frequency signal 3506-7 by the division value to generate a second time-frequency signal 3508. Thus, at the time of the second time-frequency signal, The frequency will be 1 / N. Preferably, the value of N can be programmed into any one of a plurality of different preset values, which can cause the time-frequency reduction logic 3504 to correspondingly generate a plurality of second time-frequency signals 3508 with different time frequencies. These The time frequency is smaller than the main time frequency.

在一實施例中，時頻降低邏輯3504包括一時頻除法器電路，用以將主要時頻信號3506-7除以緩和指標3512數值。在一實施例中，時頻降低邏輯3504包括時頻閘(如AND閘)，時頻閘可透過一啟動信號來門控主要時頻信號3506-7，啟動信號在主要時頻信號之每N個週期中只會產生一次真值。以一個包含一計數器以產生啟動信號之電路為例，此計數器可向上計數至N。當伴隨的邏輯電路偵測到計數器之輸出與N匹配，邏輯電路就會在第二時頻信號3508產生一真值脈衝並重設計數器。較佳地，緩和指標3512數值可由一架構指令予以程式化，例如第十四圖之MTNN指令1400。較佳地，在架構程式指示神經網路單元121開始執行神經網路單元程式前，運作於處理器100之架構程式會將緩和值程式化至緩和指標3512，這部分在後續對應於第三十七圖處會有更詳細的說明。 In one embodiment, the time-frequency reduction logic 3504 includes a The time-frequency divider circuit is used to divide the main time-frequency signal 3506-7 by the value of the relaxation index 3512. In one embodiment, the time-frequency reduction logic 3504 includes a time-frequency gate (such as an AND gate). The time-frequency gate can gate the main time-frequency signal 3506-7 through a start signal. The start signal is every N of the main time-frequency signal. True values are generated only once in each cycle. Taking a circuit including a counter to generate a start signal as an example, this counter can count up to N. When the accompanying logic circuit detects that the output of the counter matches N, the logic circuit will generate a true value pulse on the second time-frequency signal 3508 and reset the counter. Preferably, the value of the mitigation index 3512 can be stylized by a framework instruction, such as the MTNN instruction 1400 of the fourteenth figure. Preferably, before the architecture program instructs the neural network unit 121 to start executing the neural network unit program, the architecture program running on the processor 100 programs the relaxation value to the relaxation index 3512, which corresponds to the thirtieth in the following. There will be a more detailed explanation at the seventh figure.

權重隨機存取記憶體緩衝3524係耦接於權重隨機存取記憶體124與媒體暫存器118之間作為其間資料傳輸之緩衝。較佳地，權重隨機存取記憶體緩衝3524係類似於第十七圖之緩衝器1704之一個或多個實施例。較佳地，權重隨機存取記憶體緩衝3524從媒體暫存器118接收資料之部分係以具有主要時頻率之主要時頻信號3506-7作為時頻，而權重隨機存取記憶體緩衝3524從權重隨機存取記憶體124接收資料之部分係以具有第二時頻率之第二時頻信號3508-3作為時頻，第二時頻率可依據程式化於緩和指標3512之數值從主要時頻率調降或否，亦即依據神經網路單元121執行於緩和或正常模式來進行調降或否。在一實施例中，權重隨機存取記憶體124為單埠，如前文第十七圖所述，權重隨機存取記憶體124並可由媒體暫存器118透過權重隨機存取記憶體緩衝3524，以及由神經處理單元126或第十一圖之列緩衝1104，以仲裁方式(arbitrated fashion)存取。在另一實施例中，權重隨機存取記憶體124為雙埠，如前文第十六圖所述，各個埠可由媒體暫存器118透過權重隨機存取記憶體緩衝3524以及由神經處理單元126或列緩衝器1104以併行方式存取。 The weighted random access memory buffer 3524 is coupled between the weighted random access memory 124 and the media register 118 as a buffer for data transmission therebetween. Preferably, the weighted random access memory buffer 3524 is one or more embodiments similar to the buffer 1704 of FIG. Preferably, the part of the weighted random access memory buffer 3524 receiving data from the media register 118 uses the main time-frequency signal 3506-7 with the main time-frequency as the time-frequency, and the weight random-access memory buffer 3524 from The part of the data received by the weighted random access memory 124 uses the second time-frequency signal 3508-3 with the second time frequency as the time frequency. The second time frequency can be adjusted from the main time frequency according to the value programmed in the relaxation index 3512. Decrease or no, that is, according to whether the neural network unit 121 is executed in the relaxation or normal mode. Downgrade or No. In an embodiment, the weight random access memory 124 is a port. As described in the seventeenth figure, the weight random access memory 124 can be buffered by the media register 118 through the weight random access memory 3524. And it is buffered 1104 by the neural processing unit 126 or the list of the eleventh figure, and is accessed in an arbitrated fashion. In another embodiment, the weight random access memory 124 is a dual port. As described in the sixteenth figure above, each port can be buffered by the media register 118 through the weight random access memory buffer 3524 and by the neural processing unit 126. OR column buffer 1104 is accessed in parallel.

類似於權重隨機存取記憶體緩衝3524，資料隨機存取記憶體緩衝3522係耦接於資料隨機存取記憶體122與媒體暫存器118之間作為其間資料傳送之緩衝。較佳地，資料隨機存取記憶體緩衝3522係類似於第十七圖之緩衝器1704之一個或多個實施例。較佳地，資料隨機存取記憶體緩衝3522從媒體暫存器118接收資料之部分係以具有主要時頻率之主要時頻信號3506-7作為時頻，而資料隨機存取記憶體緩衝3522從資料隨機存取記憶體122接收資料之部分係以具有第二時頻率之第二時頻信號3508-3作為時頻，第二時頻率可依據程式化於緩和指標3512之數值從主要時頻率調降或否，亦即依據神經網路單元121執行於緩和或正常模式來進行調降或否。在一實施例中，資料隨機存取記憶體122為單埠，如前文第十七圖所述，資料隨機存取記憶體122並可由媒體暫存器118透過資料隨機存取記憶體緩衝3522，以及由神經處理單元126或第十一圖之列緩衝1104，以仲裁方式存取。在另一實施例中，資料隨機存取記憶體122為雙埠，如前文第十六圖所述，各個埠可由媒體暫存器118透過資料隨機存取記憶體緩衝3522以及由神經處理單元126或列緩衝器1104以併行方式存取。 Similar to the weighted random access memory buffer 3524, the data random access memory buffer 3522 is coupled between the data random access memory 122 and the media register 118 as a buffer for data transmission therebetween. Preferably, the data random access memory buffer 3522 is one or more embodiments similar to the buffer 1704 of FIG. Preferably, the part of the data random access memory buffer 3522 that receives data from the media register 118 uses the main time-frequency signal 3506-7 with the main time frequency as the time frequency, and the data random access memory buffer 3522 from The part of data received by the data random access memory 122 uses the second time-frequency signal 3508-3 with the second time frequency as the time frequency. The second time frequency can be adjusted from the main time frequency according to the value programmed in the relaxation index 3512. Decrease or not, that is, decrease or not according to the execution of the neural network unit 121 in the relaxation or normal mode. In one embodiment, the data random access memory 122 is a port. As described in the seventeenth figure, the data random access memory 122 can be buffered by the media register 118 through the data random access memory 3522. And buffered by the nerve processing unit 126 or the eleventh figure, and stored in arbitration take. In another embodiment, the data random access memory 122 is a dual port. As described in the sixteenth figure, each port can be buffered by the media register 118 through the data random access memory buffer 3522 and by the neural processing unit 126. OR column buffer 1104 is accessed in parallel.

較佳地，不論資料隨機存取記憶體122與/或權重隨機存取記憶體124為單埠或雙埠，介面邏輯3514會包括資料隨機存取記憶體緩衝3522與權重隨機存取記憶體緩衝3524以同步主要時頻域與第二時頻域。較佳地，資料隨機存取記憶體122，權重隨機存取記憶體124與程式記憶體129都具有一靜態隨機存取記憶體(SRAM)，其中包含個別之讀取致能信號，寫入致能信號與記憶體選擇致能信號。 Preferably, regardless of whether the data random access memory 122 and / or the weight random access memory 124 are port or dual port, the interface logic 3514 will include a data random access memory buffer 3522 and a weight random access memory buffer. 3524 synchronizes the main time-frequency domain with the second time-frequency domain. Preferably, the data random access memory 122, the weight random access memory 124 and the program memory 129 all have a static random access memory (SRAM), which contains individual read enable signals and write enable Enable signal and memory select enable signal.

如前述，神經網路單元121是處理器100之一執行單元。執行單元是處理器中執行架構指令轉譯出之微指令或是執行架構指令本身之功能單元，例如執行第一圖中架構指令103轉譯出之微指令105或是架構指令103本身。執行單元從處理器之通用暫存器接收運算元，例如從通用暫存器116與媒體暫存器118。執行單元執行微指令或架構指令後會產生結果，此結果會被寫入通用暫存器。第十四與十五圖所述之MTNN指令1400與MFNN指令1500為架構指令103之範例。微指令係用以實現架構指令。更精確來說，執行單元對於架構指令轉譯出之一個或多個微指令之集體執行，就會是對於架構指令所指定之輸入執行架構指令所指定之運算，以產生架構指令定義之結果。 As described above, the neural network unit 121 is an execution unit of the processor 100. The execution unit is a micro instruction translated from the execution of the architectural instruction in the processor or a functional unit executed by the execution of the architectural instruction itself. The execution unit receives operands from a general purpose register of the processor, such as from the general purpose register 116 and the media register 118. After the execution unit executes the micro instruction or the architecture instruction, a result is generated, and the result is written into the general purpose register. The MTNN instruction 1400 and MFNN instruction 1500 described in the fourteenth and fifteenth figures are examples of the architecture instruction 103. Micro instructions are used to implement architectural instructions. More precisely, the collective execution of one or more micro-instructions translated by the architectural instruction by the execution unit is to perform the operation specified by the architectural instruction on the input specified by the architectural instruction to produce the result defined by the architectural instruction.

第三十六A圖係一時序圖，顯示處理器100具有神經網路單元121運作於一般模式之一運作範例，此一般模式即以主要時頻率運作。在時序圖中，時間之進程是由左而右。處理器100係以主要時頻率執行架構程式。更精確來說，處理器100之前端(例如指令攫取單元101，指令快取102，指令轉譯器104，重命名單元106與保留站108)係以主要時頻率攫取，解碼且發佈架構指令至神經網路單元121與其他執行單元112。 Figure 36A is a timing diagram showing the processor 100 having an operation example in which the neural network unit 121 operates in one of the general modes, and this general mode operates at the main time frequency. In the timing diagram, the progress of time is from left to right. The processor 100 executes a framework program at a main time frequency. More precisely, the front end of the processor 100 (for example, the instruction fetch unit 101, the instruction cache 102, the instruction translator 104, the renaming unit 106, and the reservation station 108) is fetched at the main time frequency, decodes and issues architectural instructions to the nerve Network unit 121 and other execution units 112.

起初，架構程式執行一架構指令(如MTNN指令1400)，處理器前端100係將此架構指令發佈至神經網路單元121以指示神經網路單元121開始執行其程式記憶體129內之神經網路單元程式。在之前，架構程式會執行一架構指令將一指定主要時頻率之數值寫入緩和指標3512，亦即使神經網路單元處於一般模式。更精確地說，程式化至緩和指標3512之數值會使時頻降低邏輯3504以主要時頻信號3506之主要時頻率產生第二時頻信號3508。較佳地，在此範例中，時頻降低邏輯3504之時頻緩衝器單純提升主要時頻信號3506之電壓準位。另外在之前，架構程式會執行架構指令以寫入資料隨機存取記憶體122，權重隨機存取記憶體124並將神經網路單元程式寫入程式記憶體129。因應神經網路單元程式MTNN指令1400，神經網路單元121會開始以主要時頻率執行神經網路單元程式，這是因為緩和指標3512是以主要時頻率值予以程式化。神經網路單元121開始執行後，架構程式會持續以主要時頻率執行架構指令，包括主要是以MTNN指令1400寫入與/或讀取資料隨機存取記憶體122與權重隨機存取記憶體124，以完成對於神經網路單元程式之下一次範例(instance)，或稱調用(invocation)或執行(run)之準備。 Initially, the framework program executes a framework instruction (such as MTNN instruction 1400). The processor front-end 100 issues this framework instruction to the neural network unit 121 to instruct the neural network unit 121 to start executing the neural network in its program memory 129. Unit program. Previously, the architecture program would execute a framework instruction to write a value specifying the main time frequency into the relaxation indicator 3512, even if the neural network unit was in normal mode. More precisely, the value programmed to the relaxation indicator 3512 causes the time-frequency reduction logic 3504 to generate the second time-frequency signal 3508 at the main time-frequency of the main time-frequency signal 3506. Preferably, in this example, the time-frequency buffer of the time-frequency reduction logic 3504 simply raises the voltage level of the main time-frequency signal 3506. In addition, before, the architecture program executes the architecture instruction to write data to the random access memory 122, weights the random access memory 124, and writes the neural network unit program to the program memory 129. In response to the neural network unit program MTNN instruction 1400, the neural network unit 121 will start to execute the neural network unit program at the main time frequency, because the relaxation index 3512 is programmed with the main time frequency value. After the execution of the neural network unit 121, the architecture program will continue to execute the architecture instructions at the main frequency, including the main The MTNN instruction 1400 is used to write and / or read data random access memory 122 and weight random access memory 124 to complete the next instance of the neural network unit program, or invocation. Or run to prepare.

在第三十六A圖之範例中，相較於架構程式完成對於資料隨機存取記憶體122與權重隨機存取記憶體124寫入/讀取所花費的時間，神經網路單元121能夠以明顯較少的時間(例如四分之一的時間)完成神經網路單元程式之執行。舉例來說，以主要時頻率運作之情況下，神經網路單元121花費大約1000個時頻週期來執行神經網路單元程式，不過，架構程式會花費大約4000個時頻週期。如此，神經網路單元121在剩下的時間內就會處於待機狀態，在此範例中，這是一個相當長的時間，如大約3000個主要時頻率週期。如第三十六A圖之範例所示，依據神經網路之大小與配置的不同，會再次執行前述模式，並可能持續執行許多次。因為神經網路單元121是處理器100中一個相當大且電晶體密集之功能單元，神經網路單元121之運作將會產生大量的熱能，尤其是以主要時頻率運作的時候。 In the example of Figure 36A, compared with the time it takes for the architecture program to complete the writing / reading of the data random access memory 122 and the weight random access memory 124, the neural network unit 121 can Significantly less time (such as a quarter of the time) to complete the execution of the neural network unit program. For example, in the case of operating at the main time frequency, the neural network unit 121 takes about 1,000 time-frequency cycles to execute the neural network unit program, but the architecture program takes about 4000 time-frequency cycles. In this way, the neural network unit 121 will be in a standby state for the rest of the time. In this example, this is a relatively long time, such as about 3000 major time frequency cycles. As shown in the example of Figure 36A, depending on the size and configuration of the neural network, the aforementioned mode will be executed again, and may continue to be executed many times. Because the neural network unit 121 is a relatively large and transistor-intensive functional unit in the processor 100, the operation of the neural network unit 121 will generate a large amount of thermal energy, especially when it operates at the main frequency.

第三十六B圖係一時序圖，顯示處理器100具有神經網路單元121運作於緩和模式之一運作範例，緩和模式之運作時頻率低於主要時頻率。第三十六B圖之時序圖係類似於第三十六A圖，在第三十六A圖中，處理器100係以主要時頻率執行一架構程式。此範例係假定第三十六B圖中之架構程式與神經網路單元程式相同於第三十六A圖之架構程式與神經網路單元程式。不過，在啟動神經網路單元程式之前，架構程式會執行一MTNN指令1400以一數值程式化緩和指標3512，此數值會使時頻降低邏輯3504以小於主要時頻率之第二時頻率產生第二時頻信號3508。也就是說，架構程式會使神經網路單元121處於第三十六B圖之緩和模式，而非第三十六A圖之一般模式。如此，神經處理單元126就會以第二時頻率執行神經網路單元程式，在緩和模式下，第二時頻率小於主要時頻率。此範例中係假定緩和指標3512是以一個將第二時頻率指定為四分之一主要時頻率之數值予以程式化。如此，神經網路單元121在緩和模式下執行神經網路單元程式所花費之時間會是其於一般模式下花費時間的四倍，如第三十六A與三十六B圖所示，透過比較此二圖可發現神經網路單元121處於待機狀態之時間長度會明顯地縮短。如此，第三十六B圖中神經網路單元121執行神經網路單元程式消耗能量之持續時間大約會是第三十六A圖中神經網路單元121在一般模式下執行程式的四倍。因此，第三十六B圖中神經網路單元121執行神經網路單元程式在單位時間內產生的熱能大約會是第三十六A圖的四分之一，而具有本文所述之優點。 Figure 36B is a timing diagram showing that the processor 100 has an operation example in which the neural network unit 121 operates in one of the relaxation modes. The frequency of the relaxation mode is lower than the main time frequency. The timing diagram of Figure 36B is similar to Figure 36A. In Figure 36A, the processor 100 executes an architecture program at the main time frequency. This example assumes that the architecture program in Figure 36B is the same as the neural network unit program. The structure program and neural network unit program of Figure 36A. However, before starting the neural network unit program, the framework program will execute a MTNN instruction 1400 to program the relaxation indicator 3512 with a value that will cause the time-frequency reduction logic 3504 to generate a second at a second time frequency that is less than the main time frequency. Time-frequency signal 3508. In other words, the structure program will put the neural network unit 121 in the relaxation mode of the thirty-sixth figure B, instead of the general mode of the thirty-sixth figure A. In this way, the neural processing unit 126 executes the neural network unit program at the second time frequency. In the relaxation mode, the second time frequency is less than the main time frequency. In this example, it is assumed that the relaxation index 3512 is stylized with a value that specifies the second time frequency as a quarter of the main time frequency. In this way, the time taken by the neural network unit 121 to execute the neural network unit program in the mitigation mode will be four times the time it spends in the normal mode, as shown in Figures 36A and 36B. Comparing these two figures, it can be seen that the length of time that the neural network unit 121 is in the standby state is significantly shortened. In this way, the duration of the energy consumed by the neural network unit 121 in executing the program of the neural network unit 121 in FIG. 36B is approximately four times that of the execution of the program in the normal mode by the neural network unit 121 in FIG. 36A. Therefore, the thermal energy generated by the neural network unit 121 executed by the neural network unit 121 in the thirty-sixth figure in a unit time will be about a quarter of that in the thirty-sixth figure A, and has the advantages described herein.

第三十七圖係一流程圖，顯示第三十五圖之處理器100之運作。此流程圖描述之運作係類似於前文對應於第三十五，三十六A與三十六B圖之運作。此流程始於步驟3702。 The thirty-seventh figure is a flowchart showing the operation of the processor 100 in the thirty-fifth figure. The operation described in this flowchart is similar to the operation corresponding to the thirty-fifth, thirty-sixth A, and thirty-sixth B drawings. This process starts at step 3702.

在步驟3702中，處理器100執行MTNN指令1400而將權重寫入權重隨機存取記憶體124並且將資料寫入資料隨機存取記憶體122。接下來流程前進至步驟3704。 In step 3702, the processor 100 executes the MTNN instruction. The 1400 is instructed to write the weight into the weight random access memory 124 and write the data into the data random access memory 122. The flow then proceeds to step 3704.

在步驟3704中，處理器100執行MTNN指令1400而以一個數值程式化緩和指標3512，此數值係指定一個低於主要時頻率之時頻率，亦即使神經網路單元121處於緩和模式。接下來流程前進至步驟3706。 In step 3704, the processor 100 executes the MTNN instruction 1400 to program the relaxation index 3512 with a value that specifies a time frequency lower than the main time frequency, even if the neural network unit 121 is in the relaxation mode. The flow then proceeds to step 3706.

在步驟3706中，處理器100執行MTNN指令1400指示神經網路單元121開始執行神經網路單元程式，即類似第三十六B圖所呈現之方式。接下來流程前進至步驟3708。 In step 3706, the processor 100 executes the MTNN instruction 1400 to instruct the neural network unit 121 to start executing the neural network unit program, which is similar to the manner shown in FIG. 36B. The flow then proceeds to step 3708.

在步驟3708中，神經網路單元121開始執行此神經網路單元程式。同時，處理器100會執行MTNN指令1400而將新的權重寫入權重隨機存取記憶體124(可能也會將新的資料寫入資料隨機存取記憶體122)，以及/或執行MFNN指令1500而從資料隨機存取記憶體122讀取結果(可能也會從權重隨機存取記憶體124讀取結果)。接下來流程前進至步驟3712。 In step 3708, the neural network unit 121 starts executing the neural network unit program. At the same time, the processor 100 executes the MTNN instruction 1400 and writes new weights into the weight random access memory 124 (may also write new data into the data random access memory 122), and / or executes the MFNN instruction 1500 The results are read from the data random access memory 122 (the results may also be read from the weight random access memory 124). The flow then proceeds to step 3712.

在步驟3712中，處理器100執行MFNN指令1500(例如讀取狀態暫存器127)，以偵測神經網路單元121已結束程式執行。假設架構程式選擇一個好的緩和指標3512數值，神經網路單元121執行神經網路單元程式所花費的時間就會相同於處理器100執行部分架構程式以存取權重隨機存取記憶體124與/或資料隨機存取記憶體122所花費的時間，如第三十六B圖所示。接下來流程前進至步驟3714。 In step 3712, the processor 100 executes the MFNN instruction 1500 (for example, the read status register 127) to detect that the neural network unit 121 has finished program execution. Assuming that the framework program selects a good value of the mitigating index 3512, the time taken by the neural network unit 121 to execute the neural network unit program will be the same as that of the processor 100 executing part of the framework program to access the weight random access memory 124 and / Or the time taken by the data random access memory 122 is shown in FIG. 36B. The next process Proceed to step 3714.

在步驟3714，處理器100執行MTNN指令1400而利用一數值程式化緩和指標3512，此數值指定主要時頻率，亦即使神經網路單元121處於一般模式。接下來前進至步驟3716。 In step 3714, the processor 100 executes the MTNN instruction 1400 and uses a value to program the relaxation index 3512. This value specifies the main time frequency, even if the neural network unit 121 is in the general mode. Then proceed to step 3716.

在步驟3716中，處理器100執行MTNN指令1400指示神經網路單元121開始執行神經網路單元程式，即類似第三十六A圖所呈現之方式。接下來流程前進至步驟3718。 In step 3716, the processor 100 executes the MTNN instruction 1400 to instruct the neural network unit 121 to start executing the neural network unit program, which is similar to the manner shown in FIG. 36A. The flow then proceeds to step 3718.

在步驟3718中，神經網路單元121開始以一般模式執行神經網路單元程式。此流程終止於步驟3718。 In step 3718, the neural network unit 121 starts executing the neural network unit program in a general mode. This process ends at step 3718.

如前述，相較於在一般模式下執行神經網路單元程式(即以處理器之主要時頻率執行)，在緩和模式下執行可以分散執行時間而能避免產生高溫。進一步來說，當神經網路單元在緩和模式執行程式時，神經網路單元是以較低的時頻率產生熱能，這些熱能可以順利地經由神經網路單元(例如半導體裝置，金屬層與下方的基材)與周圍的封裝體以及冷卻機構(如散熱片，風扇)排出，也因此，神經網路單元內的裝置(如電晶體，電容，導線)就比較可能在較低的溫度下運作。整體來看，在緩和模式下運作也有助於降低處理器晶粒之其他部分內的裝置溫度。較低的運作溫度，特別是對於這些裝置之接面溫度而言，可以減輕漏電流的產生。此外，因為單位時間內流入之電流量降低，電感雜訊與IR壓降雜訊也會降低。此外，溫度降低對於處理器內之金氧半場效電晶體(MOSFET)之負偏壓溫度不穩定性(NBTI)與正偏壓不穩定性(PBSI)也有正面影響，而能提升可靠度與/或裝置以及處理器部分之壽命。溫度降低並可減輕處理器之金屬層內之焦耳熱與電遷移效應。 As mentioned above, compared with executing the neural network unit program in the general mode (that is, the main frequency of the processor is executed), the execution in the relaxation mode can disperse the execution time and avoid high temperature. Further, when the neural network unit executes a program in a relaxation mode, the neural network unit generates thermal energy at a lower time frequency, and this thermal energy can smoothly pass through the neural network unit (such as a semiconductor device, a metal layer and a lower layer). The substrate) is discharged from the surrounding packages and cooling mechanisms (such as heat sinks, fans). Therefore, the devices (such as transistors, capacitors, and wires) in the neural network unit are more likely to operate at lower temperatures. Overall, operating in mitigation mode also helps reduce device temperature in other parts of the processor die. The lower operating temperature, especially for the junction temperature of these devices, can reduce the generation of leakage current. In addition, because the amount of current flowing in per unit time is reduced, inductor noise and IR voltage drop Noise will also be reduced. In addition, the temperature decrease has a positive impact on the negative bias temperature instability (NBTI) and positive bias instability (PBSI) of the metal-oxide-semiconductor field-effect transistor (MOSFET) in the processor, which can improve reliability and / Or device and processor life. The decrease in temperature can reduce the effects of Joule heat and electromigration in the metal layer of the processor.

關於神經網路單元共享資源之架構程式與非架構程式間之溝通機制 Communication mechanism between architecture program and non-architecture program of neural network unit shared resources

如前述，在第二十四至二十八與三十五至三十七圖之範例中，資料隨機存取記憶體122與權重隨機存取記憶體124之資源是共享的。神經處理單元126與處理器100之前端係共享資料隨機存取記憶體122與權重隨機存取記憶體124。更精確地說，神經處理單元126與處理器100之前端，如媒體暫存器118，都會對資料隨機存取記憶體122與權重隨機存取記憶體124進行讀取與寫入。換句話說，執行於處理器100之架構程式與執行於神經網路單元121之神經網路單元程式會共享資料隨機存取記憶體122與權重隨機存取記憶體124，而在某些情況下，如前所述，需要對於架構程式與神經網路單元程式間之流程進行控制。程式記憶體129之資源在一定程度下也是共享的，這是因為架構程式會對其進行寫入，而定序器128會對其進行讀取。本文所述之實施例係提供一高效能的解決方案，以控制架構程式與神經網路單元程式間存取共享資源之流程。 As mentioned above, in the examples of the twenty-fourth to twenty-eight and thirty-five to thirty-seven figures, the resources of the data random access memory 122 and the weight random access memory 124 are shared. The neural processing unit 126 and the front end of the processor 100 share a data random access memory 122 and a weight random access memory 124. More precisely, the neural processing unit 126 and the front end of the processor 100, such as the media register 118, will both read and write the data random access memory 122 and the weight random access memory 124. In other words, the architecture program running on the processor 100 and the neural network unit program running on the neural network unit 121 share data random access memory 122 and weight random access memory 124, and in some cases As mentioned before, the flow between the framework program and the neural network unit program needs to be controlled. The resources of the program memory 129 are also shared to a certain extent, because the framework program will write them and the sequencer 128 will read them. The embodiment described herein provides a high-performance solution to control the flow of accessing shared resources between the architecture program and the neural network unit program.

在本文所述之實施例中，神經網路單元程式也稱為非架構程式，神經網路單元指令也稱為非架構指令，而神經網路單元指令集(如前所述也稱為神經處理單元指令集)也稱為非架構指令集。非架構指令集不同於架構指令集。在處理器100內包含指令轉譯器104將架構指令轉譯出微指令之實施例中，非架構指令集也不同於微指令集。 In the embodiment described herein, the neural network unit program is also referred to as non-architecture program, and the neural network unit instruction is also referred to as non-architecture program. Instructions, and the neural network unit instruction set (also referred to as the neural processing unit instruction set as described above) is also referred to as the non-architecture instruction set. Non-architecture instruction sets are different from architectural instruction sets. In the embodiment where the processor 100 includes the instruction translator 104 to translate the architectural instructions into micro instructions, the non-architecture instruction set is also different from the micro instruction set.

第三十八圖係一方塊圖，詳細顯示神經網路單元121之序列器128。序列器128提供記憶體位址至程式記憶體129，以選擇提供給序列器128之非架構指令，如前所述。如第三十八圖所示，記憶體位址係裝載於定序器128之一程式計數器3802內。定序器128通常會以程式記憶體129之位址順序循序遞增，除非定序器128遭遇到一非架構指令，例如一迴圈或分支指令，而在此情況下，定序器128會將程式計數器3802更新為控制指令之目標位址，即更新為位於控制指令之目標之非架構指令之位址。因此，裝載於程式計數器3802之位址131會指定當前被攫取以供神經處理單元126執行之非架構程式之非架構指令在程式記憶體129中之位址。程式計數器3802之數值可由架構程式透過狀態暫存器127之神經網路單元程式計數器欄位3912而取得，如後續第三十九圖所述。如此可使架構程式依據非架構程式之進度，決定對於資料隨機存記憶體122與/或權重隨機存取記憶體124讀取/寫入資料之位置。 The thirty-eighth figure is a block diagram showing the sequencer 128 of the neural network unit 121 in detail. The sequencer 128 provides a memory address to the program memory 129 to select non-architecture instructions provided to the sequencer 128, as described above. As shown in Figure 38, the memory address is loaded into a program counter 3802 of the sequencer 128. The sequencer 128 generally increments in sequence from the address of the program memory 129, unless the sequencer 128 encounters a non-architecture instruction, such as a loop or branch instruction, in which case the sequencer 128 will The program counter 3802 is updated to the target address of the control instruction, that is, the address of the non-framework instruction located at the target of the control instruction. Therefore, the address 131 loaded in the program counter 3802 specifies the address in the program memory 129 of the non-framework instruction of the non-framework program currently being fetched for execution by the neural processing unit 126. The value of the program counter 3802 can be obtained by the framework program through the neural network unit program counter field 3912 of the state register 127, as described in the following figure 39. In this way, based on the progress of the non-architecture program, the architecture program can determine the position where the data random storage memory 122 and / or the weight random access memory 124 read / write data.

定序器128並包括一迴圈計數器3804，此迴圈計數器3804會搭配一非架構迴圈指令進行運作，例如第二十六A圖中位址10之迴圈至1指令與第二十八圖中位址11之迴圈至1指令。在第二十六A與二十八圖之範例中，迴圈計數器3804內係載入位址0之非架構初始化指令所指定之數值，例如載入數值400。每一次定序器128遭遇到迴圈指令而跳躍至目標指令(如第二十六A圖中位於位址1之乘法累加指令或是第二十八圖中位於位址1之maxwacc指令)，定序器128就會使迴圈計數器3804遞減。一旦迴圈計數器3804減少到零，定序器128就轉向排序在下一個的非架構指令。在另一實施例中，首次遭遇到迴圈指令時會在迴圈計數器內載入一個迴圈指令中指定之迴圈計數值，以省去利用非架構初始化指令初始化迴圈計數器3804的需求。因此，迴圈計數器3804的數值會指出非架構程式之迴圈組尚待執行的次數。迴圈計數器3804之數值可由架構程式透過狀態暫存器127之迴圈計數欄位3914取得，如後續第三十九圖所示。如此可使架構程式依據非架構程式之進度，決定對於資料隨機存記憶體122與/或權重隨機存取記憶體124讀取/寫入資料之位置。在一實施例中，定序器包括三個額外的迴圈計數器以搭配非架構程式內之巢套迴圈，這三個迴圈計數器的數值也可透過狀態暫存器127讀取。迴圈指令中具有一位元以指示這四個迴圈計數器中哪一個是提供給當前之迴圈指令使用。 The sequencer 128 also includes a loop counter 3804. This loop counter 3804 will operate with a non-framework loop instruction, such as the loop to address 1 instruction at address 10 and the twenty-eighth in the 26th A diagram. Figure The loop from the middle address 11 to the 1 instruction. In the examples of the twenty-sixth A and the twenty-eighth figures, the loop counter 3804 contains the value specified by the non-architecture initialization instruction at address 0, for example, the value 400 is loaded. Each time the sequencer 128 encounters a loop instruction and jumps to the target instruction (such as the multiply accumulate instruction at address 1 in Figure 26A or the maxwacc instruction at address 1 in Figure 28), The sequencer 128 decrements the loop counter 3804. Once the loop counter 3804 is reduced to zero, the sequencer 128 moves to the next non-architecture instruction that is ordered. In another embodiment, when a loop instruction is first encountered, a loop count value specified in the loop instruction is loaded into the loop counter to eliminate the need to initialize the loop counter 3804 with a non-architecture initialization instruction. Therefore, the value of the loop counter 3804 indicates the number of times that the loop group of the non-schema program has to be executed. The value of the loop counter 3804 can be obtained by the framework program through the loop count field 3914 of the state register 127, as shown in the subsequent thirty-ninth figure. In this way, based on the progress of the non-architecture program, the architecture program can determine the position where the data random storage memory 122 and / or the weight random access memory 124 read / write data. In one embodiment, the sequencer includes three additional loop counters to match the nested loops in the non-architecture program. The values of the three loop counters can also be read through the state register 127. There is a bit in the loop instruction to indicate which of the four loop counters is provided for the current loop instruction.

定序器128並包括一迭代次數計數器3806。迭代次數計數器3806係搭配非架構指令，例如第四，九，二十與二十六A圖中位址2之乘法累加指令，以及第二十八圖中位址2之maxwacc指令，這些指令在此後將會被稱為“執行”指令。在前述範例中，各個執行指令分別指定一執行計數511，511，1023，2與3。當定序器128遭遇到一個指定一非零迭代計數之執行指令時，定序器128會以此指定值載入迭代次數計數器3806。此外，定序器128會產生一適當的微運算3418以控制第三十四圖中神經處理單元126管線級3401內之邏輯執行，並且使迭代次數計數器3806遞減。若是迭代次數計數器3806大於零，定序器128會再次產生一適當的微運算3418控制神經處理單元126內之邏輯並使迭代次數計數器3806遞減。定序器128會持續以此方式運作，直到迭代次數計數器3806之數值歸零。因此，迭代次數計數器3806之數值即為非架構執行指令內指定尚待執行之運算次數(這些運算如對於累加值與一資料/權重文字進行乘法累加，取最大值，加總運算等)。迭代次數計數器3806之數值可利用架構程式透過狀態暫存器127之迭代次數計數欄位3916取得，如後續第三十九圖所述。如此可使架構程式依據非架構程式之進度，決定對於資料隨機存記憶體122與/或權重隨機存取記憶體124讀取/寫入資料之位置。 The sequencer 128 also includes a number of iterations counter 3806. The iteration counter 3806 is matched with non-architectural instructions, such as the multiply-accumulate instruction of address 2 in the fourth, ninth, twenty, and twenty-sixth A and the maxwacc instruction of address 2 in the twenty-eighth illustration. Since then Will be called an "execute" instruction. In the foregoing example, each execution instruction specifies an execution count 511, 511, 1023, 2 and 3. When the sequencer 128 encounters an execution instruction specifying a non-zero iteration count, the sequencer 128 loads the iteration number counter 3806 with the specified value. In addition, the sequencer 128 generates an appropriate micro-operation 3418 to control the logic execution in the neural processing unit 126 pipeline stage 3401 in the thirty-fourth figure and decrements the iteration counter 3806. If the number of iterations counter 3806 is greater than zero, the sequencer 128 will again generate an appropriate micro operation 3418 to control the logic in the neural processing unit 126 and decrement the number of iterations counter 3806. The sequencer 128 continues to operate in this manner until the value of the number of iterations counter 3806 returns to zero. Therefore, the value of the iteration number counter 3806 is the number of operations to be executed specified in the non-architecture execution instruction (such as multiply accumulating the accumulated value and a data / weight text, take the maximum value, add the total operation, etc.). The value of the iteration count counter 3806 can be obtained by the framework program through the iteration count field 3916 of the state register 127, as described in the following figure 39. In this way, based on the progress of the non-architecture program, the architecture program can determine the position where the data random storage memory 122 and / or the weight random access memory 124 read / write data.

第三十九圖係一方塊圖，顯示神經網路單元121之控制與狀態暫存器127之若干欄位。這些欄位包括包括神經處理單元126執行非架構程式最近寫入之權重隨機存取記憶體列之位址2602，神經處理單元126執行非架構程式最近讀取之權重隨機存取記憶體列之位址2604，神經處理單元126執行非架構程式最近寫入之資料隨機存取記憶體列的位址2606，以及神經處理單元126 執行非架構程式最近讀取之資料隨機存取記憶體列的位址2608，如前述第二十六B圖所示。此外，這些欄位還包括一神經網路單元程式計數器3912欄位，一迴圈計數器3914欄位，與一迭代次數計數器3916欄位。如前述，架構程式可將狀態暫存器127內之資料讀取至媒體暫存器118與/或通用暫存器116，例如透過MFNN指令1500讀取包括神經網路單元程式計數器3912，迴圈計數器3914與迭代次數計數器3916欄位之數值。程式計數器欄位3912之數值反映第三十八圖中程式計數器3802之數值。迴圈計數器欄位3914之數值反映迴圈計數器3804之數值。迭代次數計數器欄位3916之數值反映迭代次數計數器3806之數值。在一實施例中，定序器128在每次需要調整程式計數器3802，迴圈計數器3804，或迭代次數計數器3806時，都會更新程式計數器欄位3912，迴圈計數器欄位3914與迭代次數計數器欄位3916之數值，如此，當架構程式讀取時這些欄位的數值就會是當下的數值。在另一實施例中，當神經網路單元121執行架構指令以讀取狀態暫存器127時，神經網路單元121僅僅取得程式計數器3802，迴圈計數器3804與迭代次數計數器3806之數值並將其提供回架構指令(例如提供至媒體暫存器118或通用暫存器116)。 The thirty-ninth figure is a block diagram showing the fields of the control and status register 127 of the neural network unit 121. These fields include the address 2602 of the weighted random access memory row recently written by the neural processing unit 126 executing the non-structural program, and the position of the weighted random access memory row recently read by the neural processing unit 126 executing the non-structural program. Address 2604, the neural processing unit 126 executes a non-architecture program to write the recently written data to the random access memory row address 2606, and the neural processing unit 126 The address 2608 of the random access memory row recently read by the non-architecture program is executed, as shown in the aforementioned twenty-sixth B diagram. In addition, these fields also include a neural network unit program counter 3912 field, a loop counter 3914 field, and an iteration number counter 3916 field. As described above, the architecture program can read the data in the state register 127 to the media register 118 and / or the general register 116, for example, read the counter 3912 including the neural network unit program through the MFNN instruction 1500. The values of the counter 3914 and the iteration number counter 3916. The value of the program counter field 3912 reflects the value of the program counter 3802 in the thirty-eighth figure. The value of the lap counter field 3914 reflects the value of the lap counter 3804. The value of the iteration number counter field 3916 reflects the value of the iteration number counter 3806. In one embodiment, the sequencer 128 updates the program counter field 3912, the loop counter field 3914, and the iteration number counter field each time the program counter 3802, the loop counter 3804, or the iteration number counter 3806 is adjusted. The value of bit 3916, so when the framework program reads the value of these fields will be the current value. In another embodiment, when the neural network unit 121 executes an architectural instruction to read the state register 127, the neural network unit 121 only obtains the values of the program counter 3802, the loop counter 3804 and the iteration number counter 3806 and It provides back-to-architecture instructions (eg, to media register 118 or general purpose register 116).

由此可以發現，第三十九圖之狀態暫存器127之欄位的數值可以理解為非架構指令由神經網路單元執行之過程中，其執行進度的資訊。關於非架構程式執行進度之某些特定面向，如程式計數器3802數值，迴圈計數器3804數值，迭代次數計數器3806數值，最近讀取/寫入之權重隨機存取記憶體124位址125之欄位2602/2604，以及最近讀取/寫入之資料隨機存取記憶體122位址123之欄位2606/2608，已於先前之章節進行描述。執行於處理器100之架構程式可以從狀態暫存器127讀取第三十九圖之非架構程式進度值並利用這些資訊來做決策，例如透過如比較與分支指令等架構指令來進行。舉例來說，架構程式會決定對於資料隨機存取記憶體122與/或權重隨機存取記憶體124進行資料/權重之讀取/寫入之列，以控制資料隨機存取記憶體122或權重隨機存取記憶體124之資料的流入與流出，尤其是針對大型資料組與/或不同非架構指令之重疊執行。這些利用架構程式進行決策之範例可參照本文前後章節之描述。 From this, it can be found that the value of the field in the state register 127 of the thirty-ninth figure can be understood as the information of the execution progress of the non-architecture instruction executed by the neural network unit. Some specific aspects of the execution progress of non-structural programs, such as the value of the program counter 3802, The value of the circle counter 3804, the number of iteration counters 3806, the weight of the most recent read / write random access memory 124, the field 125 of the address 2602/2604, and the most recently read / written data random access memory 122 The field 2606/2608 at address 123 has been described in the previous chapter. The architecture program running on the processor 100 can read the progress value of the non-architecture program in the thirty-ninth figure from the state register 127 and use this information to make decisions, for example, through architecture instructions such as comparison and branch instructions. For example, the architecture program may decide to read / write data / weight rows for the data random access memory 122 and / or the weight random access memory 124 to control the data random access memory 122 or the weight The inflow and outflow of the data in the random access memory 124 is especially for the large data sets and / or the overlapping execution of different non-architecture instructions. These examples of making decisions using the framework program can refer to the description in the previous and subsequent chapters of this article.

舉例來說，如前文第二十六A圖所述，架構程式設定非架構程式將卷積運算之結果寫回資料隨機存取記憶體122中卷積核2402上方之列(如列8上方)，而當神經網路單元121利用最近寫入資料隨機存取記憶體122列2606之位址寫入結果時，架構程式會從資料隨機存取記憶體122讀取此結果。 For example, as described in Figure 26A above, the structural program sets the non-structural program to write the result of the convolution operation back to the row above the convolution kernel 2402 in the data random access memory 122 (as above row 8) When the neural network unit 121 writes the result using the address of the recently written data random access memory 122 row 2606, the architecture program reads the result from the data random access memory 122.

在另一範例中，如前文第二十六B圖所述，架構程式利用來自第三十八圖之狀態暫存器127欄位的資訊確認非架構程式將第二十四圖之資料陣列2404分成5個512 x 1600之資料塊以執行卷積運算之進度。架構程式將此2560 x 1600資料陣列之第一個512 x 1600資料塊寫入權重隨機存取記憶體124並啟動非架構程式，其迴圈計數為1600而權重隨機存取記憶體124初始化之輸出列為0。神經網路單元121執行非架構程式時，架構程式會讀取狀態暫存器127以確認權重隨機存取記憶體124之最近寫入列2602，如此架構程式就可讀取由非架構程式寫入之有效卷積運算結果，並且在讀取後利用下一個512 x 1600資料塊覆寫此有效卷積運算結果，如此，在神經網路單元121完成非架構程式對於第一個512 x 1600資料塊之執行後，處理器100在必要時就可立即更新非架構程式並再次啟動非架構程式以執行下一個512 x 1600資料塊。 In another example, as described in Figure 26B above, the framework program uses the information from the status register 127 field of Figure 38 to confirm that the non-schema program will use the data array 2404 of Figure 24 Divide into 5 512 x 1600 data blocks to perform the progress of the convolution operation. The framework program writes the first 512 x 1600 data block of this 2560 x 1600 data array to the weight random access memory 124 and starts the non-schema program, which returns The cycle count is 1600 and the output of the weighted random access memory 124 initialization is 0. When the neural network unit 121 executes the non-architecture program, the architecture program reads the status register 127 to confirm the weight of the recent write row 2602 of the random access memory 124, so that the architecture program can read the non-architecture program. The effective convolution operation result, and overwrite the effective convolution operation result with the next 512 x 1600 data block after reading, so, complete the non-architecture program for the first 512 x 1600 data block in the neural network unit 121 After execution, the processor 100 can immediately update the non-architecture program when necessary and start the non-architecture program again to execute the next 512 x 1600 data block.

在另一範例中，假定架構程式使神經網路單元121執行一系列典型的神經網路乘法累加啟動函數，其中，權重係被儲存於權重隨機存取記憶體124而結果會被寫回資料隨機存取記憶體122。在此情況下，架構程式讀取權重隨機存取記憶體124之一列後就不會再對其進行讀取。如此，在當前的權重已經被非架構程式讀取/使用後，就可以利用架構程式開始將新的權重覆寫權重隨機存取記憶體124上之權重，以提供非架構程式之下一次範例(例如下一個神經網路層)使用。在此情況下，架構程式會讀取狀態暫存器127以取得權重隨機存取記憶體之最近讀取列2604之位址以決定其於權重隨機存取記憶體124中寫入新權重組的位置。 In another example, suppose the architecture program causes the neural network unit 121 to execute a series of typical neural network multiply-accumulate activation functions, wherein the weights are stored in the weight random access memory 124 and the results are written back to the data random Access memory 122. In this case, after the architecture program reads a row of the weight random access memory 124, it will not read it again. In this way, after the current weight has been read / used by the non-architecture program, the architecture program can be used to start overwriting the new weight with the weight on the random access memory 124 to provide the next example of the non-architecture program ( Such as the next neural network layer). In this case, the architecture program reads the state register 127 to obtain the address of the most recent read row 2604 of the weighted random access memory to determine the write of the new weighted reassembly in the weighted random access memory 124. position.

在另一個範例中，假定架構程式知道非架構程式內包括一個具有大迭代次數計數之執行指令，如第二十圖中位址2之非架構乘法累加指令。在此情況下，架構程式需要知道迭代次數計數3916，方能知道大致上還需要多少個時頻週期才能完成此非架構指令以決定架構程式接下來所要採取兩個或多個動作之一的何者。舉例來說，若是需要很長的時間才能完成執行，架構程式就會放棄控制給另一個架構程式，例如作業系統。類似地，假定架構程式知道非架構程式包括一個具有相當大之迴圈計數的迴圈組，例如第二十八圖之非架構程式。在此情況下，架構程式會需要知道迴圈計數3914，方能知道大致上還需要多少個時頻週期才能完成此非架構指令以決定接下來所要採取兩個或多個動作之一的何者。 In another example, suppose the architecture program knows that the non-architecture program includes an execution instruction with a large iteration count, such as the non-architecture multiply accumulate instruction at address 2 in the twentieth figure. In this situation, The framework program needs to know the number of iterations to 3916 before it can know how many time-frequency cycles are needed to complete this non-structural instruction to determine which of the two or more actions the framework program will take next. For example, if it takes a long time to complete execution, the framework program will give up control to another framework program, such as the operating system. Similarly, it is assumed that the architecture program knows that the non-architecture program includes a loop group with a considerable loop count, such as the non-architecture program of FIG. In this case, the architecture program needs to know the loop count of 3914 to know how many time-frequency cycles are needed to complete this non-architecture instruction to determine which of two or more actions to take next.

在另一範例中，假定架構程式使神經網路單元121執行類似於第二十七與二十八圖所述之共源運算，其中所要共源的資料是儲存在權重隨機存取記憶體124而結果會被寫回權重隨機存取記憶體124。不過，不同於第二十七與二十八圖之範例，假定此範例之結果會被寫回權重隨機存取記憶體124之最上方400列，例如列1600至1999。在此情況下，非架構程式完成讀取四列其所要共源之權重隨機存取記憶體124資料後，非架構程式就不會再次進行讀取。因此，一旦當前四列資料都已被非架構程式讀取/使用後，即可利用架構程式開始將新資料(如非架構程式之下一次範例之權重，舉例來說，例如對取得資料執行典型乘法累加啟動函數運算之非架構程式)覆寫權重隨機存取記憶體124之資料。在此情況下，架構程式會讀取狀態暫存器127以取得權重隨機存取記憶體之最近讀取列2604之位址，以決定新的權重組寫入權重隨機存取記憶體124之位置。 In another example, suppose that the structural program causes the neural network unit 121 to perform a common source operation similar to that described in Figures 27 and 28, where the data to be common source is stored in the weight random access memory 124 The result is written back to the weight random access memory 124. However, unlike the examples in figures 27 and 28, it is assumed that the results of this example will be written back to the top 400 rows of the weighted random access memory 124, such as rows 1600 to 1999. In this case, after the non-architecture program finishes reading the four rows of weighted random access memory 124 data it wants to share in common, the non-architecture program will not read it again. Therefore, once the current four rows of data have been read / used by the non-schema program, you can use the framework program to start the new data (such as the weight of the next example of the non-schema program, for example, perform a typical The multiplication and accumulation start function calculation is a non-structural program) overwriting the data of the weight random access memory 124. In this case, the architecture program reads the status register 127 to obtain the weight of the address of the most recently read row 2604 of the random access memory to determine the new weight rewrite The weighted random access memory 124 is located.

時間遞歸(recurrent)神經網路加速 Time recurrent neural network acceleration

傳統前饋神經網路不具有儲存網路先前輸入之記憶體。前饋神經網路通常被用於執行在任務中隨時間輸入網路之多個輸入是各自獨立，且多個輸出亦是如此的任務。相較之下，時間遞歸神經網路通常有助於執行在任務中隨時間輸入至神經網路之輸入順序具有重要性之任務。(此處的順序通常被稱為時間步驟。)因此，時間遞歸神經網路包括一個概念上的記憶體或稱內部狀態，以裝載網路因應序列中之先前輸入所執行之計算而產生之資訊，時間遞歸神經網路之輸出係關聯於此內部狀態與下一個時間步驟之輸入。下列任務，如語音辨識，語言模型，文字產生，語言翻譯，影像描述產生以及某些形式之手寫辨識，是時間遞歸神經網路可以執行良好的例子。 Traditional feed-forward neural networks do not have memory to store the network's previous inputs. Feedforward neural networks are often used to perform tasks in which multiple inputs to the network are independent of each other over time, and so are multiple outputs. In contrast, time-recurrent neural networks often help perform tasks where the order of input to the neural network over time is important in the task. (The sequence here is often referred to as a time step.) Therefore, a time-recurrent neural network includes a conceptual memory or internal state to load information generated by the network in response to calculations performed by previous inputs in the sequence The output of the time recurrent neural network is related to this internal state and the input of the next time step. The following tasks, such as speech recognition, language models, text generation, language translation, image description generation, and some forms of handwriting recognition, are good examples of how time-recurrent neural networks can perform.

三種習知之時間遞歸神經網路的範例為Elman時間遞歸神經網路，Jordan時間遞歸神經網路與長短期記憶(LSTM)神經網路。Elman時間遞歸神經網路包含內容節點以記憶當前時間步驟中時間遞歸神經網路之隱藏層狀態，此狀態在下一個時間步驟中會作為對於隱藏層之輸入。Jordan時間遞歸神經網路類似於Elman時間遞歸神經網路，除了其中之內容節點會記憶時間遞歸神經網路之輸出層狀態而非隱藏層狀態。長短期記憶神經網路包括由長短期記憶胞構成之一長短期記憶層。每個長短期記憶胞具有當前時間步驟之一當前狀態與一當前輸出，以及一個新的或後續時間步驟之一新的狀態與一新的輸出。長短期記憶胞包括一輸入閘與一輸出閘，以及一遺忘閘，遺忘閘可以使神經元失去其所記憶之狀態。這三種時間遞歸神經網路在後續章節會有更詳細的描述。 Three examples of conventional time recurrent neural networks are Elman time recurrent neural network, Jordan time recurrent neural network and long short-term memory (LSTM) neural network. The Elman time recurrent neural network contains content nodes to memorize the hidden layer state of the time recurrent neural network in the current time step. This state will be used as the input to the hidden layer in the next time step. Jordan time recurrent neural network is similar to Elman time recurrent neural network, except that the content nodes in it will remember the state of the output layer of the time recurrent neural network instead of the state of the hidden layer. The long-term and short-term memory neural network includes a long-term and short-term memory layer composed of long-term and short-term memory cells. Each long-term and short-term memory cell has one current state and one current time step. Output, and a new or new time step and a new state and a new output. The long-term and short-term memory cells include an input gate, an output gate, and a forget gate. The forget gate can cause a neuron to lose its memorized state. These three types of temporal recurrent neural networks are described in more detail in subsequent chapters.

如本文所述，對於時間遞歸神經網路而言，如Elman或Jordan時間遞歸神經網路，神經網路單元每次執行都會使用一時間步驟，取得一組輸入層節點值，並執行必要計算使其透過時間遞歸神經網路進行傳播，以產生輸出層節點值以及隱藏層與內容層節點值。因此，輸入層節點值會關聯於計算隱藏，輸出與內容層節點值之時間步驟；而隱藏，輸出與內容層節點值會關聯於產生這些節點值之時間步驟。輸入層節點值是時間遞歸神經網路所模擬之系統之取樣值，如影像，語音取樣，商業市場資料之快照。對於長短期記憶神經網路而言，神經網路單元之每次執行都會使用一時間步驟，取得一組記憶胞輸入值並執行必要計算以產生記憶胞輸出值(以及記憶胞狀態與輸入閘，遺忘閘以及輸出閘數值)，這也可以理解為是透過長短期記憶層記憶胞傳播記憶胞輸入值。因此，記憶胞輸入值會關聯於計算記憶胞狀態以及輸入閘，遺忘閘與輸出閘數值之時間步驟；而記憶胞狀態以及輸入閘，遺忘閘與輸出閘數值會關聯於產生這些節點值之時間步驟。 As described in this article, for a time-recurrent neural network, such as Elman or Jordan time-recurrent neural network, each time the neural network unit executes, it uses a time step to obtain a set of input layer node values, and performs the necessary calculations so that It is propagated through a time-recursive neural network to generate output layer node values and hidden layer and content layer node values. Therefore, the input layer node values will be related to the time steps of calculating hidden, output and content layer node values; and hidden, output and content layer node values will be related to the time steps that generate these node values. The input layer node values are sample values of the system simulated by the time-recurrent neural network, such as images, voice samples, and snapshots of business market data. For long-term and short-term memory neural networks, each execution of the neural network unit uses a time step to obtain a set of memory cell input values and perform the necessary calculations to generate memory cell output values (as well as memory cell states and input gates, Forget gate and output gate values), which can also be understood as the transmission of memory cell input values through long and short-term memory cells. Therefore, the input value of the memory cell will be related to the time step of calculating the state of the memory cell and the values of the input gate, the forget gate and the output gate; and the memory state and the input gate, the value of the forget gate and the output gate will be related to the time of generating these node values. step.

內容層節點值，也稱為狀態節點，是神經網路之狀態值，此狀態值係基於關聯於先前時間步驟之輸入層節點值，而不僅只關聯於當前時間步驟之輸入層節點值。神經網路單元對於時間步驟所執行之計算(例如對於Elman或Jordan時間遞歸神經網路之隱藏層節點值計算)是先前時間步驟產生之內容層節點值之一函數。因此，時間步驟開始時的網路狀態值(內容節點值)會影響此時間步驟之過程中產生之輸出層節點值。此外，時間步驟結束時之網路狀態值會受到此時間步驟之輸入節點值與時間步驟開始時之網路狀態值影響。類似地，對於長短期記憶胞而言，記憶胞狀態值係關聯於先前時間步驟之記憶胞輸入值，而非僅只關聯於當前時間步驟之記憶胞輸入值。因為神經網路單元對於時間步驟執行之計算(例如下一個記憶胞狀態之計算)是先前時間步驟產生之記憶胞狀態值之函數，時間步驟開始時之網路狀態值(記憶胞狀態值)會影響此時間步驟中產生之記憶胞輸出值，而此時間步驟結束時之網路狀態值會受到此時間步驟之記憶胞輸入值與先前網路狀態值影響。 The content node value, also known as the state node, is the state value of the neural network. This state value is based on the value associated with the previous time step. Input layer node values, not just the input layer node values associated with the current time step. The calculation performed by the neural network unit for the time step (for example, the hidden layer node value calculation of the Elman or Jordan time recurrent neural network) is a function of the content layer node value generated by the previous time step. Therefore, the network state value (content node value) at the beginning of the time step will affect the output layer node value generated during the time step. In addition, the network status value at the end of the time step will be affected by the input node value of this time step and the network status value at the beginning of the time step. Similarly, for long-term and short-term memory cells, the memory cell state value is related to the memory cell input value of the previous time step, rather than only the memory cell input value of the current time step. Because the calculation performed by the neural network unit on the time step (such as the calculation of the next memory cell state) is a function of the memory cell state value generated by the previous time step, the network state value (memory cell state value) at the beginning of the time step Affects the output value of the memory cell in this time step, and the network state value at the end of this time step will be affected by the memory cell input value and the previous network state value at this time step.

第四十圖係一方塊圖，顯示Elman時間遞歸神經網路之一範例。第四十圖之Elman時間遞歸神經網路包括輸入層節點，或神經元，標示為D0,D1至Dn，集體稱為多個輸入層節點D而個別通稱為輸入層節點D；隱藏層節點/神經元，標示為Z0,Z1至Zn，集體稱為多個隱藏層節點Z而個別通稱為隱藏層節點Z；輸出層節點/神經元，標示為Y0,Y1至Yn，集體稱為多個輸出層節點Y而個別通稱為輸出層節點Y；以及內容層節點/神經元，標示為C0,C1至Cn，集體稱為多個內容層節點C而個別通稱為內容層節點C。在第四十圖之Elman時間遞歸神經網路之範例中，各個隱藏層節點Z具有一輸入連結至各個輸入層節點D之輸出，並具有一輸入連結至各個內容層節點C之輸出；各個輸出層節點Y具有一輸入連結至各個隱藏層節點Z之輸出；而各個內容層節點C具有一輸入連結至一相對應隱藏層節點Z之輸出。 Figure 40 is a block diagram showing an example of an Elman time recurrent neural network. The Elman time recurrent neural network of the fortieth figure includes input layer nodes, or neurons, labeled D0, D1 to Dn, collectively referred to as multiple input layer nodes D and individually collectively referred to as input layer nodes D; hidden layer nodes / Neurons, labeled Z0, Z1 to Zn, collectively referred to as multiple hidden layer nodes Z and individually collectively referred to as hidden layer nodes Z; output layer nodes / neurons, labeled Y0, Y1 to Yn, collectively referred to as multiple outputs Layer node Y and are collectively referred to as output layer node Y; and content layer nodes / neural Yuan, labeled as C0, C1 to Cn, collectively referred to as multiple content layer nodes C and individually referred to as content layer nodes C. In the example of the Elman time recurrent neural network in the fortieth figure, each hidden layer node Z has an input connected to each input layer node D's output, and has an input connected to each content layer node C's output; each output The layer node Y has an input connected to an output of each hidden layer node Z; and each content layer node C has an input connected to an output of a corresponding hidden layer node Z.

在許多方面，Elman時間遞歸神經網路之運作係類似於傳統之前饋人工神經網路。也就是說，對於一給定節點而言，此節點之各個輸入連結都會有一個相關聯的權重；節點在一輸入連結收到的數值會和關聯的權重相乘以產生一乘積；此節點會將關聯於所有輸入連結之乘積相加以產生一總數(此總數內可能還會包含一偏移項)；一般而言，對此總數還會執行一啟動函數以產生節點之輸出值，此輸出值有時稱為此節點之啟動值。對於傳統之前饋網路而言，資料總是沿著輸入層至輸出層之方向流動。也就是說，輸入層提供一數值至隱藏層(通常會有多個隱藏層)，而隱藏層會產生其輸出值提供至輸出層，而輸出層會產生可被取用之輸出。 In many ways, Elman time recurrent neural networks operate similarly to traditional feedforward artificial neural networks. That is, for a given node, each input link of this node will have an associated weight; the value received by a node at an input link will be multiplied by the associated weight to produce a product; this node will Adding the products associated with all input links to generate a total (the total may also include an offset term); generally, an activation function is performed on the total to generate the node's output value, which Sometimes called the start value of this node. For traditional feedforward networks, data always flows from the input layer to the output layer. That is, the input layer provides a value to the hidden layer (usually there will be multiple hidden layers), and the hidden layer will generate its output value to provide to the output layer, and the output layer will generate an output that can be accessed.

不過，不同於傳統之前饋網路，Elman時間遞歸神經網路還包括一些反饋連結，也就是第四十圖中從隱藏層節點Z至內容層節點C之連結。Elman時間遞歸神經網路之運作如下，當輸入層節點D在一個新的時間步驟提供一輸入值至隱藏層節點Z，內容節點C會提供一數值至隱藏層Z，此數值為隱藏層節點Z因應先前輸入，也就是當前時間步驟，之輸出值。從這個意義上來說，Elman時間遞歸神經網路之內容節點C是一個基於先前時間步驟之輸入值之記憶體。第四十一與四十二圖將會對執行關聯於第四十圖之Elman時間遞歸神經網路之計算之神經網路單元121的運作實施例進行說明。 However, unlike the traditional feedforward network, the Elman time recurrent neural network also includes some feedback links, that is, the link from the hidden layer node Z to the content layer node C in the fortieth figure. The Elman time recurrent neural network works as follows. When the input layer node D provides an input value to the hidden layer node Z in a new time step, the content node C provides a value to the hidden layer node Z. This value is the hidden layer node Z. In response to previous losses In, which is the output value of the current time step. In this sense, the content node C of the Elman time recurrent neural network is a memory based on the input value of the previous time step. The forty-first and forty-second diagrams will describe an operation example of the neural network unit 121 that performs the calculation of the Elman time recurrent neural network associated with the forty diagrams.

為了說明本發明，Elman時間遞歸神經網路是一個包含至少一個輸入節點層，一個隱藏節點層，一個輸出節點層與一個內容節點層之時間遞歸神經網路。對於一給定時間步驟，內容節點層會儲存隱藏節點層於前一個時間步驟產生且反饋至內容節點層之結果。此反饋至內容層的結果可以是啟動函數之執行結果或是隱藏節點層執行累加運算而未執行啟動函數之結果。 To illustrate the present invention, the Elman time recurrent neural network is a time recurrent neural network including at least one input node layer, a hidden node layer, an output node layer and a content node layer. For a given time step, the content node layer stores the results generated by the hidden node layer in the previous time step and fed back to the content node layer. The result of the feedback to the content layer may be the execution result of the startup function or the result of the hidden node layer performing the accumulation operation without executing the startup function.

第四十一圖係一方塊圖，顯示當神經網路單元121執行關聯於第四十圖之Elman時間遞歸神經網路之計算時，神經網路單元121之資料隨機存取記憶體122與權重隨機存取記憶體124內之資料配置之一範例。在第四十一圖之範例中假定第四十圖之Elman時間遞歸神經網路具有512個輸入節點D，512個隱藏節點Z，512個內容節點C，與512個輸出節點Y。此外，亦假定此Elman時間遞歸神經網路為完全連結，即全部512個輸入節點D均連結各個隱藏節點Z作為輸入，全部512個內容節點C均連結各個隱藏節點Z作為輸入，而全部512個隱藏節點Z均連結各個輸出節點Y作為輸入。此外，此神經網路單元121係配置為512個神經處理單元126或神經元，例如採寬配置。最後，此範例係假定關聯於內容節點C至隱藏節點Z之連結的權重均為數值1，因而不需儲存這些為一的權重值。 The forty-first diagram is a block diagram showing the random access memory 122 and weights of the data of the neural network unit 121 when the neural network unit 121 performs the calculation of the Elman time recurrent neural network associated with the forty diagram. An example of data configuration in the random access memory 124 is. In the example of Figure 41, it is assumed that the Elman time recurrent neural network of Figure 40 has 512 input nodes D, 512 hidden nodes Z, 512 content nodes C, and 512 output nodes Y. In addition, it is also assumed that this Elman time recurrent neural network is fully connected, that is, all 512 input nodes D are connected to each hidden node Z as input, all 512 content nodes C are connected to each hidden node Z as input, and all 512 The hidden node Z is connected to each output node Y as an input. In addition, the neural network unit 121 is configured as 512 neural processing units 126 or neurons, such as adopting a wide configuration. Finally, this example assumes that content node C is related to hidden The weights of the nodes Z are all equal to 1, so there is no need to store these weights as one.

如圖中所示，權重隨機存取記憶體124之下方512個列(列0至511)係裝載關聯於輸入節點D與隱藏節點Z間之連結之權重值。更精確地說，如圖中所示，列0係裝載關聯於由輸入節點D0至隱藏節點Z之輸入連結的權重，亦即，文字0會裝載關聯於輸入節點D0與隱藏節點Z0間之連結的權重，文字1會裝載關聯於輸入節點D0與隱藏節點Z1間之連結的權重，文字2會裝載關聯於輸入節點D0與隱藏節點Z2間之連結的權重，依此類推，文字511會裝載關聯於輸入節點D0與隱藏節點Z511間之連結的權重；列1係裝載關聯於由輸入節點D1至隱藏節點Z之輸入連結的權重，亦即，文字0會裝載關聯於輸入節點D1與隱藏節點Z0間之連結的權重，文字1會裝載關聯於輸入節點D1與隱藏節點Z1間之連結的權重，文字2會裝載關聯於輸入節點D1與隱藏節點Z2間之連結的權重，依此類推，文字511會裝載關聯於輸入節點D1與隱藏節點Z511間之連結的權重；直到列511，列511係裝載關聯於由輸入節點D511至隱藏節點Z之輸入連結的權重，亦即，文字0會裝載關聯於輸入節點D511與隱藏節點Z0間之連結的權重，文字1會裝載關聯於輸入節點D511與隱藏節點Z1間之連結的權重，文字2會裝載關聯於輸入節點D511與隱藏節點Z2間之連結的權重，依此類推，文字511會裝載關聯於輸入節點D511與隱藏節點Z511間之連結的權重。此配置與用途係類似於前文對應於第四至六A圖所述之實施例。 As shown in the figure, the 512 rows (columns 0 to 511) below the weight random access memory 124 are loaded with weight values associated with the connection between the input node D and the hidden node Z. More precisely, as shown in the figure, column 0 is the weight associated with the input link from the input node D0 to the hidden node Z, that is, the text 0 will load the link associated with the input node D0 and the hidden node Z0. Text 1 will load the weight associated with the connection between input node D0 and hidden node Z1, text 2 will load the weight associated with the connection between input node D0 and hidden node Z2, and so on, text 511 will load association The weight of the link between the input node D0 and the hidden node Z511; column 1 contains the weight associated with the input link from the input node D1 to the hidden node Z, that is, the text 0 will load the input node D1 and the hidden node Z0 The weight of the link between texts, text 1 will load the weight associated with the link between input node D1 and hidden node Z1, the text 2 will load the weight of the link associated with input node D1 and hidden node Z2, and so on, text 511 Will load the weights associated with the link between input node D1 and hidden node Z511; up to column 511, column 511 will load the weights associated with the input link from input node D511 to hidden node Z, also That is, text 0 will be loaded with the weight associated with the connection between the input node D511 and the hidden node Z0, text 1 will be loaded with the weight associated with the connection between the input node D511 and the hidden node Z1, and text 2 will be loaded with the input node D511 and the hidden node The weight of the link between the hidden nodes Z2, and so on, the text 511 will load the weight associated with the link between the input node D511 and the hidden node Z511. This configuration and usage is similar to the previous correspondence The embodiments described in Figures 4 to 6A.

如圖中所示，權重隨機存記憶體124之後續512個列(列512至1023)是以類似的方式裝載關聯於隱藏節點Z與輸出節點Y間之連結的權重。 As shown in the figure, the subsequent 512 columns (columns 512 to 1023) of the weight random storage memory 124 load the weights associated with the connection between the hidden node Z and the output node Y in a similar manner.

資料隨機存取記憶體122係裝載Elman時間遞歸神經網路節點值供一系列時間步驟使用。進一步來說，資料隨機存取記憶體122係以三列為組裝載提供一給定時間步驟之節點值。如圖中所示，以一個具有64列之資料隨機存取記憶體122為例，此資料隨機存取記憶體122可裝載供20個不同時間步驟使用之節點值。在第四十一圖之範例中，列0至2裝載供時間步驟0使用之節點值，列3至5裝載供時間步驟1使用之節點值，依此類推，列57至59裝載供時間步驟19使用之節點值。各組中的第一列係裝載此時間步驟之輸入節點D之數值。各組中的第二列係裝載此時間步驟之隱藏節點Z之數值。各組中的第三列係裝載此時間步驟之輸出節點Y之數值。如圖中所示，資料隨機存取記憶體122之各個行係裝載其相對應之神經元或神經處理單元126之節點值。也就是說，行0係裝載關聯於節點D0，Z0與Y0之節點值，其計算是由神經處理單元0所執行；行1係裝載關聯於節點D1，Z1與Y1之節點值，其計算是由神經處理單元1所執行；依此類推，行511係裝載關聯於節點D511，Z511與Y511之節點值，其計算是由神經處理單元511所執行，這部分在後續對應於第四十二圖處會有更詳細的說明。 The data random access memory 122 is loaded with Elman time recurrent neural network node values for a series of time steps. Further, the data random access memory 122 loads the node values of a given time step in groups of three rows. As shown in the figure, a data random access memory 122 with 64 rows is taken as an example. The data random access memory 122 can load node values for 20 different time steps. In the example of the forty-first figure, columns 0 to 2 are loaded with node values for time step 0, columns 3 to 5 are loaded with node values for time step 1, and so on, and columns 57 to 59 are loaded for time steps 19 node value used. The first column in each group is the value of the input node D at this time step. The second column in each group contains the value of the hidden node Z at this time step. The third column in each group contains the value of output node Y at this time step. As shown in the figure, each row of the data random access memory 122 is loaded with the node value of its corresponding neuron or neural processing unit 126. That is, row 0 loads the node values associated with nodes D0, Z0, and Y0, and its calculation is performed by the neural processing unit 0; row 1 loads the node values associated with nodes D1, Z1, and Y1. The calculation is Executed by the neural processing unit 1; and so on, row 511 loads the node values associated with the nodes D511, Z511 and Y511, and the calculation is performed by the neural processing unit 511. This part corresponds to the forty-second figure in the following. There will be more detailed instructions.

如第四十一圖所指出，對於一給定時間步驟而言，位於各組三列記憶體之第二列之隱藏節點Z的數值會是下一個時間步驟之內容節點C的數值。也就是說，神經處理單元126在一時間步驟內計算並寫入之節點Z的數值，會成為此神經處理單元126在下一個時間步驟內用於計算節點Z的數值所使用之節點C的數值(連同此下一個時間步驟之輸入節點D的數值)。內容節點C之初始值(在時間步驟0用以計算列1中之節點Z的數值所使用之節點C的數值)係假定為零。這在後續對應於第四十二圖之非架構程式之相關章節會有更詳細的說明。 As indicated in Figure 41, for a given time step In short, the value of the hidden node Z in the second column of the three rows of memory in each group will be the value of the content node C in the next time step. That is, the value of node Z calculated and written by the neural processing unit 126 in a time step will become the value of node C used by this neural processing unit 126 to calculate the value of node Z in the next time step ( With the value of input node D for this next time step). The initial value of content node C (the value of node C used to calculate the value of node Z in column 1 at time step 0) is assumed to be zero. This will be explained in more detail in the subsequent sections of the non-structural program corresponding to Figure 42.

較佳地，輸入節點D的數值(第四十一圖之範例中之列0，3，依此類推至列57之數值)係由執行於處理器100之架構程式透過MTNN指令1400寫入/填入資料隨機存取記憶體122，並且是由執行於神經網路單元121之非架構程式讀取/使用，例如第四十二圖之非架構程式。相反地，隱藏/輸出節點Z/Y之數值(第四十一圖之範例中之列1與2，4與5，依此類推至列58與59之數值)則是由執行於神經網路單元121之非架構程式寫入/填入資料隨機存取記憶體122，並且是由執行於處理器100之架構程式透過MFNN指令1500讀取/使用。第四十一圖之範例係假定此架構程式會執行以下步驟：(1)對於20個不同的時間步驟，將輸入節點D之數值填入資料隨機存取記憶體122(列0，3，依此類推至列57)；(2)啟動第四十二圖之非架構程式；(3)偵測非架構程式是否執行完畢；(4)從資料隨機存取記憶體122讀出輸出節點Y之數值(列2，5，依此類推至列59)；以及(5)重複步驟(1) 至(4)若干次直到完成任務，例如對手機使用者之話語進行辨識所需之計算。 Preferably, the value of the input node D (the values of columns 0, 3 in the example in FIG. 41, and so on to the value of column 57) is written by the architecture program running on the processor 100 through the MTNN instruction 1400 / The data random access memory 122 is filled in, and is read / used by a non-architecture program running on the neural network unit 121, such as the non-architecture program of FIG. 42. Conversely, the values of the hidden / output nodes Z / Y (columns 1 and 2, 4 and 5 in the example in Figure 41, and so on to the values of columns 58 and 59) are executed by the neural network The non-architecture program of the unit 121 writes / fills in the data random access memory 122, and is read / used by the architecture program running on the processor 100 through the MFNN instruction 1500. The example in Figure 41 assumes that the framework program will perform the following steps: (1) For 20 different time steps, fill the value of the input node D into the data random access memory 122 (rows 0, 3, according to (Similarly to column 57); (2) start the non-architecture program in Figure 42; (3) detect whether the non-architecture program has been executed; (4) read out the output node Y from the data random access memory 122 Values (columns 2, 5, and so on to column 59); and (5) repeat step (1) Go to (4) several times until the task is completed, such as the calculation required to identify the words of the mobile phone user.

在另一種執行方式中，架構程式會執行以下步驟：(1)對單一個時間步驟，以輸入節點D之數值填入資料隨機存取記憶體122(如列0)；(2)啟動非架構程式(第四十二圖非架構程式之一修正後版本，不需迴圈，並且只存取資料隨機存記憶體122之單一組三個列)；(3)偵測非架構程式是否執行完畢；(4)從資料隨機存取記憶體122讀出輸出節點Y之數值(如列2)；以及(5)重複步驟(1)至(4)若干次直到完成任務。此二種方式何者為優可依據時間遞歸神經網路之輸入值的取樣方式而定。舉例來說，若是此任務容許在多個時間步驟對輸入進行取樣(例如大約20個時間步驟)並執行計算，第一種方式就較為理想，因為此方式可能帶來更多計算資源效率與/或較佳的效能，不過，若是此任務只容許在單一個時間步驟執行取樣，就需要使用第二種方式。 In another implementation method, the architecture program will perform the following steps: (1) for a single time step, fill the data random access memory 122 (such as row 0) with the value of the input node D; (2) start the non-architecture Program (a modified version of one of the non-architecture programs shown in Figure 42 without looping, and only accesses a single set of three rows of random storage memory 122); (3) detects whether the non-architecture program has been executed (4) read out the value of the output node Y from the data random access memory 122 (such as column 2); and (5) repeat steps (1) to (4) several times until the task is completed. Which of these two methods is optimal depends on the sampling method of the input value of the time recurrent neural network. For example, if this task allows the input to be sampled at multiple time steps (for example, about 20 time steps) and perform calculations, the first method is ideal because this method may bring more computing resources efficiency and / Or better performance, but if this task only allows sampling in a single time step, you need to use the second method.

第三實施例類似於前述第二種方式，不過，不同於第二種方式使用單一組三列資料隨機存取記憶體122，此方式之非架構程式使用多組三列記憶體，也就是在各個時間步驟使用不同組三列記憶體，此部分類似於第一種方式。在此第三實施例中，較佳地，架構程式在步驟(2)前包含一步驟，此步驟中，架構程式會在非架構程式啟動前對其進行更新，例如將位址1之指令內的資料隨機存取記憶體122列更新為指向下一組三列記憶體。 The third embodiment is similar to the foregoing second method, but different from the second method using a single set of three rows of data random access memory 122, the non-structural program of this method uses multiple sets of three rows of memory, that is, in the Each time step uses a different set of three columns of memory, this section is similar to the first method. In this third embodiment, preferably, the architecture program includes a step before step (2). In this step, the architecture program will update the non-architecture program before it is started, for example, in the address 1 instruction. The 122 rows of data random access memory are updated to point to the next set of three rows of memory.

第四十二圖係一表格，顯示儲存於神經網路單元121之程式記憶體129之一程式，此程式係由神經網路單元121執行，並依據第四十一圖之配置使用資料與權重以達成Elman時間遞歸神經網路。第四十二圖(以及第四十五，四十八，五十一，五十四與五十七圖)之非架構程式中之若干指令詳如前述(例如乘法累加(MULT-ACCUM)，迴圈(LOOP)，初始化(INITIALIZE)指令)，以下段落係假定這些指令與前述說明內容一致，除非有不同的說明。 Figure 42 is a table showing a program stored in the program memory 129 of the neural network unit 121. This program is executed by the neural network unit 121 and uses data and weights according to the configuration of figure 41 To achieve Elman time recurrent neural network. The instructions in the non-architecture program of the forty-second figure (and forty-fifth, forty-eight, fifty-one, fifty-four, and fifty-seven) are as described above (for example, MULT-ACCUM). (LOOP), initialization (INITIALIZE) instructions, the following paragraphs assume that these instructions are consistent with the previous description, unless there is a different description.

第四十二圖之範例程式包含13個非架構指令，分別位於位址0至12。位址0之指令(INITIALIZE NPU,LOOPCNT=20)清除累加器202並且將迴圈計數器3804初始化至數值20，以執行20次迴圈組(位址4至11之指令)。較佳地，此初始化指令也會使神經網路單元121處於寬配置，如此，神經網路單元121就會配置為512個神經處理單元126。如同後續章節所述，在位址1至3以及位址7至11之指令執行過程中，這512個神經處理單元126係作為512個相對應之隱藏層節點Z進行運作，而在位址4至6之指令執行過程中，這512個神經處理單元126係作為512個相對應之輸出層節點Y進行運作。 The example program in Figure 42 contains 13 non-framework instructions, located at addresses 0 to 12, respectively. The instruction at address 0 (INITIALIZE NPU, LOOPCNT = 20) clears the accumulator 202 and initializes the loop counter 3804 to the value 20 to execute the loop group 20 times (the instructions at addresses 4 to 11). Preferably, this initialization instruction will also make the neural network unit 121 in a wide configuration. In this way, the neural network unit 121 will be configured as 512 neural processing units 126. As described in the subsequent chapters, during the execution of instructions at addresses 1 to 3 and addresses 7 to 11, these 512 neural processing units 126 operate as 512 corresponding hidden layer nodes Z, and at address 4 During the execution of instructions from 6 to 6, these 512 neural processing units 126 operate as 512 corresponding output layer nodes Y.

位址1至3之指令不屬於程式之迴圈組而只會執行一次。這些指令計算隱藏層節點Z之初始值並將其寫入資料隨機存取記憶體122之列1供位址4至6之指令之第一次執行使用，以計算出第一時間步驟(時間步驟0)之輸出層節點Y。此外，這些由位址1至3之指令計算並寫入資料隨機存取記憶體122之列1之隱藏層節點Z之數值會變成內容層節點C之數值供位址7與8之指令之第一次執行使用，以計算出隱藏層節點Z之數值供第二時間步驟(時間步驟1)使用。 The instructions at addresses 1 to 3 do not belong to the loop group of the program and will only be executed once. These instructions calculate the initial value of node Z in the hidden layer and write it into the data random access memory 122 in column 1 for the first execution of the instructions at addresses 4 to 6, to calculate the first time step (time step 0) output node Y. In addition, these instructions are Calculate and write the value of the hidden layer node Z in row 1 of the random access memory 122 into the value of the content layer node C for the first execution of the instructions at addresses 7 and 8 to calculate the hidden layer node The value of Z is used for the second time step (time step 1).

在位址1與2之指令的執行過程中，這512個神經處理單元126中之各個神經處理單元126會執行512個乘法運算，將位於資料隨機存取記憶體122列0之512個輸入節點D數值乘上權重隨機存取記憶體124之列0至511中相對應此神經處理單元126之行之權重，以產生512個乘積累加於相對應神經處理單元126之累加器202。在位址3之指令的執行過程中，這512個神經處理單元之512個累加器202之數值會被傳遞並寫入資料隨機存取記憶體122之列1。也就是說，位址3之輸出指令會將512個神經處理單元中之各個神經處理單元512之累加器202數值寫入資料隨機存取記憶體122之列1，此數值即為初始之隱藏層Z數值，隨後，此指令會清除累加器202。 During the execution of the instructions at addresses 1 and 2, each of the 512 neural processing units 126 will perform 512 multiplication operations, and will place 512 input nodes located at 122 rows 0 of the data random access memory. The value of D is multiplied by the weight corresponding to the row of the neural processing unit 126 in the columns 0 to 511 of the weighted random access memory 124 to generate 512 multiply accumulations and add the accumulator 202 to the corresponding neural processing unit 126. During the execution of the instruction at the address 3, the values of the 512 accumulators 202 of the 512 neural processing units will be transferred and written into the data random access memory 122 row 1. In other words, the output instruction at address 3 writes the value of the accumulator 202 of each of the 512 neural processing units 512 into the data random access memory 122 column 1. This value is the initial hidden layer Z value. Subsequently, this instruction clears the accumulator 202.

第四十二圖之非架構程式之位址1至2之指令所執行之運算類似於第四圖之非架構指令之位址1至2之指令所執行之運算。進一步來說，位址1之指令(MULT_ACCUM DR ROW 0)會指示這512個神經處理單元126中之各個神經處理單元126將資料隨機存取記憶體122之列0之相對應文字讀入其多工暫存器208，將權重隨機存取記憶體124之列0之相對應文字讀入其多工暫存器705，將資料文字與權重文字相乘產生乘積並將此乘積加入累加器202。位址2之指令(MULT-ACCUM ROTATE,WR ROW+1,COUNT=511)指示這512個神經處理單元中之各個神經處理單元126將來自相鄰神經處理單元126之文字轉入其多工暫存器208(利用由神經網路單元121之512個多工暫存器208集體運作構成之512個文字之旋轉器，這些暫存器即為位址1之指令指示將資料隨機存取記憶體122之列讀入之暫存器)，將權重隨機存取記憶體124之下一列之相對應文字讀入其多工暫存器705，將資料文字與權重文字相乘產生乘積並將此乘積加入累加器202，並且執行前述運算511次。 The operations performed by the instructions at addresses 1 to 2 of the non-architecture program in Figure 42 are similar to the operations performed by the instructions at addresses 1 to 2 of the non-architecture instruction in Figure 4. Further, the instruction at address 1 (MULT_ACCUM DR ROW 0) will instruct each of the 512 neural processing units 126 to read the corresponding text in column 0 of the data random access memory 122 The work register 208 reads the corresponding text in column 0 of the weight random access memory 124 into its multiplex register 705, multiplies the data text and the weight text to generate a product, and adds the product to the accumulator 202. Address 2 instruction (MULT-ACCUM ROTATE, WR ROW + 1, COUNT = 511) instructs each of the 512 neural processing units 126 to transfer the text from the adjacent neural processing unit 126 into its multiplex register 208 (using the neural network unit 121). 512 multiplexer registers 208 collectively operated by 512 text spinners, these registers are address register 1 instructions to read data from the random access memory 122 row register), The corresponding text in the next row of the weight random access memory 124 is read into its multiplexing register 705, the data text is multiplied with the weight text to generate a product, and the product is added to the accumulator 202, and the foregoing operation is performed 511 times.

此外，第四十二圖中位址3之單一非架構輸出指令(OUTPUT PASSTHRU,DR OUT ROW 1,CLR ACC)會將啟動函數指令之運算與第四圖中位址3與4之寫入輸出指令合併(雖然第四十二圖之程式係傳遞累加器202數值，而第四圖之程式則是對累加器202數值執行一啟動函數)。也就是說，在第四十二圖之程式中，執行於累加器202數值之啟動函數，如果有的話，係輸出指令中指定(也在位址6與11之輸出指令中指定)，而非如第四圖之程式所示係於一個不同之非架構啟動函數指令中指定。第四圖(以及第二十，二十六A與二十八圖)之非架構程式之另一實施例，亦即將啟動函數指令之運算與寫入輸出指令(如第四圖之位址3與4)合併為如第四十二圖所示之單一非架構輸出指令亦屬於本發明之範疇。第四十二圖之範例假定隱藏層(Z)之節點不會對累加器數值執行啟動函數。不過，隱藏層(Z)對累加器數值執行啟動函數之實施例亦屬本案發明之範疇，這些實施例可利用位址3與11之指令進行運算，如S型，雙曲正切，校正函數等。 In addition, a single non-architecture output instruction (OUTPUT PASSTHRU, DR OUT ROW 1, CLR ACC) at address 3 in the forty-second figure will write the operation of the start function instruction and the write output at addresses 3 and 4 in the fourth figure Instruction merging (although the program in Figure 42 passes the value of accumulator 202, the program in Figure 4 performs an activation function on the value of accumulator 202). That is to say, in the program of the forty-second figure, the start function executed on the value of the accumulator 202, if any, is specified in the output instruction (also specified in the output instructions at addresses 6 and 11), and It is not specified in the program of the fourth figure in a different non-framework startup function instruction. Another embodiment of the non-architecture program in the fourth figure (and twentieth, twenty-sixth A, and twenty-eight) is to start the operation of the function instruction and write the output instruction (such as address 3 in the fourth figure). And 4) combined into a single non-architecture output instruction as shown in FIG. 42 also belongs to the scope of the present invention. The example in Figure 42 assumes that the nodes of the hidden layer (Z) will not execute the start function on the accumulator value. However, the embodiments in which the hidden layer (Z) executes the activation function on the accumulator value are also within the scope of the present invention. These embodiments You can use instructions at addresses 3 and 11 to perform operations such as S-type, hyperbolic tangent, correction function, etc.

相較於位址1至3之指令只會執行一次，位址4至11之指令則是位於程式迴圈內而會被執行若干次數，此次數係由迴圈計數所指定(例如20)。位址7至11之指令的前十九次執行係計算隱藏層節點Z之數值並將其寫入資料隨機存取記憶體122供位址4至6之指令之第二至二十次執行使用以計算剩餘時間步驟之輸出層節點Y(時間步驟1至19)。(位址7至11之指令之最後/第二十次執行係計算隱藏層節點Z之數值並將其寫入資料隨機存取記憶體122之列61，不過，這些數值並未被使用。) In contrast to the instructions at addresses 1 to 3, which are executed only once, the instructions at addresses 4 to 11 are located within the program loop and are executed a number of times, which is specified by the loop count (for example, 20). The first nineteen executions of the instructions at addresses 7 to 11 calculate the value of the hidden layer node Z and write it to the data random access memory 122 for the second to twenty executions of the instructions at addresses 4 to 6 To calculate the output layer node Y for the remaining time steps (time steps 1 to 19). (The last / twentieth execution of the instructions at addresses 7 to 11 is to calculate the value of the hidden layer node Z and write it to the data random access memory row 122, 61. However, these values are not used.)

在位址4與5之指令(MULT-ACCUM DR ROW+1,WR ROW 512 and MULT-ACCUM ROTATE,WR ROW+1,COUNT=511)之第一次執行中(對應於時間步驟0)，這512個神經處理單元126中之各個神經處理單元126會執行512個乘法運算，將資料隨機存取記憶體122之列1之512個隱藏節點Z之數值(這些數值係由位址1至3之指令之單一次執行而產生與寫入)乘上權重隨機存取記憶體124之列512至1023中對應此神經處理單元126之行之權重，以產生512個乘積累加於相對應神經處理單元126之累加器202。在位址6之指令(OUTPUT ACTIVATION FUNCTION,DR OUT ROW+1,CLR ACC)之第一次執行中，會對於這512個累加數值執行一啟動函數(例如S型，雙曲正切，校正函數)以計算輸出層節點Y之數值，執行結果會寫入資料隨機存取記憶體122之列2。 In the first execution of the instructions at addresses 4 and 5 (MULT-ACCUM DR ROW + 1, WR ROW 512 and MULT-ACCUM ROTATE, WR ROW + 1, COUNT = 511) (corresponding to time step 0), this Each of the 512 neural processing units 126 performs 512 multiplication operations to randomize the data to the values of 512 hidden nodes Z in row 1 of the memory 122 (these values are from addresses 1 to 3) Instructions are generated and written in a single execution) Multiplied by the weights of rows 512 to 1023 of the random access memory 124 corresponding to the trip of this neural processing unit 126 to generate 512 multiplying accumulations and adding to the corresponding neural processing unit 126 The accumulator 202. In the first execution of the instruction at address 6 (OUTPUT ACTIVATION FUNCTION, DR OUT ROW + 1, CLR ACC), a start function (such as S-type, hyperbolic tangent, correction function) will be executed for the 512 accumulated values. To calculate the value of node Y in the output layer, the execution result will be written into row 2 of the data random access memory 122.

在位址4與5之指令之第二次執行中(對應於時間步驟1)，這512個神經處理單元126中之各個神經處理單元126會執行512個乘法運算，將資料隨機存取記憶體122之列4之512個隱藏節點Z之數值(這些數值係由位址7至11之指令之第一次執行而產生與寫入)乘上權重隨機存取記憶體124之列512至1023中對應此神經處理單元126之行之權重，以產生512個乘積累加於相對應神經處理單元126之累加器202，而在位址6之指令之第二次執行中，會對於這512個累加數值執行一啟動函數以計算輸出層節點Y之數值，此結果係寫入資料隨機存取記憶體122之列5；在位址4與5之指令之第三次執行中(對應於時間步驟2)，這512個神經處理單元126中之各個神經處理單元126會執行512個乘法運算，將資料隨機存取記憶體122之列7之512個隱藏節點Z之數值(這些數值係由位址7至11之指令之第二次執行而產生與寫入)乘上權重隨機存取記憶體124之列512至1023中對應此神經處理單元126之行之權重，以產生512個乘積累加於相對應神經處理單元126之累加器202，而在位址6之指令之第三次執行中，會對於這512個累加數值執行一啟動函數以計算輸出層節點Y之數值，此結果係寫入資料隨機存取記憶體122之列8；依此類推，在位址4與5之指令之第二十次執行中(對應於時間步驟19)，這512個神經處理單元126中之各個神經處理單元126會執行512個乘法運算，將資料隨機存取記憶體122之列58之512個隱藏節點Z之數值(這些數值係由位址7至11之指令之第十九次執行而產生與寫入)乘上權重隨機存取記憶體124之列512至1023中對應此神經處理單元126之行之權重，以產生512個乘積累加於相對應神經處理單元126之累加器202，而在位址6之指令之第二十次執行中，會對於這512個累加數值執行一啟動函數以計算輸出層節點Y之數值，執行結果係寫入資料隨機存取記憶體122之列59。 In the second execution of the instructions at addresses 4 and 5 (corresponding to time step 1), each of the 512 neural processing units 126 will perform 512 multiplication operations to randomly access the data in memory. The values of 512 hidden nodes Z in column 4 of 122 (these values are generated and written by the first execution of the instructions at addresses 7 to 11) are multiplied by the weight of columns 512 to 1023 of random access memory 124 Corresponds to the weight of the trip of the neural processing unit 126 to generate 512 multiply accumulated and add to the accumulator 202 of the corresponding neural processing unit 126, and in the second execution of the instruction at address 6, the 512 accumulated values will be Execute a start function to calculate the value of output layer node Y. The result is written into column 5 of data random access memory 122; in the third execution of the instructions at addresses 4 and 5 (corresponding to time step 2) Each of the 512 neural processing units 126 will perform 512 multiplication operations to randomize the data to the values of 512 hidden nodes Z in row 7 of memory 122 (these values are from address 7 to 11 is generated and written by the second execution of the instruction) The weights corresponding to the row of the neural processing unit 126 in columns 512 to 1023 of the weight random access memory 124 above are generated to generate 512 multiply-accumulate and add to the accumulator 202 of the corresponding neural processing unit 126, and the instruction at address 6 In the third execution, an activation function will be executed for the 512 accumulated values to calculate the value of the output layer node Y. This result is written into the data random access memory row 122; and so on, at the address During the twentieth execution of the 4 and 5 instructions (corresponding to time step 19), each of the 512 neural processing units 126 will perform 512 multiplication operations to randomize the data in the memory 122. The values of the 512 hidden nodes Z in column 58 (these values are generated and written by the nineteenth execution of the instructions at addresses 7 to 11 (In) Multiply the weights of rows 512 to 1023 of the random access memory 124 corresponding to the trip of this neural processing unit 126 to generate 512 multiply accumulations and add them to the accumulator 202 of the corresponding neural processing unit 126, and at the address In the twentieth execution of the 6 instruction, an activation function is executed for the 512 accumulated values to calculate the value of the output layer node Y. The execution result is written into the row 59 of the data random access memory 122.

在位址7與8之指令之第一次執行中，這512個神經處理單元126中之各個神經處理單元126將資料隨機取記憶體122之列1之512個內容節點C的數值累加至其累加器202，這些數值係由位址1至3之指令之單一次執行所產生。進一步來說，位址7之指令(ADD_D_ACC DR ROW+0)會指示這512個神經處理單元126中之各個神經處理單元126將資料隨機存取記憶體122當前列(在第一次執行之過程中即為列0)之相對應文字讀入其多工暫存器208，並將此文字加入累加器202。位址8之指令(ADD_D_ACC ROTATE,COUNT=511)指示這512個神經處理單元126中之各個神經處理單元126將來自相鄰神經處理單元126之文字轉入其多工暫存器208(利用由神經網路單元121之512個多工暫存器208集體運作構成之512個文字之旋轉器，這些多工暫存器即為位址7之指令指示讀入資料隨機存取記憶體122之列之暫存器)，將此文字加入累加器202，並且執行前述運算511次。 In the first execution of the instructions at addresses 7 and 8, each of the 512 neural processing units 126 accumulates the values of the 512 content nodes C in the random access memory 122 in column 1 to it. Accumulator 202. These values are generated by a single execution of the instructions at addresses 1 to 3. Further, the instruction at address 7 (ADD_D_ACC DR ROW + 0) will instruct each of the 512 neural processing units 126 to randomly access data in the current row of the memory 122 (in the first execution process) The corresponding text in column 0) is read into its multiplexer register 208, and this text is added to the accumulator 202. The instruction at address 8 (ADD_D_ACC ROTATE, COUNT = 511) instructs each of the 512 neural processing units 126 to transfer the text from the adjacent neural processing unit 126 into its multiplexing register 208 (using the The 512 multiplex register 208 of the neural network unit 121 is a 512-character rotator composed of collective operations. These multiplex registers are instructions at address 7 to read data into the random access memory 122 row. Register), add this text to the accumulator 202, and perform the aforementioned operation 511 times.

在位址7與8之指令之第二次執行中，這512個神經處理單元126中之各個神經處理單元126會將將資料隨機取記憶體122之列4之512個內容節點C之數值累加至其累加器202，這些數值係由位址9至11之指令之第一次執行所產生並寫入；在位址7與8之指令之第三次執行中，這512個神經處理單元126中之各個神經處理單元126會將將資料隨機取記憶體122之列7之512個內容節點C之數值累加至其累加器202，這些數值係由位址9至11之指令之第二次執行所產生並寫入；依此類推，在位址7與8之指令之第二十次執行中，這512個神經處理單元126中之各個神經處理單元126會將將資料隨機取記憶體122之列58之512個內容節點C之數值累加至其累加器202，這些數值係由位址9至11之指令之第十九次執行所產生並寫入。 In the second execution of the instructions at addresses 7 and 8, each of the 512 neural processing units 126 will randomly fetch the data from the number 512 of content node C in memory row 4 The values are accumulated to its accumulator 202, which are generated and written by the first execution of the instructions at addresses 9 to 11; during the third execution of the instructions at addresses 7 and 8, the 512 neural processes Each of the neural processing units 126 in the unit 126 will accumulate the values of the 512 content nodes C in the random access memory 122 to the accumulator 202. These values are the second from the instructions at addresses 9 to 11 Generated and written by multiple executions; and so on, during the twentieth execution of the instructions at addresses 7 and 8, each of the 512 neural processing units 126 will randomly fetch data into memory The values of 512 content nodes C of column 122 and 58 are added to its accumulator 202, and these values are generated and written by the nineteenth execution of the instructions at addresses 9 to 11.

如前述，第四十二圖之範例係假定關聯於內容節點C至隱藏層節點Z之連結之權重具有為一的值。不過，在另一實施例中，這些位於Elman時間遞歸神經網路內之連結則是具有非零權重值，這些權重在第四十二圖之程式執行前係放置於權重隨機存取記憶體124(例如列1024至1535)，位址7之程式指令為MULT-ACCUM DR ROW+0,WR ROW 1024，而位址8之程式指令為MULT-ACCUM ROTATE,WR ROW+1,COUNT=511。較佳地，位址8之指令並不存取權重隨機存取記憶體124，而是旋轉位址7之指令從權重隨機存取記憶體124讀入多工暫存器705之數值。在511個執行位址8指令之時頻周期內不對權重隨機存取記憶體124進行存取即可保留更多頻寬供架構程式存取權重隨機存取記憶體124使用。 As mentioned above, the example of the forty-second figure assumes that the weight of the link associated with the content node C to the hidden layer node Z has a value of one. However, in another embodiment, these links in the Elman time recurrent neural network have non-zero weight values. These weights are placed in the weight random access memory 124 before the program in Figure 42 is executed. (For example, rows 1024 to 1535), the program command at address 7 is MULT-ACCUM DR ROW + 0, WR ROW 1024, and the program command at address 8 is MULT-ACCUM ROTATE, WR ROW + 1, COUNT = 511. Preferably, the instruction of address 8 does not access the weight random access memory 124, but the instruction of rotating the address 7 reads the value of the multiplexer register 705 from the weight random access memory 124. Without accessing the weight random access memory 124 during the time-frequency period of the 511 execution address 8 instructions, more bandwidth can be reserved for use by the framework program access weight random access memory 124.

在位址9與10之指令(MULT-ACCUM DR ROW+2,WR ROW 0 and MULT-ACCUM ROTATE,WR ROW+1,COUNT=511)之第一次執行中(對應於時間步驟1)，這512個神經處理單元126中之各個神經處理單元126會執行512個乘法運算，將資料隨機存取記憶體122之列3之512個輸入節點D之數值乘上權重隨機存取記憶體124之列0至511中對應此神經處理單元126之行之權重以產生512個乘積，連同位址7與8之指令對於512個內容節點C數值所執行之累加運算，累加於相對應神經處理單元126之累加器202以計算隱藏層節點Z之數值，在位址11之指令(OUTPUT PASSTHRU,DR OUT ROW+2,CLR ACC)之第一次執行中，這512個神經處理單元126之512個累加器202數值被傳遞並寫入資料隨機存取記憶體122之列4，而累加器202會被清除；在位址9與10之指令之第二次執行中(對應於時間步驟2)，這512個神經處理單元126中之各個神經處理單元126會執行512個乘法運算，將資料隨機存取記憶體122之列6之512個輸入節點D之數值乘上權重隨機存取記憶體124之列0至511中對應此神經處理單元126之行之權重，以產生512個乘積，連同位址7與8之指令對於512個內容節點C數值所執行之累加運算，累加於相對應神經處理單元126之累加器202以計算隱藏層節點Z之數值，在位址11之指令之第二次執行中，這512個神經處理單元126之512個累加器202數值被傳遞並寫入資料隨機存取記憶體122之列7，而累加器202則會被清除；依此類推，在位址9與10之指令之第十九次執行中(對應於時間步驟19)，這512個神經處理單元126中之各個神經處理單元126會執行512個乘法運算，將資料隨機存取記憶體122之列57之512個輸入節點D之數值乘上權重隨機存取記憶體124之列0至511中對應此神經處理單元126之行之權重，以產生512個乘積，連同位址7與8之指令對於512個內容節點C數值所執行之累加運算，累加於相對應神經處理單元126之累加器202以計算隱藏層節點Z之數值，而在位址11之指令之第十九次執行中，這512個神經處理單元126之512個累加器202數值被傳遞並寫入資料隨機存取記憶體122之列58，而累加器202則會被清除。如前所述，在位址9與10之指令之第二十次執行中所產生並寫入之隱藏層節點Z之數值並不會被使用。 In the first execution of the instructions at addresses 9 and 10 (MULT-ACCUM DR ROW + 2, WR ROW 0 and MULT-ACCUM ROTATE, WR ROW + 1, COUNT = 511) (corresponding to time step 1), this Each of the 512 neural processing units 126 performs 512 multiplication operations to multiply the value of the 512 input node D of column 3 of data random access memory 122 by the column of weight random access memory 124 The weights of 0 to 511 corresponding to the trip of this neural processing unit 126 to generate 512 products, together with the accumulation operation performed by the instructions at addresses 7 and 8 on the 512 content node C values, are accumulated in the corresponding neural processing unit 126. Accumulator 202 calculates the value of node Z in the hidden layer. In the first execution of the instruction at address 11 (OUTPUT PASSTHRU, DR OUT ROW + 2, CLR ACC), the 512 accumulators of the 512 neural processing units 126 The value of 202 is passed and written into the data random access memory 122, column 4, and the accumulator 202 will be cleared; in the second execution of the instructions at addresses 9 and 10 (corresponding to time step 2), this 512 Each of the neural processing units 126 performs 512 multiplication operations, The value of the 512 input nodes D in row 6 of the random access memory 122 is multiplied by the weight corresponding to the row of the neural processing unit 126 in the rows 0 to 511 of the random access memory 124 to generate 512 products. Together with the instructions of addresses 7 and 8 for the accumulation of 512 content node C values, they are accumulated in the accumulator 202 of the corresponding neural processing unit 126 to calculate the value of the hidden layer node Z. In the second execution, the values of the 512 accumulators 202 of the 512 neural processing units 126 are transferred and written into the data random access memory 122 column 7, and the accumulator 202 is cleared; and so on, in place Addresses 9 and 10 In the nineteenth execution (corresponding to time step 19), each of the 512 neural processing units 126 will perform 512 multiplication operations to input data to the 512 inputs of the 57th row of the memory 122 The value of node D is multiplied by the weight of rows 0 to 511 of the random access memory 124 corresponding to the row of this neural processing unit 126 to produce 512 products, together with instructions at addresses 7 and 8 for 512 content nodes C The accumulation operation performed by the values is accumulated in the accumulator 202 of the corresponding neural processing unit 126 to calculate the value of the node Z in the hidden layer. In the nineteenth execution of the instruction at address 11, the 512 neural processing units 126 The values of the 512 accumulators 202 are passed and written into the data random access memory 122 column 58, and the accumulators 202 are cleared. As mentioned above, the hidden layer node Z value generated and written in the twentieth execution of the instructions at addresses 9 and 10 will not be used.

位址12之指令(LOOP 4)會使迴圈計數器3804遞減並且在新的迴圈計數器3804數值大於零之情況下回到位址4之指令。 The instruction at address 12 (LOOP 4) will decrement the loop counter 3804 and return to the instruction at address 4 if the new loop counter 3804 value is greater than zero.

第四十三圖係一方塊圖顯示Jordan時間遞歸神經網路之一範例。第四十三圖之Jordan時間遞歸神經網路類似於第四十圖之Elman時間遞歸神經網路，具有輸入層節點/神經元D，隱藏層節點/神經元Z，輸出層節點/神經元Y，與內容層節點/神經元C。不過，在第四十三圖之Jordan時間遞歸神經網路中，內容層節點C係以來自其相對應輸出層節點Y之輸出回饋作為其輸入連結，而非如第四十圖之Elman時間遞歸神經網路中係來自隱藏層節點Z之輸出作為其輸入連結。 The forty-third diagram is a block diagram showing an example of Jordan time recurrent neural network. The Jordan Time Recurrent Neural Network in Figure 43 is similar to the Elman Time Recurrent Neural Network in Figure 40. It has an input layer node / neuron D, a hidden layer node / neuron Z, and an output layer node / neuron Y. , With content layer node / neuron C. However, in the Jordan time recursive neural network of Figure 43, the content layer node C uses the output feedback from its corresponding output layer node Y as its input link, instead of the Elman time recursion as shown in Figure 40. In the neural network, the output from the hidden layer node Z is used as its input link.

為了說明本發明，Jordan時間遞歸神經網路是一個包含至少一個輸入節點層，一個隱藏節點層，一個輸出節點層與一個內容節點層之時間遞歸神經網路。在一給定時間步驟之開始，內容節點層會儲存輸出節點層於前一個時間步驟產生且回饋至內容節點層之結果。此回饋至內容層的結果可以是啟動函數之結果或是輸出節點層執行累加運算而未執行啟動函數之結果。 To illustrate the present invention, the Jordan time-recurrent neural network is a time-recurrent neural network including at least one input node layer, a hidden node layer, an output node layer and a content node layer. At the beginning of a given time step, the content node layer stores the results produced by the output node layer in the previous time step and fed back to the content node layer. The result fed back to the content layer may be the result of the activation function or the output node layer performs an accumulation operation without executing the activation function.

第四十四圖係一方塊圖，顯示當神經網路單元121執行關聯於第四十三圖之Jordan時間遞歸神經網路之計算時，神經網路單元121之資料隨機存取記憶體122與權重隨機存取記憶體124內之資料配置之一範例。在第四十四圖之範例中係假定第四十三圖之Jordan時間遞歸神經網路具有512個輸入節點D，512個隱藏節點Z，512個內容節點C，與512個輸出節點Y。此外，亦假定此Jordan時間遞歸神經網路為完全連結，即全部512個輸入節點D均連結各個隱藏節點Z作為輸入，全部512個內容節點C均連結各個隱藏節點Z作為輸入，而全部512個隱藏節點Z均連結各個輸出節點Y作為輸入。第四十四圖之Jordan時間遞歸神經網路之範例雖然會對累加器202數值施以一啟動函數以產生輸出層節點Y之數值，不過，此範例係假定會將施以啟動函數前之累加器202數值傳遞至內容層節點C，而非真正的輸出層節點Y數值。此外，神經網路單元121設置有512個神經處理單元126，或神經元，例如採取寬配置。最後，此範例假定關聯於由內容節點C至隱藏節點Z之連結之權重均具有數值1；因而不需儲存這些為一的權重值。 The forty-fourth diagram is a block diagram showing that when the neural network unit 121 performs the calculation of the Jordan time recurrent neural network associated with the forty-third diagram, the data of the neural network unit 121 and the random access memory 122 and An example of data allocation in the weighted random access memory 124 is. In the example of Figure 44, it is assumed that the Jordan time recurrent neural network of Figure 43 has 512 input nodes D, 512 hidden nodes Z, 512 content nodes C, and 512 output nodes Y. In addition, it is also assumed that this Jordan time recurrent neural network is fully connected, that is, all 512 input nodes D are connected to each hidden node Z as input, all 512 content nodes C are connected to each hidden node Z as input, and all 512 The hidden node Z is connected to each output node Y as an input. The example of the Jordan time recursive neural network shown in Figure 44 applies an activation function to the value of the accumulator 202 to generate the value of the output node Y. However, this example assumes that the accumulation before the activation function is applied. The value of the router 202 is passed to the content layer node C, instead of the actual output layer node Y value. In addition, the neural network unit 121 is provided with 512 neural processing units 126, or neurons, for example, adopting a wide configuration. Finally, this example assumes that the weights associated with the links from content node C to hidden node Z have a value of 1; There is no need to store these weight values as one.

如同第四十一圖之範例，如圖中所示，權重隨機存取記憶體124之下方512個列(列0至511)會裝載關聯於輸入節點D與隱藏節點Z間之連結之權重值，而權重隨機存取記憶體124之後續512個列(列512至1023)會裝載關聯於隱藏節點Z與輸出節點Y間之連結之權重值。 As in the example of the forty-first figure, as shown in the figure, the 512 rows (rows 0 to 511) below the weight random access memory 124 will be loaded with the weight values associated with the connection between the input node D and the hidden node Z , And the subsequent 512 rows (columns 512 to 1023) of the weight random access memory 124 will be loaded with the weight values associated with the connection between the hidden node Z and the output node Y.

資料隨機存取記憶體122係裝載Jordan時間遞歸神經網路節點值供一系列類似於第四十一圖之範例中之時間步驟使用；不過，第四十四圖之範例中係以一組四列之記憶體裝載提供給定時間步驟之節點值。如圖中所示，在具有64列之資料隨機存取記憶體122之實施例中，資料隨機存取記憶體122可以裝載15個不同時間步驟所需之節點值。在第四十四圖之範例中，列0至3係裝載供時間步驟0使用之節點值，列4至7係裝載供時間步驟1使用之節點值，依此類推，列60至63係裝載供時間步驟15使用之節點值。此四列一組記憶體之第一列係裝載此時間步驟之輸入節點D之數值。此四列一組記憶體之第二列係裝載此時間步驟之隱藏節點Z之數值。此四列一組記憶體之第三列係裝載此時間步驟之內容節點C之數值。此四列一組記憶體之第四列則是裝載此時間步驟之輸出節點Y之數值。如圖中所示，資料隨機存取記憶體122之各個行係裝載其相對應之神經元或神經處理單元126之節點值。也就是說，行0係裝載關聯於節點D0，Z0，C0與Y0之節點值，其計算是由神經處理單元0執行；行1 係裝載關聯於節點D1，Z1，C1與Y1之節點值，其計算是由神經處理單元1執行；依此類推，行511係裝載關聯於節點D511，Z511，C511與Y511之節點值，其計算是由神經處理單元511執行。這部分在後續對應於第四十四圖處會有更詳細的說明。 Data random access memory 122 is loaded with Jordan time recurrent neural network node values for a series of time steps similar to the example in Figure 41; however, the example in Figure 44 uses a set of four The memory loads listed provide the node values for a given time step. As shown in the figure, in an embodiment with 64 rows of data random access memory 122, data random access memory 122 can be loaded with the node values required for 15 different time steps. In the example of Figure 44, columns 0 to 3 are loaded with node values for time step 0, columns 4 to 7 are loaded with node values for time step 1, and so on, and 60 to 63 are loaded. Node value for time step 15. The first row of the four-row set of memories is the value of the input node D loaded with this time step. The second row of the four-row set of memories is the value of the hidden node Z that is loaded at this time step. The third column of the four-column set of memory is the value of the content node C that contains this time step. The fourth column of the four-column set of memory is the value of the output node Y loaded with this time step. As shown in the figure, each row of the data random access memory 122 is loaded with the node value of its corresponding neuron or neural processing unit 126. That is, row 0 loads the node values associated with nodes D0, Z0, C0, and Y0, and the calculation is performed by the neural processing unit 0; row 1 It loads the node values associated with nodes D1, Z1, C1, and Y1, and its calculation is performed by the neural processing unit 1. By analogy, line 511 loads the node values associated with nodes D511, Z511, C511, and Y511. Is executed by the neural processing unit 511. This part will be explained in more detail in the subsequent section corresponding to the 44th figure.

第四十四圖中給定時間步驟之內容節點C之數值係於此時間步驟內產生並作為下一個時間步驟之輸入。也就是說，神經處理單元126在此時間步驟內計算並寫入之節點C的數值，會成為此神經處理單元126在下一個時間步驟內用於計算節點Z的數值所使用之節點C的數值(連同此下一個時間步驟之輸入節點D的數值)。內容節點C之初始值(即時間步驟0計算列1節點Z之數值所使用之節點C之數值)係假定為零。這部分在後續對應於第四十五圖之非架構程式之章節會有更詳細的說明。 The value of the content node C at a given time step in Figure 44 is generated during this time step and used as the input for the next time step. That is, the value of node C calculated and written by the neural processing unit 126 in this time step will become the value of node C used by this neural processing unit 126 to calculate the value of node Z in the next time step ( With the value of input node D for this next time step). The initial value of the content node C (that is, the value of the node C used for calculating the value of the node 1 in the time step 0) is assumed to be zero. This part will be explained in more detail in the subsequent chapter of the non-structural program corresponding to Figure 45.

如前文第四十一圖所述，較佳地，輸入節點D的數值(第四十四圖之範例中之列0，4，依此類推至列60之數值)係由執行於處理器100之架構程式透過MTNN指令1400寫入/填入資料隨機存取記憶體122，並且是由執行於神經網路單元121之非架構程式讀取/使用，例如第四十五圖之非架構程式。相反地，隱藏節點Z/內容節點C/輸出節點Y之數值(第四十四圖之範例中分別為列1/2/3，5/6/7，依此類推至列61/62/63之數值)係由執行於神經網路單元121之非架構程式寫入/填入資料隨機存取記憶體122，並且是由執行於處理器100之架構程式透過MFNN指令1500讀取/使用。第四十四圖之範例係假定此架構程式會執行以下步驟：(1)對於15個不同的時間步驟，將輸入節點D之數值填入資料隨機存取記憶體122(列0，4，依此類推至列60)；(2)啟動第四十五圖之非架構程式；(3)偵測非架構程式是否執行完畢；(4)從資料隨機存取記憶體122讀出輸出節點Y之數值(列3，7，依此類推至列63)；以及(5)重複步驟(1)至(4)若干次直到完成任務，例如對手機使用者之話語進行辨識所需之計算。 As described in the forty-first figure above, preferably, the value of the input node D (the values of columns 0 and 4 in the example of the forty-fourth figure, and so on to the value of column 60) is executed by the processor 100. The architecture program is written / filled into the data random access memory 122 through the MTNN instruction 1400, and is read / used by a non-architecture program running on the neural network unit 121, such as the non-architecture program of FIG. 45. Conversely, the values of hidden node Z / content node C / output node Y (in the example in Figure 44 are columns 1/2/3, 5/6/7, and so on to column 61/62/63 The value) is written / filled into the data random access memory 122 by a non-architecture program executed on the neural network unit 121, and is executed by the processor 100 The program is read / used by MFNN instruction 1500. The example in Figure 44 assumes that the framework program will perform the following steps: (1) For 15 different time steps, fill the value of the input node D into the data random access memory 122 (rows 0, 4, according to And so on to column 60); (2) start the non-architecture program in Figure 45; (3) detect whether the non-architecture program has been executed; (4) read out the output node Y from the data random access memory 122 Values (columns 3, 7, and so on to column 63); and (5) repeat steps (1) to (4) several times until the task is completed, such as the calculation required to identify the words of the mobile phone user.

在另一種執行方式中，架構程式會執行以下步驟：(1)對單一個時間步驟，以輸入節點D之數值填入資料隨機存取記憶體122(如列0)；(2)啟動非架構程式(第四十五圖非架構程式之一修正後版本，不需迴圈，並且只存取資料隨機存記憶體122之單一組四個列)；(3)偵測非架構程式是否執行完畢；(4)從資料隨機存取記憶體122讀出輸出節點Y之數值(如列3)；以及(5)重複步驟(1)至(4)若干次直到完成任務。此二種方式何者為優可依據時間遞歸神經網路之輸入值的取樣方式而定。舉例來說，若是此任務容許在多個時間步驟內對輸入進行取樣(例如大約15個時間步驟)並執行計算，第一種方式就較為理想，因為此方式可帶來更多計算資源效率與/或較佳的效能，不過，若是此任務只容許在單一個時間步驟內執行取樣，就需要使用第二種方式。 In another implementation method, the architecture program will perform the following steps: (1) for a single time step, fill the data random access memory 122 (such as row 0) with the value of the input node D; (2) start the non-architecture Program (a modified version of one of the non-architecture programs in Figure 45 without looping, and only accesses a single set of four rows of random storage memory 122); (3) detects whether the non-architecture program has been executed (4) read the value of the output node Y from the data random access memory 122 (such as column 3); and (5) repeat steps (1) to (4) several times until the task is completed. Which of these two methods is optimal depends on the sampling method of the input value of the time recurrent neural network. For example, if this task allows the input to be sampled over multiple time steps (for example, about 15 time steps) and perform calculations, the first method is ideal because this method can bring more computing resources efficiency and / Or better performance, but if this task only allows sampling in a single time step, you need to use the second method.

第三實施例類似於前述第二種方式，不過，不同於第二種方式使用單一組四個資料隨機存取記憶體122列，此方式之非架構程式使用多組四列記憶體，也就是在各個時間步驟使用不同組四列記憶體，此部分類似於第一種方式。在此第三實施例中，較佳地，架構程式在步驟(2)前包含一步驟，在此步驟中，架構程式會在非架構程式啟動前對其進行更新，例如將位址1之指令內的資料隨機存取記憶體122列更新為指向下一組四列記憶體。 The third embodiment is similar to the foregoing second method, but different from the second method using a single set of four data random access records The memory is 122 rows. The non-architecture program in this method uses multiple sets of four rows of memory, that is, different sets of four rows of memory are used at each time step. This part is similar to the first method. In this third embodiment, preferably, the architecture program includes a step before step (2). In this step, the architecture program will update the non-architecture program before it is started, such as the instruction of address 1. The 122 rows of data in the RAM are updated to point to the next set of four rows of memory.

第四十五圖係一表格，顯示儲存於神經網路單元121之程式記憶體129之一程式，此程式係由神經網路單元121執行，並依據第四十四圖之配置使用資料與權重，以達成Jordan時間遞歸神經網路。第四十五圖之非架構程式類似於第四十二圖之非架構程式，二者之差異可參照本文相關章節之說明。 The forty-fifth diagram is a table showing a program stored in the program memory 129 of the neural network unit 121. This program is executed by the neural network unit 121 and uses data and weights according to the configuration of the forty-fourth diagram. To achieve Jordan time recurrent neural network. The non-structural program in Figure 45 is similar to the non-structural program in Figure 42. For the differences between the two, refer to the descriptions in the relevant sections of this article.

第四十五圖之範例程式包括14個非架構指令，分別位於位址0至13。位址0之指令是一個初始化指令，用以清除累加器202並將迴圈計數器3804初始化至數值15，以執行15次迴圈組(位址4至12之指令)。較佳地，此初始化指令並會使神經網路單元121處於寬配置而配置為512個神經處理單元126。如本文所述，在位址1至3以及位址8至12之指令執行過程中，這512個神經處理單元126係對應並作為512個隱藏層節點Z進行運作，而在位址4，5與7之指令執行過程中，這512個神經處理單元126係對應並作為512個輸出層節點Y進行運作。 The example program in Figure 45 includes 14 non-framework instructions, which are located at addresses 0 to 13. The instruction at address 0 is an initialization instruction that clears the accumulator 202 and initializes the loop counter 3804 to a value of 15 to execute the loop group 15 times (instructions at addresses 4 to 12). Preferably, the initialization instruction does not place the neural network unit 121 in a wide configuration and configures 512 neural processing units 126. As described in this article, during the execution of instructions at addresses 1 to 3 and addresses 8 to 12, the 512 neural processing units 126 correspond to and operate as 512 hidden layer nodes Z, and at addresses 4, 5 During the execution of the instruction of 7, these 512 neural processing units 126 correspond to and operate as 512 output layer nodes Y.

位址1至5與位址7之指令與第四十二圖中位址1至6之指令相同並具有相同功能。位址1至3之指令計算隱藏層節點Z之初始值並將其寫入資料隨機存取記憶體122之列1供位址4，5與7之指令之第一次執行使用，以計算出第一時間步驟(時間步驟0)之輸出層節點Y。 The instructions at addresses 1 to 5 and 7 are the same as the instructions at addresses 1 to 6 in Figure 42 and have the same functions. Addresses 1 to 3 Calculate the initial value of the hidden layer node Z and write it to the data in random access memory 122. Row 1 is used for the first execution of the instructions at addresses 4, 5, and 7 to calculate the first time step (time step 0) output node Y.

在位址6之輸出指令之第一次執行之過程中，這512個由位址4與5之指令累加產生之累加器202數值(接下來這些數值會被位址7之輸出指令使用以計算並寫入輸出層節點Y之數值)會被傳遞並寫入資料隨機存取記憶體122之列2，這些數值即為第一時間步驟(時間步驟0)中產生之內容層節點C數值並於第二時間步驟(時間步驟1)中使用；在位址6之輸出指令之第二次執行之過程中，這512個由位址4與5之指令累加產生之累加器202數值(接下來，這些數值會被位址7之輸出指令使用以計算並寫入輸出層節點Y之數值)會被傳遞並寫入資料隨機存取記憶體122之列6，這些數值即為第二時間步驟(時間步驟1)中產生之內容層節點C數值並於第三時間步驟(時間步驟2)中使用；依此類推，在位址6之輸出指令之第十五次執行之過程中，這512個由位址4與5之指令累加產生之累加器202數值(接下來這些數值會被位址7之輸出指令使用以計算並寫入輸出層節點Y之數值)會被傳遞並寫入資料隨機存取記憶體122之列58，這些數值即為第十五時間步驟(時間步驟14)中產生之內容層節點C數值(並由位址8之指令讀取，但不會被使用)。 During the first execution of the output instruction at address 6, the 512 accumulator 202 values generated by the accumulation of the instructions at addresses 4 and 5 (these values will be used by the output instruction at address 7 to calculate And write the value of output layer node Y) will be passed and written into the data random access memory 122 column 2, these values are the content layer node C value generated in the first time step (time step 0) and are Used in the second time step (time step 1); during the second execution of the output instruction at address 6, the values of the 512 accumulator 202 generated by the accumulation of the instructions at addresses 4 and 5 (next, These values will be used by the output instruction at address 7 to calculate and write the value of node Y in the output layer) will be passed and written into the data random access memory row 122, these values are the second time step (time The content layer node C value generated in step 1) is used in the third time step (time step 2); and so on, during the fifteenth execution of the output instruction at address 6, the 512 Accumulator 202 values generated by instruction accumulation at addresses 4 and 5 (these are the following The value will be used by the output instruction at address 7 to calculate and write the value of node Y in the output layer.) It will be passed and written to the data in random access memory 122 column 58. These values are the fifteenth time step (time The content layer node C value generated in step 14) (and read by the instruction at address 8, but will not be used).

位址8至12之指令與第四十二圖中位址7至11之指令大致相同並具有相同功能，二者僅具有一差異點。此差異點即，第四十五圖中位址8之指令(ADD_D_ACC DR ROW+1)會使資料隨機存取記憶體122之列數增加一，而第四十二圖中位址7之指令(ADD_D_ACC DR ROW+0)會使資料隨機存取記憶體122之列數增加零。此差異係導因於資料隨機存取記憶體122內之資料配置之不同，特別是，第四十四圖中四列一組之配置包括一獨立列供內容層節點C數值使用(如列2，6，10等)，而第四十一圖中三列一組之配置則不具有此獨立列，而是讓內容層節點C之數值與隱藏層節點Z之數值共用同一個列(如列1，4，7等)。位址8至12之指令之十五次執行會計算出隱藏層節點Z之數值並將其寫入資料隨機存取記憶體122(寫入列5，9，13，依此類推直到列57)供位址4，5與7之指令之第二至十六次執行使用以計算第二至十五時間步驟之輸出層節點Y(時間步驟1至14)。(位址8至12之指令之最後/第十五次執行係計算隱藏層節點Z之數值並將其寫入資料隨機存取記憶體122之列61，不過這些數值並未被使用。) The instructions at addresses 8 to 12 are roughly the same and have the same functions as the instructions at addresses 7 to 11 in Figure 42. Difference. This difference is that the instruction at address 8 (ADD_D_ACC DR ROW + 1) in the forty-fifth figure will increase the number of rows in the data random access memory 122 by one, while the instruction at address seven in the forty-second figure (ADD_D_ACC DR ROW + 0) will increase the number of rows of data random access memory 122 to zero. This difference is due to the difference in the data configuration in the data random access memory 122. In particular, the four-column configuration in Figure 44 includes an independent row for the content layer node C value (such as row 2 , 6, 10, etc.), and the three-column configuration in the forty-first figure does not have this independent column, but the value of the content layer node C and the value of the hidden layer node Z share the same column (such as column 1, 4, 7 etc.). Fifteen executions of the instructions at addresses 8 to 12 calculate the value of the hidden layer node Z and write it to the data random access memory 122 (write to columns 5, 9, 13, and so on until column 57) for The second to sixteen executions of the instructions at addresses 4, 5, and 7 are used to calculate the output layer node Y for the second to fifteenth time steps (time steps 1 to 14). (The last / fifteenth execution of the instructions at addresses 8 to 12 is to calculate the value of the hidden layer node Z and write it to the data random access memory row 122, but these values are not used.)

位址13之迴圈指令會使迴圈計數器3804遞減並且在新的迴圈計數器3804數值大於零之情況下回到位址4之指令。 The loop instruction at address 13 decrements the loop counter 3804 and returns to the instruction at address 4 if the new loop counter 3804 value is greater than zero.

在另一實施例中，Jordan時間遞歸神經網路之設計係利用內容節點C裝載輸出節點Y之啟動函數值，此啟動函數值即啟動函數執行後之累加值。在此實施例中，因為輸出節點Y之數值與內容節點C之數值相同，位址6之非架構指令並不包含於非架構程式內。因而可以減少資料隨機存取記憶體122內使用之列數。更精確的說，第四十四圖中之各個裝載內容節點C數值之列(例如列2，6，59)都不存在於本實施例。此外，此實施例之各個時間步驟僅需要資料隨機存取記憶體122之三個列，而會搭配20個時間步驟，而非15個，第四十五圖中非架構程式之指令的位址也會進行適當的調整。 In another embodiment, the design of the Jordan time recurrent neural network uses the content node C to load the activation function value of the output node Y, and the activation function value is the cumulative value after the activation function is executed. In this embodiment, because the value of the output node Y is the same as the value of the content node C, the non-framework instruction at address 6 is not included in the non-framework program. thus The number of rows used in the data random access memory 122 can be reduced. More precisely, none of the columns (for example, columns 2, 6, 59) of the values of each load content node C in the forty-fourth figure exist in this embodiment. In addition, each time step of this embodiment only requires three rows of data random access memory 122, and 20 time steps are used instead of 15, the address of the non-architecture program instruction in the forty-fifth figure Appropriate adjustments will also be made.

長短期記憶胞 Long short-term memory cells

長短期記憶胞用於時間遞歸神經網路是本技術領域所習知之概念。舉例來說，Long Short-Term Memory,Sepp Hochreiter and Jürgen Schmidhuber,Neural Computation,November 15,1997,Vol.9,No.8,Pages 1735-1780；Learning to Forget：Continual Prediction with LSTM,Felix A.Gers,Jürgen Schmidhuber,and Fred Cummins,Neural Computation,October 2000,Vol.12,No.10,Pages 2451-2471；這些文獻都可以從麻省理工出版社期刊(MIT Press Journals)取得。長短期記憶胞可以建構為多種不同型式。以下所述第四十六圖之長短期記憶胞4600係以網址http：//deeplearning.net/tutorial/lstm.html標題為用於情緒分析之長短期記憶網路(LSTM Networks for Sentiment Analysis)之教程所描述之長短期記憶胞為模型，此教程之副本係於2015年10月19日下載(以下稱為“長短期記憶教程”)並提供於本案之美國申請案資料揭露陳報書內。此長短期記憶胞4600可用於一般性地描述本文所述之神經網路單元121實施例能夠有效執行關聯於長短期記憶之計算之能力。值得注意的是，這些神經網路單元121之實施例，包括第四十九圖所述之實施例，都可以有效執行關聯於第四十六圖所述之長短期記憶胞以外之其他長短期記憶胞之計算。 The use of long short-term memory cells for temporal recurrent neural networks is a concept well known in the art. For example, Long Short-Term Memory, Sepp Hochreiter and Jürgen Schmidhuber, Neural Computation, November 15, 1997, Vol. 9, No. 8, Pages 1735-1780; Learning to Forget: Continental Prediction with LSTM, Felix A. Gers , Jürgen Schmidhuber, and Fred Cummins, Neural Computation, October 2000, Vol. 12, No. 10, Pages 2451-2471; these documents can be obtained from the MIT Press Journals. Long and short-term memory cells can be constructed in many different types. The long-term and short-term memory cell 4600 of the forty-sixth figure described below is based on the URL http://deeplearning.net/tutorial/lstm.html and is titled as the LSTM Networks for Sentiment Analysis. The short and long-term memory cells described in the tutorial are models. A copy of this tutorial was downloaded on October 19, 2015 (hereinafter referred to as the "long-and-short-term memory tutorial") and provided in the US application information disclosure report of this case. This long short-term memory cell 4600 can be used to generally describe the ability of the embodiment of the neural network unit 121 described herein to effectively perform calculations related to long-term and short-term memory. It is worth noting that the embodiments of these neural network units 121, including the embodiment described in FIG. 49, can effectively perform other long-term and short-term events other than the long-term and short-term memory cells described in FIG. 46. Calculation of memory cells.

較佳地，神經網路單元121可用以針對一個具有長短期記憶胞層連結其他層級之時間遞歸神經網路執行計算。舉例來說，在此長短期記憶教程中，網路包含一均值共源層以接收長短期記憶層之長短期記憶胞之輸出(H)，以及一邏輯回歸層以接收均值共源層之輸出。 Preferably, the neural network unit 121 may be configured to perform calculations on a time-recurrent neural network having a long-term short-term memory cell layer connected to other layers. For example, in this long-short-term memory tutorial, the network includes a mean-source layer to receive the output of the long-term and short-term memory cells from the long-term and short-term memory layer (H), and a logistic regression layer to receive the output of the mean-source source .

第四十六圖係一方塊圖，顯示長短期記憶胞4600之一實施例。 The forty-sixth figure is a block diagram showing one embodiment of a long-term short-term memory cell 4600.

如圖中所示，此長短期記憶胞4600包括一記憶胞輸入(X)，一記憶胞輸出(H)，一輸入閘(I)，一輸出閘(O)，一遺忘閘(F)，一記憶胞狀態(C)與一候選記憶胞狀態(C’)。輸入閘(I)可門控記憶胞輸入(X)至記憶胞狀態(C)之信號傳遞，而輸出閘(O)可門控記憶胞狀態(C)至記憶胞輸出(H)之信號傳遞。此記憶胞狀態(C)會反饋為一時間步驟之候選記憶胞狀態(C’)。遺忘閘(F)可門控此候選記憶胞狀態(C’)，此候選記憶胞狀態會反饋並變成下一個時間步驟之記憶胞狀態(C)。 As shown in the figure, the long-term and short-term memory cell 4600 includes a memory cell input (X), a memory cell output (H), an input gate (I), an output gate (O), and a forget gate (F). A memory cell state (C) and a candidate memory cell state (C '). The input gate (I) can gate the signal transmission from the memory cell input (X) to the memory cell state (C), and the output gate (O) can gate the signal transmission from the memory cell state (C) to the memory cell output (H). . This memory cell state (C) is fed back as a candidate memory cell state (C ') at a time step. The forget gate (F) can gate this candidate memory cell state (C '), and this candidate memory cell state will be fed back and become the memory cell state (C) at the next time step.

第四十六圖之實施例使用下列等式來計算前述各種不同數值：(1)I=SIGMOID(Wi * X+Ui * H+Bi) The embodiment of Figure 46 uses the following equations to calculate the aforementioned various values: (1) I = SIGMOID (Wi * X + Ui * H + Bi)

(2)F=SIGMOID(Wf * X+Uf * H+Bf) (2) F = SIGMOID (Wf * X + Uf * H + Bf)

(3)C’=TANH(Wc * X+Uc * H+Bc) (3) C ’= TANH (Wc * X + Uc * H + Bc)

(4)C=I * C’+F * C (4) C = I * C ’+ F * C

(5)O=SIGMOID(Wo * X+Uo * H+Bo) (5) O = SIGMOID (Wo * X + Uo * H + Bo)

(6)H=O * TANH(C) (6) H = O * TANH (C)

Wi與Ui是關聯於輸入閘(I)之權重值，而Bi是關聯於輸入閘(I)之偏移值。Wf與Uf是關聯於遺忘閘(F)之權重值，而Bf是關聯於遺忘閘(F)之偏移值。Wo與Uo是關聯於輸出閘(O)之權重值，而Bo是關聯於輸出閘(O)之偏移值。如前述，等式(1)，(2)與(5)分別計算輸入閘(I)，遺忘閘(F)與輸出閘(O)。等式(3)計算候選記憶胞狀態(C’)，而等式(4)計算以當前記憶胞狀態(C)為輸入之候選記憶胞狀態(C’)，當前記憶胞狀態(C)即當前時間步驟之記憶胞狀態(C)。等式(6)計算記憶胞輸出(H)。不過本發明並不限於此。使用他種方式計算輸入閘，遺忘閘，輸出閘，候選記憶胞狀態，記憶胞狀態與記憶胞輸出之長短期記憶胞之實施例亦為本發明所涵蓋。 Wi and Ui are weight values associated with the input gate (I), and Bi is an offset value associated with the input gate (I). Wf and Uf are weight values associated with forget gate (F), and Bf is an offset value associated with forget gate (F). Wo and Uo are weight values associated with the output gate (O), and Bo is an offset value associated with the output gate (O). As before, equations (1), (2) and (5) calculate the input gate (I), the forget gate (F) and the output gate (O), respectively. Equation (3) calculates the candidate memory cell state (C '), and equation (4) calculates the candidate memory cell state (C') with the current memory cell state (C) as the input. The current memory cell state (C) is Memory cell status at the current time step (C). Equation (6) calculates the memory cell output (H). However, the present invention is not limited to this. Embodiments of calculating the input gate, forget gate, output gate, candidate memory cell state, memory cell state, and long-term and short-term memory cell output using other methods are also covered by the present invention.

為了說明本發明，長短期記憶胞包括一記憶胞輸入，一記憶胞輸出，一記憶胞狀態，一候選記憶胞狀態，一輸入閘，一輸出閘與一遺忘閘。對各個時間步驟而言，輸入閘，輸出閘，遺忘閘與候選記憶胞狀態為當前時間步驟之記憶體記憶胞輸入與先前時間步驟之記憶胞輸出與相關權重之函數。此時間步驟之記憶胞狀態為先前時間步驟之記憶胞狀態，候選記憶胞狀態，輸入閘與輸出閘之函數。從這個意義上說，記憶胞狀態會反饋用於計算下一個時間步驟之記憶胞狀態。此時間步驟之記憶胞輸出是此時間步驟計算出之記憶胞狀態與輸出閘之函數。長短期記憶神經網路是一個具有一個長短期記憶胞層之神經網路。 To illustrate the present invention, the long-term and short-term memory cells include a memory cell input, a memory cell output, a memory cell state, a candidate memory cell state, an input gate, an output gate, and a forget gate. For each time step, the state of the input gate, output gate, forget gate and candidate memory cell is a function of the memory cell input of the current time step and the memory cell output and related weights of the previous time step. Memory cells at this time step The state is a function of the memory cell state, candidate memory cell state, input gate and output gate of the previous time step. In this sense, the state of the memory cell is fed back to calculate the state of the memory cell for the next time step. The memory cell output at this time step is a function of the memory cell state and output gate calculated at this time step. A long-term short-term memory neural network is a neural network with a long-term short-term memory cell.

第四十七圖係一方塊圖，顯示當神經網路單元121執行關聯於第四十六圖之長短期記憶神經網路之長短期記憶胞4600層之計算時，神經網路單元121之資料隨機存取記憶體122與權重隨機存取記憶體124內之資料配置之一範例。在第四十七圖之範例中，神經網路單元121係配置為512個神經處理單元126或神經元，例如採寬配置，不過，只有128個神經處理單元126(如神經處理單元0至127)所產生之數值會被使用，這是因為在此範例之長短期記憶層只有128個長短期記憶胞4600。 The forty-seventh diagram is a block diagram showing the data of the neural network unit 121 when the neural network unit 121 performs the calculation of the long-term and short-term memory cells of the long-term and short-term memory neural network associated with the forty-sixth diagram. An example of the data arrangement in the random access memory 122 and the weighted random access memory 124. In the example in Figure 47, the neural network unit 121 is configured as 512 neural processing units 126 or neurons, such as a wide configuration. However, there are only 128 neural processing units 126 (such as neural processing units 0 to 127). The value generated by) will be used because the long-term and short-term memory layer in this example has only 128 long-term and short-term memory cells 4600.

如圖中所示，權重隨機存取記憶體124會裝載神經網路單元121之相對應神經處理單元0至127之權重值，偏移值與居間值。權重隨機存取記憶體124之行0至127裝載神經網路單元121之相對應神經處理單元0至127之權重值，偏移值與居間值。列0至14中之各列則是裝載128個下列對應於前述等式(1)至(6)之數值以提供給神經處理單元0至127，這些數值為：Wi,Ui,Bi,Wf,Uf,Bf,Wc,Uc,Bc,C’,TANH(C),C,Wo,Uo,Bo。較佳地，權重值與偏移值-Wi,Ui,Bi,Wf,Uf,Bf,Wc,Uc,Bc,Wo,Uo,Bo(位於列0至8與列12至14)-係由執行於處理器100 之架構程式透過MTNN指令1400寫入/填入權重隨機存取記憶體124，並由執行於神經網路單元121之非架構程式讀取/使用，如第四十八圖之非架構程式。較佳地，居間值-C’,TANH(C),C(位於列9至11)-係由執行於神經網路單元121之非架構程式寫入/填入權重隨機存取記憶體124並進行讀取/使用，詳如後述。 As shown in the figure, the weight random access memory 124 loads the weight values, offset values, and median values of the corresponding neural processing units 0 to 127 of the neural network unit 121. Rows 0 to 127 of the weight random access memory 124 load the weight values, offset values, and intermediate values of the corresponding neural processing units 0 to 127 of the neural network unit 121. Each of the columns 0 to 14 is loaded with 128 values corresponding to the aforementioned equations (1) to (6) to be provided to the neural processing unit 0 to 127. These values are: Wi, Ui, Bi, Wf, Uf, Bf, Wc, Uc, Bc, C ', TANH (C), C, Wo, Uo, Bo. Preferably, the weight and offset values-Wi, Ui, Bi, Wf, Uf, Bf, Wc, Uc, Bc, Wo, Uo, Bo (located in columns 0 to 8 and columns 12 to 14)-are performed by On processor 100 The architecture program is written / filled into the weight random access memory 124 through the MTNN instruction 1400, and is read / used by a non-architecture program running on the neural network unit 121, as shown in the non-architecture program in FIG. Preferably, the median values-C ', TANH (C), C (located in columns 9 to 11)-are written / filled into the weight random access memory 124 by a non-structural program executed on the neural network unit 121 and Read / use as described below.

如圖中所示，資料隨機存取記憶體122裝載輸入(X)，輸出(H)，輸入閘(I)，遺忘閘(F)與輸出閘(O)數值供一系列時間步驟使用。進一步來說，此記憶體五列一組裝載X，H，I，F與O之數值供一給定時間步驟使用。以一個具有64列之資料隨機存取記憶體122為例，如圖中所示，此資料隨機存取記憶體122可裝載供12個不同時間步驟使用之記憶胞數值。在第四十七圖之範例中，列0至4係裝載供時間步驟0使用之記憶胞數值，列5至9係裝載供時間步驟1使用之記憶胞數值，依此類推，列55至59係裝載供時間步驟11使用之記憶胞數值。此五列一組記憶體中之第一列係裝載此時間步驟之X數值。此五列一組記憶體中之第二列係裝載此時間步驟之H數值。此五列一組記憶體中之第三列係裝載此時間步驟之I數值。此五列一組記憶體中之第四列係裝載此時間步驟之F數值。此五列一組記憶體中之第五列係裝載此時間步驟之O數值。如圖中所示，資料隨機存取記憶體122內之各行係裝載供相對應神經元或神經處理單元126使用之數值。也就是說，行0係裝載關聯於長短期記憶胞0之數值，而其計算是由神經處理單元0所執行；行1係裝載關聯於長短期記憶胞1之數值，而其計算是由神經處理單元1所執行；依此類推，行127係裝載關聯於長短期記憶胞127之數值，而其計算是由神經處理單元127所執行，詳如後續第四十八圖所述。 As shown in the figure, the data random access memory 122 is loaded with input (X), output (H), input gate (I), forget gate (F) and output gate (O) values for a series of time steps. Further, the memory is loaded with a set of five values of X, H, I, F, and O for a given time step. Take a data random access memory 122 with 64 rows as an example. As shown in the figure, the data random access memory 122 can be loaded with memory cell values for 12 different time steps. In the example in Figure 47, columns 0 to 4 are the values of memory cells for time step 0, columns 5 to 9 are the values of memory cells for time step 1, and so on, 55 to 59. Load the memory cell value for time step 11. The first column of the five-column set of memory contains the X value for this time step. The second column of the five-column set contains the H value for this time step. The third column of the five-column set contains the I value for this time step. The fourth column of the five-column set contains the F value for this time step. The fifth column in the five-column set contains the value of O for this time step. As shown in the figure, each row in the data random access memory 122 is loaded with a value for use by a corresponding neuron or neural processing unit 126. That is, row 0 is loaded with the value associated with long-term and short-term memory cell 0, and its calculation is performed by the neural processing unit 0; row 1 is loaded with the The value associated with long-term and short-term memory cells 1 is calculated by the neural processing unit 1. By analogy, line 127 is loaded with the value associated with long-term and short-term memory cells 127, and its calculation is performed by the neural processing unit 127. Implementation, as detailed in the subsequent forty-eighth figure.

較佳地，X數值(位於列0，5，9，依此類推至列55)係由執行於處理器100之架構程式透過MTNN指令1400寫入/填入資料隨機存取記憶體122，並由執行於神經網路單元121之非架構程式進行讀取/使用，如第四十八圖所示之非架構程式。較佳地，I數值，F數值與O數值(位於列2/3/4，7/8/9，12/13/14，依此類推至列57/58/59)係由執行於神經處理單元121之非架構程式寫入/填入資料隨機存取記憶體122，詳如後述。較佳地，H數值(位於列1，6，10，依此類推至列56)係由執行於神經處理單元121之非架構程式寫入/填入資料隨機存取記憶體122並進行讀取/使用，並且由執行於處理器100之架構程式透過MFNN指令1500進行讀取。 Preferably, the X value (located in rows 0, 5, 9, and so on to row 55) is written / filled into the data random access memory 122 by the architecture program running on the processor 100 through the MTNN instruction 1400, and The non-framework program executed by the neural network unit 121 reads / uses the non-framework program shown in FIG. 48. Preferably, the I value, the F value and the O value (located in columns 2/3/4, 7/8/9, 12/13/14, and so on to columns 57/58/59) are performed by neural processing The non-schema program of the unit 121 writes / fills in the data random access memory 122, as described later. Preferably, the H value (located in columns 1, 6, 10, and so on to column 56) is written / filled into the data random access memory 122 and read by a non-architecture program executed on the neural processing unit 121. / Use, and read by the framework program running on the processor 100 through the MFNN instruction 1500.

第四十七圖之範例係假定此架構程式會執行以下步驟：(1)對於12個不同的時間步驟，將輸入X之數值填入資料隨機存取記憶體122(列0，5，依此類推至列55)；(2)啟動第四十八圖之非架構程式；(3)偵測非架構程式是否執行完畢；(4)從資料隨機存取記憶體122讀出輸出H之數值(列1，6，依此類推至列59)；以及(5)重複步驟(1)至(4)若干次直到完成任務，例如對手機使用者之話語進行辨識所需之計算。 The example in Figure 47 assumes that the framework program will perform the following steps: (1) For 12 different time steps, fill the value of the input X into the data random access memory 122 (rows 0, 5, and so on) (Analog to column 55); (2) Start the non-architecture program of Figure 48; (3) detect whether the non-architecture program has been executed; (4) read the value of output H from the data random access memory 122 ( (Columns 1, 6, and so on to column 59); and (5) Repeat steps (1) to (4) several times until the task is completed, such as the calculation required to recognize the words of the mobile phone user.

在另一種執行方式中，架構程式會執行以下步驟：(1)對單一個時間步驟，以輸入X之數值填入資料隨機存取記憶體122(如列0)；(2)啟動非架構程式(第四十八圖非架構程式之一修正後版本，不需迴圈，並且只存取資料隨機存記憶體122之單一組五個列)；(3)偵測非架構程式是否執行完畢；(4)從資料隨機存取記憶體122讀出輸出H之數值(如列1)；以及(5)重複步驟(1)至(4)若干次直到完成任務。此二種方式何者為優可依據長短期記憶層之輸入X數值的取樣方式而定。舉例來說，若是此任務容許在多個時間步驟對輸入進行取樣(例如大約12個時間步驟)並執行計算，第一種方式就較為理想，因為此方式可能帶來更多計算資源效率與/或較佳的效能，不過，若是此任務只容許在單一個時間步驟執行取樣，就需要使用第二種方式。 In another implementation, the framework program runs Next steps: (1) For a single time step, fill in the data random access memory 122 (such as row 0) with the value of X; (2) start the non-structural program (one of the 48 non-structural programs in Figure 48) The revised version does not require loops, and only accesses a single set of five rows of data random storage memory 122); (3) detects whether the non-structural program is completed; (4) random access memory 122 from data Read out the value of output H (such as column 1); and (5) repeat steps (1) to (4) several times until the task is completed. Which of these two methods is optimal may depend on the sampling method of the input X value of the long-term and short-term memory layers. For example, if this task allows the input to be sampled at multiple time steps (for example, about 12 time steps) and perform calculations, the first method is ideal because this method may bring more computing resources efficiency and / Or better performance, but if this task only allows sampling in a single time step, you need to use the second method.

第三實施例類似於前述第二種方式，不過，不同於第二種方式使用單一組五列資料隨機存取記憶體122，此方式之非架構程式使用多組五列記憶體，也就是在各個時間步驟使用不同的五列一組記憶體，此部分類似於第一種方式。在此第三實施例中，較佳地，架構程式在步驟(2)前包含一步驟，此步驟中，架構程式會在非架構程式啟動前對其進行更新，例如將位址0之指令內的資料隨機存取記憶體122列更新為指向下一組五列記憶體。 The third embodiment is similar to the foregoing second method, but different from the second method using a single set of five rows of data random access memory 122, the non-structural program of this method uses multiple sets of five rows of memory, that is, in Each time step uses a different set of five columns of memory, this part is similar to the first way. In this third embodiment, preferably, the architecture program includes a step before step (2). In this step, the architecture program will update the non-architecture program before it is started, for example, in the address 0 instruction. The 122 rows of data random access memory are updated to point to the next set of five rows of memory.

第四十八圖係一表格，顯示儲存於神經網路單元121之程式記憶體129之一程式，此程式係由神經網路單元121執行並依據第四十七圖之配置使用資料與權重，以達成關聯於長短期記憶胞層之計算。第四十八圖之範例程式包括24個非架構指令分別位於位址0至23。位址0之指令(INITIALIZE NPU,CLR ACC,LOOPCNT=12,DR IN ROW=-1,DR OUT ROW=2)會清除累加器202並將迴圈計數器3804初始化至數值12，以執行12次迴圈組(位址1至22之指令)。此初始化指令並會將資料隨機存取記憶體122之待讀取列初始化為數值-1，而在位址1之指令之第一次執行後，此數值會增加為零。此初始化指令並會將資料隨機存取記憶體122之待寫入列(例如第二十六與三十九圖之暫存器2606)初始化為列2。較佳地，此初始化指令並會使神經網路單元121處於寬配置，如此，神經網路單元121就會配置有512個神經處理單元126。如同後續章節所述，在位址0至23之指令執行過程中，這512個神經處理單元126其中之128個神經處理單元126係對應並作為128個長短期記憶胞4600進行運作。 The forty-eighth figure is a table showing a program stored in the program memory 129 of the neural network unit 121. This program is executed by the neural network unit 121 and uses data according to the configuration of the forty-seventh figure. Weights to achieve calculations related to long- and short-term memory cells. The example program in Figure 48 includes 24 non-framework instructions at addresses 0 to 23. The instruction at address 0 (INITIALIZE NPU, CLR ACC, LOOPCNT = 12, DR IN ROW = -1, DR OUT ROW = 2) will clear the accumulator 202 and initialize the loop counter 3804 to the value 12 to execute 12 cycles. Circle group (addresses 1 to 22). This initialization command initializes the row to be read of the data random access memory 122 to a value of -1, and after the first execution of the instruction at address 1, this value will increase to zero. This initialization command also initializes the data random access memory 122 to be written into the row (for example, the register 2606 of the twenty-sixth and thirty-ninth figures) as the row two. Preferably, the initialization command does not place the neural network unit 121 in a wide configuration. Thus, the neural network unit 121 is configured with 512 neural processing units 126. As described in the subsequent chapters, during the execution of the instructions at addresses 0 to 23, 128 of the 512 neural processing units 126 correspond to and operate as 128 long-term and short-term memory cells 4600.

在位址1至4之指令之第一次執行中，這128個神經處理單元126(即神經處理單元0至127)中之各個神經處理單元126會針對相對應長短期記憶胞4600之第一時間步驟(時間步驟0)計算輸入閘(I)數值並將I數值寫入資料隨機存取記憶體122之列2之相對應文字；在位址1至4之指令之第二次執行中，這128個神經處理單元126中之各個神經處理單元126會針對相對應長短期記憶胞4600之第二時間步驟(時間步驟1)計算I數值並將I數值寫入資料隨機存取記憶體122之列7之相對應文字；依此類推，在位址1至4之指令之第十二次執行中，這128個神經處理單元126中之各個神經處理單元126會針對相對應長短期記憶胞4600之第十二時間步驟(時間步驟11)計算I數值並將I數值寫入資料隨機存取記憶體122之列57之相對應文字，如第四十七圖所示。 In the first execution of the instructions at addresses 1 to 4, each of the 128 neural processing units 126 (ie, neural processing units 0 to 127) will target the first long-term short-term memory cell 4600. Time step (time step 0) calculates the value of the input gate (I) and writes the value of I into the corresponding text in column 2 of the data random access memory 122; in the second execution of the instructions at addresses 1 to 4, Each of the 128 neural processing units 126 calculates the I value for the second time step (time step 1) corresponding to the long-term short-term memory cell 4600 and writes the I value into the data random access memory 122. Corresponding text of column 7; By analogy, in the twelfth execution of the instructions at addresses 1 to 4, each of the 128 neural processing units 126 will address the twelfth time step of the corresponding long-term and short-term memory cell 4600 ( Step 11) Calculate the I value and write the I value into the corresponding text in column 57 of the data random access memory 122, as shown in FIG. 47.

進一步來說，位址1之乘法累加指令會讀取資料隨機存取記憶體122當前列後方之下一列(在第一執行即為列0，在第二執行即為列5，依此類推，在第十二執行即為列55)，此列係包含關聯於當前時間步驟之記憶胞輸入(X)值，此指令並會讀取權重隨機存取記憶體124中包含Wi數值之列0，並且將前述讀取數值相乘以產生第一乘積累加至剛剛由位址0之初始化指令或位址22之指令清除之累加器202。隨後，位址2之乘法累加指令會讀取下一個資料隨機存取記憶體122列(在第一執行即為列1，在第二執行即為列6，依此類推，在第十二執行即為列56)，此列係包含關聯於當前時間步驟之記憶胞輸出(H)值，此指令並會讀取權重隨機存取記憶體124中包含Ui數值之列1，並且將前述數值相乘以產生第二乘積累加至累加器202。關聯於當前時間步驟之H數值係由位址2之指令(以及位址6，10與18之指令)由資料隨機存取記憶體122讀取，在先前時間步驟產生，並由位址22之輸出指令寫入資料隨機存取記憶體122；不過，在第一次執行中，位址2之指令會以一初始值寫入資料隨機存取記憶體之列1作為H數值。較佳地，架構程式會在啟動第四十八圖之非架構程式前將初始H數值寫入資料隨機存取記憶體122之列1(例如使用MTNN指令1400)；不過，本發明並不限於此，非架構程式內包含有初始化指令將初始H數值寫入資料隨機存取記憶體122之列1之其他實施例亦屬於本發明之範疇。在一實施例中，此初始H數值為零。接下來，位址3之將權重文字加入累加器的指令(ADD_W_ACC WR ROW 2)會讀取權重隨機存取記憶體124中包含Bi數值之列2並將其加入累加器202。最後，位址4之輸出指令(OUTPUT SIGMOID,DR OUT ROW+0,CLR ACC)會對累加器202數值執行一S型啟動函數並將執行結果寫入資料隨機存取記憶體122之當前輸出列(在第一執行即為列2，在第二執行即為列7，依此類推，在第十二執行即為列57)並且清除累加器202。 Further, the multiply-accumulate instruction at address 1 reads the next row behind the current row of the random access memory 122 (row 0 in the first execution, row 5 in the second execution, and so on, etc.) In the twelfth execution, it is column 55). This column contains the memory cell input (X) value associated with the current time step. This instruction will read the weight 0 in the random access memory 124 which contains the Wi value. And the aforementioned read values are multiplied to generate a first multiplying accumulation to be added to the accumulator 202 which has just been cleared by the initialization instruction at address 0 or the instruction at address 22. Subsequently, the multiply-accumulate instruction at address 2 reads the next row of random access memory 122 rows (row 1 in the first execution, row 6 in the second execution, and so on, in the twelfth execution This is column 56). This column contains the memory cell output (H) value associated with the current time step. This instruction will read column 1 containing the Ui value in the weight random access memory 124 and compare the aforementioned value with Multiplying produces a second multiplying accumulation and adds it to the accumulator 202. The H value associated with the current time step is read by the instruction at address 2 (and the instructions at addresses 6, 10, and 18) by the data random access memory 122, generated at the previous time step, and by the address 22 The output instruction writes to the data random access memory 122; however, in the first execution, the instruction at address 2 will write an initial value to the data random access memory row 1 as the H value. Preferably, the structured program will write the initial H value into the data random access record before starting the non-structured program shown in Figure 48. Column 1 of memory 122 (for example, using MTNN instruction 1400); however, the present invention is not limited to this. Non-architecture programs include initialization instructions to write the initial H value to the data. Examples also belong to the scope of the present invention. In one embodiment, the initial H value is zero. Next, the instruction (ADD_W_ACC WR ROW 2) at the address 3 to add the weight text to the accumulator reads the column 2 containing the Bi value in the weight random access memory 124 and adds it to the accumulator 202. Finally, the output instruction at address 4 (OUTPUT SIGMOID, DR OUT ROW + 0, CLR ACC) executes an S-shaped start function on the value of accumulator 202 and writes the execution result to the current output row of the data random access memory 122 (Column 2 in the first execution, column 7 in the second execution, and so on, and column 57 in the twelfth execution) and the accumulator 202 is cleared.

在位址5至8之指令之第一次執行中，這128個神經處理單元126中之各個神經處理單元126會針對相對應長短期記憶胞4600之第一時間步驟(時間步驟0)計算其遺忘閘(F)數值並將F數值寫入資料隨機存取記憶體122之列3之相對應文字；在位址5至8之指令之第二次執行中，這128個神經處理單元126中之各個神經處理單元126會針對相對應長短期記憶胞4600之第二時間步驟(時間步驟1)計算其遺忘閘(F)數值並將F數值寫入資料隨機存取記憶體122之列8之相對應文字；依此類推，在位址5至8之指令之第十二次執行中，這128個神經處理單元126中之各個神經處理單元126會針對相對應長短期記憶胞4600之第十二時間步驟(時間步驟11)計算其遺忘閘(F)數值並將F數值寫入資料隨機存取記憶體122之列58之相對應文字，如第四十七圖所示。位址5至8之指令計算F數值之方式類似於前述位址1至4之指令，不過，位址5至7之指令會分別從權重隨機存取記憶體124之列3，列4與列5讀取Wf，Uf與Bf數值以執行乘法與/或加法運算。 In the first execution of the instructions at addresses 5 to 8, each of the 128 neural processing units 126 calculates the first time step (time step 0) of the corresponding long-term short-term memory cell 4600. Forget the gate (F) value and write the F value into the corresponding text in column 3 of the data random access memory 122; in the second execution of the instructions at addresses 5 to 8, the 128 neural processing units 126 Each neural processing unit 126 calculates its forgetting gate (F) value for the second time step (time step 1) corresponding to the long-term short-term memory cell 4600 and writes the F value into the 8th row of the data random access memory 122. Corresponding text; and so on, in the twelfth execution of the instructions at addresses 5 to 8, each of the 128 neural processing units 126 will target the tenth corresponding to the short-term memory cell 4600. The second time step (time step 11) calculates its forget gate (F) value and writes the F value into the data random access memory 122. The corresponding text of column 58 is shown in Figure 47. The instructions for addresses 5 to 8 calculate the F value similarly to the instructions for addresses 1 to 4, except that the instructions for addresses 5 to 7 will access random access memory 124 from row 3, row 4, and row respectively. 5 Read Wf, Uf and Bf values to perform multiplication and / or addition.

在位址9至12之指令之十二次執行中，這128個神經處理單元126中之各個神經處理單元126會針對相對應長短期記憶胞4600之相對應時間步驟計算其候選記憶胞狀態(C’)數值並將C’數值寫入權重隨機存取記憶體124之列9之相對應文字。位址9至12之指令計算C’數值之方式類似於前述位址1至4之指令，不過，位址9至11之指令會分別從權重隨機存取記憶體124之列6，列7與列8讀取Wc，Uc與Bc數值以執行乘法與/或加法運算。此外，位址12之輸出指令會執行雙曲正切啟動函數而非(如位址4之輸出指令執行)S型啟動函數。 In the twelve executions of the instructions at addresses 9 to 12, each of the 128 neural processing units 126 will calculate its candidate memory cell state for the corresponding time step corresponding to the long-term short-term memory cell 4600 ( C ') value and write the C' value into the corresponding text in column 9 of the weight random access memory 124. The instructions for addresses 9 to 12 calculate the C 'value similarly to the instructions for addresses 1 to 4 above, but the instructions for addresses 9 to 11 will access random access memory 124 from row 6, row 7, and Column 8 reads Wc, Uc and Bc values to perform multiplication and / or addition operations. In addition, the output instruction at address 12 will execute the hyperbolic tangent start function instead of (such as the output instruction at address 4) the S-type start function.

進一步來說，位址9之乘法累加指令會讀取資料隨機存取記憶體122之當前列(在第一次執行即為列0，在第二次執行即為列5，依此類推，在第十二次執行即為列55)，此當前列係包含關聯於當前時間步驟之記憶胞輸入(X)值，此指令並會讀取權重隨機存取記憶體124中包含Wc數值之列6，並且將前述數值相乘以產生第一乘積累加至剛剛由位址8之指令清除之累加器202。接下來，位址10之乘法累加指令會讀取資料隨機存取記憶體122之次一列(在第一次執行即為列1，在第二次執行即為列6，依此類推，在第十二次執行即為列56)，此列係包含關聯於當前時間步驟之記憶胞輸出(H)值，此指令並會讀取權重隨機存取記憶體124中包含Uc數值之列7，並且將前述數值相乘以產生第二乘積累加至累加器202。接下來，位址11之將權重文字加入累加器的指令會讀取權重隨機存取記憶體124中包含Bc數值之列8並將其加入累加器202。最後，位址12之輸出指令(OUTPUT TANH,WR OUT ROW 9,CLR ACC)會對累加器202數值執行一雙曲正切啟動函數並將執行結果寫入權重隨機存取記憶體124之列9，並且清除累加器202。 Further, the multiply-accumulate instruction at address 9 reads the current row of data random access memory 122 (row 0 in the first execution, row 5 in the second execution, and so on. The twelfth execution is column 55). This current column contains the memory cell input (X) value associated with the current time step. This instruction also reads column 6 containing the Wc value in the random access memory 124. And multiplying the aforementioned values to generate a first multiplying accumulation and adding to the accumulator 202 just cleared by the instruction at address 8. Next, the multiply-accumulate instruction at address 10 reads the next row of the data random access memory 122 (row 1 in the first execution, row 6 in the second execution, and so on. Twelve executions is column 56), this column It contains the memory cell output (H) value associated with the current time step. This instruction will read the weight 7 of the random access memory 124 containing the Uc value, and multiply the aforementioned values to generate a second multiplied accumulation and add to Accumulator 202. Next, the instruction of adding the weight text to the accumulator at address 11 reads the column 8 containing the Bc value in the weight random access memory 124 and adds it to the accumulator 202. Finally, the output instruction at address 12 (OUTPUT TANH, WR OUT ROW 9, CLR ACC) performs a hyperbolic tangent start function on the value of accumulator 202 and writes the execution result to column 9 of weight random access memory 124. And the accumulator 202 is cleared.

在位址13至16之指令之十二次執行中，這128個神經處理單元126中之各個神經處理單元126會針對相對應長短期記憶胞4600之相對應時間步驟計算新的記憶胞狀態(C)數值並將此新的C數值寫入權重隨機存取記憶體122之列11之相對應文字，各個神經處理單元126還會計算tanh(C)並將其寫入權重隨機存取記憶體124之列10之相對應文字。進一步來說，位址13之乘法累加指令會讀取資料隨機存取記憶體122當前列後方之下一列(在第一次執行即為列2，在第二次執行即為列7，依此類推，在第十二次執行即為列57)，此列包含關聯於當前時間步驟之輸入閘(I)數值，此指令並讀取權重隨機存取記憶體124中包含候選記憶胞狀態(C’)數值之列9(剛剛由位址12之指令寫入)，並且將前述數值相乘以產生第一乘積累加至剛剛由位址12之指令清除之累加器202。接下來，位址14之乘法累加指令會讀取資料隨機存取記憶體122之下一列(在第一次執行即為列3，在第二次執行即為列8，依此類推，在第十二次執行即為列58)，此列包含關聯於當前時間步驟之遺忘閘(F)數值，此指令並讀取權重隨機存取記憶體124中包含於先前時間步驟中計算之當前記憶胞狀態(C)數值(由位址15之指令之最近一次執行進行寫入)之列11，並且將前述數值相乘以產生第二乘積加入累加器202。接下來，位址15之輸出指令(OUTPUT PASSTHRU,WR OUT ROW 11)會傳遞此累加器202數值並將其寫入權重隨機存取記憶體124之列11。需要理解的是，位址14之指令由資料隨機存取記憶體122之列11讀取之C數值即為位址13至15之指令於最近一次執行中產生並寫入之C數值。位址15之輸出指令並不會清除累加器202，如此，其數值即可由位址16之指令使用。最後，位址16之輸出指令(OUTPUT TANH,WR OUT ROW 10,CLR ACC)會對累加器202數值執行一雙曲正切啟動函數並將其執行結果寫入權重隨機存取記憶體124之列10供位址21之指令使用以計算記憶胞輸出(H)值。位址16之指令會清除累加器202。 In the twelve executions of the instructions at addresses 13 to 16, each of the 128 neural processing units 126 will calculate a new memory cell state for the corresponding time step corresponding to the long-term short-term memory cell 4600 ( C) value and write the new C value into the corresponding text in column 11 of the weighted random access memory 122. Each neural processing unit 126 also calculates tanh (C) and writes it into the weighted random access memory. Corresponding text in column 10 of 124. Further, the multiply-accumulate instruction at address 13 reads the next row behind the current row of the random access memory 122 (row 2 in the first execution and row 7 in the second execution, and so on) By analogy, in the twelfth execution, it is column 57). This column contains the input gate (I) value associated with the current time step. This instruction reads the weight of the random access memory 124 and contains the state of the candidate memory cell (C ') Value column 9 (just written by the instruction at address 12), and multiplying the aforementioned values to generate a first multiplication accumulation is added to the accumulator 202 just cleared by the instruction at address 12. Next, the multiply-accumulate instruction at address 14 reads the row below the data random access memory 122 (row 3 in the first execution, and in the second execution It is column 8, and so on. In the twelfth execution, it is column 58. This column contains the value of the forget gate (F) associated with the current time step. This instruction reads the weight contained in the random access memory 124. Column 11 of the current memory cell state (C) value (written by the most recent execution of the instruction at address 15) calculated in the previous time step, and multiplying the aforementioned values to generate a second product is added to the accumulator 202. Next, the output instruction at address 15 (OUTPUT PASSTHRU, WR OUT ROW 11) will pass the value of this accumulator 202 and write it into the column 11 of the weight random access memory 124. It should be understood that the C value read by the instruction at address 14 from column 11 of the data random access memory 122 is the C value generated and written by the instructions at addresses 13 to 15 in the most recent execution. The output instruction at address 15 does not clear the accumulator 202, so its value can be used by the instruction at address 16. Finally, the output instruction at address 16 (OUTPUT TANH, WR OUT ROW 10, CLR ACC) executes a hyperbolic tangent start function on the value of accumulator 202 and writes its execution result to column 10 of weight random access memory 124 Used by instruction at address 21 to calculate memory cell output (H) value. The instruction at address 16 clears accumulator 202.

在位址17至20之指令之第一次執行中，這128個神經處理單元126中之各個神經處理單元126會針對相對應長短期記憶胞4600之第一時間步驟(時間步驟0)計算其輸出閘(O)數值並將O數值寫入資料隨機存取記憶體122之列4之相對應文字；在位址17至20之指令之第二次執行中，這128個神經處理單元126中之各個神經處理單元126會針對相對應長短期記憶胞4600之第二時間步驟(時間步驟1)計算其輸出閘(O)數值並將O數值寫入資料隨機存取記憶體122之列9之相對應文字；依此類推，在位址17至20之指令之第十二次執行中，這128個神經處理單元126中之各個神經處理單元126會針對相對應長短期記憶胞4600之第十二時間步驟(時間步驟11)計算其輸出閘(O)數值並將O數值寫入資料隨機存取記憶體122之列58之相對應文字，如第四十七圖所示。位址17至20之指令計算O數值之方式類似於前述位址1至4之指令，不過，位址17至19之指令會分別從權重隨機存取記憶體124之列12，列13與列14讀取Wo，Uo與Bo數值以執行乘法與/或加法運算。 In the first execution of the instructions at addresses 17 to 20, each of the 128 neural processing units 126 calculates the first time step (time step 0) of the corresponding long-term short-term memory cell 4600. Output the gate (O) value and write the O value into the corresponding text in column 4 of the data random access memory 122; in the second execution of the instructions at addresses 17 to 20, the 128 neural processing units 126 Each neural processing unit 126 calculates its output gate (O) value for the second time step (time step 1) of the corresponding long-term and short-term memory cell 4600 and sets the O value Write the corresponding text in column 9 of data random access memory 122; and so on, in the twelfth execution of the instructions at addresses 17 to 20, each of the 128 neural processing units 126 126 calculates the output gate (O) value for the twelfth time step (time step 11) of the corresponding long-term and short-term memory cell 4600 and writes the O value into the corresponding text of column 58 of the random access memory 122, As shown in Figure 47. The instructions for addresses 17 to 20 calculate the O value similar to the instructions for addresses 1 to 4 above. However, the instructions for addresses 17 to 19 will access random access memory 124 from row 12, row 13, and row respectively. 14 Read Wo, Uo and Bo values to perform multiplication and / or addition.

在位址21至22之指令之第一次執行中，這128個神經處理單元126中之各個神經處理單元126會針對相對應長短期記憶胞4600之第一時間步驟(時間步驟0)計算其記憶胞輸出(H)值並將H數值寫入資料隨機存取記憶體122之列6之相對應文字；在位址21至22之指令之第二次執行中，這128個神經處理單元126中之各個神經處理單元126會針對相對應長短期記憶胞4600之第二時間步驟(時間步驟1)計算其記憶胞輸出(H)值並將H數值寫入資料隨機存取記憶體122之列11之相對應文字；依此類推，在位址21至22之指令之第十二次執行中，這128個神經處理單元126中之各個神經處理單元126會針對相對應長短期記憶胞4600之第十二時間步驟(時間步驟11)計算其記憶胞輸出(H)值並將H數值寫入資料隨機存取記憶體122之列60之相對應文字，如第四十七圖所示。 In the first execution of the instructions at addresses 21 to 22, each of the 128 neural processing units 126 calculates the first time step (time step 0) of the corresponding long-term short-term memory cell 4600. The memory cell outputs the (H) value and writes the H value into the corresponding text in column 6 of the data random access memory 122; in the second execution of the instructions at addresses 21 to 22, the 128 neural processing units 126 Each of the neural processing units 126 calculates the memory cell output (H) value for the second time step (time step 1) corresponding to the long-term short-term memory cell 4600 and writes the H value into the data random access memory 122 column. The corresponding text of 11; and so on, in the twelfth execution of the instructions at addresses 21 to 22, each of the 128 neural processing units 126 will target the corresponding long-term and short-term memory cells 4600. The twelfth time step (time step 11) calculates the memory cell output (H) value and writes the H value into the corresponding text of column 60 of the data random access memory 122, as shown in FIG. 47.

進一步來說，位址21之乘法累加指令會讀取資料隨機存取記憶體122當前列後方之第三列(在第一次執行即為列4，在第二次執行即為列9，依此類推，在第十二次執行即為列59)，此列包含關聯於當前時間步驟之輸出閘(O)數值，此指令並讀取權重隨機存取記憶體124中包含tanh(C)數值之列10(由位址16之指令寫入)，並且將前述數值相乘以產生一乘積累加至剛剛由位址20之指令清除之累加器202。隨後，位址22之輸出指令會傳遞累加器202數值並將其寫入資料隨機存取記憶體122之接下來第二個輸出列11(在第一次執行即為列6，在第二次執行即為列11，依此類推，在第十二次執行即為列61)，並且清除累加器202。需要理解的是，由位址22之指令寫入資料隨機存取記憶體122列之H數值(在第一次執行即為列6，在第二次執行即為列11，依此類推，在第十二次執行即為列61)即為位址2，6，10與18之指令之後續執行中所消耗/讀取之H數值。不過，第十二次執行中寫入列61之H數值並不會被位址2，6，10與18之指令之執行所消耗/讀取；就一較佳實施例而言，此數值會是由架構程式所消耗/讀取。 Further, the multiply-accumulate instruction at address 21 reads the third row behind the current row of the data random access memory 122 (row 4 in the first execution and row 9 in the second execution. By analogy, in the twelfth execution, it is column 59). This column contains the output gate (O) value associated with the current time step. This instruction reads the weight of the random access memory 124 which contains the value of tanh (C). Column 10 (written by the instruction at address 16), and multiplying the aforementioned values to generate a multiplying accumulation is added to the accumulator 202 just cleared by the instruction at address 20. Subsequently, the output instruction at address 22 will pass the value of accumulator 202 and write it to the data in random access memory 122. The second output row 11 (in the first execution, it will be row 6, in the second execution The execution is column 11 and so on, and the twelfth execution is column 61), and the accumulator 202 is cleared. It should be understood that the value of H in row 122 of the random access memory is written by the instruction at address 22 (in the first execution, it is row 6, in the second execution, it is row 11, and so on. The twelfth execution is column 61) which is the H value consumed / read in the subsequent execution of the instructions at addresses 2, 6, 10, and 18. However, the value of H written in column 61 during the twelfth execution will not be consumed / read by the execution of the instructions at addresses 2, 6, 10, and 18; for a preferred embodiment, this value will be It is consumed / read by the framework program.

位址23之指令(LOOP 1)會使迴圈計數器3804遞減並且在新的迴圈計數器3804數值大於零的情況下回到位址1之指令。 The instruction at address 23 (LOOP 1) decrements the loop counter 3804 and returns to the instruction at address 1 if the new loop counter 3804 value is greater than zero.

第四十九圖係一方塊圖，顯示一神經網路單元121之實施例，此實施例之神經處理單元群組內具有輸出緩衝遮蔽與反饋能力。第四十九圖顯示單一個由四個神經處理單元126構成之神經處理單元群組4901。雖然第四十九圖僅顯示單一個神經處理單元群組4901，不過需要理解的是，神經網路單元121中之各個神經處理單元126都會包含於一個神經處理單元群組4901內，因此，一共會有N/J個神經處理單元群組4901，其中N是神經處理單元126的數量(舉例來說，就寬配置而言為512，就窄配置而言為1024)而J是單一個群組4901內之神經處理單元126的數量(舉例來說，就第四十九圖之實施例而言即為四)。第四十九圖中係將神經處理單元群組4901內之四個神經處理單元126稱為神經處理單元0，神經處理單元1，神經處理單元2與神經處理單元3。 The forty-ninth figure is a block diagram showing an embodiment of a neural network unit 121. The neural processing unit group in this embodiment has output buffer masking and feedback capabilities. The forty-ninth figure shows a single one by four A neural processing unit group 4901 composed of three neural processing units 126. Although the forty-ninth figure only shows a single neural processing unit group 4901, it should be understood that each neural processing unit 126 in the neural network unit 121 is included in a neural processing unit group 4901. Therefore, a total of There will be N / J neural processing unit groups 4901, where N is the number of neural processing units 126 (for example, 512 for a wide configuration and 1024 for a narrow configuration) and J is a single group The number of neural processing units 126 in 4901 (for example, four for the embodiment of FIG. 49). In the forty-ninth figure, the four neural processing units 126 in the neural processing unit group 4901 are referred to as neural processing unit 0, neural processing unit 1, neural processing unit 2, and neural processing unit 3.

第四十九圖之實施例中之各個神經處理單元係類似於前述第七圖之神經處理單元126，並且圖中具有相同標號之元件亦相類似。不過，多工暫存器208係經調整以包含四個額外的輸入4905，多工暫存器705係經調整以包含四個額外的輸入4907，選擇輸入213係經調整而能從原本之輸入211與207以及額外輸入4905中進行選擇提供至輸出209，並且，選擇輸入713係經調整而能從原本之輸入711與206以及額外輸入4907中進行選擇提供至輸出203。 Each neural processing unit in the embodiment of the forty-ninth figure is similar to the neural processing unit 126 of the aforementioned seventh figure, and the components with the same reference numerals in the figure are also similar. However, the multiplex register 208 is adjusted to include four additional inputs 4905, the multiplex register 705 is adjusted to include four additional inputs 4907, and the selection input 213 is adjusted to be able to recover from the original input The selection input 211 and 207 and the additional input 4905 are provided to the output 209, and the selection input 713 is adjusted to provide the selection 203 from the original inputs 711 and 206 and the additional input 4907.

如圖中所示，第十一圖之列緩衝器1104在第四十九圖中即為輸出緩衝器1104。進一步來說，圖中所示之輸出緩衝器1104之文字0，1，2與3係接收關聯於神經處理單元0，1，2與3之四個啟動函數單元212之相對應輸出。此部分之輸出緩衝器1104包含N個文字對應於一神經處理單元群組4901，這些文字係稱為一個輸出緩衝文字群組。在第四十九圖之實施例中，N為四。輸出緩衝器1104之這四個文字係反饋至多工暫存器208與705，並作為四個額外輸入4905由多工暫存器208所接收以及作為四個額外輸入4907由多工暫存器705所接收。輸出緩衝文字群組反饋至其相對應神經處理單元群組4901之反饋動作，使非架構程式之算術指令能夠從關聯於神經處理單元群組4901之輸出緩衝器1104之文字(即輸出緩衝文字群組)中選擇一個或兩個文字作為其輸入，其範例請參照後續第五十一圖之非架構程式，如圖中位址4，8，11，12與15之指令。也就是說，指定於非架構指令內之輸出緩衝器1104文字會確認選擇輸入213/713產生之數值。這個能力實際上使輸出緩衝器1104可以作為一個類別草稿記憶體(scratch pad memory)，能夠讓非架構程式減少寫入資料隨機存取記憶體122與/或權重隨機存取記憶體124以及後續從中讀取之次數，例如減少過程中居間產生與使用之數值。較佳地，輸出緩衝器1104，或稱列緩衝器1104，包括一個一維之暫存器陣列，用以儲存1024個窄文字或是512個寬文字。較佳地，對於輸出緩衝器1104之讀取可以在單一個時頻周期內執行，而對於輸出緩衝器1104之寫入也可以在單一個時頻周期內執行。不同於資料隨機存取記憶體122與權重隨機存取記憶體124，可由架構程式與非架構程式進行存取，輸出緩衝器1104無法由架構程式進行存取，而只能由非架構程式進行存取。 As shown in the figure, the column buffer 1104 in the eleventh figure is the output buffer 1104 in the forty-ninth figure. Further, the characters 0, 1, 2 and 3 of the output buffer 1104 shown in the figure receive the corresponding outputs of the four activation function units 212 associated with the neural processing units 0, 1, 2 and 3. The output buffer 1104 in this section contains N text correspondences In a neural processing unit group 4901, these texts are referred to as an output buffered text group. In the embodiment of the forty-ninth figure, N is four. The four characters of the output buffer 1104 are fed back to the multiplexer registers 208 and 705 and received as four additional inputs 4905 by the multiplex register 208 and as four additional inputs 4907 by the multiplex register 705. Received. The output buffer text group feeds back to the corresponding action of its corresponding neural processing unit group 4901, so that the arithmetic instructions of the non-architecture program can read the text from the output buffer 1104 associated with the neural processing unit group 4901 (that is, the output buffer text group Group) to select one or two characters as its input. For an example, refer to the non-framework program in the following fifty-first figure, as shown in the instructions at addresses 4, 8, 11, 12, and 15. That is, the text in the output buffer 1104 specified in the non-architecture instruction will confirm the value generated by selecting the input 213/713. This capability actually enables the output buffer 1104 to be used as a scratch pad memory, which enables non-architectural programs to reduce the amount of data written to the random access memory 122 and / or the weighted random access memory 124 and subsequent slaves. The number of readings, such as reducing the number of intervening generations and uses during the process. Preferably, the output buffer 1104, or column buffer 1104, includes a one-dimensional register array for storing 1024 narrow characters or 512 wide characters. Preferably, the reading to the output buffer 1104 can be performed in a single time-frequency cycle, and the writing to the output buffer 1104 can also be performed in a single time-frequency cycle. Unlike data random access memory 122 and weighted random access memory 124, which can be accessed by architecture programs and non-architecture programs, the output buffer 1104 cannot be accessed by architecture programs, but can only be stored by non-architecture programs. take.

輸出緩衝器1104係將經調整以接收一遮罩輸入(mask input)4903。較佳地，遮罩輸入4903包括四個位元對應至輸出緩衝器1104之四個文字，此四個文字係關聯於神經處理單元群組4901之四個神經處理單元126。較佳地，若是此對應至輸出緩衝器1104之文字之遮罩輸入4903位元為真，此輸出緩衝器1104之文字就會維持其當前值；否則，此輸出緩衝器1104之文字就會被啟動函數單元212之輸出所更新。也就是說，若是此對應至輸出緩衝器1104之文字之遮罩輸入4903位元為假，啟動函數單元212之輸出就會被寫入輸出緩衝器1104之文字。如此，非架構程式之輸出指令即可選擇性地將啟動函數單元212之輸出寫入輸出緩衝器1104之某些文字並使輸出緩衝器1104之其他文字之當前數值維持不變，其範例請參照後續第五十一圖之非架構程式之指令，如圖中位址6，10，13與14之指令。也就是說，指定於非架構程式內之輸出緩衝器1104之文字即決產生於遮罩輸入4903之數值。 The output buffer 1104 is adjusted to receive a mask input 4903. Preferably, the mask input 4903 includes four characters corresponding to four characters of the output buffer 1104, and the four characters are associated with the four neural processing units 126 of the neural processing unit group 4901. Preferably, if the mask input 4903 corresponding to the text of the output buffer 1104 is true, the text of the output buffer 1104 will maintain its current value; otherwise, the text of the output buffer 1104 will be The output of the activation function unit 212 is updated. That is, if the mask input 4903 corresponding to the text of the output buffer 1104 is false, the output of the activation function unit 212 will be written into the text of the output buffer 1104. In this way, the output instruction of the non-structural program can selectively write the output of the activation function unit 212 into some characters of the output buffer 1104 and keep the current values of other characters of the output buffer 1104 unchanged. For an example, please refer to The following instructions of the non-framework program in the fifty-first figure are the instructions of addresses 6, 10, 13 and 14 in the figure. That is, the text specified in the output buffer 1104 in the non-architecture program is generated from the value of the mask input 4903.

為了簡化說明，第四十九圖中並未顯示多工暫存器208/705之輸入1811(如第十八，十九與二十三圖所示)。不過，同時支援可動態配置神經處理單元126與輸出緩衝器1104之反饋/遮罩之實施例亦屬本發明之範疇。較佳地，在此等實施例中，輸出緩衝文字群組為可相對應地動態配置。 To simplify the description, the input 1811 of the multiplexer register 208/705 is not shown in the forty-ninth figure (as shown in the eighteenth, nineteenth, and twenty-third). However, embodiments that support the feedback / masking of the dynamically configurable neural processing unit 126 and the output buffer 1104 are also within the scope of the present invention. Preferably, in these embodiments, the output buffer text group can be dynamically configured correspondingly.

需要理解的是，雖然此實施例之神經處理單元群組4901內之神經處理單元126的數量為四，不過，本發明並不限於此，群組內神經處理單元126數量較多或較少之實施例均屬於本發明之範疇。此外，就一個具有共享啟動函數單元1112之實施例而言，如第五十二圖所示，一個神經處理單元群組4901內之神經處理單元126數量與一個啟動函數單元212群組內之神經處理單元126數量會有協同影響。神經處理單元群組內輸出緩衝器1104之遮蔽與反饋能力特別有助於提升關聯於長短期記憶胞4600之計算效率，詳如後續第五十與五十一圖所述。 It should be understood that although the number of neural processing units 126 in the neural processing unit group 4901 in this embodiment is four, however, The present invention is not limited thereto. Embodiments in which the number of neural processing units 126 in the group is larger or smaller are all within the scope of the present invention. In addition, for an embodiment with a shared activation function unit 1112, as shown in FIG. 52, the number of neural processing units 126 in a neural processing unit group 4901 and the nerves in a activation function unit 212 group The number of processing units 126 has a synergistic effect. The shielding and feedback capabilities of the output buffer 1104 in the neural processing unit group are particularly helpful to improve the computational efficiency associated with long-term and short-term memory cells 4600, as described in the subsequent figures 50 and 51.

第五十圖係一方塊圖，顯示當神經網路單元121執行關聯於第四十六圖中由128個長短期記憶胞4600構成之一層級之計算時，第四十九圖之神經網路單元121之資料隨機存取記憶體122，權重隨機存取記憶體124與輸出緩衝器1104內之資料配置之一範例。在第五十圖之範例中，神經網路單元121係配置為512個神經處理單元126或神經元，例如採取寬配置。如同第四十七與四十八圖之範例，在第五十與五十一圖之範例中之長短期記憶層中只具有128個長短期記憶胞4600。不過，在第五十圖之範例中，全部512個神經處理單元126(如神經處理單元0至127)產生之數值都會被使用。在執行第五十一圖之非架構程式的時候，各個神經處理單元群組4901會集體做為一個長短期記憶胞4600進行運作。 Figure 50 is a block diagram showing the neural network of Figure 49 when the neural network unit 121 performs a one-level calculation associated with 128 long and short-term memory cells 4600 in Figure 46. The data random access memory 122, the weight random access memory 124, and the data buffer in the output buffer 1104 of the unit 121 are an example of data configuration. In the example in FIG. 50, the neural network unit 121 is configured as 512 neural processing units 126 or neurons, for example, a wide configuration is adopted. As in the examples of the forty-seventh and forty-eight diagrams, the long-short-term memory layer in the examples of the fifty-first and fifty-one diagrams has only 128 long-short-term memory cells 4600. However, in the example in Figure 50, the values generated by all 512 neural processing units 126 (such as neural processing units 0 to 127) will be used. When the non-architecture program of the fifty-first figure is executed, each neural processing unit group 4901 collectively operates as a long-term and short-term memory cell 4600.

如圖中所示，資料隨機存記憶體122裝載記憶胞輸入(X)與輸出(H)值供一系列時間步驟使用。進一步來說，對於一給定時間步驟，會有一對兩列記憶體分別裝載X數值與H數值。以一個具有64列之資料隨機存取記憶體122為例，如圖中所示，此資料隨機存取記憶體122所裝載之記憶胞數值可供31個不同時間步驟使用。在第五十圖之範例中，列2與3裝載供時間步驟0使用之數值，列4與5裝載供時間步驟1使用之數值，依此類推，列62與63裝載供時間步驟30使用之數值。這對兩列記憶體中之第一列係裝載此時間步驟之X數值，而第二列則是裝載此時間步驟之H數值。如圖中所示，資料隨機存取記憶體122中各組四行對應至神經處理單元群組4901之記憶體係裝載供其對應長短期記憶胞4600使用之數值。也就是說，行0至3係裝載關聯於長短期記憶胞0之數值，其計算是由神經處理單元0-3執行，即神經處理單元群組0執行；行4至7係裝載關聯於長短期記憶胞1之數值，其計算是由神經處理單元4-7執行，即神經處理單元群組1執行；依此類推，行508至511係裝載關聯於長短期記憶胞127之數值，其計算是由神經處理單元508-511執行，即神經處理單元群組127執行，詳如後續第五十一圖所示。如圖中所示，列1並未被使用，列0裝載初始之記憶胞輸出(H)值，就一較佳實施例而言，可由架構程式填入零值，不過，本發明並不限於此，利用非架構程式指令填入列0之初始記憶胞輸出(H)數值亦屬於本發明之範疇。 As shown in the figure, the data random access memory 122 is loaded with memory cell input (X) and output (H) values for a series of time steps. Further, for a given time step, there will be a pair of two rows of memory loaded with X and H values, respectively. Randomly store a data with 64 rows Take the memory 122 as an example. As shown in the figure, the value of the memory cell loaded in the data random access memory 122 can be used in 31 different time steps. In the example in Figure 50, rows 2 and 3 are used for time step 0, rows 4 and 5 are used for time step 1, and so on, and rows 62 and 63 are used for time step 30. Value. The first of the two rows of memory contains the X value of this time step, and the second of the two rows contains the H value of this time step. As shown in the figure, each group of four rows in the data random access memory 122 corresponds to the memory system of the neural processing unit group 4901, which is loaded with values for its corresponding long-term and short-term memory cells 4600. That is, rows 0 to 3 are loaded with the values associated with long-term and short-term memory cells 0, and the calculation is performed by the neural processing unit 0-3, that is, the neural processing unit group 0; rows 4 to 7 are loaded with the long The calculation of the short-term memory cell 1 is performed by the neural processing unit 4-7, that is, the neural processing unit group 1. By analogy, lines 508 to 511 are the values associated with the long-term and short-term memory cell 127. It is executed by the neural processing units 508-511, that is, the neural processing unit group 127, as shown in the following fifty-first figure. As shown in the figure, column 1 is not used, and column 0 is loaded with the initial memory cell output (H) value. For a preferred embodiment, a zero value can be filled in by a structural program. However, the present invention is not limited to this. Therefore, it is also within the scope of the present invention to fill the initial memory cell output (H) value of column 0 with non-architectural program instructions.

較佳地，X數值(位於列2，4，6依此類推至列62)係由執行於處理器100之架構程式透過MTNN指令1400寫入/填入資料隨機存取記憶體122，並由執行於神經網路單元121之非架構程式進行讀取/使用，例如第五十圖所示之非架構程式。較佳地，H數值(位於列3，5，7依此類推至列63)係由執行於神經網路單元121之非架構程式寫入/填入資料隨機存取記憶體122並進行讀取/使用，詳如後述。較佳地，H數值並由執行於處理器100之架構程式透過MFNN指令1500進行讀取。需要注意的是，第五十一圖之非架構程式係假定對應至神經處理單元群組4901之各組四行記憶體(如行0-3，行4-7，行5-8，依此類推至行508-511)中，在一給定列之四個X數值係填入相同的數值(例如由架構程式填入)。類似地，第五十一圖之非架構程式會在對應至神經處理單元群組4901之各組四行記憶體中，計算並對一給定列之四個H數值寫入相同數值。 Preferably, the X value (located in rows 2, 4, 6 and so on to row 62) is written / filled into the data random access memory 122 by the architecture program running on the processor 100 through the MTNN instruction 1400, and Non-architecture program running on the neural network unit 121 for reading / using, for example The non-structural program shown in Figure 50. Preferably, the H value (located in columns 3, 5, 7 and so on to column 63) is written / filled into the data random access memory 122 and read by a non-architecture program executed on the neural network unit 121. / Use, as described below. Preferably, the H value is read by a framework program running on the processor 100 through the MFNN instruction 1500. It should be noted that the non-architecture program of the fifty-first figure assumes that each group of four rows of memory corresponding to the neural processing unit group 4901 (such as rows 0-3, rows 4-7, rows 5-8, and so on) By analogy to rows 508-511), the four X values in a given column are filled with the same value (for example, by a framework program). Similarly, the non-architecture program of the fifty-first figure calculates and writes the same value to the four H values in a given column in the four rows of memory corresponding to the neural processing unit group 4901.

如圖中所示，權重隨機存取記憶體124係裝載神經網路單元121之神經處理單元所需之權重，偏移與記憶胞狀態(C)值。在對應至神經處理單元群組121之各組四行記憶體中(例如行0-3，行4-7，行5-8依此類推至行508-511)：(1)行編號除以4之餘數等於3之行，會在其列0，1，2與6分別裝載Wc，Uc，Bc，與C之數值；(2)行編號除以4之餘數等於2之行，會在其列3，4與5分別裝載Wo，Uo與Bo之數值；(3)行編號除以4之餘數等於1之行，會在其列3，4與5分別裝載Wf，Uf與Bf之數值；以及(4)行編號除以4之餘數等於0之行，會在其列3，4與5分別裝載Wi，Ui與Bi之數值。較佳地，這些權重與偏移值-Wi,Ui,Bi,Wf,Uf,Bf,Wc,Uc,Bc,Wo,Uo,Bo(在列0至5)-係由執行於處理器100之架構程式透過 MTNN指令1400寫入/填入權重隨機存取記憶體124，並由執行於神經網路單元121之非架構程式進行讀取/使用，如第五十一圖之非架構程式。較佳地，居間之C值係由執行於神經網路單元121之非架構程式寫入/填入權重隨機存取記憶體124並進行讀取/使用，詳如後述。 As shown in the figure, the weight random access memory 124 is a weight, offset and memory cell state (C) value required by the neural processing unit loaded with the neural network unit 121. In each of the four rows of memory corresponding to the neural processing unit group 121 (eg, rows 0-3, rows 4-7, rows 5-8, and so on to rows 508-511): (1) line number divided by Rows with the remainder of 4 equal to 3 will be loaded with the values of Wc, Uc, Bc, and C in columns 0, 1, 2, and 6, respectively. (2) The row with the remainder divided by 4 equal to 2 Columns 3, 4 and 5 carry the values of Wo, Uo, and Bo, respectively; (3) The row where the remainder of the row number divided by 4 equals 1 will load the values of Wf, Uf, and Bf in columns 3, 4, and 5, respectively; And (4) The line whose number is divided by 4 and whose remainder is equal to 0 will have the values of Wi, Ui, and Bi in columns 3, 4, and 5, respectively. Preferably, these weights and offset values-Wi, Ui, Bi, Wf, Uf, Bf, Wc, Uc, Bc, Wo, Uo, Bo (in columns 0 to 5)-are executed by the processor 100 Framework program through The MTNN instruction 1400 writes / fills in the weight random access memory 124 and is read / used by a non-architecture program executed on the neural network unit 121, as shown in the non-architecture program of Fig. 51. Preferably, the intermediate C value is written / filled into the random access memory 124 and read / used by a non-framework program executed on the neural network unit 121, as described later.

第五十圖之範例係假定架構程式會執行以下步驟：(1)對於31個不同的時間步驟，將輸入X之數值填入資料隨機存取記憶體122(列2，4，依此類推至列62)；(2)啟動第五十一圖之非架構程式；(3)偵測非架構程式是否執行完畢；(4)從資料隨機存取記憶體122讀出輸出H之數值(列3，5，依此類推至列63)；以及(5)重複步驟(1)至(4)若干次直到完成任務，例如對手機使用者之話語進行辨識所需之計算。 The example in Figure 50 assumes that the framework program will perform the following steps: (1) For 31 different time steps, fill the value of the input X into the data random access memory 122 (rows 2, 4, and so on) (Column 62); (2) start the non-schema program of the fifty-first figure; (3) detect whether the non-schema program has been executed; (4) read the value of output H from the data random access memory 122 (column 3) , 5, and so on to column 63); and (5) Repeat steps (1) to (4) several times until the task is completed, such as the calculation required to identify the speech of the mobile phone user.

在另一種執行方式中，架構程式會執行以下步驟：(1)對單一個時間步驟，以輸入X之數值填入資料隨機存取記憶體122(如列2)；(2)啟動非架構程式(第五十一圖非架構程式之一修正後版本，不需迴圈，並且只存取資料隨機存記憶體122之單一對兩個列)；(3)偵測非架構程式是否執行完畢；(4)從資料隨機存取記憶體122讀出輸出H之數值(如列3)；以及(5)重複步驟(1)至(4)若干次直到完成任務。此二種方式何者為優可依據長短期記憶層之輸入X數值的取樣方式而定。舉例來說，若是此任務容許在多個時間步驟對輸入進行取樣(例如大約31個時間步驟)並執行計算，第一種方式就較為理想，因為此方式可能帶來更多計算資源效率與/或較佳的效能，不過，若是此任務只容許在單一個時間步驟執行取樣，就需要使用第二種方式。 In another implementation method, the architecture program will perform the following steps: (1) for a single time step, fill the data random access memory 122 (such as column 2) with the value of X; (2) start the non-schema program (Figure 51 is a revised version of one of the non-architecture programs, which does not require loops, and only accesses a single pair of two rows of the random storage memory 122); (3) detects whether the non-architecture program has completed execution; (4) Read the value of the output H from the data random access memory 122 (such as column 3); and (5) repeat steps (1) to (4) several times until the task is completed. Which of these two methods is optimal may depend on the sampling method of the input X value of the long-term and short-term memory layers. For example, if this task allows the input to be sampled at multiple time steps (for example, about 31 time steps) and perform calculations, the first method is ideal because this method may bring more computing resources efficiency and / Or better Performance, however, if this task only allows sampling to be performed in a single time step, the second method is needed.

第三實施例類似於前述第二種方式，不過，不同於第二種方式使用單一對兩列資料隨機存取記憶體122，此方式之非架構程式使用多對記憶體列，也就是在各個時間步驟使用不同對記憶體列，此部分類似於第一種方式。較佳地，此第三實施例之架構程式在步驟(2)前包含一步驟，在此步驟中，架構程式會在非架構程式啟動前對其進行更新，例如將位址1之指令內的資料隨機存取記憶體122列更新為指向下一對兩列記憶體。 The third embodiment is similar to the foregoing second method, but different from the second method using a single pair of two rows of data random access memory 122, the non-structural program of this method uses multiple pairs of memory rows, that is, in each Time steps use different pairs of memory columns, this section is similar to the first way. Preferably, the architecture program of this third embodiment includes a step before step (2). In this step, the architecture program will update the non-architecture program before it is started, for example, the instruction in the address 1 is updated. The 122 rows of data random access memory are updated to point to the next two rows of memory.

如圖中所示，對於神經網路單元121之神經處理單元0至511，在第五十一圖之非架構程式中不同位址之指令執行後，輸出緩衝器1104係裝載記憶胞輸出(H)，候選記憶胞狀態(C’)，輸入閘(I)，遺忘閘(F)，輸出閘(O)，記憶胞狀態(C)與tanh(C)之居間值，每一個輸出緩衝文字群組中(例如輸出緩衝器1104對應至神經處理單元群組4901之四個文字之群組，如文字0-3，4-7，5-8依此類推至508-511)，文字編號除以4之餘數為3的文字係表示為OUTBUF[3]，文字編號除以4之餘數為2的文字係表示為OUTBUF[2]，文字編號除以4之餘數為1的文字係表示為OUTBUF[1]，而文字編號除以4之餘數為0的文字係表示為OUTBUF[0]。 As shown in the figure, for the neural processing units 0 to 511 of the neural network unit 121, the output buffer 1104 is loaded with the memory cell output (H ), Candidate memory cell state (C '), input gate (I), forget gate (F), output gate (O), intermediate value of memory cell state (C) and tanh (C), each output buffers text group In the group (for example, the output buffer 1104 corresponds to the group of four characters of the neural processing unit group 4901, such as the characters 0-3, 4-7, 5-8 and so on to 508-511), the text number is divided by Characters with a remainder of 4 being 3 are represented as OUTBUF [3], characters with a remainder of 4 divided by 4 are represented by OUTBUF [2], characters with a remainder of 4 divided by 4 are represented as OUTBUF [ 1], and the character whose remainder is 0 when the character number is divided by 4 is represented as OUTBUF [0].

如圖中所示，在第五十一圖之非架構程式中位址2之指令執行後，對於各個神經處理單元群組4901而言，輸出緩衝器1104之全部四個文字都會寫入相對應長短期記憶胞4600之初始記憶胞輸出(H)值。在位址6之指令執行後，對於各個神經處理單元群組4901而言，輸出緩衝器1104之OUTBUF[3]文字會寫入相對應長短期記憶胞4600之候選記憶胞狀態(C’)值，而輸出緩衝器1104之其他三個文字則會維持其先前數值。在位址10之指令執行後，對於各個神經處理單元群組4901而言，輸出緩衝器1104之OUTBUF[0]文字會寫入相對應長短期記憶胞4600之輸入閘(I)數值，OUTBUF[1]文字會寫入相對應長短期記憶胞4600之遺忘閘(F)數值，OUTBUF[2]文字會寫入相對應長短期記憶胞4600之輸出閘(O)數值，而OUTBUF[3]文字則是維持其先前數值。在位址13之指令執行後，對於各個神經處理單元群組4901而言，輸出緩衝器1104之OUTBUF[3]文字會寫入相對應長短期記憶胞4600之新的記憶胞狀態(C)值(對於輸出緩衝器1104而言，包含槽(slot)3之C數值，係寫入權重隨機存取記憶體124之列6，詳如後續第五十一圖所述)，而輸出緩衝器1104之其他三個文字則是維持其先前數值。在位址14之指令執行後，對於各個神經處理單元群組4901而言，輸出緩衝器1104之OUTBUF[3]文字會寫入相對應長短期記憶胞4600之tanh(C)數值，而輸出緩衝器1104之其他三個文字則是維持其先前數值。在位址16之指令執行後，對於各個神經處理單元群組4901而言，輸出緩衝器1104之全部四個文字都會寫入相對應長短期記憶胞4600之新的記憶胞輸出(H)值。前述位址6至16之執行流程(也就是排除位址2之執行，這是因為位址2不屬於程式迴圈之一部分)會再重複三十次，作為位址17回到位址3之程式迴圈。 As shown in the figure, after the instruction of address 2 in the non-architecture program of the fifty-first figure is executed, for each neural processing unit group 4901, all four characters of the output buffer 1104 are written correspondingly. The initial memory cell output (H) value of the long-term and short-term memory cell 4600. After the instruction at address 6 is executed, for each neural processing unit group 4901, the word OUTBUF [3] in the output buffer 1104 is written into the candidate memory cell state (C ') value corresponding to the long-term and short-term memory cell 4600. , And the other three characters of the output buffer 1104 will maintain their previous values. After the instruction at address 10 is executed, for each neural processing unit group 4901, the OUTBUF [0] characters of the output buffer 1104 will be written into the input gate (I) value corresponding to the long-term short-term memory cell 4600, OUTBUF [ 1] The text will write the value of the forget gate (F) corresponding to the long short-term memory cell 4600, the OUTBUF [2] text will write the value of the output gate (O) corresponding to the long short-term memory cell 4600, and the OUTBUF [3] text Is to maintain its previous value. After the instruction at address 13 is executed, for each neural processing unit group 4901, the word OUTBUF [3] in the output buffer 1104 will be written into the new memory cell state (C) value corresponding to the short-term memory cell 4600. (For the output buffer 1104, the value of C including slot 3 is written in column 6 of the weight random access memory 124, as described in the following fifty-first figure), and the output buffer 1104 The other three texts maintain their previous values. After the instruction at address 14 is executed, for each neural processing unit group 4901, the word OUTBUF [3] in the output buffer 1104 is written into the tanh (C) value corresponding to the long-term short-term memory cell 4600, and the output buffer The other three characters of the device 1104 maintain their previous values. After the instruction at address 16 is executed, for each neural processing unit group 4901, all four characters of the output buffer 1104 are written into the new memory cell output (H) value corresponding to the long-term and short-term memory cell 4600. The execution flow of the aforementioned addresses 6 to 16 (that is, the execution of address 2 is excluded, because address 2 does not belong to one of the program loops. Part) will be repeated 30 times, as a program loop from address 17 back to address 3.

第五十一圖係一表格，顯示儲存於神經網路單元121之程式記憶體129之一程式，此程式係由第四十九圖之神經網路單元121執行並依據第五十圖之配置使用資料與權重，以達成關聯於長短期記憶胞層之計算。第五十一圖之範例程式包含18個非架構指令分別位於位址0至17。位址0之指令是一個初始化指令，用以清除累加器202並將迴圈計數器3804初始化至數值31，以執行31次迴圈組(位址1至17之指令)。此初始化指令並會將資料隨機存取記憶體122之待寫入列(例如第二十六/三十九圖之暫存器2606)初始化為數值1，而在位址16之指令之第一次執行後，此數值會增加至3。較佳地，此初始化指令並會使神經網路單元121處於寬配置，如此，神經網路單元121就會配置有512個神經處理單元126。如後續章節所述，在位址0至17之指令執行過程中，這512個神經處理單元126構成之128個神經處理單元群組4901係作為128個相對應之長短期記憶胞4600進行運作。 Figure 51 is a table showing a program stored in the program memory 129 of the neural network unit 121. This program is executed by the neural network unit 121 of the 49th figure and is configured according to the 50th figure Use data and weights to achieve calculations related to long- and short-term memory cells. The example program in Figure 51 contains 18 non-framework instructions at addresses 0 to 17, respectively. The instruction at address 0 is an initialization instruction that clears the accumulator 202 and initializes the loop counter 3804 to a value of 31 to execute the loop group 31 times (instructions at addresses 1 to 17). This initialization command also initializes the to-be-written row of the data random access memory 122 (for example, the register 2606 of the twenty-sixth / thirty-nine figure) to the value 1, and the first in the address 16 instruction After the first execution, this value will increase to 3. Preferably, the initialization command does not place the neural network unit 121 in a wide configuration. Thus, the neural network unit 121 is configured with 512 neural processing units 126. As described in the subsequent chapters, during the execution of instructions at addresses 0 to 17, the 128 neural processing unit group 4901 composed of the 512 neural processing units 126 operates as 128 corresponding long-term and short-term memory cells 4600.

位址1與2之指令不屬於程式之迴圈組而只會執行一次。這些指令會產生初始記憶胞輸出(H)值(例如0)並將其寫入輸出緩衝器1104之所有文字。位址1之指令會從資料隨機存取記憶體122之列0讀取初始H數值並將其放置於由位址0之指令清除之累加器202。位址2之指令(OUTPUT PASSTHRU,NOP,CLR ACC)會將累加器202數值傳遞至輸出緩衝器1104，如第五十圖所示。位址2之輸出指令(以及第五十一圖之其他輸出指令)中之“NOP”標示表示輸出值只會被寫入輸出緩衝器1104，而不會被寫入記憶體，也就是不會被寫入資料隨機存取記憶體122或權重隨機存取記憶體124。位址2之指令並會清除累加器202。 The instructions at addresses 1 and 2 do not belong to the loop group of the program and will only be executed once. These instructions generate the initial memory cell output (H) value (for example, 0) and write it to all text in the output buffer 1104. The instruction at address 1 reads the initial H value from row 0 of the data random access memory 122 and places it in the accumulator 202 cleared by the instruction at address 0. The instruction at address 2 (OUTPUT PASSTHRU, NOP, CLR ACC) will pass the value of accumulator 202 to the output buffer 1104, as shown in Figure 50. Bit The "NOP" mark in the output instruction at address 2 (and other output instructions in Figure 51) indicates that the output value will only be written to the output buffer 1104, and will not be written to memory, that is, will not be The data random access memory 122 or the weight random access memory 124 is written. The instruction at address 2 will clear the accumulator 202.

位址3至17之指令係位於迴圈組內，其執行次數為迴圈計數之數值(如31)。 The instructions at addresses 3 to 17 are located in the loop group, and the number of executions is the value of the loop count (such as 31).

位址3至6之指令之每一次執行會計算當前時間步驟之tanh(C’)數值並將其寫入文字OUTBUF[3]，此文字將會被位址11之指令使用。更精確地說，位址3之乘法累加指令會從資料隨機存取記憶體122之當前讀取列(如列2，4，6依此類推至列62)讀取關聯於此時間步驟之記憶胞輸入(X)值，從權重隨機存取記憶體124之列0讀取Wc數值，並將前述數值相乘以產生一乘積加入由位址2之指令清除之累加器202。 Each execution of the instructions at addresses 3 to 6 will calculate the tanh (C ') value of the current time step and write it into the text OUTBUF [3]. This text will be used by the instruction at address 11. More precisely, the multiply-accumulate instruction at address 3 reads the memory associated with this time step from the current read row of data random access memory 122 (such as row 2, 4, 6, and so on to row 62). The cell inputs the (X) value, reads the value of Wc from column 0 of the weighted random access memory 124, and multiplies the aforementioned values to generate a product that is added to the accumulator 202 cleared by the instruction at address 2.

位址4之乘法累加指令(MULT-ACCUM OUTBUF[0],WR ROW 1)會從文字OUTBUF[0]讀取H數值(即神經處理單元群組4901之全部四個神經處理單元126)，從權重隨機存取記憶體124之列1讀取Uc數值，並將前述數值相乘以產生一第二乘積加入累加器202。 The multiply accumulate instruction at address 4 (MULT-ACCUM OUTBUF [0], WR ROW 1) will read the H value from the text OUTBUF [0] (that is, all four neural processing units 126 of the neural processing unit group 4901), from Column 1 of the weighted random access memory 124 reads the value of Uc, and multiplies the aforementioned values to generate a second product and adds it to the accumulator 202.

位址5之將權重文字加入累加器指令(ADD_W_ACC WR ROW 2)會從權重隨機存記憶體124之列2讀取Bc數值並將其加入累加器202。 The add weight instruction to the accumulator instruction (ADD_W_ACC WR ROW 2) at address 5 reads the value of Bc from column 2 of the weight random storage memory 124 and adds it to the accumulator 202.

位址6之輸出指令(OUTPUT TANH,NOP,MASK[0：2],CLR ACC)會對累加器202數值執行一雙曲正切啟動函數，並且只將執行結果寫入文字OUTBUF[3](亦即，只有神經處理單元群組4901中編號除4之餘數為3之神經處理單元126會寫入此結果)，並且，累加器202會被清除。也就是說，位址6之輸出指令會遮蔽文字OUTBUF[0]，OUTBUF[1]與OUTBUF[2](如指令術語MASK[0：2]所表示)而維持其當前數值，如第五十圖所示。此外，位址6之輸出指令並不會寫入記憶體(如指令術語NOP所表示)。 The output instruction at address 6 (OUTPUT TANH, NOP, MASK [0: 2], CLR ACC) performs a hyperbolic operation on the value of accumulator 202 The tangent starts the function and only writes the execution result to the text OUTBUF [3] (that is, only the neural processing unit 126 whose number is divided by 4 in the neural processing unit group 4901 will write this result), and accumulates器 202 will be cleared. In other words, the output instruction at address 6 will obscure the text OUTBUF [0], OUTBUF [1], and OUTBUF [2] (as indicated by the instruction term MASK [0: 2]) and maintain its current value, such as the 50th As shown. In addition, the output instruction at address 6 is not written into the memory (as indicated by the instruction term NOP).

位址7至10之指令之每一次執行會計算當前時間步驟之輸入閘(I)數值，遺忘閘(F)數值與輸出閘(O)數值並將其分別寫入文字OUTBUF[0]，OUTBUF[1]，與OUTBUF[2]，這些數值將會被位址11，12與15之指令使用。更精確地說，位址7之乘法累加指令會從資料隨機存取記憶體122之當前讀取列(如列2，4，6依此類推至列62)讀取關聯於此時間步驟之記憶胞輸入(X)值，從權重隨機存取記憶體124之列3讀取Wi，Wf與Wo數值，並將前述數值相乘以產生一乘積加入由位址6之指令清除之累加器202。更精確地說，在神經處理單元群組4901中，編號除4之餘數為0之神經處理單元126會計算X與Wi之乘積，編號除4之餘數為1之神經處理單元126會計算X與Wf之乘積，而編號除4之餘數為2之神經處理單元126會計算X與Wo之乘積。 Each execution of the instructions at addresses 7 to 10 will calculate the input gate (I) value, forget gate (F) value and output gate (O) value at the current time step and write them into the text OUTBUF [0], OUTBUF respectively. [1], and OUTBUF [2], these values will be used by instructions at addresses 11, 12, and 15. More precisely, the multiply accumulate instruction at address 7 reads the memory associated with this time step from the current read row of data random access memory 122 (such as row 2, 4, 6 and so on to row 62). The cell input (X) value reads Wi, Wf, and Wo values from column 3 of the weighted random access memory 124, and multiplies the foregoing values to generate a product that is added to the accumulator 202 cleared by the instruction at address 6. More precisely, in the neural processing unit group 4901, the neural processing unit 126 with the remainder of the number divided by 4 is 0 and calculates the product of X and Wi, and the neural processing unit 126 with the remainder of the number divided by 4 is 1 calculates X and The product of Wf, and the neural processing unit 126 with a remainder of 2 divided by 4 calculates the product of X and Wo.

位址8之乘法累加指令會從文字OUTBUF[0]讀取H數值(即神經處理單元群組4901之全部四個神經處理單元126)，從權重隨機存取記憶體124 之列4讀取Ui，Uf與Uo數值，並將前述數值相乘以產生一第二乘積加入累加器202。更精確地說，在神經處理單元群組4901中，編號除4之餘數為0之神經處理單元126會計算H與Ui之乘積，編號除4之餘數為1之神經處理單元126會計算H與Uf之乘積，而編號除4之餘數為2之神經處理單元126會計算H與Uo之乘積。 The multiply accumulate instruction at address 8 reads the H value from the text OUTBUF [0] (that is, all four neural processing units 126 of the neural processing unit group 4901), and randomly accesses the memory 124 from the weights. Column 4 reads the values of Ui, Uf and Uo, and multiplies the aforementioned values to generate a second product and adds it to the accumulator 202. More precisely, in the neural processing unit group 4901, the neural processing unit 126 with the remainder of the number divided by 4 is 0 to calculate the product of H and Ui, and the neural processing unit 126 with the remainder of the number divided by 4 is 1 calculates H and The product of Uf, and the neural processing unit 126 with a remainder of 2 divided by 4 calculates the product of H and Uo.

位址9之將權重文字加入累加器指令(ADD_W_ACC WR ROW 2)會從權重隨機存記憶體124之列5讀取Bi，Bf與Bo數值並將其加入累加器202。更精確地說，，在神經處理單元群組4901中，編號除4之餘數為0之神經處理單元126會執行Bi數值之加法計算，編號除4之餘數為1之神經處理單元126會執行Bf數值之加法計算，而編號除4之餘數為2之神經處理單元126會執行Bo數值之加法計算。 The instruction of adding weight text to the accumulator at address 9 (ADD_W_ACC WR ROW 2) will read the values of Bi, Bf and Bo from column 5 of the weight random storage memory 124 and add them to the accumulator 202. More precisely, in the neural processing unit group 4901, the neural processing unit 126 with the remainder of the number divided by 4 is 0 to perform an addition calculation of the Bi value, and the neural processing unit 126 with the remainder of the number divided by 4 is 1 performs Bf The numerical value is added, and the neural processing unit 126 whose number is divided by 4 and whose remainder is 2 performs the addition of the Bo value.

位址10之輸出指令(OUTPUT SIGMOID,NOP,MASK[3],CLR ACC)會對累加器202數值執行一S型啟動函數並將計算出來之I，F與O數值分別寫入文字OUTBUF[0]，OUTBUF[1]與OUTBUF[2]，此指令並會清除累加器202，而不寫入記憶體。也就是說，位址10之輸出指令會遮蔽文字OUTBUF[3](如指令術語MASK[3]所表示)而維持此文字之當前數值(也就是C’)，如第五十圖所示。 The output instruction at address 10 (OUTPUT SIGMOID, NOP, MASK [3], CLR ACC) will execute an S-shaped start function on the value of accumulator 202 and write the calculated I, F and O values to the text OUTBUF [0 ], OUTBUF [1] and OUTBUF [2]. This instruction does not clear the accumulator 202 without writing to the memory. In other words, the output instruction at address 10 will obscure the text OUTBUF [3] (as indicated by the instruction term MASK [3]) and maintain the current value of the text (that is, C '), as shown in Figure 50.

位址11至13之指令之每一次執行會計算當前時間步驟產生之新的記憶胞狀態(C)值並將其寫入權重隨機存取記憶體124之列6供下一個時間步驟使用 (也就是供位址12之指令在下一次迴圈執行時使用)，更精確的說，此數值係寫入列6對應於神經處理單元群組4901之四行文字中標號除4之餘數為3之文字。此外，位址14之指令之每一次執行都會將tanh(C)數值寫入OUTBUF[3]供位址15之指令使用。 Each execution of the instructions at addresses 11 to 13 calculates a new memory cell state (C) value generated by the current time step and writes it into the weight random access memory 124 column 6 for the next time step (That is, it is used when the instruction at address 12 is executed in the next loop). More precisely, this value is written in column 6. The remainder of the label divided by 4 in the four lines of text corresponding to the neural processing unit group 4901 is 3. Text. In addition, each execution of the instruction at address 14 will write the value of tanh (C) to OUTBUF [3] for the instruction at address 15.

更精確地說，位址11之乘法累加指令(MULT-ACCUM OUTBUF[0],OUTBUF[3])會從文字OUTBUF[0]讀取輸入閘(I)數值，從文字OUTBUF[3]讀取候選記憶胞狀態(C’)值，並將前述數值相乘以產生一第一乘積加入由位址10之指令清除之累加器202。更精確地說，神經處理單元群組4901之四個神經處理單元126中之各個神經處理單元126都會計算I數值與C’數值之第一乘積。 More precisely, the multiply-accumulate instruction at address 11 (MULT-ACCUM OUTBUF [0], OUTBUF [3]) reads the value of the input gate (I) from the text OUTBUF [0], and reads it from the text OUTBUF [3] Candidate memory cell state (C ') values are multiplied to generate a first product and added to the accumulator 202 cleared by the instruction at address 10. More specifically, each of the four neural processing units 126 in the four neural processing units 126 of the neural processing unit group 4901 calculates the first product of the I value and the C 'value.

位址12之乘法累加指令(MULT-ACCUM OUTBUF[1],WR ROW 6)會指示神經處理單元126從文字OUTBUF[1]讀取遺忘閘(F)數值，從權重隨機存取記憶體124之列6讀取其相對應文字，並將其相乘以產生一第二乘積與位址11之指令產生於累加器202內之第一乘積相加。更精確地說，對於神經處理單元群組4901內標號除4之餘數為3之神經處理單元126而言，從列6讀取之文字是先前時間步驟計算出之當前記憶胞狀態(C)值，第一乘積與第二乘積之加總即為此新的記憶胞狀態(C)。不過，對於神經處理單元群組4901之其他三個神經處理單元126而言，從列6讀取之文字是不需理會的數值，這是因為這些數值所產生之累加值將不被使用，亦即不會被位址13與14之指令放入輸出緩衝器1104而會被位址14之指令所清除。也就是說，只有神經處理單元群組4901中標號除4之餘數為3之神經處理單元126所產生之新的記憶胞狀態(C)值將會被使用，即被位址13與14之指令使用。就位址12之指令之第二至三十一次執行而言，從權重隨機存取記憶體124之列6讀取之C數值是迴圈組之前次執行中由位址13之指令寫入之數值。不過，對於位址12之指令之第一次執行而言，列6之C數值則是由架構程式在啟動第五十一圖之非架構程式前或是由非架構程式之一調整後版本寫入之初始值。 The multiply-accumulate instruction at address 12 (MULT-ACCUM OUTBUF [1], WR ROW 6) instructs the neural processing unit 126 to read the value of the forget gate (F) from the text OUTBUF [1], and randomly access the memory 124 from the weight. Column 6 reads its corresponding text, and multiplies it to generate a second product and adds the first product in the accumulator 202 to the instruction at address 11. More precisely, for the neural processing unit 126 with the remainder of the label divided by 4 in the neural processing unit group 4901, the text read from column 6 is the current memory cell state (C) value calculated in the previous time step. The sum of the first product and the second product is the new memory cell state (C). However, for the other three neural processing units 126 of the neural processing unit group 4901, the text read from column 6 is a value that is ignored, because the cumulative value generated by these values will not be used, and That will not be The instructions at addresses 13 and 14 are placed in the output buffer 1104 and will be cleared by the instruction at address 14. In other words, only the new memory cell state (C) value generated by the neural processing unit 126 with the remainder of the number 3 divided by 4 in the neural processing unit group 4901 will be used, that is, the instructions at addresses 13 and 14 use. For the second to thirty-first executions of the instruction at address 12, the value of C read from row 6 of the weighted random access memory 124 is written by the instruction at address 13 during the previous execution of the loop group Value. However, for the first execution of the instruction at address 12, the value of C in column 6 was written by the framework program before starting the non-schema program in Figure 51 or after an adjusted version of one of the non-schema programs. Into the initial value.

位址13之輸出指令(OUTPUT PASSTHRU,WR ROW 6,MASK[0：2])只會傳遞累加器202數值，即計算出之C數值，至文字OUTBUF[3](也就是說，只有神經處理單元群組4901中標號除4之餘數為3之神經處理單元126會將其計算出之C數值寫入輸出緩衝器1104)，而權重隨機存取記憶體124之列6則是以更新後之輸出緩衝器1104寫入，如第五十圖所示。也就是說，位址13之輸出指令會遮蔽文字OUTBUF[0]，OUTBUF[1]與OUTBUF[2]而維持其當前數值(即I，F與O數值)。如前述，只有列6對應於神經處理單元群組4901之四行文字中標號除4之餘數為3之文字內之C數值會被使用，也就是由位址12之指令使用；因此，非架構程式不會理會權重隨機存取記憶體124之列6中位於行0-2，行4-6，依此類推至行508-510之數值，如第五十圖所示(即I，F與O數值)。 The output instruction at address 13 (OUTPUT PASSTHRU, WR ROW 6, MASK [0: 2]) will only pass the value of accumulator 202, that is, the calculated C value, to the text OUTBUF [3] (that is, only neural processing In the unit group 4901, the neural processing unit 126 whose number is divided by 4 and whose remainder is 3 will write the calculated C value to the output buffer 1104), and the weight 6 of the random access memory 124 is the updated one The output buffer 1104 is written as shown in the fiftieth figure. In other words, the output instruction at address 13 will obscure the characters OUTBUF [0], OUTBUF [1] and OUTBUF [2] and maintain their current values (ie, I, F and O values). As mentioned above, only the value of C in column 4 corresponding to the number 4 in the four lines of the text of the neural processing unit group 4901 and the remainder of 3 will be used, that is, used by the instruction at address 12; therefore, non-architecture The program will ignore the values in row 6-2, row 4-6 in row 6 of weight random access memory 124, and so on to the values in row 508-510, as shown in Figure 50 (that is, I, F, and O value).

位址14之輸出指令(OUTPUT TANH,NOP,MASK[0：2],CLR ACC)會對累加器202數值執行一雙曲正切啟動函數，並將計算出來之tanh(C)數值寫入文字OUTBUF[3]，此指令並會清除累加器202，而不寫入記憶體。位址14之輸出指令，如同位址13之輸出指令，會遮蔽文字OUTBUF[0]，OUTBUF[1]與OUTBUF[2]而維持其原本數值，如第五十圖所示。 The output instruction at address 14 (OUTPUT TANH, NOP, MASK [0: 2], CLR ACC) performs a hyperbolic tangent start function on the value of accumulator 202, and writes the calculated tanh (C) value into the text OUTBUF [3], this instruction does not clear the accumulator 202 without writing to the memory. The output instruction at address 14 is the same as the output instruction at address 13. It will obscure the characters OUTBUF [0], OUTBUF [1] and OUTBUF [2] and maintain their original values, as shown in Figure 50.

位址15至16之指令之每一次執行會計算當前時間步驟產生之記憶胞輸出(H)值並將其寫入資料隨機存取記憶體122之當前輸出列後方第二列，其數值將會由架構程式讀取並用於下一次時間步驟(亦即在下一次迴圈執行中由位址3及7之指令使用)。更精確地說，位址15之乘法累加指令會從文字OUTBUF[2]讀取輸出閘(O)數值，從文字OUTBUF[3]讀取tanh(C)數值，並將其相乘以產生一乘積加入由位址14之指令清除之累加器202。更精確地說，神經處理單元群組4901之四個神經處理單元126中之各個神經處理單元126都會計算數值O與tanh(C)之乘積。 Each execution of the instructions at addresses 15 to 16 will calculate the memory cell output (H) value generated by the current time step and write it into the second row behind the current output row of the data random access memory 122. The value will be Read by the framework program and used for the next time step (that is, used by the instructions at addresses 3 and 7 in the next loop execution). More precisely, the multiply-accumulate instruction at address 15 reads the output gate (O) value from the text OUTBUF [2], reads the tanh (C) value from the text OUTBUF [3], and multiplies it to produce a The product is added to the accumulator 202 cleared by the instruction at address 14. More precisely, each of the four neural processing units 126 of the four neural processing units 126 of the neural processing unit group 4901 calculates the product of the value O and tanh (C).

位址16之輸出指令會傳遞累加器202數值並在第一次執行中將計算出之H數值寫入列3，在第二次執行中將計算出之H數值寫入列5，依此類推在第三十一次執行中將計算出之H數值寫入列63，如第五十圖所示，接下來這些數值會由位址4與8之指令使用。此外，如第五十圖所示，這些計算出來之H數值會被放入輸出緩衝器1104供位址4與8之指令後續使用。位址16之輸出指令並會清除累加器202。在一實施例中，長短期記憶胞4600之設計係使位址16之輸出指令(以及/或第四十八圖中位址22之輸出指令)具有一啟動函數，如S型或雙曲正切函數，而非傳遞累加器202數值。 The output instruction at address 16 will pass the value of accumulator 202 and write the calculated H value in column 3 in the first execution, and write the calculated H value in column 5 in the second execution, and so on. In the thirty-first execution, the calculated H values are written into column 63, as shown in the fifty figure, and these values will be used by the instructions at addresses 4 and 8. In addition, as shown in Figure 50, these calculated H values will be placed in the output buffer 1104 for subsequent use of the instructions at addresses 4 and 8. Address 16 output Instruction and clears accumulator 202. In one embodiment, the design of the long-term and short-term memory cell 4600 is such that the output instruction at address 16 (and / or the output instruction at address 22 in the forty-eighth figure) has an activation function, such as an S-type or hyperbolic tangent Function instead of passing the accumulator 202 value.

位址17之迴圈指令會使迴圈計數器3804遞減並且在新的迴圈計數器3804數值大於零之情況下回到位址3之指令。 The loop instruction at address 17 decrements the loop counter 3804 and returns to the instruction at address 3 if the new loop counter 3804 value is greater than zero.

由此可發現，因為第四十九圖之神經網路單元121實施例中之輸出緩衝器1104之反饋與屏蔽能力，第五十一圖之非架構程式之迴圈組內的指令數相較於第四十八圖之非架構指令大致減少34%。此外，因為第四十九圖之神經網路單元121實施例中之輸出緩衝器1104之反饋與屏蔽能力，第五十一圖非架構程式之資料隨機存取記憶體122中之記憶體配置所搭配之時間步驟數大致為第四十八圖之三倍。前述改善有助於某些利用神經網路單元121執行長短期記憶胞層計算之架構程式應用，特別是針對長短期記憶胞層中之長短期記憶胞4600數量少於或等於128之應用。 It can be found that because of the feedback and shielding capabilities of the output buffer 1104 in the embodiment of the neural network unit 121 in Figure 49, the number of instructions in the loop group of the non-architecture program in Figure 51 is compared. The non-structural instructions in Figure 48 were reduced by approximately 34%. In addition, because of the feedback and shielding capabilities of the output buffer 1104 in the embodiment of the neural network unit 121 in FIG. 49, the memory allocation in the data random access memory 122 of the non-architecture program in FIG. 51 The number of matching time steps is roughly three times that of the forty-eighth picture. The aforementioned improvements are helpful for certain framework program applications that use the neural network unit 121 to perform long-term and short-term memory cell calculations, especially for applications where the number of long-term and short-term memory cells in the long-term and short-term memory cells is less than or equal to 128.

第四十七至五十一圖之實施例係假定各個時間步驟中之權重與偏移值維持不變。不過，本發明並不限於此，其他權重與偏移值隨時間步驟改變之實施例亦屬本發明之範疇，其中，權重隨機存取記憶體124並非如第四十七至五十圖所示填入單一組權重與偏移值，而是在各個時間步驟填入不同組權重與偏移值而第四十八至五十一圖之非架構程式之權重隨機存記憶體 124位址會隨之調整。 The embodiments of the forty-seventh to fifty-one diagrams assume that the weights and offset values in each time step remain unchanged. However, the present invention is not limited to this, and other embodiments in which the weights and offset values change with time steps are also within the scope of the present invention. The weight random access memory 124 is not as shown in Figures 47 to 50. Fill in a single set of weights and offsets, but fill in different sets of weights and offsets at each time step. The weights of the non-architecture programs in Figures 48 to 51 are stored randomly in memory. The 124 address will be adjusted accordingly.

基本上，在前述第四十七至五十一圖之實施例中，權重，偏移與居間值(如C，C’數值)係儲存於權重隨機存取記憶體124，而輸入與輸出值(如X，H數值)則是儲存於資料隨機存取記憶體122。此特徵有利於資料隨機存取記憶體122為雙埠而權重隨機存取記憶體124為單埠之實施例，這是因為從非架構程式與架構程式至資料隨機存取記憶體122會有更多的流量。不過，因為權重隨機存取記憶體124較大，在本發明之另一實施例中則是互換儲存非架構與架構程式寫入數值之記憶體(即互換資料隨機存取記憶體122與權重隨機存取記憶體124)。也就是說，W，U，B，C’，tanh(C)與C數值係儲存於資料隨機存取記憶體122而X，H，I，F與O數值則是儲存於權重隨機存取記憶體124(第四十七圖之調整後實施例)；以及W，U，B，與C數值係儲存於資料隨機存取記憶體122而X與H數值則是儲存於權重隨機存取記憶體124(第五十圖之調整後實施例)。因為權重隨機存取記憶體124較大，這些實施例在一個批次中可處理較多時間步驟。對於利用神經網路單元121執行計算之架構程式的應用而言，此特徵有利於某些能從較多之時間步驟得利之應用並且可以為單埠設計之記憶體(如權重隨機存取記憶體124)提供足夠頻寬。 Basically, in the foregoing embodiments of the forty-seventh to fifty-one figures, the weights, offsets, and intermediate values (such as C, C 'values) are stored in the weight random access memory 124, and the input and output values (Such as X and H values) are stored in the data random access memory 122. This feature is beneficial to the embodiment in which the data random access memory 122 is dual-port and the weight random access memory 124 is a port. This is because the non-architecture program and the architecture program to the data random access memory 122 will have more changes. More traffic. However, because the weight random access memory 124 is relatively large, in another embodiment of the present invention, it is a memory that stores non-architecture and structure program write values interchangeably (that is, the data random access memory 122 and the weight random access memory). Access memory 124). That is, the values of W, U, B, C ', tanh (C) and C are stored in the data random access memory 122 and the values of X, H, I, F and O are stored in the weight random access memory. Body 124 (the modified embodiment of the forty-seventh figure); and W, U, B, and C values are stored in the data random access memory 122 and X and H values are stored in the weight random access memory 124 (the fifty-second embodiment after adjustment). Because the weighted random access memory 124 is larger, these embodiments can process more time steps in a batch. For applications that use the neural network unit 121 to execute calculation-based programs, this feature is beneficial to some applications that can benefit from more time steps and can be designed for the port (such as weight random access memory). 124) Provide sufficient bandwidth.

第五十二圖係一方塊圖，顯示一神經網路單元121之實施例，此實施例之神經處理單元群組內具有輸出緩衝遮蔽與反饋能力，並且共享啟動函數單元 1112。第五十二圖之神經網路單元121係類似於第四十七圖之神經網路單元121，並且圖中具有相同標號之元件亦相類似。不過，第四十九圖之四個啟動函數單元212在本實施例中則是由單一個共享啟動函數單元1112所取代，此單一個啟動函數單元會接收四個來自四個累加器202之輸出217並產生四個輸出至文字OUTBUF[0]，OUTBUF[1]，OUTBUF[2]與OUTBUF[3]。第五十二圖之神經網路單元212之運作方式類似於前文第四十九至五十一圖所述之實施例，並且其運作共享啟動函數單元1112之方式係類似於前文第十一至十三圖所述之實施例。 The fifty-second figure is a block diagram showing an embodiment of a neural network unit 121. The neural processing unit group in this embodiment has output buffer masking and feedback capabilities, and shares an activation function unit. 1112. The neural network unit 121 of the fifty-second figure is similar to the neural network unit 121 of the forty-seventh figure, and the components with the same reference numerals in the figure are similar. However, the four activation function units 212 in FIG. 49 are replaced by a single shared activation function unit 1112 in this embodiment. This single activation function unit will receive four outputs from the four accumulators 202. 217 and generate four outputs to the text OUTBUF [0], OUTBUF [1], OUTBUF [2] and OUTBUF [3]. The operation mode of the neural network unit 212 in the fifty-second diagram is similar to the embodiment described in the forty-ninth to fifty-one diagrams above, and the operation of the shared activation function unit 1112 is similar to the eleventh to the eleventh to The embodiment described in the thirteenth figure.

第五十三圖係一方塊圖，顯示當神經網路單元121執行關聯於第四十六圖中一個具有128個長短期記憶胞4600之層級之計算時，第四十九圖之神經網路單元121之資料隨機存取記憶體122，權重隨機存取記憶體124與輸出緩衝器1104內之資料配置之另一實施例。第五十三圖之範例係類似於第五十圖之範例。不過，在第五十三圖中，Wi，Wf與Wo值係位於列0(而非如第五十圖係位於列3)；Ui，Uf與Uo值係位於列1(而非如第五十圖係位於列4)；Bi，Bf與Bo值係位於列2(而非如第五十圖係位於列5)；C值係位於列3(而非如第五十圖係位於列6)。另外，第五十三圖之輸出緩衝器1104之內容係類似於第五十圖，不過，因為第五十四圖與第五十一圖之非架構程式之差異，第三列之內容(即I，F，O與C’數值)是在位址7之指令執行後出現在輸出緩衝器1104 (而非如第五十圖是位址10之指令)；第四列之內容(即I，F，O與C數值)是在位址10之指令執行後出現在輸出緩衝器1104(而非如第五十圖是位址13之指令)；第五列之內容(即I，F，O與tanh(C)數值)是在位址11之指令執行後出現在輸出緩衝器1104(而非如第五十圖是位址14之指令)；並且第六列之內容(即H數值)是在位址13之指令執行後出現在輸出緩衝器1104(而非如第五十圖是位址16之指令)，詳如後述。 The fifty-third diagram is a block diagram showing the neural network of the forty-ninth diagram when the neural network unit 121 performs a calculation related to a hierarchy of 128 long-term and short-term memory cells 4600 in the forty-sixth diagram. The data random access memory 122, the weight random access memory 124, and the data buffer in the output buffer 1104 of the unit 121 are another embodiment of data allocation. The example of Figure 53 is similar to the example of Figure 50. However, in the 53rd figure, the values of Wi, Wf and Wo are in column 0 (instead of the 50th figure in column 3); Ui, Uf and Uo values are in the column 1 (instead of 5th Figure 10 is in column 4); Bi, Bf and Bo values are in column 2 (as opposed to column 50 in Figure 50); value C is in column 3 (not column 50 as in Figure 50) ). In addition, the content of the output buffer 1104 of Figure 53 is similar to that of Figure 50, but because of the difference between the non-structural programs of Figure 54 and Figure 51, the content of the third column (that is, I, F, O, and C 'values) appear in output buffer 1104 after the instruction at address 7 is executed (Rather than the fifteenth figure is the instruction at address 10); the contents of the fourth column (that is, I, F, O, and C values) appear in the output buffer 1104 after the instruction at address 10 is executed (instead of (Figure 50 is the instruction at address 13); the contents of the fifth column (ie, I, F, O, and tanh (C) values) appear in the output buffer 1104 after the instruction at address 11 is executed (instead of If the figure 50 is the instruction at address 14); and the contents of the sixth column (that is, the H value) appear in the output buffer 1104 after the instruction at address 13 is executed (as opposed to the address at 50) 16 instructions), as described below.

第五十四圖係一表格，顯示儲存於神經網路單元121之程式記憶體129之一程式，此程式係由第四十九圖之神經網路單元121執行並依據第五十三圖之配置使用資料與權重，以達成關聯於長短期記憶胞層之計算。第五十四圖之範例程式係類似於第五十一圖之程式。更精確地說，第五十四圖與第五十一圖中，位址0至5之指令相同；第五十四圖中位址7與8之指令相同於第五十一圖中位址10與11之指令；並且第五十四圖中位址10到14之指令相同於第五十一圖中位址13到17之指令。 The fifty-fourth figure is a table showing a program stored in the program memory 129 of the neural network unit 121. This program is executed by the neural network unit 121 of the forty-ninth figure and according to the fifty-third figure. Configure usage data and weights to achieve calculations related to long and short-term memory cells. The example program of figure 54 is similar to the program of figure 51. More precisely, the instructions in addresses 54 to 54 are the same as those in address 51 to 51; the instructions at addresses 7 and 8 in the fifty-fourth illustration are the same as those in the fifty-first illustration. The instructions of 10 and 11; and the instructions of addresses 10 to 14 in the fifty-fourth figure are the same as the instructions of addresses 13 to 17 in the fifty-first figure.

不過，第五十四圖中位址6之指令並不會清除累加器202(相較之下，第五十一圖中位址6之指令則會清除累加器202)。此外，第五十一圖中位址7至9之指令並不出現在第五十四圖之非架構程式內。最後，就第五十四圖中位址9之指令與第五十一圖中位址12之指令而言，除了第五十四圖中位址9之指令係讀取權重隨機存取記憶體124之列3而第五十一圖中位址12之指令則是讀取權重隨機存取記憶體之列6外，其他部分均相同。 However, the instruction at address 6 in the fifty-fourth figure will not clear the accumulator 202 (in contrast, the instruction at address six in the fifty-first figure will clear the accumulator 202). In addition, the instructions at addresses 7 to 9 in the fifty-first figure do not appear in the non-structural program in the fifty-fourth figure. Finally, with regard to the instruction at address 9 in the fifty-fourth figure and the instruction at address 12 in the fifty-first figure, except for the instruction at address ninety-four in the fifty-fourth figure, the read weight random access memory Column 3 of 124 and the instruction at address 12 in the fifty-first figure are read weights of random access memory except column 6, and the other parts are the same.

因為第五十四圖之非架構程式與第五十一圖之非架構程式之差異，第五十三圖之配置使用之權重隨機存取記憶體124之列數會減少三個，而程式迴圈內之指令數也會減少三個。第五十四圖之非架構程式內之迴圈組尺寸實質上只有第四十八圖之非架構程式內之迴圈組尺寸的一半，並且大致只有第五十一圖之非架構程式內之迴圈組尺寸之80%。 Because of the difference between the non-structural program in Figure 54 and the non-structural program in Figure 51, the number of rows of weighted random access memory 124 used in the configuration of Figure 53 is reduced by three, and the program returns The number of instructions in the circle will also be reduced by three. The size of the loop group in the non-structural program of the fifty-fourth figure is substantially only half the size of the loop group in the non-structural program of the 48th figure, and is roughly only the size of the non-structured program in the fifty-first chart. 80% of the size of the loop group.

第五十五圖係一方塊圖，顯示本發明另一實施例之神經處理單元126之部分。更精確地說，對於第四十九圖之多個神經處理單元126中之單一個神經處理單元126而言，圖中顯示多工暫存器208與其相關聯輸入207，211與4905，以及多工暫存器705與其相關聯輸入206，711與4907。除了第四十九圖之輸入外，神經處理單元126之多工暫存器208與多工暫存器705個別接收一群組內編號(index_within_group)輸入5599。群組內編號輸入5599指出特定神經處理單元126在其神經處理單元群組4901內之編號。因此，舉例來說，以各個神經處理單元群組4901具有四個神經處理單元126之實施例為例，在各個神經處理單元群組4901內，其中一個神經處理單元126在其群組內編號輸入5599中接收數值零，其中一個神經處理單元126在其群組內編號輸入5599中接收數值一，其中一個神經處理單元126在其群組內編號輸入5599中接收數值二，而其中一個神經處理單元126在其群組內編號輸入5599中接收數值三。換句話說，神經處理單元126所接收之群組內編號輸入5599數值就是此神經處理單元126在神經網路單元121內之編號除以J之餘數，其中J是神經處理單元群組4901內之神經處理單元126之數量。因此，舉例來說，神經處理單元73在其群組內編號輸入5599接收數值一，神經處理單元353在其群組內編號輸入5599接收數值三，而神經處理單元6在其群組內編號輸入5599接收數值二。 The fifty-fifth figure is a block diagram showing a part of a neural processing unit 126 according to another embodiment of the present invention. More precisely, for a single neural processing unit 126 of the plurality of neural processing units 126 in the forty-ninth figure, the figure shows the multiplexing register 208 and its associated inputs 207, 211, and 4905, and multiple The work register 705 is associated with inputs 206, 711, and 4907. In addition to the input of the forty-ninth figure, the multiplexer register 208 and the multiplexer register 705 of the neural processing unit 126 each receive a group number (index_within_group) input 5599. In-group number input 5599 indicates the number of a particular neural processing unit 126 within its neural processing unit group 4901. Therefore, for example, taking an embodiment in which each neural processing unit group 4901 has four neural processing units 126 as an example, within each neural processing unit group 4901, one of the neural processing units 126 is numbered in its group and entered 5599 receives the value zero, one of the neural processing units 126 receives the value one in its group number input 5599, one of the neural processing units 126 receives the value two in its group number input 5599, and one of the neural processing units 126 receives the value three in its group number entry 5599. In other words, the number in the group received by the neural processing unit 126 is 5599. The number of the processing unit 126 in the neural network unit 121 divided by J, where J is the number of the neural processing units 126 in the neural processing unit group 4901. Therefore, for example, the neural processing unit 73 enters the number 5599 in its group to receive the value one, the neural processing unit 353 enters the number 5599 in its group to receive the value three, and the neural processing unit 6 enters the number in its group to enter 5599 receives the value two.

此外，當控制輸入213指定一預設值，在此表示為“SELF”，多工暫存器208會選擇對應於群組內編號輸入5599數值之輸出緩衝器1104輸出4905。因此，當一非架構指令以SELF之數值指定接收來自輸出緩衝器1104之資料(在第五十七圖位址2與7之指令中係標示為OUTBUF[SELF])，各個神經處理單元126之多工暫存器208會從輸出緩衝器1104接收其相對應文字。因此，舉例來說，當神經網路單元121執行第五十七圖中位址2與7之非架構指令，神經處理單元73之多工暫存器208會在四個輸入4905中選擇第二個(編號1)輸入以接收來自輸出緩衝器1104之文字73，神經處理單元353之多工暫存器208會在四個輸入4905中選擇第四個(編號3)輸入以接收來自輸出緩衝器1104之文字353，而神經處理單元6之多工暫存器208會在四個輸入4905中選擇第三個(編號2)輸入以接收來自輸出緩衝器1104之文字6。雖然並未使用於第五十七圖之非架構程式，不過，非架構指令亦可利用SELF數值(OUTBUF[SELF])指定接收來自輸出緩衝器1104之資料而使控制輸入713指定預設值使各個神經處理單元126之多工暫存器705從輸出緩衝器1104接收其相對應文字。 In addition, when the control input 213 specifies a preset value, which is represented as "SELF" here, the multiplexer register 208 will select the output buffer 1104 corresponding to the number 5599 in the group to output 4905. Therefore, when a non-architecture instruction is designated to receive data from the output buffer 1104 (indicated as OUTBUF [SELF] in the instructions at addresses 2 and 7 in Figure 57), the value of each neural processing unit 126 The multiplexer register 208 receives its corresponding text from the output buffer 1104. Therefore, for example, when the neural network unit 121 executes the non-architecture instructions at addresses 2 and 7 in the fifty-seventh figure, the multiplexer register 208 of the neural processing unit 73 will select the second among the four inputs 4905. (Number 1) input to receive the text 73 from the output buffer 1104, the multiplexing register 208 of the neural processing unit 353 will select the fourth (number 3) input from the four inputs 4905 to receive from the output buffer The text 353 of 1104, and the multiplexing register 208 of the neural processing unit 6 will select the third (number 2) input among the four inputs 4905 to receive the text 6 from the output buffer 1104. Although it is not used in the non-architecture program in the fifty-seventh figure, the non-architecture instructions can also use the SELF value (OUTBUF [SELF]) to specify the reception of data from the output buffer 1104 and the control input 713 to specify a preset value The multiplexer register 705 of each neural processing unit 126 receives its phase from the output buffer 1104. Corresponding text.

第五十六圖係一方塊圖，顯示當神經網路單元執行關聯於第四十三圖之Jordan時間遞歸神經網路之計算並利用第五十五圖之實施例時，神經網路單元121之資料隨機存取記憶體122與權重隨機存取記憶體124內之資料配置之一範例。圖中權重隨機存取記憶體124內之權重配置係相同於第四十四圖之範例。圖中資料隨機存取記憶體122內之數值的配置係相似於第四十四圖之範例，除了在本範例中，各個時間步驟具有相對應之一對兩列記憶體以裝載輸入層節點D值與輸出層節點Y值，而非如第四十四圖之範例使用一組四列之記憶體。也就是說，在本範例中，隱藏層Z數值與內容層C數值並不寫入資料隨機存取記憶體122。而是將輸出緩衝器1104作為隱藏層Z數值與內容層C數值之一類別草稿記憶體，詳如第五十七圖之非架構程式所述。前述OUTBUF[SELF]輸出緩衝器1104之反饋特徵，可以使非架構程式之運作更為快速(這是將對於資料隨機存取記憶體122執行之兩次寫入與兩次讀取動作，以對於輸出緩衝器1104執行之兩次寫入與兩次讀取動作來取代)並減少各個時間步驟使用之資料隨機存取記憶體122之空間，而使本實施例之資料隨機存取記憶體122所裝載之資料可用於大約兩倍於第四十四與四十五圖之實施例所具有之時間步驟，如圖中所示，即32個時間步驟。 The fifty-sixth diagram is a block diagram showing the neural network unit 121 when the neural network unit performs calculations of the Jordan time recursive neural network associated with the forty-third diagram and uses the embodiment of the fifty-fifth diagram. An example of the data arrangement in the data random access memory 122 and the weight random access memory 124 is. The weight allocation in the weight random access memory 124 in the figure is the same as the example in Figure 44. The configuration of the values in the data random access memory 122 in the figure is similar to the example in Figure 44 except that in this example, each time step has a corresponding pair of two rows of memory to load the input layer node D. Value and output layer node Y value instead of using a set of four rows of memory as in the example in Figure 44. That is, in this example, the values of the hidden layer Z and the content layer C are not written into the data random access memory 122. Instead, the output buffer 1104 is used as a type of draft memory that is one of the hidden layer Z value and the content layer C value, as described in the non-architecture program in FIG. 57. The aforementioned feedback characteristics of the OUTBUF [SELF] output buffer 1104 can make non-architecture programs operate faster (this is to perform two writes and two reads to the data random access memory 122 to The output buffer 1104 performs two writes and two reads instead) and reduces the space of the data random access memory 122 used in each time step, so that the data random access memory 122 of this embodiment The loaded data can be used for about twice the time steps of the embodiment of Figures 44 and 45, as shown in the figure, which is 32 time steps.

第五十七圖係一表格，顯示儲存於神經網路單元121之程式記憶體129之一程式，此程式係由神經網路單元121執行並依據第五十六圖之配置使用資料與權重，以達成Jordan時間遞歸神經網路。第五十七圖之非架構程式類似於第四十五圖之非架構程式，其差異處如下所述。 Figure 57 is a table showing a program stored in the program memory 129 of the neural network unit 121. The network unit 121 executes and uses data and weights according to the configuration of the fifty-sixth figure to achieve a Jordan time recurrent neural network. The non-structural program in Figure 57 is similar to the non-structural program in Figure 45. The differences are as follows.

第五十七圖之範例程式具有12個非架構指令分別位於位址0至11。位址0之初始化指令會清除累加器202並將迴圈計數器3804之數值初始化為32，使迴圈組(位址2至11之指令)執行32次。位址1之輸出指令會將累加器202(由位址0之指令所清除)之零值放入輸出緩衝器1104。由此可觀察到，在位址2至6之指令的執行過程中，這512個神經處理單元126係對應並作為512個隱藏層節點Z進行運作，而在位址7至10之指令的執行過程中，係對應並作為512個輸出層節點Y進行運作。也就是說，位址2至6之指令之32次執行會計算32個相對應時間步驟之隱藏層節點Z數值，並將其放入輸出緩衝器1104供位址7至9之指令之相對應32次執行使用，以計算這32個相對應時間步驟之輸出層節點Y並將其寫入資料隨機存取記憶體122，並提供位址10之指令之相對應32次執行使用，以將這32個相對應時間步驟之內容層節點C放入輸出緩衝器1104。(放入輸出緩衝器1104中第32個時間步驟之內容層節點C並不會被使用。) The example program in Figure 57 has twelve non-framework instructions at addresses 0 to 11, respectively. The initialization instruction at address 0 clears the accumulator 202 and initializes the value of the loop counter 3804 to 32, so that the loop group (instructions at addresses 2 to 11) is executed 32 times. The output instruction at address 1 places the zero value of accumulator 202 (cleared by the instruction at address 0) into output buffer 1104. It can be observed that during the execution of the instructions at addresses 2 to 6, the 512 neural processing units 126 correspond to and operate as 512 hidden layer nodes Z, and the execution of the instructions at addresses 7 to 10 In the process, it corresponds to and operates as 512 output layer nodes Y. In other words, the 32 executions of the instructions at addresses 2 to 6 will calculate the hidden layer node Z values of 32 corresponding time steps and put them into the output buffer 1104 for the correspondence of the instructions at addresses 7 to 9. 32 executions are used to calculate the output layer node Y of the 32 corresponding time steps and write them into the data random access memory 122, and provide the corresponding 32 executions of the address 10 instruction to use this The content layer node C of 32 corresponding time steps is put into the output buffer 1104. (The content layer node C placed in the 32nd time step in the output buffer 1104 will not be used.)

在位址2與3之指令(ADD_D_ACC OUTBUF[SELF]與ADD_D_ACC ROTATE,COUNT=511)之第一次執行中，512個神經處理單元126中之各個神經處理單元126會將輸出緩衝器1104之512個內容節點C值累加至其累加器202，這些內容節點C值係由位址0至1之指令執行所產生與寫入。在位址2與3之指令之第二次執行中，這512個神經處理單元126中之各個神經處理單元126會將輸出緩衝器1104之512個內容節點C值累加至其累加器202，這些內容節點C值係由位址7至8與10之指令執行所產生與寫入。更精確地說，位址2之指令會指示各個神經處理單元126之多工暫存器208選擇其相對應輸出緩衝器1104文字，如前所述，並將其加入累加器202；位址3之指令會指示神經處理單元126在512個文字之旋轉器內旋轉內容節點C值，此512個文字之旋轉器係由這512個神經處理單元中相連接之多工暫存器208之集體運作所構成，而使各個神經處理單元126可以將這512個內容節點C值累加至其累加器202。位址3之指令並不會清除累加器202，如此位址4與5之指令即可將輸入層節點D值(乘上其相對應權重)加上由位址2與3之指令累加出之內容層節點C值。 In the first execution of instructions 2 and 3 (ADD_D_ACC OUTBUF [SELF] and ADD_D_ACC ROTATE, COUNT = 511), each of the 512 neural processing units 126 will output 512 of the output buffer 1104. Content node C value It accumulates to its accumulator 202. These content node C values are generated and written by the instruction execution at addresses 0 to 1. In the second execution of the instructions at addresses 2 and 3, each of the 512 neural processing units 126 will accumulate the 512 content node C values of the output buffer 1104 to its accumulator 202. These The content node C value is generated and written by the execution of instructions at addresses 7 to 8 and 10. More precisely, the instruction at address 2 will instruct the multiplexer register 208 of each neural processing unit 126 to select its corresponding output buffer 1104 text, as described above, and add it to the accumulator 202; address 3 The instruction will instruct the neural processing unit 126 to rotate the content node C value in a 512-character rotator. The 512-character rotator is collectively operated by the multiplexed register 208 of the 512 neural processing units. So that each neural processing unit 126 can accumulate the 512 content node C values to its accumulator 202. The instruction at address 3 does not clear the accumulator 202, so the instructions at addresses 4 and 5 can add the D value of the input layer node (multiplied by its corresponding weight) to the value accumulated by the instructions at addresses 2 and 3. Content layer node C value.

在位址4與5之指令(MULT-ACCUM DR ROW+2,WR ROW 0與MULT-ACCUM ROTATE,WR ROW+1,COUNT=511)之各次執行中，這512個神經處理單元126中之各個神經處理單元126會執行512次乘法運算，將資料隨機存取記憶體122中關聯於當前時間步驟之列(例如：對於時間步驟0而言即為列0，對於時間步驟1而言即為列2，依此類推，對於對於時間步驟31而言即為列62)之512個輸入節點D值，乘上權重隨機存取記憶體124之列0至511中對應於此神經處理單元126之行之權重，以產生512個乘積，而連同這位址2與3之指令對於這512個內容節點C值執行之累加結果，一併累加至相對應神經處理單元126之累加器202以計算隱藏節點Z層數值。 In the execution of instructions at addresses 4 and 5 (MULT-ACCUM DR ROW + 2, WR ROW 0 and MULT-ACCUM ROTATE, WR ROW + 1, COUNT = 511), one of the 512 neural processing units 126 Each neural processing unit 126 performs 512 multiplication operations to associate the row of the data random access memory 122 with the current time step (for example, for time step 0, it is column 0, and for time step 1, it is Column 2, and so on. For time step 31, 512 input node D values are multiplied by weights in columns 0 to 511 of the random access memory 124 corresponding to this neural processing unit 126. Weight of the line 512 products are generated, and the cumulative results of the execution of the instructions C and 2 on the 512 content node C values are accumulated to the accumulator 202 of the corresponding neural processing unit 126 to calculate the hidden node Z layer value.

在位址6之指令(OUTPUT PASSTHRU,NOP,CLR ACC)之各次執行中，這512個神經處理單元126之512個累加器202數值係傳遞並寫入輸出緩衝器1104之相對應文字，並且累加器202會被清除。 In each execution of the instruction at address 6 (OUTPUT PASSTHRU, NOP, CLR ACC), the values of the 512 accumulators 202 of the 512 neural processing units 126 are passed and written to the corresponding text in the output buffer 1104, and The accumulator 202 is cleared.

在位址7與8之指令(MULT-ACCUM OUTBUF[SELF],WR ROW 512與MULT-ACCUM ROTATE,WR ROW+1,COUNT=511)之執行過程中，這512個神經處理單元126中之各個神經處理單元126會執行512次乘法運算，將輸出緩衝器1104中之512個隱藏節點Z值(由位址2至6之指令之相對應次執行所產生並寫入)，乘上權重隨機存取記憶體124之列512至1023中對應於此神經處理單元126之行之權重，以產生512個乘積累加至相對應神經處理單元126之累加器202。 During the execution of instructions at addresses 7 and 8 (MULT-ACCUM OUTBUF [SELF], WR ROW 512 and MULT-ACCUM ROTATE, WR ROW + 1, COUNT = 511), each of these 512 neural processing units 126 The neural processing unit 126 will perform 512 multiplication operations, and the 512 hidden node Z values in the output buffer 1104 (produced and written by the corresponding execution of the instructions at addresses 2 to 6) will be multiplied by the random storage. The weights in the columns 512 to 1023 of the memory 124 corresponding to the trip of the neural processing unit 126 are taken to generate 512 multiply-accumulate and add to the accumulator 202 of the corresponding neural processing unit 126.

在位址9之指令(OUTPUT ACTIVATION FUNCTION,DR OUT ROW+2)之各次執行中，會對於這512個累加值執行一啟動函數(如雙曲正切函數，S型函數，校正函數)以計算輸出節點Y值，此輸出節點Y值會被寫入資料隨機存取記憶體122中對應於當前時間步驟之列(例如：對於時間步驟0而言即為列1，對於時間步驟1而言即為列3，依此類推，對於時間步驟31而言即為列63)。位址9之指令並不會清除累加器202。 In each execution of the instruction at address 9 (OUTPUT ACTIVATION FUNCTION, DR OUT ROW + 2), a start function (such as hyperbolic tangent function, S-shaped function, correction function) will be executed for the 512 accumulated values to calculate Output node Y value, this output node Y value will be written into the column of the data random access memory 122 corresponding to the current time step (for example, for time step 0, it is column 1 and for time step 1, it is Is column 3, and so on, for time step 31 it is column 63). The instruction at address 9 does not clear accumulator 202.

在位址10之指令(OUTPUT PASSTHRU, NOP,CLR ACC)之各次執行中，位址7與8之指令累加出之512個數值會被放入輸出緩衝器1104供位址2與3之指令之下一次執行使用，並且累加器202會被清除。 The instruction at address 10 (OUTPUT PASSTHRU, (NOP, CLR ACC) in each execution, the 512 values accumulated by the instructions at addresses 7 and 8 will be put into the output buffer 1104 for the next execution under the instructions at addresses 2 and 3, and the accumulator 202 Will be cleared.

位址11之迴圈指令會使迴圈計數器3804之數值遞減，而若是新的迴圈計數器3804數值仍然大於零，就指示回到位址2之指令。 The loop instruction at address 11 will decrement the value of the loop counter 3804. If the value of the new loop counter 3804 is still greater than zero, it will instruct the instruction to return to address 2.

如同對應於第四十四圖之章節所述，在利用第五十七圖之非架構程式執行Jordan時間遞歸神經網路之範例中，雖然會對於累加器202數值施以一啟動函數以產生輸出層節點Y值，不過，此範例係假定在施以啟動函數前，累加器202數值就傳遞至內容層節點C，而非傳遞真正的輸出層節點Y值。不過，對於將啟動函數施加於累加器202數值以產生內容層節點C之Jordan時間遞歸神經網路而言，位址10之指令將會從第五十七圖之非架構程式中移除。在本文所述之實施例中，Elman或Jordan時間遞歸神經網路具有單一個隱藏節點層(如第四十與四十二圖)，不過，需要理解的是，這些處理器100與神經網路單元121之實施例可以使用類似於本文所述之方式，有效地執行關聯於具有多個隱藏層之時間遞歸神經網路之計算。 As described in the section corresponding to figure 44, in the example of executing a Jordan time recursive neural network using a non-architecture program of figure 57, although an activation function is applied to the value of accumulator 202 to generate an output Layer node Y value, however, this example assumes that the accumulator 202 value is passed to the content layer node C before the start function is applied, rather than the actual output layer node Y value. However, for the Jordan time recursive neural network that applies the activation function to the value of the accumulator 202 to generate the content layer node C, the instruction at address 10 will be removed from the non-architecture program in Figure 57. In the embodiment described herein, the Elman or Jordan time-recurrent neural network has a single hidden node layer (such as the fortieth and forty-two graphs), but it should be understood that these processors 100 and neural networks The embodiment of unit 121 may use a method similar to that described herein to efficiently perform computations associated with a temporal recurrent neural network with multiple hidden layers.

如前文對應於第二圖之章節所述，各個神經處理單元126係作為一個人工神經網路內之神經元進行運作，而神經網路單元121內所有的神經處理單元126會以大規模平行處理之方式有效地計算此網路之一層級之神經元輸出值。此神經網路單元之平行處理方式，特別是使用神經處理單元多工暫存器集體構成之旋轉器，並非傳統上計算神經元層輸出之方式所能直覺想到。進一步來說，傳統方式通常涉及關聯於單一個神經元或是一個非常小之神經元子集合之計算(例如，使用平行算術單元執行乘法與加法計算)，然後就繼續執行關聯於同一層級之下一個神經元之計算，依此類推以序列方式繼續執行，直到完成對於此層級中所有之神經元之計算。相較之下，本發明在各個時頻週期內，神經網路單元121之所有神經處理單元126(神經元)會平行執行關聯於產生所有神經元輸出所需計算中之一個小集合(例如單一個乘法與累加計算)。在大約M個時頻週期結束後-M是當前層級內連結之節點數-神經網路單元121就會計算出所有神經元之輸出。在許多人工神經網路配置中，因為存在大量神經處理單元126，神經網路單元121就可以在M個時頻週期結束時對於整個層級之所有神經元計算其神經元輸出值。如本文所述，此計算對於所有類型之人工神經網路計算而言都具效率，這些人工神經網路包含但不限於前饋與時間遞歸神經網路，如Elman，Jordan與長短期記憶網路。最後，雖然本文之實施例中，神經網路單元121係配置為512個神經處理單元126(例如採取寬文字配置)以執行時間遞歸神經網路之計算，不過，本發明並不限於此，將神經網路單元121配置為1024個神經處理單元126(例如採取窄文字配置)以執行時間遞歸神經網路單元之計算之實施例，以及如前述具有512與1024以外其他數量之神經處理單元126之神經網路單元 121，亦屬本發明之範疇。 As described in the previous section corresponding to the second figure, each neural processing unit 126 operates as a neuron in an artificial neural network, and all neural processing units 126 in the neural network unit 121 are processed in a large-scale parallel. This method effectively calculates the neuron output at one level of this network. The parallel processing method of this neural network unit, especially In particular, a rotator composed of a multiplex register of a neural processing unit is not intuitively conceivable in the traditional way of calculating the output of a neuron layer. Furthermore, traditional methods usually involve calculations that are associated with a single neuron or a very small subset of neurons (for example, using parallel arithmetic units to perform multiplication and addition calculations), and then continue to perform associations below the same level The calculation of a neuron, and so on, continues in a sequential manner until the calculation of all neurons in this level is completed. In contrast, in the present invention, in each time-frequency period, all the neural processing units 126 (neurons) of the neural network unit 121 execute a small set (e.g. A multiplication and accumulation). After about M time-frequency periods have ended-M is the number of nodes connected in the current level-the neural network unit 121 will calculate the output of all neurons. In many artificial neural network configurations, because there are a large number of neural processing units 126, the neural network unit 121 can calculate its neuron output value for all neurons in the entire hierarchy at the end of M time-frequency cycles. As described in this article, this calculation is efficient for all types of artificial neural network calculations. These artificial neural networks include, but are not limited to, feedforward and time-recurrent neural networks such as Elman, Jordan, and long-term and short-term memory networks. . Finally, although the neural network unit 121 is configured as 512 neural processing units 126 (for example, in a wide text configuration) to perform time-recurrent neural network calculations in the embodiment herein, the present invention is not limited to this. The neural network unit 121 is configured as an embodiment of 1024 neural processing units 126 (for example, in a narrow text configuration) to perform the calculation of the time recursive neural network unit, and as described above, the number of neural processing units 126 other than 512 and 1024 Neural Network Unit 121 also belongs to the scope of the present invention.

惟以上所述者，僅為本發明之較佳實施例而已，當不能以此限定本發明實施之範圍，即大凡依本發明申請專利範圍及發明說明內容所作之簡單的等效變化與修飾，皆仍屬本發明專利涵蓋之範圍內。舉例來說，軟體可以執行本發明所述之裝置與方法的功能、製造、形塑、模擬、描述以及/或測試等。這可由一般的程式語言(如C、C++)、硬體描述語言(HDL)包含Verilog HDL,VHDL等，或是其他既有程式來達成。此軟體可以設置於任何已知的電腦可利用媒介，如磁帶、半導體、磁碟、光碟(如CD-ROM、DVD-ROM等)、網路接線、無線或是其他通訊媒介。此處描述之裝置與方法的實施例可被包含於一半導體智財核心，例如一微處理核心(如以硬體描述語言的實施方式)並且透過積體電路的製作轉換為硬體。此外，本文所描述之裝置與方法亦可包含硬體與軟體之結合。因此，本文所述的任何實施例，並非用以限定本發明之範圍。此外，本發明可應用於一般通用電腦之微處理器裝置。最後，所屬技術領域具有通常知識者利用本發明所揭露的觀念與實施例作為基礎，來設計並調整出不同的結構已達成相同的目的，亦不超出本發明之範圍。 However, the above are only the preferred embodiments of the present invention. When the scope of implementation of the present invention cannot be limited by this, that is, the simple equivalent changes and modifications made according to the scope of the patent application and the description of the invention, All are still within the scope of the invention patent. For example, software can perform the functions, manufacturing, shaping, simulation, description, and / or testing of the devices and methods described in the present invention. This can be achieved by general programming languages (such as C, C ++), hardware description language (HDL) including Verilog HDL, VHDL, etc., or other existing programs. This software can be installed on any known computer-usable medium, such as magnetic tape, semiconductor, magnetic disk, optical disk (such as CD-ROM, DVD-ROM, etc.), network connection, wireless or other communication media. Embodiments of the devices and methods described herein may be included in a semiconductor intellectual property core, such as a micro-processing core (such as an implementation in a hardware description language) and converted into hardware through the fabrication of integrated circuits. In addition, the devices and methods described herein may include a combination of hardware and software. Therefore, any embodiments described herein are not intended to limit the scope of the invention. In addition, the present invention can be applied to a microprocessor device of a general-purpose computer. Finally, those skilled in the art can use the concepts and embodiments disclosed in the present invention as a basis to design and adjust different structures to achieve the same purpose without departing from the scope of the present invention.

100‧‧‧處理器 100‧‧‧ processor

101‧‧‧指令攫取單元 101‧‧‧ instruction fetch unit

102‧‧‧指令快取 102‧‧‧Instruction cache

103‧‧‧架構指令 103‧‧‧Architecture Instructions

104‧‧‧指令轉譯器 104‧‧‧Instruction translator

105‧‧‧微指令 105‧‧‧microinstructions

106‧‧‧重命名單元 106‧‧‧ Rename Unit

108‧‧‧保留站 108‧‧‧ Reserved Station

112‧‧‧其他執行單元 112‧‧‧Other execution units

114‧‧‧記憶體子系統 114‧‧‧Memory Subsystem

116‧‧‧通用暫存器 116‧‧‧General purpose register

118‧‧‧媒體暫存器 118‧‧‧Media Register

121‧‧‧神經網路單元 121‧‧‧ Neural Network Unit

122‧‧‧資料隨機存取記憶體 122‧‧‧Data Random Access Memory

126‧‧‧神經處理單元 126‧‧‧Neural Processing Unit

127‧‧‧控制與狀態暫存器 127‧‧‧Control and status register

128‧‧‧定序器 128‧‧‧Sequencer

129‧‧‧程式記憶體 129‧‧‧program memory

123,125,131‧‧‧記憶體位址 123,125,131‧‧‧Memory address

133‧‧‧結果 133‧‧‧ results

Claims

A device includes: an output buffer for loading N characters, the N characters are allocated to N / J mutually exclusive output buffer text groups, and the output buffer text group has one of the N characters J characters, J is greater than 2, N is at least twice J; an array of N processing units, the N processing units are assigned to N / J mutually exclusive processing unit groups, the processing unit group The group has J processing units among the N processing units, and each processing unit group corresponds to one of the N / J output buffer text groups, and each of the processing units includes: first and second multiplexing Registers, each of the multiplexing registers includes: at least J + 1 inputs, one of the J + 1 inputs receives a operand from a memory, and the other of the J + 1 inputs J inputs receive the J characters of the corresponding output buffer text group; an output; and a control input to control the selection of the J + 1 inputs to provide to the output; an accumulator having a Output to provide output buffer text corresponding to one of the N output buffer texts; and The operation unit has first, second, and third inputs. The first and second inputs are respectively used to receive the outputs of the first and second multiplexing registers. The third input is used to receive the accumulator. Output of the arithmetic unit for the first, second and third inputs Perform an operation to generate a result to accumulate to the accumulator; wherein the output buffer includes a mask input to control which characters in the N characters will maintain their original values or output from the corresponding accumulator Update.

For example, the device of the first patent application range, wherein the mask input specifies J values, and each of the J values controls the corresponding text of one of the J characters of each output buffer text group Keep its current value or update it with the value of its corresponding accumulator.

For example, the device of the scope of patent application, wherein the mask input is generated in response to the device executing an instruction.

For example, the device of the scope of application for patent No. 3 further includes: a program memory for loading a plurality of program instructions, the program instructions including the instructions executed by generating the mask input; wherein the device is a processor An execution unit, the processor includes the device, the processor has an architecture instruction set, and the instructions loaded in the program memory are non-architecture instructions different from the architecture instructions in the architecture instruction set of the processor .

For example, the device in the fifth item of the patent application scope further includes: a plurality of activation function units, selectively performing an activation function on the output of the accumulator to generate a result and providing the corresponding output in the N output buffer characters. Buffer text.

For example, the device of the scope of patent application of the invention further includes: an iteration count, which is generated in response to an instruction executed by the device. The iteration count is controlled by the arithmetic unit for the first, second and third. Enter the number of times the operation is performed to produce a result that is accumulated in the accumulator.

For example, the device in the scope of patent application of the invention further includes: a first memory for providing a first set of N characters to provide the first input to the corresponding first multiplexing register; and The second memory is used to provide a second set of N characters to provide the first input to the corresponding second multiplexer register.

For example, the device in the seventh scope of the patent application, wherein the N characters of the output buffer can be written into the first or second memory.

For example, the device of the scope of patent application, wherein the control input of the first and second multiplexer registers is generated in response to an instruction executed by the device.

For example, the device in the ninth scope of the patent application, wherein the control input is provided to each of the N processing units in response to the execution of an instruction.

For example, the device of claim 1 in the patent scope, wherein the first multiplexer register includes an offset input for receiving the output of the first multiplexer register of the adjacent processing unit, when The control input of the first multiplexer register specifies the offset input, and the first multiplexer registers of the N processing units collectively operate as an N-character rotator.

For example, the device of claim 11 of the patent scope, wherein the second multiplexer register includes an offset input for receiving the output of the second multiplexer register of the adjacent processing unit, when The control input of the second multiplexer register specifies the offset input, and the N processing orders The second multiplex registers are operated collectively as a spinner of N characters.

A processor includes: an execution unit including: an output buffer for loading N characters, the N characters are allocated to N / J mutually exclusive output buffer text groups, and the output buffer text groups With J characters in the N characters, J is greater than 2, N is at least twice J; an array of N processing units, the N processing units are allocated to N / J mutually exclusive processing units Group, the processing unit group has J processing units of the N processing units, each processing unit group corresponds to one of the N / J output buffering text groups, and each processing unit includes: First and second multiplexing registers, each of which includes: at least J + 1 inputs, one of the J + 1 inputs having a first input receiving an operand from a memory, the J The other J inputs of the +1 input receive the J characters of the corresponding output buffer text group; an output; and a control input to control the selection of the J + 1 inputs to provide to the output ; An accumulator with an output corresponding to one of the N output buffer characters A text buffer; and an arithmetic unit having a first, second and third input, the first and second inputs for receiving respectively the first and second multiplexing The output of the register. The third input is used to receive the output of the accumulator. The arithmetic unit performs an operation on the first, second, and third inputs to generate a result to accumulate to the accumulator. Among them, the The output buffer includes a mask input to control which characters in the N characters will maintain their original values or be updated with the output of their corresponding accumulator.

For example, the processor of claim 13 in the patent scope, wherein the mask input specifies J values, and each of the J values controls the corresponding one of the J characters of each output buffer text group. The text maintains its current value or is updated with the value of its corresponding accumulator.

For example, the processor of claim 13, wherein the mask input is generated according to an instruction executed by the execution unit.

For example, the processor of claim 15 in which the execution unit further includes a program memory for loading a plurality of program instructions, and the program instructions include the instruction executed by generating the mask input; wherein, the The processor has an architecture instruction set, and the instructions loaded in the program memory are non-architecture instructions different from the architecture instructions in the architecture instruction set of the processor.

For example, the processor according to item 13 of the patent application, wherein the execution unit further includes a plurality of activation function units, and selectively executes an activation function on the output of the accumulator to generate a result and provide the result to the N output buffer words. The corresponding output buffer text.

For example, the processor of the scope of application for item 13, wherein, The execution unit further includes a count of the number of iterations, which is generated in response to an instruction executed by the execution unit. The number of iterations controls the arithmetic unit to perform the operation on the first, second, and third inputs to generate a result accumulated to the Accumulator times.

For example, the processor of claim 13 in which the execution unit further includes: a first memory for providing a first set of N characters to the corresponding first multiplexing register. A first input; and a second memory for providing a second set of N characters to provide the first input to the corresponding second multiplexer register.

For example, the processor according to item 13 of the patent application, wherein the control input of the first and second multiplexer registers is generated according to an instruction executed by an execution unit.

For example, the processor of claim 13 in which the first multiplexer register includes an offset input for receiving the output of the first multiplexer register of an adjacent processing unit, When the control input of the first multiplexer register specifies the offset input, the first multiplexer registers of the N processing units collectively operate as a rotator of N characters.

For example, the processor of claim 21, wherein the second multiplexer register includes an offset input for receiving the output of the second multiplexer register of the adjacent processing unit, When the control input of the second multiplexer register specifies the offset input, the second multiplexer registers of the N processing units operate collectively as a N-character rotator.

A computer program product encoded in at least one non-transitory computer-useable medium for use by a computer device, including: a computer-useable code included in the medium, used to describe a device, the computer-useable code Including: a first code for describing an output buffer, the output buffer is used to load N characters, the N characters are allocated to N / J mutually exclusive output buffer text groups, and the output buffer The text group has J characters of the N characters, J is greater than 2, N is at least twice of J; the second code is used to describe an array of N processing units, and the N processing units are Assigned to N / J mutually exclusive processing unit groups, the processing unit group has J processing units of the N processing units, and each processing unit group corresponds to N / J output buffer text groups In one of them, each of the processing units includes: first and second multiplexing registers, each of which includes: at least J + 1 inputs, and one of the J + 1 inputs Receives an operand from a memory, the other J of the J + 1 inputs are lost To receive the J characters of the corresponding output buffer text group; an output; and a control input to control the selection of the J + 1 inputs to provide to the output; an accumulator having an output to Output buffer text corresponding to one of the N output buffer texts; and An arithmetic unit having first, second, and third inputs, the first and second inputs are used to receive outputs of the first and second multiplexer registers, respectively, and the third input is used to receive the accumulation The arithmetic unit performs an operation on the first, second and third inputs to generate a result and accumulate the result to the accumulator; wherein the output buffer includes a mask input to control the N characters Which characters in the text will maintain their original values or be updated with the output of their corresponding accumulator.