CN108804139B

CN108804139B - Programmable device, method of operation thereof, and computer usable medium

Info

Publication number: CN108804139B
Application number: CN201810620150.0A
Authority: CN
Inventors: G·葛兰·亨利; 泰瑞·派克斯
Original assignee: Shanghai Zhaoxin Integrated Circuit Co Ltd
Current assignee: Shanghai Zhaoxin Semiconductor Co Ltd
Priority date: 2017-06-16
Filing date: 2018-06-15
Publication date: 2020-10-20
Anticipated expiration: 2038-06-15
Also published as: CN108804139A

Abstract

The present invention relates to a programmable device, a method of operating the same, and a computer usable medium. The programmable device includes: a program memory for holding instructions of a program picked up and executed by the apparatus; a data memory for holding data processed by the instructions; a status register to hold a status having the following fields: a program memory address at which a most recent instruction was fetched from the program memory; a data store access address at which the device accessed data in the data store the most recent time; and a repeat count indicating the number of times the operation specified in the current program instruction has yet to be performed. The condition register has a condition field corresponding to the status register field. Control logic generates an interrupt request for a processing core in response to detecting that a condition specified in a condition register is satisfied for a state held in the state register.

Description

Programmable device, method of operation thereof, and computer usable medium

Technical Field

The invention relates to a neural network element for interrupting a core in dependence on a condition.

Background

Recently, Artificial Neural Networks (ANN) have attracted renewed interest, and such research is commonly referred to as deep learning, computer learning, and similar terms. The increase in computing power of general purpose processors has led to a renewed interest that has subsided decades ago. Recent applications of ANN include voice recognition and image recognition, among others. There is an increasing demand for improving the performance and efficiency of the computations associated with ANN.

Disclosure of Invention

A programmable device, comprising: an output for generating an interrupt request for a processing core coupled to the apparatus; a program memory for holding instructions of a program picked up and executed by the apparatus; a data memory for holding data processed by the instructions; a status register to hold a status of the device that is updated during operation of the device, the status having fields that include: a program memory address at which a most recent instruction was fetched from the program memory; a data store access address at which the device has last accessed data in the data store; and a repeat count for indicating a number of times an operation specified in the current program instruction remains in progress; a condition register having condition fields corresponding to the status fields held in the status register, wherein the condition register is capable of writing a condition including one or more of the condition fields via an instruction of the program; and control logic for generating an interrupt request on the output for the processing core in response to detecting that the state held in the state register satisfies the condition specified in the condition register.

A method of operation of a device, the device comprising: a program memory for holding instructions of a program picked up and executed by the apparatus; a data memory for holding data processed by the instructions; a status register to hold a status of the device that is updated during operation of the device, wherein the status has fields comprising: a program memory address at which a most recent instruction was fetched from the program memory; a data store access address at which the device has last accessed data in the data store; and a repeat count for indicating a number of times an operation specified in a current program instruction remains in progress, the apparatus further comprising a condition register having a condition field corresponding to a status field held in the status register, the method comprising: writing, via an instruction of the program, a condition to the condition register that includes one or more of the condition fields; and in response to detecting that the state held in the state register satisfies the condition specified in the condition register, generating an interrupt request for a processing core.

A non-transitory computer usable medium includes a computer usable program that causes a computer to function as each component in a processor according to the present application.

Drawings

Fig. 1 is a block diagram illustrating a processor including a Neural Network Unit (NNU).

Fig. 2 is a block diagram illustrating the NPU of fig. 1.

FIG. 3 is a block diagram illustrating an embodiment of an arrangement of N multiplexing registers (mux-regs) of the N NPUs of the NNU of FIG. 1 to illustrate operation of the N multiplexing registers as an N-word rotator or circular shifter for a row of data words received from the data RAM of FIG. 1.

FIG. 4 is a table that illustrates a program for storing in and executing by the NNU of FIG. 1's program memory.

FIG. 5 is a timing diagram illustrating the execution of the routine of FIG. 4 by an NNU.

FIG. 6A is a block diagram illustrating the NNU of FIG. 1 executing the routine of FIG. 4.

FIG. 6B is a flow diagram illustrating operation of the processor of FIG. 1 to perform an architectural procedure that uses NNUs to perform multiply-accumulate activation function computations (such as performed by the procedure of FIG. 4) typically associated with neurons of a hidden layer of an artificial neural network.

Fig. 7 is a block diagram illustrating the NPU of fig. 1 according to an alternative embodiment.

Fig. 8 is a block diagram illustrating the NPU of fig. 1 according to an alternative embodiment.

FIG. 9 is a table that illustrates a program for storing in the program memory of the NNU of FIG. 1 and for execution by the NNU.

FIG. 10 is a timing diagram illustrating the execution of the routine of FIG. 9 by an NNU.

FIG. 11 is a block diagram illustrating an embodiment of the NNU of FIG. 1. In the embodiment of fig. 11, the neuron element is divided into two parts, an activation function unit part and an ALU part (this part also includes a shift register part), and each activation function unit part is shared by a plurality of ALU parts.

FIG. 12 is a timing diagram illustrating the NNUs of FIG. 11 executing the routine of FIG. 4.

FIG. 13 is a timing diagram illustrating the NNUs of FIG. 11 executing the routine of FIG. 4.

FIG. 14 is a block diagram illustrating a Move To Neural Network (MTNN) architecture instruction and the operation of the architecture instruction with respect to portions of the NNUs of FIG. 1.

Fig. 15 is a block diagram illustrating a Move From Neural Network (MFNN) architecture instruction and the operation of the architecture instruction with respect to portions of the NNUs of fig. 1.

FIG. 16 is a block diagram illustrating an embodiment of the data RAM of FIG. 1.

FIG. 17 is a block diagram illustrating an embodiment of the weight RAM and buffer of FIG. 1.

FIG. 18 is a block diagram illustrating the dynamically configurable NPU of FIG. 1.

FIG. 19 is a block diagram illustrating an embodiment of an arrangement of 2N multiplexing registers of the N NPUs of the NNU of FIG. 1 to illustrate operation of the 2N multiplexing registers as a rotator for a row of data words received from the data RAM of FIG. 1, in accordance with the embodiment of FIG. 18.

FIG. 20 is a table illustrating a program for storage in and execution by the NNUs of FIG. 1 having NPUs according to the embodiment of FIG. 18.

FIG. 21 is a timing diagram illustrating an NNU executing the program of FIG. 20, where the NNU includes the NPU of FIG. 18 operating in a narrow configuration.

Fig. 22 is a block diagram illustrating the NNU of fig. 1, wherein the NNU includes the NPU of fig. 18 to execute the routine of fig. 20.

FIG. 23 is a block diagram illustrating the dynamically configurable NPU of FIG. 1 in accordance with an alternative embodiment.

FIG. 24 is a block diagram illustrating an example of a data structure used by the NNUs of FIG. 1 to perform convolution operations.

FIG. 25 is a flow diagram illustrating operation of the processor of FIG. 1 to execute an architectural program that uses NNUs that perform convolution on convolution kernels for the data array of FIG. 24.

FIG. 26A is a program listing of an NNU program that performs convolution on the data matrix using the convolution kernel of FIG. 24 and writes it back to the weight RAM.

FIG. 26B is a block diagram that illustrates certain fields of the control registers of the NNU of FIG. 1, according to one embodiment.

FIG. 27 is a block diagram illustrating an example of the weight RAM of FIG. 1 filled with input data to which the NNU of FIG. 1 performs a pooling (pooling) operation.

FIG. 28 is a program listing of an NNU program that pools the input data matrix of FIG. 27 and writes it back to the weight RAM.

FIG. 29A is a block diagram illustrating an embodiment of the control register of FIG. 1.

FIG. 29B is a block diagram illustrating an embodiment of the control register of FIG. 1 according to an alternative embodiment.

FIG. 29C is a block diagram illustrating an embodiment of the inverse of FIG. 29A stored as two parts, according to one embodiment.

FIG. 30 is a block diagram illustrating an embodiment of the AFU of FIG. 2 in greater detail.

FIG. 31 is an example of the operation of the AFU of FIG. 30.

FIG. 32 is a second example of the operation of the AFU of FIG. 30.

FIG. 33 is a third example of the operation of the AFU of FIG. 30.

FIG. 34 is a block diagram illustrating a more detailed portion of the processors of FIG. 1 and the NNUs of FIG. 1.

FIG. 35 is a block diagram that illustrates a processor that includes a variable rate NNU.

FIG. 36A is a timing diagram that illustrates an example of operation of a processor having NNUs that operate in a normal mode, i.e., at a master clock rate.

FIG. 36B is a timing diagram that illustrates an example of operation of a processor having NNUs that operate in a mitigative mode, i.e., at a rate less than the master clock rate.

FIG. 37 is a flow chart illustrating operation of the processor of FIG. 35.

FIG. 38 is a block diagram illustrating the sequence of NNUs in more detail.

FIG. 39 is a block diagram illustrating certain fields of the control and status registers of an NNU.

FIG. 40 is a block diagram illustrating an embodiment of portions of an NNU.

Fig. 41 is a block diagram showing a processor.

Fig. 42 is a block diagram illustrating the ring stop (ring stop) of fig. 41 in more detail.

FIG. 43 is a block diagram illustrating the slave interface of FIG. 42 in more detail.

FIG. 44 is a block diagram illustrating host interface 0 of FIG. 42 in more detail.

FIG. 45 is a block diagram illustrating portions of the ring station of FIG. 42 and a ring bus coupling embodiment of an NNU.

FIG. 46 is a block diagram illustrating a ring bus coupling embodiment of an NNU.

FIG. 47 is a block diagram illustrating an embodiment of an NNU.

FIG. 48 is a block diagram illustrating the interrupt condition register of FIG. 47 in greater detail.

FIG. 49 is a block diagram illustrating the status register of FIG. 47 in greater detail.

FIG. 50 is a flowchart illustrating operation of the NNU of FIG. 47 to generate interrupt requests for cores based on conditions.

FIG. 51 is a table illustrating a program for storing in the program memory of the NNU of FIG. 47 and for execution by the NNU.

FIG. 52 is an instruction to set interrupt conditions for storage in and execution by the NNUs of FIG. 47 in accordance with an alternative embodiment.

FIG. 53 is a table illustrating a program for storing in the program memory of the NNU of FIG. 47 and for execution by the NNU.

Detailed Description

Processor with architected neural network elements

Referring now to FIG. 1, a block diagram is shown illustrating a processor 100 including a Neural Network Unit (NNU) 121. Processor 100 includes an instruction fetch unit 101, an instruction cache 102, an instruction translator 104, a rename unit 106, a reservation station 108, media registers 118, General Purpose Registers (GPRs) 116, an execution unit 112 other than an NNU 121, and a memory subsystem 114.

Processor 100 is an electronic device that functions as a Central Processing Unit (CPU) on an integrated circuit. Processor 100 receives digital data as input, processes the data according to instructions fetched from memory, and generates as output the results of operations specified by the instructions. The processor 100 may be used in a desktop computer, a mobile computer, or a tablet computer, and for purposes such as computing, word editing, multimedia display, and internet browsing. The processor 100 may also be provided in an embedded system to control a variety of devices including home appliances, mobile phones, smart phones, vehicles, industrial control devices, and the like. A CPU is an electronic circuit (i.e., "hardware") that executes computer program (also referred to as a "computer application" or "application") instructions by performing operations on data, including arithmetic operations, logical operations, and input/output operations. An Integrated Circuit (IC) is a set of electronic circuits fabricated on a small piece of semiconductor material, typically silicon. An IC is also known as a chip, microchip, or die.

Instruction fetch unit 101 controls the fetching of architectural instructions 103 from system memory (not shown) into instruction cache 102. Instruction fetch unit 101 provides a fetch address specifying a memory address at which processor 100 fetches a cache line of architectural instruction bytes into instruction cache 102 to instruction cache 102. The pick address is based on the current value of an instruction pointer (not shown) or program counter of processor 100. Typically, the program counter is sequentially incremented by instruction size unless a control instruction such as a branch, call, or return instruction is encountered in the instruction stream or an exception condition such as an interrupt, trap (trap), exception, or error occurs, in which case the program counter is updated with a non-sequential address such as a branch target address, return address, or exception vector. Generally, the program counter is updated in response to execution of instructions by execution unit 112/121. The program counter may also be updated in response to detecting an exception condition, such as instruction translator 104 encountering an instruction 103 not defined by the instruction set architecture of processor 100.

Instruction cache 102 caches architectural instructions 103 fetched from system memory coupled to processor 100. The architecture instructions 103 include Move To Neural Network (MTNN) instructions and Move From Neural Network (MFNN) instructions, which are described in more detail below. In one embodiment, the architectural instructions 103 are instructions of the x86 Instruction Set Architecture (ISA) with the addition of MTNN instructions and MFNN instructions. In the context of the present invention, an x86ISA processor is used as

Generating and executing instructions in the instruction set architecture layer when the processor executes the same machine language instruction

A processor that generates the same result. However, other embodiments contemplate other instruction set architectures, such as advanced reduced instruction set machines

SUN

Or

The instruction cache 102 provides the architectural instructions 103 to the instruction translator 104, and the instruction translator 104 translates the architectural instructions 103 into micro instructions 105.

The microinstructions 105 are provided to the rename unit 106 and ultimately executed by the execution unit 112/121. The microinstructions 105 implement architectural instructions. Preferably, the instruction translator 104 includes a first portion that translates frequently executed and/or relatively less complex architectural instructions 103 into microinstructions 105. Instruction translator 104 also includes a second portion, where the second portion includes a microcode unit (not shown). The microcode unit includes a microcode memory that holds microcode instructions that implement complex and/or infrequently used instructions of the architectural instruction set. The microcode unit also includes a micro sequencer (micro-sequencer) that provides a non-architected micro program counter (micro-PC) to the microcode memory. Preferably, microcode instructions are translated into microinstructions 105 via a micro-translator (not shown). The selector selects a microinstruction 105 from either the first portion or the second portion to provide to the rename unit 106 depending on whether the microcode unit currently has control.

Rename unit 106 renames architectural registers specified in architectural instructions 103 to physical registers of processor 100. Preferably, processor 100 includes a reorder buffer (not shown). The rename unit 106 allocates entries in a reorder buffer in program order for each microinstruction 105. This enables the processor 100 to retire (retire) the microinstruction 105 and its corresponding architectural instruction 103 in program order. In one embodiment, media registers 118 have a 256-bit width and GPRs 116 have a 64-bit width. In one embodiment, the media registers 118 are x86 media registers such as advanced vector extension (AVX) registers.

In one embodiment, each entry of the reorder buffer includes storage space for the results of the microinstructions 105; further, processor 100 includes an architectural register file that includes physical registers for each architectural register (e.g., media registers 118, GPRs 116, and other architectural registers). (preferably, there is a separate register file for both, e.g., due to the different sizes of media registers 118 and GPRs 116.) for each source operand in a microinstruction 105 that specifies an architectural register, the rename unit populates the source operand field of the microinstruction 105 with the reorder buffer index of the newest microinstruction in the old microinstruction 105 that is written to the architectural register. When the execution unit 112/121 completes execution of the microinstruction 105, the execution unit 112/121 writes the result to the reorder buffer entry of the microinstruction 105. When a microinstruction 105 retires, a retirement unit (not shown) writes the result from the microinstruction's reorder buffer entry to a register of the physical register file associated with the architectural destination register specified by the retired microinstruction 105.

In another embodiment, processor 100 includes a physical register file that includes a greater number of physical registers than architectural registers and does not include result storage space than does the architectural register file. (preferably, there is a separate physical register file for both media registers 118 and GPRs 116, e.g., due to the different sizes of these registers.) processor 100 also includes a pointer table with associated pointers for each architectural register. For operands in the microinstruction 105 that specify architectural registers, the rename unit populates the destination operand field of the microinstruction 105 with pointers to free registers in the physical register file. Rename unit 106 stalls the pipeline if there are no free registers within the physical register file. For each source operand of a specified architectural register of the microinstruction 105, the rename unit populates the source operand field of the microinstruction 105 with a pointer to a register in the physical register file that is assigned to the newest microinstruction in the older microinstruction 105 that is written to the architectural register. When the execution unit 112/121 completes execution of the microinstruction 105, the execution unit 112/121 writes the result to the register in the physical register file pointed to by the destination operand field of the microinstruction 105. When a microinstruction 105 retires, the retirement unit copies the destination operand field value of the microinstruction 105 to the pointer in the pointer table associated with the architectural destination register specified by the retired microinstruction 105.

Reservation station 108 holds microinstructions 105 until they are ready to be issued to execution unit 112/121 for execution. The microinstructions 105 are ready to issue when all of the source operands of the microinstructions 105 are available and the execution unit 112/121 is available to execute the microinstructions 105. Execution unit 112/121 receives register source operands from a reorder buffer or an architectural register file as described in the first embodiment above, or from a physical register file as described in the second embodiment above. Furthermore, execution units 112/121 may receive register source operands directly from execution units 112/121 via a result forwarding bus (not shown). Additionally, the execution unit 112/121 may receive immediate operands specified by the microinstructions 105 from the reservation station 108. As described in more detail below, the MTNN and MFNN architecture instruction 103 includes immediate operands for specifying a function to be performed by the NNU 121, where the function is provided in one of the one or more microinstructions 105 into which the MTNN and MFNN architecture instruction 103 is translated.

Execution units 112 include one or more load/store units (not shown) that load data from memory subsystem 114 and store data to memory subsystem 114. Memory subsystem 114 preferably includes a memory management unit (not shown), which may include, for example, a translation look-up (lookup) buffer and table move (tablewalk) unit, a level 1 data cache (and instruction cache 102), a level 2 unified cache, and a bus interface unit for interfacing processor 100 with system memory. In one embodiment, processor 100 of FIG. 1 is representative of a processing core that is one of a plurality of processing cores sharing a last level cache memory in a multi-core processor. Execution units 112 may also include integer units, media units, floating point units, and branch units.

NNU 121 includes a weighted Random Access Memory (RAM)124, a data RAM122, N Neural Processing Units (NPUs) 126, a program memory 129, a sequencer 128, and a Control and Status Register (CSRS) 127. The NPU 126 conceptually functions as a neuron in a neural network. The weight RAM 124, data RAM122, and program memory 129 may be written to and read from the MTNN and MFNN architecture instructions 103, respectively. The weight RAM 124 is arranged in W rows of N weight words each, and the data RAM122 is arranged in D rows of N data words each. Each data word and each weight word has a plurality of bits, preferably 8, 9, 12 or 16 bits. Each data word serves as an output value (sometimes also referred to as an activation value) for a neuron of a previous layer in the network, and each weight word serves as a weight associated with a connection of a neuron into a current layer of the network. Although in many applications of NNU 121, the words or operands held in weight RAM 124 are actually weights associated with connections into neurons, it should be understood that in other applications of NNU 121, the words held in weight RAM 124 are not weights, but are still referred to as "weight words" because they are stored in weight RAM 124. For example, in certain applications of NNU 121, such as the convolution example of fig. 24-26A or the pooling example of fig. 27-28, weight RAM 124 may hold non-weights, such as elements of a data matrix (e.g., image pixel data), and the like. Likewise, while in many applications of NNU 121 the words or operands held in data RAM122 are actually output values or activation values of neurons, it will be appreciated that in other applications of NNU 121 the words held in data RAM122 are not, but are still referred to as "data words" because they are stored in data RAM 122. For example, in certain applications of NNU 121, such as the convolution examples of fig. 24-26A, data RAM122 may hold non-neuronal outputs, such as elements of a convolution kernel, and the like.

In one embodiment, the NPU 126 and the sequencer 128 include combinatorial logic, sequencing logic, state machines, or a combination thereof. An architectural instruction (e.g., MFNN instruction 1500) loads the contents of status register 127 into one of GPRs 116 to determine the status of NNU 121, e.g., to determine the status of a program that NNU 121 has completed a command or that NNU 121 is running from program memory 129, or to determine the status of a new command or a new NNU program that NNU 121 is free to receive.

Advantageously, the number of NPUs 126 may be increased as desired, and the size of the weight RAM124 and data RAM 122 may be expanded in width and depth accordingly. Preferably, the weight RAM124 is large because in a typical neural network layer there are many connections associated with individual neurons and thus many weights. Various embodiments are described herein relating to the size of the data words and weight words, the size of the weight RAM124 and data RAM 122, and the number of NPUs 126. In one embodiment, the NNU 121 with 64KB (8192 bits x 64 rows) data RAM 122, 2MB (8192 bits x 2048 rows) weight RAM124, and 512 NPUs 126 is implemented in taiwan semiconductor manufacturing limited (TSMC) 16 nanometer technology, occupying an area of about 3.3 square millimeters.

Sequencer 128 picks up instructions from program memory 129 and executes, and also generates address and control signals to provide to data RAM 122, weight RAM 124, and NPU 126. The sequencer 128 generates the memory address 123 and read commands to provide to the data RAM 122 to select one of the D rows having N data words per row and provide to the N NPUs 126. Sequencer 128 also generates memory addresses 125 and read commands to provide to weight RAM 124 to select one of the W rows with N weight words per row to provide to the N NPUs 126. Sequencer 128 generates an order determination "connection" between neurons in the order of

addresses

123 and 125 provided to NPU 126. The sequencer 128 also generates memory addresses 123 and write commands to provide to the data RAM 122 to select one of the D rows having N data words per row to write from the N NPUs 126. Sequencer 128 also generates memory addresses 125 and write commands to provide to weight RAM 124 to select one of the W rows with N weight words per row to write from N NPUs 126. Sequencer 128 also generates a memory address 131 for program memory 129 to select the NNU instruction provided to sequencer 128 as described below. The memory address 131 corresponds to a program counter (not shown) that the sequencer 128 increments, typically by sequential position in the program memory 129, unless the sequencer 128 encounters a control instruction, such as a loop instruction (see, e.g., FIG. 26A), in which case the sequencer 128 updates the program counter to the target address of the control instruction. The sequencer 128 also generates control signals for the NPU 126 to instruct the NPU 126 to perform various operations or functions such as initialization, arithmetic/logical operations, rotate and shift operations, activate functions, and write back operations, examples of which are described in more detail below (see, e.g., micro-operation 3418 of fig. 34).

The N NPUs 126 generate N result words 133, where the result words 133 may be written back to a row of the weight RAM 124 or to the data RAM 122. Preferably, the weight RAM 124 and the data RAM122 are coupled directly to the N NPUs 126. More specifically, the weight RAM 124 and the data RAM122 are dedicated to the NPU 126 and are not shared by other execution units 112 of the processor 100, and these NPUs 126 are able to consume one row from one or both of the weight RAM 124 and the data RAM122 on each clock cycle in a continuous manner (preferably, in a pipelined manner). In one embodiment, data RAM122 and weight RAM 124 are each capable of providing 8192 bits to NPU 126 at each clock cycle. As described in more detail below, these 8192 bits may be consumed as 512 16-bit words or 1024 8-bit words.

Advantageously, the size of the data sets that can be processed by the NNU 121 is not limited by the size of the weight RAM 124 and the data RAM122, but only by the size of the system memory, since the MTNN and MFNN instructions (e.g., via the media registers 118) can be used to move data and weights between the system memory and the weight RAM 124 and the data RAM 122. In one embodiment, the data RAM122 is dual ported, enabling data words to be written to the data RAM122 while data words are being read from or written to the data RAM122 in parallel. In addition, the large memory hierarchy of memory subsystem 114, including cache memory, provides a very large data bandwidth for transfers between system memory and NNU 121. Further, memory subsystem 114 preferably includes a hardware data pre-picker that tracks memory access patterns (such as loading of neural data and weights from system memory, etc.) and performs data pre-picking on the cache hierarchy to facilitate high bandwidth and low latency transfers to weight RAM 124 and data RAM 122.

Although embodiments are described in which one of the operands provided to each NPU 126 is provided from a weight store and is represented as weights (the term is used in general in neural networks), it should be understood that the operands may be other types of data associated with calculations that can be accelerated by the apparatus.

Referring now to FIG. 2, a block diagram is shown illustrating the NPU 126 of FIG. 1. The NPU 126 operates to perform a number of functions or operations. In particular, the NPU 126 is advantageously configured to operate as a neuron or node in an artificial neural network to perform a classical multiply-accumulate function or operation. That is, in general, the NPU 126 (neuron) is configured to: (1) receive input values from neurons having connections to the NPU 126 (typically, but not necessarily, from an immediately preceding layer in an artificial neural network); (2) multiplying each input value by a respective weight value associated with the connection to produce a product; (3) adding all products to produce a sum; and (4) performing an activation function on the sum to produce an output of the neuron. However, rather than performing all of the multiplications associated with all of the connected inputs and then adding all of the products together as is conventional, advantageously each neuron is configured to perform a weighted multiplication operation associated with one of the connected inputs in a given clock cycle and then add (accumulate) the product with an accumulated value of the products associated with the connected inputs processed in the previous clock cycle up to that point. Assuming there are M connections to a neuron, after all M products are accumulated (taking about M clock cycles), the neuron performs an activation function on the accumulated values to produce an output or result. This has the following advantages: fewer multipliers and smaller, simpler, and faster adder circuits (e.g., 2-input adders) are needed within the neuron, as compared to adders that add all or even a subset of the products associated with all of the connected inputs. This has the following advantages: it is advantageous to implement a very large number (N) of neurons (NPUs 126) within NNUs 121, such that after about M clock cycles, NNUs 121 have produced the output of all of these large number (N) of neurons. Finally, NNUs 121, which are composed of such neurons, have the advantage of performing efficiently as an artificial neural network layer for a large number of different connection inputs. That is, as M increases or decreases for different layers, the number of clock cycles required to generate neuron outputs increases or decreases accordingly, and resources (e.g., multipliers and accumulators) are fully utilized; while in more conventional designs, some multipliers and partial adders are not utilized for smaller values of M. Thus, the embodiments described herein have the benefit of flexibility and efficiency with respect to the number of connection inputs to the neurons of NNUs 121, and provide extremely high performance.

NPU126 includes registers 205, a 2-input multiplexing register (mux-reg)208, an Arithmetic Logic Unit (ALU)204, an accumulator 202, and an Activation Function Unit (AFU) 212. Register 205 receives weight word 206 from weight RAM 124 and provides its output 203 in a subsequent clock cycle. The multiplexer register 208 selects one of the

inputs

207 or 211 for storage in its register and then provided on the output 209 in a subsequent clock cycle. An input 207 receives a data word from the data RAM 122. Another input 211 receives the output 209 of the neighboring NPU 126. The NPU126 shown in fig. 2 is labeled NPU J among the N NPUs 126 of fig. 1. That is, NPU J is a representative example of N NPUs 126. Preferably, an input 211 of the multiplexing register 208 of the NPU J receives an output 209 of the multiplexing register 208 of the instance J-1 of the NPU126, and an output 209 of the multiplexing register 208 of the NPU J is provided to an input 211 of the multiplexing register 208 of the instance J +1 of the NPU 126. As such, the multiplexing registers 208 of the N NPUs 126 operate as an N-word rotator or circular shifter as a whole, as described in more detail below with respect to fig. 3. Control input 213 controls which of these two inputs is selected by the multiplexing register 208 for storage in the register and subsequent provision on output 209.

The ALU 204 has three inputs. One input receives a weight word 203 from a register 205. The other input receives an output 209 of the multiplexing register 208. And a further input receives the output 217 of the accumulator 202. The ALU 204 performs arithmetic and/or logical operations on its inputs to produce results that are provided on its outputs. Preferably, the arithmetic and/or logical operations performed by ALU 204 are specified by instructions stored in program memory 129. For example, the multiply-accumulate instruction of FIG. 4 specifies a multiply-accumulate operation, i.e., the result 215 is the sum of the product of the weight word 203 and the data word of the output 209 of the multiplexing register 208 and the value 217 of the accumulator 202. Other operations that may be specified include, but are not limited to: result 215 is the pass value of the multiplexed register output 209; result 215 is the pass value of weight word 203; the result 215 is zero; result 215 is the pass value of weight word 203; the result 215 is the sum of the value 217 of the accumulator 202 and the weight word 203; result 215 is the sum of the value 217 of the accumulator 202 and the output 209 of the multiplexing register; result 215 is the value 217 of accumulator 202 and the maximum value of weight word 203; result 215 is the maximum of the value 217 of accumulator 202 and the output 209 of the multiplexing register.

The ALU 204 provides an output 215 to the accumulator 202 for storage in the accumulator 202. The ALU 204 includes a multiplier 242 for multiplying the weight word 203 with the data word of the output 209 of the multiplexing register 208 to produce a product 246. In one embodiment, multiplier 242 multiplies two 16-bit operands to generate a 32-bit result. The ALU 204 also includes an adder 244 for adding a product 246 to the output 217 of the accumulator 202 to produce a sum, which is the result 215 accumulated in the accumulator 202 for storage in the accumulator 202. In one embodiment, the adder 244 adds the 32-bit result of the multiplier 242 to the 41-bit value 217 of the accumulator 202 to produce a 41-bit result. As such, the NPU 126 completes the addition of the product for the neuron required by the neural network by using the rotator aspect of the multiplexing register 208 over the course of multiple clock cycles. The ALU 204 may also include other circuit elements to perform other arithmetic/logic operations as previously described. In one embodiment, the second adder subtracts the weight word 203 from the data word at output 209 of the multiplexing register 208 to produce a difference value, which is then added by adder 244 to output 217 of the accumulator 202 to produce sum 215, which is the result of the accumulation in the accumulator 202. In this manner, the NPU 126 may complete the addition of the difference values over the course of multiple clock cycles. Preferably, the weight word 203 is the same size (in bits) as the data word 209, but may have a different binary decimal point location, as described in more detail below. Preferably, as described in more detail below, the multipliers 242 and adders 244 are integer multipliers and adders to advantageously implement the ALU 204 that is less complex, smaller, faster, and consumes less power than floating-point multipliers and adders. However, it should be understood that in other embodiments, ALU 204 performs floating point operations.

Although FIG. 2 only shows the multiplier 242 and adder 244 within the ALU 204, the ALU 204 preferably includes other elements to perform the other operations described above. For example, the ALU 204 preferably includes a comparator (not shown) for comparing the accumulator 202 with the data/weight words and a multiplexer (not shown) for selecting the larger of the two values indicated by the comparator (the maximum value) to be stored in the accumulator 202. As another example, ALU 204 preferably includes selection logic (not shown) for causing the data/weight word to skip multiplier 242 to enable adder 244 to add the data/weight word to the value 217 of accumulator 202 to produce the sum for storage in accumulator 202. These additional operations are described in more detail below (e.g., with respect to fig. 18-29A), and may be used to perform, for example, convolution operations and pooling operations.

AFU 212 receives output 217 of accumulator 202. AFU 212 performs an activation function on output 217 of accumulator 202 to produce result 133 of fig. 1. In general, activation functions within neurons of intermediate layers of an artificial neural network can be used to normalize and accumulate products, preferably in a non-linear fashion. To "normalize" the sum, the activation function of the current neuron produces a resultant value within a range of values that other neurons connected to the current neuron expect to receive as inputs. (the normalized result is sometimes referred to as an "activation value," which, as described herein, is the output of the current node, and the receiving node multiplies that output by a weight associated with the connection between the output node and the receiving node to produce a product that is accumulated with other products associated with other input connections to the receiving node.) for example, a receiving/connected neuron is expected to receive as input a value between 0 and 1, in which case the output neuron may need to compress and/or adjust (e.g., shift up to convert negative values to positive values) the accumulation and non-linearly outside the range of 0-1 to a value within the expected range. Thus, AFU 212 performs an operation on the value 217 of accumulator 202 to bring result 133 within a known range. The results 133 of all N NPUs 126 may be written back to the data RAM 122 or the weight RAM 124 in parallel. Preferably, AFU 212 is configured to execute a plurality of activation functions, and one of these activation functions is selected, for example, from an input of control register 127, to be executed on output 217 of accumulator 202. The activation function may include, but is not limited to, a step function, a correction function, a sigmoid function, a hyperbolic tangent function ) And a soft addition function (also referred to as a smoothing correction function). The soft addition function is an analytic function f (x) ln (1+ e)^x) I.e. 1 and e^xWhere "e" is the Euler (Euler) number and x is the input 217 to the function. Preferably, the activation function may also include a pass-through function of the value 217 or a portion thereof through the accumulator 202, as described in more detail below. In one embodiment, the circuitry of AFU 212 executes the activation function in a single clock cycle. In one embodiment, AFU 212 includes a table that receives accumulated values and outputs values for certain activation functions (e.g., sigmoid functions, hyperbolic tangent functions, soft-add functions, etc.) that approximate the values that a true activation function would provide.

Preferably, the width (in bits) of accumulator 202 is greater than the width of output 133 of AFU 212. For example, in one embodiment, the accumulator is 41 bits wide to avoid loss of precision for accumulation of up to 512 32-bit products (as described in more detail below, e.g., with respect to fig. 30), and the result 133 is 16 bits wide. In one embodiment, an example of which is described in more detail below with respect to fig. 8, during a subsequent clock cycle, a different portion of the output 217 value of the "raw" accumulator 202 passes through AFU 212 and is written back to data RAM 122 or weight RAM 124. This enables loading of the value of the raw accumulator 202 back into the media register 118 via the MFNN instruction, such that instructions executing on the other execution units 112 of the processor 100 may perform complex activation functions (also referred to as standardized exponential functions), such as the well-known soft max (softmax) activation function, that AFU 212 cannot perform. In one embodiment, the instruction set architecture of processor 100 includes instructions to perform an exponential function, commonly referred to as e ^xOr exp (x), which may be used to expedite the execution of the soft maximin activation function by other execution units 112 of the processor 100.

In one embodiment, the NPU 126 is a pipelined design. For example, the NPU 126 may include registers of the ALU 204 (such as registers located between multipliers and adders and/or other circuitry of the ALU 204) and registers that hold the output of the AFU 212, among other things. Other embodiments of the NPU 126 are described below.

Referring now to FIG. 3, a block diagram is shown illustrating an embodiment of an arrangement of N multiplexing registers 208 of N NPUs 126 of NNU 121 of FIG. 1 to illustrate the operation of the N multiplexing registers as an N-word rotator or circular shifter for a row of data words 207 received from data RAM 122 of FIG. 1. In the embodiment of FIG. 3, N is 512, such that NNU 121 has 512 multiplexing registers 208 labeled 0 through 511 corresponding to 512 NPUs 126 as shown. Each multiplexing register 208 receives a respective data word 207 on one of the D rows of the data RAM 122. That is, multiplexing register 0 receives data word 0 in a row of data RAM 122, multiplexing register 1 receives data word 1 in a row of data RAM 122, multiplexing register 2 receives data word 2 in a row of data RAM 122, and so on, multiplexing register 511 receives data word 511 in a row of data RAM 122. Furthermore, multiplexing register 1 receives the output 209 of multiplexing register 0 on another input 211, multiplexing register 2 receives the output 209 of multiplexing register 1 on another input 211, multiplexing register 3 receives the output 209 of multiplexing register 2 on another input 211, and so on, multiplexing register 511 receives the output 209 of multiplexing register 510 on another input 211, and multiplexing register 0 receives the output 209 of multiplexing register 511 on another input 211. Each multiplexing register 208 receives a control input 213 for controlling whether the data word 207 or the toggle input 211 is selected. As described in more detail below, in one mode of operation, during a first clock cycle, the control input 213 controls each of the multiplexer registers 208 to select a data word 207 for storage in the register and subsequent provision to the ALU 204; and during a subsequent clock cycle (e.g., M-1 clock cycles as described above), the control input 213 controls each of the multiplexer registers 208 to select the toggle input 211 for storage in the register and subsequent provision to the ALU 204.

Although in the embodiment depicted in fig. 3 (and fig. 7 and 19 below), the NPU 126 is configured to rotate the value of the multiplexing register 208/705 to the right, i.e., from NPU J to NPU J +1, embodiments are contemplated (such as for the embodiments of fig. 24-26, etc.) in which the NPU 126 is configured to rotate the value of the multiplexing register 208/705 to the left, i.e., from NPU J to NPU J-1. Further, embodiments are contemplated in which the NPU 126 is configured to selectively rotate the value of the multiplexing register 208/705 to the left or right, as specified by the NNU instruction, for example.

Referring now to FIG. 4, a table is shown that illustrates a program for storage in program memory 129 of NNU 121 of FIG. 1 and execution by NNU 121. As described above, the exemplary program performs computations associated with the layers of the artificial neural network. In the table of fig. 4, five rows and three columns are shown. Each row corresponds to an address in program memory 129 labeled as the first row. The second column specifies the instruction and the third column indicates the number of clock cycles associated with the instruction. Preferably, the number of clock cycles represents the number of clocks available per instruction clock type value in a pipelined embodiment, rather than the latency of the instruction. As shown, because of the pipelined nature of NNU 121, each instruction has an associated one clock cycle, with the instruction at address 2 being an exception, 511 clocks being required because the instruction actually repeats itself 511 times, as described in more detail below.

For each instruction of the program, all NPUs 126 process the instruction in parallel. That is, all N NPUs 126 execute instructions in the first row in the same clock cycle(s), all N NPUs 126 execute instructions in the second row in the same clock cycle(s), and so on. However, other embodiments are described below in which some instructions are executed in a partially parallel and partially sequential manner, e.g., in embodiments where the NPUs 126 share units of an active function, such as for the embodiment of FIG. 11, the active function and output instructions at

addresses

3 and 4, respectively, are executed in this manner. The example of fig. 4 assumes that one layer has 512 neurons (NPU 126), with each neuron having 512 connection inputs from 512 neurons of the previous layer, for a total of 256K connections. Each neuron receives a 16-bit data value from a respective coupling input and multiplies the 16-bit data value by an appropriate 16-bit weight value.

The first row at address 0 (although other addresses may be specified) specifies the initialize NPU instruction. The initialization instruction clears the value of the accumulator 202. In one embodiment, the initialization instruction may also specify loading accumulator 202 with the corresponding word in a row of data RAM122 or weight RAM 124 that is addressed by the instruction. The initialization instruction also loads configuration values into control registers 127, as described in more detail below with respect to fig. 29A and 29B. For example, the width of the data word 207 and weight word 209 may be loaded, where the width may be used by the ALU 204 to determine the size of the operation performed by the circuit and may affect the result 215 stored in the accumulator 202. In one embodiment, the NPU126 includes circuitry for saturating the output 215 of the ALU 204 before the output 215 is stored in the accumulator 202, and an initialization instruction loads a configuration value into the circuitry to affect saturation. In one embodiment, the accumulator 202 may also be cleared to a zero value by so specifying in an ALU function instruction (e.g., a multiply accumulate instruction at address 1) or an output instruction (such as a write AFU output instruction at address 4).

The second row at address 1 specifies a multiply-accumulate instruction that instructs 512 NPUs 126 to load a corresponding data word from one row of data RAM 122 and a corresponding weight word from one row of weight RAM 124, and to perform a first multiply-accumulate operation on data word input 207 and weight word input 206, the first multiply-accumulate operation being accumulated with the initialized accumulator 202 at a zero value. More specifically, the instruction instructs the sequencer 128 to generate a value on the control input 213 to select the data word input 207. In the example of fig. 4, the specified row of data RAM 122 is row 17 and the specified row of weight RAM 124 is row 0, thereby instructing sequencer 128 to output a value of 17 for data RAM address 123 and a value of 0 for weight RAM address 125. Thus, 512 data words from row 17 of the data RAM 122 are provided to respective data inputs 207 of the 512

NPUs

126, and 512 weight words from row 0 of the weight RAM 124 are provided to respective weight inputs 206 of the 512 NPUs 126.

The third row at address 2 specifies a multiply-accumulate rotate instruction of count 511 that instructs each of the 512 NPUs 126 to perform 511 multiply-accumulate operations. The instruction indicates to the 512 NPUs 126 that the data word 209 input to the ALU 204 in each of the 511 multiply-accumulate operations is a round-robin value 211 from the adjacent NPU 126. That is, the instruction instructs the sequencer 128 to generate a value on the control input 213 to select the wheel value 211. Further, the instruction instructs the 512 NPUs 126 to load the respective weight values for each of the 511 multiply-accumulate operations from the "next" line of the weight RAM 124. That is, the instruction instructs sequencer 128 to increment the weight RAM address 125 by 1 relative to its value in the previous clock cycle, which in this example is row 1 for the first clock cycle of the instruction, row 2 for the next clock cycle, row 3 for the next clock cycle, and so on, and row 511 for the 511 th clock cycle. For each of these 511 multiply-accumulate operations, the product of the round-robin input 211 and the weight word input 206 is accumulated with the previous value of the accumulator 202. The 512 NPUs 126 perform 511 multiply-accumulate operations in 511 clock cycles, where each NPU126 performs a multiply-accumulate operation on a different data word from row 17 of the data RAM 122, i.e., the data word on which the adjacent NPU126 performed the operation in the previous cycle, and a different weight word associated with that data word, conceptually for different connected inputs of the neuron. In this example, it is assumed that the number of connection inputs of each NPU126 (neuron) is 512, thus involving 512 data words and 512 weight words. Once the last iteration of the multiply-accumulate rotate instruction for line 2 is performed, the accumulator 202 contains the sum of the products of all 512 concatenated inputs. In one embodiment, the instruction set of the NPU126 includes an "execute" instruction to instruct the ALU 204 to perform an ALU operation specified by the initializing NPU instruction (such as specified in the ALU function 2926 of fig. 29A), rather than having separate instructions for each type of ALU operation (e.g., multiply-accumulate, accumulator-and-weight word maximization, etc., as described above).

The fourth row at address 3 specifies the activate function instruction. The activate function instruction instructs AFU212 to perform the specified activate function on the value 217 of accumulator 202 to produce result 133. The activation function according to one embodiment is described in more detail below.

The fifth row at address 4 specifies a write AFU output instruction that instructs 512 NPUs 126 to write back the output of AFU212 as result 133 to one row of data RAM 122 (row 16 in this example). That is, the instruction instructs the sequencer 128 to output a data RAM address 123 with a value of 16 and a write command (as opposed to a read command in the case of a multiply accumulate instruction at address 1). Preferably, under the essence of the pipeline, the execution of the write AFU output instruction may overlap with the execution of other instructions, such that the write AFU output instruction actually executes within a single clock cycle.

Preferably, each NPU 126 is configured as a pipeline, wherein the pipeline includes various functional elements, such as multiplexing registers 208 (and multiplexing registers 705 of fig. 7), ALUs 204, accumulators 202, AFUs 212, multiplexer 802 (of fig. 8), row buffers 1104, and AFUs 1112 (of fig. 11), among others, wherein some of these functional elements may be pipelined themselves. In addition to data words 207 and weight words 206, the pipeline receives instructions from the program memory 129. These instructions flow along the pipeline and control the various functional units. In an alternative embodiment, no activate function instruction is included within the program. Instead, the initialize NPU instruction specifies the activation function to be performed on the value 217 of the accumulator 202, and indicates that the value of the specified activation function is saved in a configuration register for later use by the AFU212 portion of the pipeline after the last accumulator 202 value 217 has been generated, i.e., after the last iteration of the multiply-accumulate rotate instruction at address 2 is complete. Preferably, for power saving purposes, the AFU212 portion of the pipeline is inactive until the write AFU output instruction reaches the AFU212 portion, at which point the AFU212 starts and performs an activation function on the output 217 of the accumulator 202 specified by the initialize instruction.

Referring now to FIG. 5, a timing diagram is shown that illustrates the execution of the program of FIG. 4 by NNU 121. Each row of the timing diagram corresponds to successive clock cycles as indicated by the first row. The other columns correspond to and indicate operations of a different one 126 of the 512 NPUs 126. For simplicity and clarity of illustration, only the operations of

NPUs

0, 1 and 511 are shown.

At clock 0, each of the 512 NPUs 126 executes the initialization instruction of FIG. 4, which is illustrated in FIG. 5 by assigning a zero value to accumulator 202.

At clock 1, each of the 512 NPUs 126 executes the multiply-accumulate instruction at address 1 in FIG. 4. As shown, NPU 0 accumulates the product of word 0 of row 17 of data RAM 122 and word 0 of row 0 of weight RAM 124 with the value of accumulator 202 (i.e., zero); NPU 1 accumulates the product of word 1 of row 17 of data RAM 122 and word 1 of row 0 of weight RAM 124 with the value of accumulator 202 (i.e., zero); by analogy, NPU 511 accumulates the product of word 511 of row 17 of data RAM 122 and word 511 of row 0 of weight RAM 124 with the value of accumulator 202 (i.e., zero).

At clock 2, each of the 512 NPUs 126 executes the first iteration of the multiply-accumulate rotate instruction at address 2 in FIG. 4. As shown, NPU 0 accumulates the product of the round-robin data word 211 received from the output 209 of the multiplexing register 208 of NPU 511 (i.e., the data word 511 received from data RAM 122) and word 0 of row 1 of weight RAM 124 with the value of the accumulator 202; NPU 1 accumulates the product of the round-robin data word 211 received from output 209 of multiplexing register 208 of NPU 0 (i.e., data word 0 received from data RAM 122) and word 1 of row 1 of weight RAM 124 with the value of accumulator 202; by analogy, NPU 511 accumulates the product of the round-robin data word 211 received from output 209 of multiplexing register 208 of NPU 510 (i.e., data word 510 received from data RAM 122) and word 511 of row 1 of weight RAM 124 with the value of accumulator 202.

At clock 3, each of the 512 NPUs 126 performs a second iteration of the multiply-accumulate round-robin instruction at address 2 in fig. 4. As shown, NPU 0 accumulates the product of the round-robin data word 211 received from the output 209 of the multiplexing register 208 of NPU 511 (i.e., the data word 510 received from the data RAM 122) and word 0 of row 2 of the weight RAM 124 with the value of the accumulator 202; NPU 1 accumulates the product of the round-robin data word 211 received from the output 209 of the multiplexing register 208 of NPU 0 (i.e., the data word 511 received from the data RAM 122) and word 1 of row 2 of the weight RAM 124 with the value of the accumulator 202; by analogy, NPU 511 accumulates the product of the round-robin data word 211 received from output 209 of multiplexing register 208 of NPU 510 (i.e., data word 509 received from data RAM 122) and word 511 of row 2 of weight RAM 124 with the value of accumulator 202. As indicated by the ellipses in FIG. 5, the next 509 clock cycles each continue in turn until clock 512.

At clock 512, each of the 512 NPUs 126 executes the 511 th iteration of the multiply-accumulate round-robin instruction at address 2 in fig. 4. As shown, NPU 0 accumulates the product of the round-robin data word 211 received from the output 209 of the multiplexing register 208 of NPU 511 (i.e., data word 1 received from data RAM 122) and word 0 of row 511 of weight RAM 124 with the value of accumulator 202; NPU 1 accumulates the product of the round-robin data word 211 received from output 209 of multiplexing register 208 of NPU 0 (i.e., data word 2 received from data RAM 122) and word 1 of row 511 of weight RAM 124 with the value of accumulator 202; by analogy, NPU 511 accumulates the product of the round-robin data word 211 received from output 209 of multiplexing register 208 of NPU 510 (i.e., data word 0 received from data RAM 122) and word 511 of row 511 of weight RAM 124 with the value of accumulator 202. In one embodiment, multiple clock cycles are required to read a data word and a weight word from data RAM 122 and weight RAM 124 to execute the multiply-accumulate instruction at address 1 in FIG. 4; however, data RAM 122, weight RAM 124, and NPU 126 are pipelined such that once a first multiply-accumulate operation begins (e.g., as shown during clock 1 of fig. 5), subsequent multiply-accumulate operations begin in successive clock cycles (e.g., as shown during clocks 2-512). Preferably, the NPU 126 may stall in response to access to the data RAM 122 and/or the weight RAM 124 by architectural instructions (e.g., MTNN or MFNN instructions, described later with respect to FIGS. 14 and 15) or microinstructions translated by architectural instructions.

At clock 513, AFU212 of each of the 512 NPUs 126 executes the Activate function instruction at address 3 in FIG. 4. Finally, at clock 514, each of the 512 NPUs 126 executes the write AFU output instruction at address 4 of fig. 4 by writing the result 133 back to the corresponding word in row 16 of the data RAM122, i.e., writing the result 133 of NPU 0 to word 0 of the data RAM122, writing the result 133 of NPU 1 to word 1 of the data RAM122, and so on, until the result 133 of NPU 511 is written to word 511 of the data RAM 122. The operation described above with respect to fig. 5 is also shown in block diagram form in fig. 6A.

Referring now to FIG. 6A, a block diagram is shown that illustrates execution of the program of FIG. 4 by NNU 121 of FIG. 1. NNU 121 includes 512 NPUs 126, a data RAM122 that receives address inputs 123, and a weight RAM124 that receives address inputs 125. Although not shown, at

clock

0, 512 NPUs 126 execute an initialization instruction. As shown, at

clock

1, 512 16-bit data words of row 17 are read from the data RAM122 and provided to 512 NPUs 126. At clocks 1 through 512, 512 16-bit weight words for lines 0 through 511 are read from the weight RAM124 and provided to 512 NPUs 126, respectively. Although not shown, at

clock

1, 512 NPUs 126 perform corresponding multiply-accumulate operations on the loaded data word and weight word. At clocks 2 through 512, the multiplexing registers 208 of the 512 NPUs 126 operate as 512 round-robin for 16-bit words to round-robin the data word of row 17 of the previously loaded data RAM122 to the adjacent NPU 126, and the NPU 126 performs a multiply-accumulate operation on the round-robin data words and the weight words loaded from the weight RAM 124. Although not shown, at

clock

513, 512 AFUs 212 execute the activate instruction. At clock 514, the 512 NPUs 126 write the corresponding 512 16-bit results 133 back to row 16 of the data RAM 122.

It can be seen that the number of clocks required to generate the result word (neuron output) and write back to the data RAM 122 or weight RAM 124 is approximately the square root of the number of data inputs (connections) received by the current layer of the neural network. For example, if the current layer includes 512 neurons each having 512 connections from the previous layer, the sum of these connections is 256K, and the number of clocks required to produce the results for the current layer slightly exceeds 512. Thus, NNU 121 provides extremely high performance for neural network computations.

Referring now to FIG. 6B, a flow diagram is shown illustrating operation of processor 100 of FIG. 1 to execute an architectural program that uses NNUs 121 to perform multiply-accumulate activation function computations (such as the operations performed by the program of FIG. 4) typically associated with neurons of a hidden layer of an artificial neural network. The example of fig. 6B assumes computation of four hidden LAYERS (denoted by initialization of the NUM _ LAYERS variable of block 602), each hidden layer having 512 neurons, each neuron connecting all 512 neurons of the previous layer (by the procedure of fig. 4). However, it should be understood that the number of layers and neurons are chosen for illustration purposes, and that NNU 121 may be used to perform the same calculations for different numbers of hidden layers, different numbers of neurons in each layer, and not all connected neurons. In one embodiment, the weight value may be set to zero for neurons not present in this layer or connections to neurons not present. Preferably, the architectural program writes a first set of weights to weight RAM 124 and enables NNU 121, and while NNU 121 is performing computations associated with the first layer, this architectural program writes a second set of weights to weight RAM 124 so that NNU 121 can begin computations for the second layer once NNU 121 completes the computations for the first hidden layer. Thus, the configuration process moves to and from between two regions of weight RAM 124 to ensure that NNUs 121 are fully utilized. Flow begins at block 602.

At block 602, as shown and described with respect to fig. 6A, the processor 100 (i.e., the architectural program running on the processor 100) writes input values to a current neuron hiding layer of the data RAM 122, e.g., to row 17 of the data RAM 122. Alternatively, these values may also have been in row 17 of data RAM 122 as the result 133 of the operation of NNU 121 on a previous layer (e.g., a convolutional, pooling, or input layer). In addition, the framework program initializes a variable N to a value of 1. The variable N indicates the current layer in the hidden layer being processed by NNU 121. In addition, the architectural program initializes the variable NUM _ LAYERS to a value of 4 because there are four hidden LAYERS in this example. Flow proceeds to block 604.

At block 604, as shown in FIG. 6A, processor 100 writes the weight word for tier 1 to weight RAM 124, e.g., to rows 0 through 511. Flow proceeds to block 606.

At block 606, the processor 100 writes the multiply-accumulate activate function program (e.g., of FIG. 4) to the program memory 129 of the NNU 121 using the MTNN instruction 1400 that specifies the function 1432 to write to the program memory 129. The processor 100 then initiates the NNU program with the MTNN instruction 1400 specifying the function 1432 to begin executing the program. Flow proceeds to decision block 608.

At decision block 608, the architecture program determines whether the value of variable N is less than NUM _ LAYERS. If so, flow proceeds to block 612; otherwise block 614 is entered.

At block 612, the processor 100 writes the weight word for tier N +1 to the weight RAM 124, e.g., to rows 512-1023. Thus, advantageously, the architecture program writes the weight words of the next layer to weight RAM 124 while NNU121 is performing hidden layer calculations for the current layer, so that NNU121 can begin performing hidden layer calculations for the next layer immediately upon completion of the calculations for the current layer, i.e., writing to data RAM 122. Flow proceeds to block 614.

At block 614, the processor 100 determines that the currently running NNU program (starting at block 606 in the case of layer 1, and beginning at block 618 in the case of layers 2 through 4) has completed. Preferably, processor 100 determines this by executing MFNN instruction 1500 to read status register 127 of NNU 121. In an alternative embodiment, NNU121 generates an interrupt to indicate that it has completed the multiply-accumulate activate function layer routine. Flow proceeds to decision block 616.

At decision block 616, the architecture program determines whether the value of variable N is less than NUM _ LAYERS. If so, flow proceeds to block 618; otherwise flow proceeds to block 622.

At block 618, the processor 100 updates the multiply-accumulate activation function program so that the processor can perform hidden layer computations for layer N + 1. More specifically, the processor 100 updates the row value of the data RAM 122 of the multiply-accumulate instruction at address 1 of fig. 4 to the row of the data RAM 122 to which the result of the previous layer was written (e.g., to row 16), and also updates the output row (e.g., to row 15). Processor 100 then begins the updated NNU program. Alternatively, the program of FIG. 4 specifies the same row in the output instruction at address 4 as the row specified by the multiply accumulate instruction at address 1 (i.e., the row read from data RAM 122). In this embodiment, the current line of the input data word is overwritten (this way of processing is acceptable as long as this line of data words need not be used for other purposes, as this line of data words has already been read into the multiplexing register 208 and rotated through the NPUs 126 via the N-word rotator). In this case, at block 618, the NNU program need not be updated, but rather only restarted. Flow proceeds to block 622.

At block 622, the processor 100 reads the results of the layer N NNU program from the data RAM 122. However, if these results are only used for the next layer, the architectural program need not read these results from the data RAM 122, but instead can retain them in the data RAM 122 for the next hidden layer computation. Flow proceeds to decision block 624.

At decision block 624, the architecture program determines whether the value of variable N is less than NUM _ LAYERS. If so, flow proceeds to block 626; otherwise, the flow ends.

At block 626, the framework program increments N. Flow returns to decision block 608.

As can be determined from the example of fig. 6B, the NPU 126 (by virtue of operation of the NNU program of fig. 4) performs a read and a write to the data RAM 122 substantially every 512 clock cycles. Further, NPU 126 reads weight RAM 124 approximately every clock cycle to read a row of weight words. Thus, the entire bandwidth of weight RAM 124 is consumed by NNU 121 in the hybrid manner used to perform hidden layer operations. Further, assuming an embodiment that includes writing and reading buffers (such as buffer 1704 of FIG. 17), in parallel with NPU 126 reading, processor 100 writes to weight RAM 124 such that buffer 1704 performs a write to weight RAM 124 to write a weight word approximately every 16 clock cycles. Thus, in a single port embodiment of weight RAM 124 (such as the embodiment described with respect to FIG. 17), NPU 126 must stall reading of weight RAM 124 approximately every 16 clock cycles, thereby enabling buffer 1704 to write to weight RAM 124. However, in embodiments where weight RAM 124 is dual ported, NPU 126 need not stall.

Referring now to FIG. 7, a block diagram is shown illustrating the NPU126 of FIG. 1 in accordance with an alternative embodiment. The NPU126 of fig. 7 is similar in many respects to the NPU126 of fig. 2. However, the NPU126 of fig. 7 additionally includes a second 2-input multiplexing register 705. The multiplexing register 705 selects one of the

inputs

206 or 711 to be stored in the register and then provided on the output 203 on a subsequent clock cycle. Input 206 receives weight words from weight RAM 124. Another input 711 receives the output 203 of the second multiplexing register 705 of the adjacent NPU 126. Preferably, an input 711 of the multiplexing register 705 of the NPU J receives the output 203 of the multiplexing register 705 of the NPU126 instance J-1, and an output of the NPU J is provided to an input 711 of the multiplexing register 705 of the NPU126 instance J + 1. Thus, in the same manner as described above with respect to FIG. 3, the multiplexing registers 705 of the N NPUs 126 operate as N-word rotators in their entirety, but for weight words rather than data words. Control input 713 controls which of these two inputs is selected by multiplexing register 705 to be stored in the register and subsequently provided on output 203.

The inclusion of multiplexing register 208 and/or multiplexing register 705 (as well as multiplexing registers in other embodiments such as those shown in fig. 18 and 23) to actually form a large round robin for rotating a row of data/weights received from data RAM 122 and/or weight RAM 124 has the following advantages: NNUs 121 do not require the otherwise required extensive multiplexers between data RAM 122 and/or weight RAM 124 to provide the necessary data words/weight words to the appropriate NNUs 121.

Writing back accumulator values in addition to activating function results

In some applications, it may be useful for the processor 100 to receive back (e.g., to the media registers 118 via the MFNN instruction of fig. 15) the values 217 of the original accumulators 202, where instructions executing on other execution units 112 may perform calculations on the values 217 of these accumulators 202. For example, in one embodiment, to reduce the complexity of AFU212, AFU212 is not configured to perform a soft maximum activation function. Thus, NNU 121 can output the values 217 or a subset thereof of the original accumulator 202 to data RAM 122 or weight RAM 124, and the architectural program then reads the values 217 or a subset thereof of the original accumulator 202 from data RAM 122 or weight RAM 124 and computes the original values. However, the application of the value 217 of the original accumulator 202 is not limited to the execution of soft maximum operations, and other applications are also contemplated.

Referring now to FIG. 8, a block diagram is shown illustrating the NPU126 of FIG. 1 in accordance with an alternative embodiment. The NPU126 of fig. 8 is similar in many respects to the NPU126 of fig. 2. However, NPU126 of fig. 8 includes a multiplexer (mux)802 within AFU212, where AFU212 has a control input 803. The width (in bits) of the accumulator 202 is greater than the width of the data word. Multiplexer 802 has a plurality of inputs for receiving the data word width portion of output 217 of accumulator 202. In one embodiment, the accumulator 202 is 41 bits wide, and the NPU126 is configured to output a 16-bit result word 133; thus, for example, multiplexer 802 (or multiplexer 3032 and/or multiplexer 3037 of fig. 30) has bits [ 15: 0), bit [ 31: 16] and bit [ 47: 32 ]. Preferably, output bits not provided by the accumulator 202 (e.g., bits [ 47: 41]) are forced to zero value bits.

In response to a write ACC instruction (e.g., a write ACC instruction at addresses 3-5 of fig. 9 described below, etc.), sequencer 128 generates a value on control input 803 to control multiplexer 802 to select one of the words (e.g., 16 bits) of accumulator 202. Preferably, multiplexer 802 also has one or more inputs for receiving the outputs of the activate function circuits (e.g.,

elements

3022, 3024, 3026, 3018, 3014, and 3016 in fig. 30) which produce outputs as the width of a data word. In response to an instruction, such as a write AFU output instruction at address 4 of FIG. 4, the sequencer 128 generates a value on the control input 803 to control the multiplexer 802 to select one of the active function circuit outputs instead of one of the words of the accumulator 202.

Referring now to FIG. 9, a table illustrating a program for storage in program memory 129 of NNU 121 of FIG. 1 and execution by NNU 121 is shown. The exemplary process of fig. 9 is similar in many respects to the process of fig. 4. Specifically, the instructions at addresses 0 to 2 are the same. However, the instructions at

addresses

3 and 4 of FIG. 4 are replaced in FIG. 9 with write ACC instructions, which instruct 512 NPUs 126 to write the output 217 of their accumulators 202 back as a result 133 into three rows (rows 16 through 18 in this example) of the data RAM 122. That is, the write ACC instruction instructs the sequencer 128 to output a data RAM address 123 with a value of 16 and a write command in a first clock cycle, a data RAM address 123 with a value of 17 and a write command in a second clock cycle, and a data RAM address 123 with a value of 18 and a write command in a third clock cycle. Preferably, the execution of the write ACC instruction may overlap with the execution of other instructions such that the write ACC instruction is actually executed in three clock cycles, one for each row of the write data RAM 122. In one embodiment, the user specifies the values of the activate function 2934 and the output command 2956 fields in the control register 127 (of FIG. 29A) to complete writing the desired portion of the accumulator 202 to the data RAM 122 or weight RAM 124. Alternatively, the write ACC instruction may optionally write back a subset of the accumulator 202, rather than writing back the entire contents of the accumulator 202. In one embodiment, the normal accumulator 202 may be written back as described in more detail below with respect to fig. 29-31.

Referring now to FIG. 10, a timing diagram is shown that illustrates the execution of the program of FIG. 9 by NNU 121. The timing diagram of fig. 10 is similar to that of fig. 5, and clocks 0 to 512 are the same. However, at

clock

513 and 515, AFU 212 of each NPU126 of the 512 NPUs 126 executes one of the write ACC instructions at addresses 3 through 5 of FIG. 9. Specifically, at clock 513, each of the 512 NPUs 126 couples a bit [ 15: 0] as result 133 the corresponding word written back into line 16 of data RAM 122; at clock 514, each of the 512 NPUs 126 couples a bit [ 31: 16] as result 133, the corresponding word in line 17 written back to data RAM 122; and at clock 515, each of the 512 NPUs 126 couples bits [ 40: 32] as result 133 are written back to the corresponding word in line 18 of data RAM 122. Preferably, bit [ 47: 41 is forced to a zero value.

Shared AFU

Referring now to FIG. 11, a block diagram is shown that illustrates an embodiment of NNU 121 of FIG. 1. In the embodiment of fig. 11, the neuron element is divided into two parts, namely, an activation function unit part and an ALU part (the ALU part also includes a shift register part), and each activation function unit part is shared by a plurality of ALU parts. In FIG. 11, the ALU portion is referred to as the NPU126, and the shared active function unit portion is referred to as the AFU 1112. This is in contrast to the embodiment of fig. 2, for example, in the embodiment of fig. 2, each neuron contains its own AFU 212. Thus, for example, in one embodiment, the NPU126 (ALU portion) of the embodiment of fig. 11 includes the accumulator 202, ALU 204, multiplexing register 208, and register 205 of fig. 2, but does not include AFU 212. In the embodiment of FIG. 11, NNUs 121 include, by way of example, 512 NPUs 126; however, other embodiments having other numbers of NPUs 126 are contemplated. In the example of fig. 11, the 512 NPUs 126 are grouped into 64 groups (referred to as groups 0 through 63 in fig. 11), and each group has 8 NPUs 126.

NNU 121 also includes a row buffer 1104 and a plurality of shared AFUs 1112 coupled between NPU126 and row buffer 1104. The width (in bits) of the line buffer 1104 is the same as the line of the data RAM 122 or weight RAM 124, e.g., 512 words. There is one AFU 1112 for each NPU126 group, i.e., each AFU 1112 has a corresponding NPU126 group; thus, in the embodiment of FIG. 11, there are 64 AFUs 1112 corresponding to the 64 NPU126 groups. Each NPU126 of the 8 NPUs 126 within the group shares a respective AFU 1112. Other embodiments are contemplated having different numbers of AFUs 1112 and different numbers of NPUs 126 in each group. For example, other embodiments are contemplated in which two, four, or sixteen NPUs 126 in a group share AFU 1112.

The motivation for sharing AFU 1112 is to reduce the size of NNU 121. Size reduction is achieved at the cost of performance degradation. That is, for example, as shown in fig. 12 below, several clocks that may be longer depending on the sharing rate may be required to produce the results 133 for the entire NPU126 array, in this case, since 8: a sharing rate of 1, thus requiring seven additional clock cycles. However, in general, the aforementioned number of additional clocks (e.g., 7) is relatively small compared to the number of clocks required to generate the accumulated sum (e.g., 512 clocks are required for a layer having 512 connections per neuron). Thus, a relatively small performance impact (e.g., increasing one percent of the computation time) may be a cost-effective compromise for the size reduction of the NNU 121.

In one embodiment, each NPU 126 includes an AFU212, wherein AFU212 is configured to perform relatively simple activation functions, thereby enabling these simple AFUs 212 to be relatively small and therefore contained within each NPU 126; while shared or complex AFU 1112 performs a relatively complex activation function and is therefore relatively significantly larger than simple AFU 212. In such an embodiment, additional clock cycles are only required if complex activation functions are specified that need to share complex AFU 1112, but not if activation functions are specified that are executed by simple AFU212 configuration.

Referring now to fig. 12 and 13, two timing diagrams illustrating the execution of the program of fig. 4 by NNU 121 of fig. 11 are shown. The timing diagram of fig. 12 is similar to that of fig. 5, and clocks 0 to 512 are the same. However, at clock 513, the operation is different from that described in the timing diagram of FIG. 5 because the NPU 126 of FIG. 11 shares AFU 1112; that is, the NPUs 126 in a group share the AFUs 1112 associated with that group, and fig. 11 illustrates the sharing.

Each row of the timing diagram of fig. 13 corresponds to a successive clock cycle indicated in the first column. The other columns correspond to and indicate the operation of different AFUs 1112 in the 64 AFUs 1112. For simplicity and clarity of illustration, only the operations of

AFUs

0, 1, and 63 are shown. The clock cycles of FIG. 13 correspond to the clock cycles of FIG. 12, but the sharing of AFU 1112 by NPU 126 is shown in a different manner. As shown in FIG. 13, at clocks 0-512, each AFU 1112 of 64 AFUs 1112 is inactive and the NPU 126 executes an initialize NPU instruction, a multiply-accumulate instruction, and a multiply-accumulate rotation instruction.

As shown in both fig. 12 and 13, at clock 513, AFU 0 (AFU 1112 associated with set 0) begins performing the specified activation function on the value 217 of the accumulator 202 of NPU 0 (i.e., the first NPU 126 in set 0), and the output of AFU 0 will be stored to word 0 of the line buffer 1104. Also at clock 513, each AFU 1112 begins executing the designated activate function on the accumulator 202 of the first NPU 126 in the corresponding set of NPUs 126. Thus, as shown in FIG. 13, at clock 513, AFU 0 begins performing the specified activate function on NPU 0's accumulator 202 to produce a result of word 0 to be stored to line buffer 1104; AFU 1 begins executing the specified activate function on accumulator 202 of NPU 8 to produce a result for word 8 to be stored to line buffer 1104; by analogy, AFU63 begins performing the specified activate function on the accumulator 202 of NPU 504 to produce the result of word 504 to be stored to row register 1104.

As shown, at clock 514, AFU 0 (AFU 1112 associated with Bank 0) begins performing the specified activation function on the value 217 of the accumulator 202 of NPU 1 (i.e., the second NPU 126 in Bank 0), and the output of AFU 0 will be stored to word 1 of the line buffer 1104. Also at clock 514, each AFU 1112 begins executing the designated activate function on the accumulator 202 of the second NPU 126 in the corresponding set of NPUs 126. Thus, as shown in FIG. 13, at clock 514, AFU 0 begins performing the specified activate function on NPU 1's accumulator 202 to produce a result of word 1 to be stored to line buffer 1104; AFU 1 begins executing the specified activate function on accumulator 202 of NPU 9 to produce a result for word 9 to be stored to line buffer 1104; by analogy, AFU63 begins performing the specified activate function on accumulator 202 of NPU 505 to produce the result of word 505 to be stored to row register 1104. As shown, this mode continues until clock cycle 520, AFU 0 (AFU 1112 associated with Bank 0) begins to perform the specified activation function on the value 217 of the accumulator 202 of NPU 7 (i.e., the eighth (last) NPU 126 in Bank 0), and the output of AFU 0 will be stored to word 7 of the line buffer 1104. Also at clock 520, each AFU 1112 begins executing the specified activation function on the accumulator 202 of the eighth NPU 126 in the corresponding set of NPUs 126. Thus, as shown in FIG. 13, at clock 520, AFU 0 begins performing the specified activate function on the accumulator 202 of NPU 7 to produce the result of word 7 to be stored to the line buffer 1104; AFU 1 begins executing the specified activate function on accumulator 202 of NPU 15 to produce a result for word 15 to be stored to line buffer 1104; by analogy, AFU63 begins performing the specified activate function on NPU 511 accumulator 202 to produce the result of word 511 to be stored to row register 1104.

At clock 521, once all 512 results associated with the 512 NPUs 126 have been generated and written to the line buffer 1104, the line buffer 1104 begins writing its contents to either the data RAM 122 or the weight RAM 124. As such, the AFU 1112 in each of the 64 NPU 126 groups executes a portion of the Activate function instruction at address 3 of FIG. 4.

As described in more detail below, e.g., with respect to fig. 29A-33, embodiments that share AFU 1112 between groups of ALUs 204 (such as the embodiment in fig. 11, etc.) may be particularly advantageous in conjunction with integer ALUs 204.

MTNN and MFNN architecture instructions

Referring now to FIG. 14, a block diagram is shown that illustrates a Move To Neural Network (MTNN) architecture instruction 1400 and its operation with respect to portions of the NNUs 121 of FIG. 1. The MTNN instruction 1400 includes an operation code (opcode) field 1402, a src1 field 1404, a src2 field 1406, a gpr field 1408, and a real time field 1412. The MTNN instruction 1400 is an architectural instruction, i.e., the instruction is contained within the instruction set architecture of the processor 100. Preferably, the instruction set architecture associates predetermined values of the opcode field 1402 with the MINN instruction 1400 to distinguish the MTNN instruction 1400 from other instructions in the instruction set architecture. The operation code 1402 of the MTNN instruction 1400 may or may not include a prefix (prefix), such as is common in the x86 architecture.

The immediate field 1412 provides a value for specifying a function 1432 to the control logic 1434 of the NNU 121. Preferably, the function 1432 is provided as an immediate operand to the microinstruction 105 of FIG. 1. Functions 1432 that may be executed by NNU121 include, but are not limited to, writing to data RAM 122, writing to weight RAM 124, writing to program memory 129, writing to control registers 127, starting execution of a program in program memory 129, pausing execution of a program in program memory 129, completing a request notification (e.g., an interrupt) to execute a program in program memory 129, and resetting NNU 121. Preferably, the NNU instruction set includes instructions whose results indicate that the NNU program has completed. Optionally, the NNU instruction set includes an explicit interrupt-generating instruction. Preferably, resetting NNU121 includes effectively forcing NNU121 back to the reset state (e.g., clearing the internal state machine and setting it to an idle state) in addition to the contents of data RAM 122, weight RAM 124, and program memory 129 remaining intact. In addition, internal registers such as accumulator 202 are not affected by the reset function and must be cleared explicitly, for example using the initialize NPU instruction at address 0 of FIG. 4. In one embodiment, function 1432 may comprise a direct execution function, where the first source register contains a micro-operation (see, e.g., micro-operation 3418 of FIG. 34). The direct execution function instructs NNU121 to directly execute the specified micro-operation. Thus, instead of writing instructions to program memory 129 and subsequently directing NNU121 to execute instructions in program memory 129 or by way of MTNN instruction 1400 (or MFNN instruction 1500 of fig. 15), the configuration program may directly control NNU121 to perform operations. FIG. 14 shows an example of a function 1432 written to the data RAM 122.

The GPR field 1408 specifies a GPR within the general register file 116. In one embodiment, each GPR is 64 bits. As shown, general register file 116 provides a value from the selected GPR to NNU 121, which NNU 121 uses as address 1422. Address 1422 selects the row of memory specified in function 1432. In the case of data RAM 122 or weight RAM 124, address 1422 additionally selects a block of data that is twice the size of the location of a media register (e.g., 512 bits) within the selected row. Preferably, this location is on a 512-bit boundary. In one embodiment, multiplexer selects either address 1422 (or address 1422 in the case of the MFNN instruction 1400 described below) or address 123/125/131 from sequencer 128 to provide to data RAM 122/weight RAM 124/program memory 129. In one embodiment, as described in more detail below, the data RAM 122 is dual ported, enabling the NPU 126 to read/write to the data RAM 122 in parallel with the media registers 118 reading/writing to the data RAM 122. In one embodiment, weight RAM 124 is also dual ported for similar purposes.

The src1 field 1404 and the src2 field 1406 each specify a media register in the media register file 118. In one embodiment, each media register 118 is 256 bits. As shown, the media register file 118 provides concatenated data (e.g., 512 bits) from the selected media register to the data RAM 122 (or weight RAM 124 or program memory 129) for writing to the selected row 1428 specified by the address 1422 and for writing to the selected row 1428 at the location specified by the address 1422. Advantageously, by executing a series of MTNN instructions 1400 (and MFNN instructions 1500 described below), an architectural program executing on processor 100 may populate rows of data RAM 122 and rows of weight RAM 124 and write a program, such as the programs described herein (e.g., of fig. 4 and 9) to program memory 129 to cause NNUs 121 to perform operations on the data and weights at very fast speeds, thereby implementing an artificial neural network. In one embodiment, the architecture program directly controls NNUs 121 rather than writing programs to program memory 129.

In one embodiment, the MTNN instruction 1400 specifies a starting source register and a number of source registers, i.e., Q, rather than specifying two source registers (e.g., 1404 and 1406). This form of the MTNN instruction 1400 instructs the processor 100 to write the media register 118 designated as the starting source register and the next Q-1 subsequent media registers 118 to the NNU 121, i.e., to the designated data RAM 122 or weight RAM 124. Preferably, the instruction translator 104 translates the MTNN instruction 1400 into as many microinstructions as are necessary to write all Q specified media registers 118. For example, in one embodiment, when the MTNN instruction 1400 specifies the start source register as MR4 and Q is 8, the instruction translator 104 translates the MTNN instruction 1400 into four micro instructions, wherein the first micro instruction is written to MR4 and MR5, the second micro instruction is written to MR6 and MR7, the third micro instruction is written to MR8 and MR9, and the fourth micro instruction is written to MR10 and MR 11. In an alternative embodiment where the data path from the media registers 118 to the NNU 121 is 1024 bits instead of 512 bits, the instruction translator 104 translates the MTNN instruction 1400 into two micro instructions, where the first micro instruction writes MR 4-MR 7 and the second micro instruction writes MR 8-MR 11. Similar embodiments are contemplated in which the MFNN instructions 1500 specify a starting destination register and a number of destination registers such that each MFNN instruction 1500 is capable of reading a block of data in a row of the data RAM 122 or weight RAM 124 that is larger than a single media register 118.

Referring now to FIG. 15, a block diagram is shown that illustrates a move from neural network (MTNN) architecture instruction 1500 and the operation of the architecture instruction with respect to portions of the NNUs 121 of FIG. 1. The MFNN instruction 1500 includes an opcode field 1502, a dst field 1504, a gpr field 1508, and an immediate field 1512. The MFNN instruction 1500 is an architectural instruction that is included within the instruction set architecture of the processor 100. Preferably, the instruction set architecture associates predetermined values of the opcode field 1502 with the MFNN instruction 1500 to distinguish the MFNN instruction 1500 from other instructions within the instruction set architecture. The operation code 1502 of the MFNN instruction 1500 may or may not include a prefix, such as is common in the x86 architecture.

The immediate field 1512 provides a value for specifying a function 1532 to the control logic 1434 of the NNU 121. Preferably, the function 1532 is provided as an immediate operand to the microinstruction 105 of FIG. 1. Functions 1532 that may be performed by NNU 121 include, but are not limited to, read data RAM 122, read weight RAM 124, read program memory 129, and read status register 127. Fig. 15 shows an example of a function 1532 of the read data RAM 122.

The GPR field 1508 specifies a GPR within general register file 116. As shown, general register file 116 provides the value from the selected GPR to NNU 121, where NNU 121 uses this value as address 1522 and operates in a manner similar to address 1422 of FIG. 14 to select the row of memory specified in function 1532, and in the case of data RAM 122 or weight RAM 124, address 1522 additionally selects a block of data whose size is the size of the location of the media register (e.g., 256 bits) within the selected row. Preferably, this location is on a 256 bit boundary.

The dst field 1504 specifies the media registers in the media register file 118. As shown, media register file 118 receives data (e.g., 256 bits) from data RAM122 (or weight RAM 124 or program memory 129) to the selected media register, which data is read from selected row 1528 specified by address 1522 and the location specified by address 1522 in selected row 1528.

NNU internal RAM port configuration

Referring now to FIG. 16, a block diagram illustrating an embodiment of the data RAM122 of FIG. 1 is shown. Data RAM122 includes a memory array 1606, read ports 1602, and write ports 1604. Memory array 1606 holds data words and is preferably arranged in D rows of N words each, as described above. In one embodiment, memory array 1606 includes an array of 64 horizontally arranged static RAM cells (where each cell has a width of 128 bits and a height of 64 bits) to provide a 64KB data RAM122 that is 8192 bits wide and has 64 rows, and the die area occupied by data RAM122 is approximately 0.2 square millimeters. However, other embodiments are contemplated.

The read port 1602 is preferably multiplexed to couple to the NPU126 and the media register 118. (more precisely, the media register 118 may be coupled to the read port 1602 via a result bus, which may also provide data to reorder registers and/or a result forwarding bus to other execution units 112.) the NPU126 shares the read port 1602 with the media register 118 to read the data RAM 122. Write port 1604 is also preferably multiplexed to the NPU126 and media registers 118. The NPU126 shares the write port 1604 with the media register 118 to write to the data RAM 122. Thus, advantageously, the media registers 118 may be written to the data RAM122 in parallel while the NPU126 is reading from the data RAM122, or the NPU126 may be written to the data RAM122 in parallel while the media registers 118 are reading from the data RAM 122. This may advantageously provide improved performance. For example, the NPU126 may read the data RAM122 (e.g., continuously perform computations) while the media register 118 may write more data words to the data RAM 122. As another example, the NPU126 may write the computed results to the data RAM122 while the media register 118 reads the computed results from the data RAM 122. In one embodiment, the NPU126 may write a row of computation results to the data RAM122 while the NPU126 also reads a row of data words from the data RAM 122. In one embodiment, the memory array 1606 is configured as a memory block (bank). When the NPU126 accesses the data RAM122, all memory blocks are activated to access an entire row of the memory array 1606; when the media register 118 accesses the data RAM122, only the designated memory block is activated. In one embodiment, each memory bank is 128 bits wide and the media register 118 is 256 bits wide, so for example, two memory banks are activated each time the media register 118 is accessed. In one embodiment, one of the ports 1602/1604 is a read/write port. In one embodiment, both ports 1602/1604 are read/write ports.

The advantage of the rotator capability of the NPU 126 as described herein is: this rotator capability helps to make the rows of the memory array 1606 of the data RAM 122 significantly smaller, thus making the array relatively much smaller, than would be required for a memory array required to continuously provide data to the data RAM 122 and retrieve results from the data RAM 122 while the NPU 126 is performing computations in order to ensure that the NPU 126 is highly utilized.

Internal RAM cache

Referring now to FIG. 17, a block diagram illustrating an embodiment of weight RAM124 and buffer 1704 of FIG. 1 is shown. The weight RAM124 includes a memory array 1706 and a port 1702. The memory array 1706 holds weight words and is preferably arranged in W rows, as described above, with N words per row. In one embodiment, memory array 1706 includes an array of 128 horizontally arranged static RAM cells (where each cell has a width of 64 bits and a height of 2048 bits) to provide a 2MB weight RAM124 that is 8192 bits wide and has 2048 rows, and the die area occupied by weight RAM124 is approximately 2.4 square millimeters. However, other embodiments are contemplated.

The port 1702 is preferably coupled to the NPU 126 and the buffer 1704 in a multiplexed manner. NPU 126 and register 1704 read from and write to weight RAM124 via port 1702. The registers 1704 are also coupled to the media registers 118 of FIG. 1 such that the media registers 118 are read from and written to the weight RAM124 via the registers 1704. Thus, advantageously, while the NPU 126 is reading or writing to the weight RAM124, the media register 118 may also write or read the buffer 1704 in parallel (but if the NPU 126 is currently executing, the NPU 126 is preferably stalled to avoid accessing the weight RAM124 while the buffer 1704 is accessing the weight RAM 124). This may advantageously improve performance, particularly because media registers 118 read and write to weight RAM124 are much smaller than NPU 126 reads and writes to weight RAM 124. For example, in one embodiment, the NPU 126 reads/writes 8192 bits (a row) at a time, while the media registers 118 are 256 bits wide and each MTNN instruction 1400 writes two

media registers

118, 512 bits. Thus, in the case where the architected program executes the sixteen MTNN instructions 1400 to fill the registers 1704, the NPU 126 and the architected program conflict with respect to the access weight RAM124 by only less than about six percent of the time. In another embodiment, the instruction translator 104 translates the MTNN instruction 1400 into two microinstructions 105, wherein each microinstruction 105 writes a single data register 118 to the register 1704, in which case the NPU 126 and the architected program conflict with respect to the access weight RAM124 even less frequently.

In embodiments that include registers 1704, writing to weight RAM 124 using the architected program requires multiple MTNN instructions 1400. One or more MTNN instructions 1400 specify a function 1432 to write a specified block of data in the buffer 1704, which is then followed by the MTNN instructions 1400 specifying the function 1432 to instruct the NNU 121 to write the contents of the buffer 1704 into a specified row of the weight RAM 124, where the block of data is twice the size of the number of bits of the media registers 118 and the blocks of data are naturally aligned within the buffer 1704. In one embodiment, each MTNN instruction 1400 that specifies a function 1432 to write to a specified data block of buffer 1704 contains a bitmask (bitmask) with bits corresponding to each data block of buffer 1704. Data from two designated source registers 118 is written into each data block of buffer 1704 with the corresponding bit within the bit mask set. This may be useful for weighting repeated data values within a row of RAM 124. For example, to zero out buffer 1704 (and subsequent rows of weight RAM 124), the programmer may load the source register with a zero value and set all bits of the bit mask. Furthermore, the bit mask enables the programmer to write only selected data blocks in buffer 1704, thereby preserving previous data in other data blocks.

In one embodiment that includes register 1704, multiple MFNN instructions 1500 are required to read weight RAM 124 using a framework program. The initial MFNN instruction 1500 specifies a function 1532 to load the buffers 1704 from the specified line of the weight RAM 124, and then one or more MFNN instructions 1500 specify a function 1532 to read specified data blocks of the buffers 1704 into destination registers, where the size of the data blocks is the number of bits of the media registers 118, and the data blocks are naturally aligned within the buffers 1704. Other embodiments are contemplated in which the weight RAM 124 includes multiple buffers 1704 to further reduce contention between the NPU 126 and the architectural program for access to the weight RAM 124 by increasing the number of accesses to the architectural program by the NPU 126 during execution, which may increase the likelihood that accesses to the buffers 1704 can be performed during clock cycles in which the NPU 126 does not need to access the weight RAM 124.

Although fig. 16 depicts a dual port data RAM 122, other embodiments are contemplated in which the weight RAM 124 is also dual port. Further, while FIG. 17 depicts a buffer for weight RAM 124, other embodiments are contemplated in which data RAM 122 also has an associated buffer similar to buffer 1704.

Dynamically configurable NPU

Referring now to FIG. 18, a block diagram is shown illustrating the dynamically configurable NPU126 of FIG. 1. The NPU126 of fig. 18 is similar in many respects to the NPU126 of fig. 2. However, the NPU126 of fig. 18 may be dynamically configured to operate in one of two different configurations. In a first configuration, the operation of the NPU126 of fig. 18 is similar to the NPU126 of fig. 2. That is, in a first configuration (referred to herein as a "wide" configuration or a "single" configuration), the ALUs 204 of the NPU126 perform operations on a single wide data word and a single wide weight word (e.g., 16 bits) to produce a single wide result. In contrast, in a second configuration (referred to herein as a "narrow" configuration or a "dual" configuration), the NPU126 performs operations on two narrow data words and two corresponding narrow weight words (e.g., 8 bits) to produce two corresponding narrow results. In one embodiment, the configuration (wide or narrow) of the NPU126 is performed by an initialize NPU instruction (e.g., address 0 in FIG. 20 described below). Optionally, the configuration may also be implemented by an MTNN instruction whose function 1432 specifies that the NPU126 is configured to the configuration (wide or narrow). Preferably, the configuration registers are filled by program memory 129 instructions or MTNN instructions that determine configuration (wide or narrow). For example, the output of the configuration register is provided to ALU 204, AFU 212, and logic that generates the multiplexed register control signal 213. In general, elements of the NPU126 of fig. 18 perform similar functions to like-numbered elements in fig. 2, and reference should be made for an understanding of fig. 18. However, reference will now be made to the embodiment of fig. 18 (including the differences from fig. 2).

The NPU 126 of fig. 18 includes two

registers

205A and 205B, two 3-input multiplexing registers 208A and 208B, ALU 204, two accumulators 202A and 202B, and two AFUs 212A and 212B. Each register 205A/205B has half the width (e.g., 8 bits) of the register 205 of FIG. 2. Each register 205A/205B receives a respective narrow weight word 206A/B206 (e.g., 8 bits) from the weight RAM 124 and provides its output 203A/203B to the operand selection logic 1898 of the ALU 204 in a subsequent clock cycle. In the case where the NPU 126 is in a wide configuration, the registers 205A/205B effectively operate together to receive wide weight words 206A/206B (e.g., 16 bits) from the weight RAM 124, in a manner similar to the registers 205 of the embodiment of FIG. 2; and in the case of the NPU 126 being in a narrow configuration, the registers 205A/205B operate essentially independently, each receiving a narrow weight word 206A/206B (e.g., 8 bits) from the weight RAM 124, such that the NPU 126 is essentially two separate narrow NPUs. However, the same output bits of weight RAM 124 are coupled to and provided to registers 205A/205B regardless of the configuration of NPU 126. For example, register 205A of NPU 0 receives byte 0, register 205B of NPU 0 receives byte 1, register 205A of NPU 1 receives byte 2, register 205B of NPU 1 receives byte 3, and so on, register 205B of NPU 511 receives byte 1023.

Each of the multiplexing registers 208A/208B has half the width (e.g., 8 bits) of the registers 208 of FIG. 2, respectively. Multiplexer register 208A selects one of its

inputs

207A, 211A, and 1811A for storage in its register and provision on output 209A in a subsequent clock cycle, and multiplexer register 208B selects one of its

inputs

207B, 211B, and 1811B for storage in its register and provision on output 209B in a subsequent clock cycle to operand selection logic 1898. Input 207A receives a narrow data word (e.g., 8 bits) from data RAM 122, and input 207B receives a narrow data word from data RAM 122. Where the NPU 126 is of a wide configuration, in a manner similar to the multiplexing registers 208 of the embodiment of FIG. 2, the multiplexing registers 208A/208B actually operate together to receive wide data words 207A/207B (e.g., 16 bits) from the data RAM 122; with the NPU 126 in the narrow configuration, the multiplexing registers 208A/208B operate essentially independently, each receiving a narrow data word 207A/207B (e.g., 8 bits) from the data RAM 122, such that the NPU 126 is essentially two separate narrow NPUs. However, the same output bits of the data RAM 122 are coupled to and provided to the multiplexing registers 208A/208B, regardless of the configuration of the NPU 126. For example, multiplexing register 208A of NPU 0 receives byte 0, multiplexing register 208B of NPU 0 receives byte 1, multiplexing register 208A of NPU1 receives byte 2, multiplexing register 208B of NPU1 receives byte 3, and so on, multiplexing register 208B of NPU 511 receives byte 1023.

Input 211A receives output 209A of multiplexing register 208A of the neighboring NPU 126 and input 211B receives output 209B of multiplexing register 208B of the neighboring NPU 126. As shown, input 1811A receives output 209B of the multiplexing register 208B of the neighboring NPU 126, and input 1811B receives output 209A of the multiplexing register 208A of the current NPU 126. Among the N NPUs 126 shown in FIG. 1, the NPU 126 shown in FIG. 18 is labeled as NPU J. That is, NPU J is a representative example of N NPUs. Preferably, input 211A of multiplexing register 208A of NPU J receives output 209A of multiplexing register 208A of NPU 126 instance J-1, and input 1811A of multiplexing register 208A of NPU J receives output 209B of multiplexing register 208B of NPU 126 instance J-1, and output 209A of multiplexing register 208A of NPU J is provided to both input 211A of multiplexing register 208A of NPU 126 instance J +1 and input 211B of multiplexing register 208B of NPU 126 instance J; and input 211B of multiplexing register 208B of NPU J receives output 209B of multiplexing register 208B of NPU 126 instance J-1, input 1811B of multiplexing register 208B of NPU J receives output 209A of multiplexing register 208A of NPU 126 instance J, and output 209B of multiplexing register 208B of NPU J is provided to both input 1811A of multiplexing register 208A of NPU 126 instance J +1 and input 211B of multiplexing register 208B of NPU 126 instance J + 1.

Control input 213 controls which of these three inputs is selected by multiplexing registers 208A/208B for storage in the respective register and subsequent provision on the respective output 209A/209B. In the event that the NPU 126 is instructed (e.g., by a multiply-accumulate instruction at address 1 of fig. 20 as described below) to load a row from the data RAM 122, the control inputs 213 control the respective multiplexing registers 208A/208B to select respective narrow data words 207A/207B (e.g., 8 bits) from corresponding narrow words of the selected row of the data RAM 122, whether the NPU 126 is in the wide or narrow configuration.

Where the NPU 126 is instructed (e.g., by a multiply-accumulate rotate instruction at address 2 of fig. 20 as described below) to rotate the value of a previously received data line, the control input 213 controls each of the multiplexing registers 208A/208B to select a respective input 1811A/1811B if the NPU 126 is in a narrow configuration. In this case, the multiplexing registers 208A/208B operate effectively independently, such that the NPU 126 is effectively two separate narrow NPUs. As such, the multiplexing registers 208A and 208B of the N NPUs 126 collectively operate as a 2N narrow-word round-robin, as described in more detail below with respect to fig. 19.

Where the NPU 126 is instructed to rotate the value of a previously received data line, the control input 213 controls each of the multiplexing registers 208A/208B to select the corresponding input 211A/211B if the NPU 126 is in the wide configuration. In this case, the multiplexing registers 208A/208B operate virtually as if the NPU 126 were a single wide NPU 126 as a whole. As such, in a manner similar to that described with respect to fig. 3, the multiplexing registers 208A and 208B of the N NPUs 126 collectively operate as a round-robin rotator for the N wide words.

ALU 204 includes operand selection logic 1898, wide multiplier 242A, narrow multiplier 242B, wide 2 input multiplexer 1896A, narrow 2 input multiplexer 1896B, wide adder 244A, and narrow adder 244B. In practice, ALU 204 includes operand selection logic 1898, wide ALU 204A (including wide multiplier 242A, wide multiplexer 1896A, and wide adder 244A), and narrow ALU 204B (including narrow multiplier 242B, narrow multiplexer 1896B, and narrow adder 244B). Preferably, wide multiplier 242A multiplies two wide words and is similar to multiplier 242 of fig. 2 (e.g., a 16 bit by 16 bit multiplier). Narrow multiplier 242B multiplies two narrow words (e.g., an 8-bit by 8-bit multiplier that produces a 16-bit result). When the NPU 126 is in a narrow configuration, the wide multiplier 242A effectively acts as a narrow multiplier to multiply two narrow words by means of operand selection logic 1898, such that the NPU 126 effectively acts as two narrow NPUs. Preferably, wide adder 244A adds the output of wide multiplexer 1896A to output 217A of wide accumulator 202A to produce sum 215A to provide to wide accumulator 202A, which is similar to adder 244 of FIG. 2. Narrow adder 244B adds the output of narrow multiplexer 1896B to output 217B of narrow accumulator 202B to produce sum 215B for provision to narrow accumulator 202B. In one embodiment, narrow accumulator 202B has a width of 28 bits to avoid loss of precision when accumulating up to 1024 16-bit products. When NPU 126 is in a wide configuration, narrow multiplier 242B, narrow multiplexer 1896B, narrow adder 244B, narrow accumulator 202B, and narrow AFU 212B are preferably inactive to reduce power consumption.

As described in more detail below, operand selection logic 1898 selects operands from 209A, 209B, 203A, and 203B to provide to other elements of the ALU 204. Operand selection logic 1898 preferably also performs other functions, such as performing signed value data word and sign extension of weight words. For example, if NPU 126 is in a narrow configuration, operand selection logic 1898 sign-extends the narrow data words and weight words to the width of the wide words before providing them to wide multiplier 242A. Similarly, if ALU 204 is instructed to pass narrow data/weight words (skipping wide multiplier 242A via wide multiplexer 1896A), operand selection logic 1898 sign-expands the narrow data/weight words to the width of the wide words before providing them to wide adder 244A. Logic to perform the sign extension function is also preferably present in ALU 204 of NPU 126 of FIG. 2.

Wide multiplexer 1896A receives the output of wide multiplier 242A and the operand from operand selection logic 1898 and selects one of these inputs to provide to wide adder 244A, and narrow multiplexer 1896B receives the output of narrow multiplier 242B and the operand from operand selection logic 1898 and selects one of these inputs to provide to narrow adder 244B.

The operands provided by the operand selection logic 1898 depend on the configuration of the NPU126 and the arithmetic and/or logical operations performed by the ALU 204 based on the functions specified by the instruction being executed by the NPU 126. For example, if the instruction instructs ALU 204 to perform multiply-accumulate and NPU126 is in a wide configuration, operand selection logic 1898 provides the concatenated wide word as

outputs

209A and 209B to one input of wide multiplier 242A and the concatenated wide word as

outputs

203A and 203B to the other input, while narrow multiplier 242B is inactive, such that NPU126 functions as a single wide NPU126 similar to NPU126 of FIG. 2. Whereas if the instruction instructs ALU 204 to perform multiply-accumulate and NPU126 is in a narrow configuration, operand selection logic 1898 provides an expanded or widened version of narrow data word 209A to one input of wide multiplier 242A and an expanded version of narrow weight word 203A to the other input; further, operand selection logic 1898 provides narrow data word 209B to one input of narrow multiplier 242B and narrow weight word 203B to the other input. To expand or widen a narrow word, operand selection logic 1898 sign-expands the narrow word if it is signed; whereas if the narrow word is unsigned, operand selection logic 1898 fills the narrow word with the high order bits having a value of zero.

As another example, if NPU 126 is in a wide configuration and the instruction instructs ALU 204 to perform an accumulation of weight words, wide multiplier 242A is skipped and operand selection logic 1898 provides a concatenation of

outputs

203A and 203B to wide multiplexer 1896A for supply to wide adder 244A. Whereas if NPU 126 is in the narrow configuration and the instruction instructs ALU 204 to perform an accumulation of weight words, wide multiplier 242A is skipped and operand selection logic 1898 provides an expanded version of output 203A to wide multiplexer 1896A for provision to wide adder 244A; and narrow multiplier 242B is skipped and operand selection logic 1898 provides an expanded version of output 203B to narrow multiplexer 1896B for provision to narrow adder 244B.

As another example, if NPU 126 is in a wide configuration and the instruction instructs ALU 204 to perform accumulation of data words, wide multiplier 242A is skipped and operand selection logic 1898 provides the concatenation of

outputs

209A and 209B to wide multiplexer 1896A for supply to wide adder 244A. Whereas if NPU 126 is in the narrow configuration and the instruction instructs ALU 204 to perform accumulation of data words, wide multiplier 242A is skipped and operand selection logic 1898 provides an expanded version of output 209A to wide multiplexer 1896A for provision to wide adder 244A; and narrow multiplier 242B is skipped and operand selection logic 1898 provides an expanded version of output 209B to narrow multiplexer 1896B for provision to narrow adder 244B. The accumulation of weights/data words may help to perform averaging operations for pooling layers for certain artificial neural network applications, such as image processing.

Preferably, the NPU 126 further comprises: a second wide multiplexer (not shown) for skipping wide adder 244A in order to load wide accumulator 202A with wide data/weight words in a wide configuration or with expanded narrow data/weight words in a narrow configuration; and a second narrow multiplexer (not shown) for skipping narrow adder 244B in order to load narrow accumulator 202B with narrow data/weight words in a narrow configuration. Preferably, ALU 204 also includes wide and narrow comparator/multiplexer combinations (not shown) that receive the respective accumulator values 217A/217B and the respective multiplexer 1896A/1896B outputs to select a maximum value between the accumulator values 217A/217B and the data/weight words 209A/B/203A/B, as described in more detail below, e.g., with respect to FIGS. 27 and 28, such operations being used in pooling layers for certain artificial neural network applications. Further, operand selection logic 1898 is configured to provide operands having a value of zero (for adding zeros or for clearing accumulators) and to provide operands having a value of one (for multiplying by one).

Narrow AFU 212B receives output 217B of narrow accumulator 202B and performs an activation function thereon to produce narrow result 133B, while wide AFU 212A receives output 217A of wide accumulator 202A and performs an activation function thereon to produce wide result 133A. When NPU 126 is in the narrow configuration, wide AFU 212A accordingly considers output 217A of wide accumulator 202A and performs an activation function thereon to produce a narrow result (e.g., 8 bits), as described in more detail below, e.g., with respect to fig. 29A-30.

From the above description it can be seen that advantageously a single NPU 126, when in a narrow configuration, actually operates as two narrow NPUs, thus providing a throughput for smaller words that is roughly twice the throughput in a wide configuration. For example, assume that the neural network layer has 1024 neurons, and each neuron receives 1024 narrow inputs (and has narrow weight words) from the previous layer, resulting in one million connections. In contrast to NNUs 121 having NPUs 126 with 512 wide configurations, NNUs 121 having NPUs 126 with 512 narrow configurations are able to handle four times the number of connections (256K connections per million connections vs) in approximately half the time (about 1026 clocks vs514 clocks), although narrow words are handled instead of wide words.

In one embodiment, dynamically configurable NPU 126 of fig. 18 includes a 3-input multiplexing register similar to multiplexing

registers

208A and 208B in place of

registers

205A and 205B to implement a rotator for a row of weight words received from weight RAM 124 in a manner somewhat similar to that described for the embodiment of fig. 7 but in the dynamically configurable manner described for fig. 18.

Referring now to FIG. 19, a block diagram is shown illustrating an embodiment of an arrangement of 2N multiplexing registers 208A/208B of the N NPUs 126 of NNU 121 of FIG. 1 according to the embodiment of FIG. 18, illustrating the operation of the 2N multiplexing registers 208A/208B as a rotator for a row of data words 207 received from data RAM 122 of FIG. 1. In the embodiment of FIG. 19, as shown, N is 512, such that NNU 121 has 1024 multiplexing registers 208A/208B labeled 0 through 511, which correspond to 512 NPUs 126 (in effect 1024 narrow NPUs). The two narrow NPUs within the NPU 126 are labeled a and B, and in each multiplexing register 208, the designation of the respective narrow NPU is shown. More specifically, multiplexing register 208A of NPU 1260 is assigned 0-A, multiplexing register 208B of NPU 1260 is assigned 0-B, multiplexing register 208A of NPU 1261 is assigned 1-A, multiplexing register 208B of NPU 1261 is assigned 1-B, multiplexing register 208A of NPU 126511 is assigned 511-A, and multiplexing register 208B of NPU 126511 is assigned 511-B, which also correspond to the narrow NPU of FIG. 21 described below.

Each multiplexing register 208A receives a respective narrow data word 207A in one of the D rows of the data RAM122 and each multiplexing register 208B receives a respective narrow data word 207B in one of the D rows of the data RAM 122. That is, multiplexing register 0A receives narrow data word 0 of data RAM122 row, multiplexing register 0B receives narrow data word 1 of data RAM122 row, multiplexing register 1A receives narrow data word 2 of data RAM122 row, multiplexing register 1B receives narrow data word 3 of data RAM122 row, and so on, multiplexing register 511A receives narrow data word 1022 of data RAM122 row, and multiplexing register 511B receives narrow data word 1023 of data RAM122 row. Further, multiplexing register 1A receives on its input 211A the output 209A of multiplexing register 0A, multiplexing register 1B receives on its input 211B the output 209B of multiplexing register 0B, and so on, multiplexing register 511A receives on its input 211A the output 209A of multiplexing register 510A, multiplexing register 511B receives on its input 211B the output 209B of multiplexing register 510B, and multiplexing register 0A receives on its input 211A the output 209A of multiplexing register 511A, multiplexing register 0B receives on its input 211B the output 209B of multiplexing register 511B. Each of the multiplexer registers 208A/208B receives a control input 213, wherein the control input 213 controls whether the data word 207A/207B is to be selected, the post-rotation input 211A/211B is to be selected, or the post-rotation input 1811A/1811B is to be selected. Finally, multiplexing register 1A receives on its input 1811A the output 209B of multiplexing register 0B, multiplexing register 1B receives on its input 1811B the output 209A of multiplexing register 1A, and so on, multiplexing register 511A receives on its input 1811A the output 209B of multiplexing register 510B, multiplexing register 511B receives on its input 1811B the output 209A of multiplexing register 511A, and multiplexing register 0A receives on its input 1811A the output 209B of multiplexing register 511B, multiplexing register 0B receives on its input 1811B the output 209A of multiplexing register 0A. Each of the multiplexer registers 208A/208B receives a control input 213, wherein the control input 213 controls whether the data word 207A/207B is to be selected, the post-rotation input 211A/211B is to be selected, or the post-rotation input 1811A/1811B is to be selected. As described in more detail below, in an operational mode, during a first clock cycle, the control input 213 controls each of the multiplexer registers 208A/208B to select a data word 207A/207B for storage to the register and subsequent provision to the ALU 204; and in a subsequent clock cycle (e.g., the M-1 clock cycle described above), control input 213 controls each of the mux registers 208A/208B to select the post-rotation inputs 1811A/1811B for storage into the registers and subsequent provision to ALU 204.

Referring now to fig. 20, a table illustrating a program stored in program memory 129 of NNU121 of fig. 1 and executed by the NNU121 is shown, where the NNU121 has NPU 126 according to the embodiment of fig. 18. The exemplary process of fig. 20 is similar in many respects to the process of fig. 4. However, the differences will be explained below. The initialize NPU instruction at address 0 specifies that the NPU 126 is to be a narrow configuration. Further, as shown, the multiply-accumulate rotate instruction at address 2 specifies a count of 1023 and takes 1023 clock cycles. This is because the example of fig. 20 assumes that one layer actually has 1024 narrow (e.g., 8-bit) Neurons (NPUs), each narrow neuron having 1024 connection inputs from 1024 neurons of the previous layer, thus there are a total of 1024K connections. Each neuron receives an 8-bit data value from each of the connected inputs and multiplies the 8-bit data value by an appropriate 8-bit weight value.

Referring now to FIG. 21, a timing diagram is shown that illustrates an NNU121 executing the routine of FIG. 20, where the NNU121 includes the NPU 126 of FIG. 18 operating in a narrow configuration. The timing diagram of FIG. 21 is similar in many respects to the timing diagram of FIG. 5; however, the differences will be explained below.

In the timing diagram of FIG. 21, the NPUs 126 are in a narrow configuration because the initializing NPU instruction at address 0 initializes the NPUs to the narrow configuration. Thus, the 512 NPUs 126 actually operate as 1024 narrow NPUs (or neurons), where the 1024 narrow NPUs are designated within the column as NPU 0-A and NPU 0-B (the two narrow NPUs of NPU 1260), NPU 1-A and NPU 1-B (the two narrow NPUs of NPU 1261), …, NPU 511-A and NPU 511-B (the two narrow NPUs of NPU 126511). For simplicity and clarity of illustration, only the operation of the narrow NPUs 0-A, 0-B and 511-B is shown. The rows of the timing diagram of fig. 21 include up to 1026 clock cycles due to the fact that the multiply-accumulate rotation at address 2 specifies a count of 1023, which requires 1023 clock cycles.

At clock 0, the 1024 NPUs each execute the initialization instruction of fig. 4, i.e., the initialization instruction that assigns a zero value to accumulator 202 as shown in fig. 5.

At clock 1, the 1024 narrow NPUs each execute the multiply-accumulate instruction at address 1 of fig. 20. As shown, narrow NPU 0-A accumulates the product of narrow word 0 of row 17 of data RAM 122 and narrow word 0 of row 0 of weight RAM 124 with the value of accumulator 202A (i.e., zero); narrow NPU 0-B accumulates the product of narrow word 1 of row 17 of data RAM 122 and narrow word 1 of row 0 of weight RAM 124 with the value of accumulator 202B (i.e., zero); by analogy, narrow NPU 511-B accumulates the product of narrow word 1023 for row 17 of data RAM 122 and narrow word 1023 for row 0 of weight RAM 124 with the value of accumulator 202B (i.e., zero).

At clock 2, the 1024 narrow NPUs each perform the first iteration of the multiply-accumulate rotate instruction of address 2 of fig. 20. As shown, narrow NPU 0-A accumulates the product of rotated narrow data word 1811A received from output 209B of multiplexing register 208B of narrow NPU511-B (i.e., narrow data word 1023 received from data RAM 122) and narrow word 0 of row 1 of weight RAM 124 with value 217A of accumulator 202A; narrow NPU 0-B accumulates the product of rotated narrow data word 1811B received from output 209A of multiplexing register 208A of narrow NPU 0-a (i.e. narrow data word 0 received from data RAM 122) and narrow word 1 of row 1 of weight RAM 124 with value 217B of accumulator 202B; by analogy, narrow NPU511-B accumulates the product of the rotated narrow data word 1811B received from output 209A of multiplexing register 208A of narrow NPU 511-A (i.e. narrow data word 1022 received from data RAM 122) and narrow word 1023 of row 1 of weight RAM 124 with the value 217B of accumulator 202B.

At clock 3, the 1024 narrow NPUs each perform a second iteration of the multiply-accumulate rotate instruction at address 2 of fig. 20. As shown, narrow NPU 0-A accumulates the product of rotated narrow data word 1811A received from output 209B of multiplexing register 208B of narrow NPU511-B (i.e., narrow data word 1022 received from data RAM 122) and narrow word 0 of row 2 of weight RAM 124 with value 217A of accumulator 202A; narrow NPU 0-B accumulates the product of rotated narrow data word 1811B received from output 209A of multiplexing register 208A of narrow NPU 0-a (i.e. narrow data word 1023 received from data RAM 122) and narrow word 1 of row 2 of weight RAM 124 with value 217B of accumulator 202B; by analogy, narrow NPU511-B accumulates the product of the rotated narrow data word 1811B received from output 209A of multiplexing register 208A of narrow NPU 511-A (i.e. narrow data word 1021 received from data RAM 122) and narrow word 1023 of row 2 of weight RAM 124 with value 217B of accumulator 202B. This continues for each of the next 1021 clock cycles, as indicated by the ellipsis in FIG. 21, until clock 1024.

At clock 1024, the 1024 narrow NPUs each execute the 1023 rd iteration of the multiply-accumulate round-robin instruction at address 2 of fig. 20. As shown, narrow NPU 0-A accumulates the product of rotated narrow data word 1811A received from output 209B of multiplexing register 208B of narrow NPU511-B (i.e., narrow data word 1 received from data RAM 122) and narrow word 0 of row 1023 of weight RAM 124 with value 217A of accumulator 202A; narrow NPU 0-B accumulates the product of rotated narrow data word 1811B received from output 209A of multiplexing register 208A of NPU 0-A (i.e. narrow data word 2 received from data RAM 122) and narrow word 1 of row 1023 of weight RAM 124 with value 217B of accumulator 202B; by analogy, narrow NPU511-B accumulates the product of the rotated narrow data word 1811B received from output 209A of multiplexing register 208A of NPU 511-A (i.e., narrow data word 0 received from data RAM 122) and narrow word 1023 of row 1023 of weight RAM 124 with the value 217B of accumulator 202B.

At clock 1025, AFU 212A/212B of each of the 1024 narrow NPUs execute the Activate function instruction at address 3 of FIG. 20. Finally, at clock 1026, the 1024 narrow NPUs each execute the write AFU output instruction at address 4 of FIG. 20 by writing their narrow results 133A/133B back to the corresponding narrow word of row 16 of data RAM122, i.e., the narrow result 133A of NPU 0-A is written to narrow word 0 of data RAM122, the narrow result 133B of NPU 0-B is written to narrow word 1 of data RAM122, and so on, and the narrow result 133 of NPU511-B is written to narrow word 1023 of data RAM 122. In fig. 22, the operations described above with respect to fig. 21 are also shown in block diagram form.

Referring now to FIG. 22, a block diagram is shown that illustrates the NNU 121 of FIG. 1, where the NNU 121 includes the NPU 126 of FIG. 18 to execute the routine of FIG. 20. NNU 121 includes 512 NPUs 126, i.e., 1024 narrow NPUs, a data RAM 122 that receives its address inputs 123, and a weight RAM 124 that receives its address inputs 125. Although not shown, at clock 0, the 1024 narrow NPUs execute the initialization instruction of fig. 20. As shown, at clock 1, 1024 8-bit data words of row 17 are read from data RAM 122 and provided to 1024 narrow NPUs. At clocks 1 through 1024, 1024 8-bit weight words of rows 0 through 1023 are read from the weight RAM 124 and provided to 1024 narrow NPUs, respectively. Although not shown, at clock 1, 1024 narrow NPUs perform corresponding multiply-accumulate operations on the loaded data word and weight word. At clocks 2 through 1024, the 1024 narrow NPU multiplexing registers 208A/208B operate as 1024 8-bit word rotator to rotate the previously loaded data word of row 17 of data RAM 122 to the adjacent narrow NPU, and the narrow NPU performs multiply-accumulate operations on the respective rotated data words and the respective narrow weight words loaded from weight RAM 124. Although not shown, at clock 1025, 1024 narrow AFUs 212A/212B execute the activate instruction. At clock 1026, the 1024 narrow NPUs write their respective 1024 8-bit results 133A/133B back to row 16 of the data RAM 122.

It may be found, for example, that the embodiment of fig. 18 may have advantages over the embodiment of fig. 2 because the embodiment of fig. 18 provides the programmer with the flexibility to use wide data words and weight words (e.g., 16 bits) for calculations where a certain degree of accuracy is required for the particular application being modeled, and narrow data words and weight words (e.g., 8 bits) for calculations where a certain degree of accuracy is required for the application. From one perspective, for applications with narrow data, the embodiment of FIG. 18 may provide twice the throughput at the expense of additional narrow elements (e.g., multiplexing register 208B, register 205B, narrow ALU204B, narrow accumulator 202B, narrow AFU 212B) that increase the area of the NPU126 by about 50% as compared to the embodiment of FIG. 2.

Three-mode NPU

Referring now to FIG. 23, a block diagram is shown illustrating the NPU126 of FIG. 1 that is dynamically configurable, in accordance with alternative embodiments. The NPU126 of fig. 23 may be configured not only in a wide configuration and a narrow configuration, but also in a third configuration (referred to herein as a "funnel" configuration). The NPU126 of fig. 23 is similar in many respects to the NPU126 of fig. 18. However, the wide adder 244A in fig. 18 is replaced in the NPU126 of fig. 23 by a 3-input wide adder 2344A, where the 3-input wide adder 2344A receives the third addend 2399 that is an extended version of the output of the narrow multiplexer 1896B. The procedure for operating NNU 121 with NPU126 of fig. 23 is similar in many respects to the procedure of fig. 20. However, the initialize NPU instruction at address 0 initializes these NPUs 126 to a funnel configuration, rather than a narrow configuration. Further, the multiply-accumulate rotate instruction for address 2 has a count of 511 instead of 1023.

In the case of the funnel configuration, the operation of the NPU 126 is similar to that in the case of a multiply-accumulate instruction, such as at address 1 of fig. 20, executed in a narrow configuration in the following respects: the NPU 126 receives two narrow data words 207A/207B and two narrow weight words 206A/206B; wide multiplier 242A multiplies data word 209A with weight word 203A to produce product 246A selected by wide multiplexer 1896A; and narrow multiplier 242B multiplies data word 209B with weight word 203B to produce product 246B selected by narrow multiplexer 1896B. However, wide adder 2344A adds both product 246A (selected by wide multiplexer 1896A) and product 246B/2399 (selected by wide multiplexer 1896B) to value 217A of wide accumulator 202A, while narrow adder 244B is inactive with narrow accumulator 202B. Further, when a multiply-accumulate rotate instruction such as at address 2 of FIG. 20 is executed in a funnel configuration, control input 213 rotates multiplexer registers 208A/208B by two narrow words (e.g., 16 bits), that is, multiplexer registers 208A/208B select their respective inputs 211A/211B, just as with the wide configuration. However, wide multiplier 242A multiplies data word 209A with weight word 203A to produce product 246A selected by wide multiplexer 1896A; narrow multiplier 242B multiplies data word 209B with weight word 203B to produce product 246B selected by narrow multiplexer 1896B; and wide adder 2344A adds both product 246A (selected by wide multiplexer 1896A) and product 246B/2399 (selected by wide multiplexer 1896B) to value 217A of wide accumulator 202A, while narrow adder 244B and narrow accumulator 202B are inactive as described above. Finally, when executing an activate function instruction such as at address 3 of fig. 20 in the funnel configuration, wide AFU 212A executes the activate function on the resulting sum 215A to produce narrow result 133A, while narrow AFU 212B is inactive. Thus, only the narrow NPU labeled A produces narrow result 133A, while the narrow result 133B produced by the narrow NPU labeled B is invalid. Thus, the result line written back (e.g., line 16 indicated by the instruction at address 4 of FIG. 20) contains a hole because only narrow result 133A is valid, while narrow result 133B is invalid. Thus, in contrast to the embodiments of fig. 2 and 18, where each neuron processes one connected data input per clock cycle, conceptually, each neuron (NPU 126 of fig. 23) processes two connected data inputs per clock cycle, i.e., multiplies two narrow data words by respective weights and accumulates the two products.

It can be found for the embodiment of fig. 23 that the number of result words (neuron outputs) generated and written back to the data RAM 122 or weight RAM 124 is half the square root of the number of data inputs (connections) received and that the result lines written back have holes, i.e. every other narrow word result is invalid, more specifically the narrow NPU result denoted B does not make sense. Thus, the embodiment of fig. 23 is particularly effective for neural networks having two consecutive layers, e.g., a first layer having twice the number of neurons as a second layer (e.g., a first layer having 1024 neurons fully connected to 512 neurons of a second layer). Furthermore, other execution units 122 (e.g., media units such as x86AVX units) may perform packing operations (pack operations) on scattered (i.e., with holes) result lines to make them tight (i.e., without holes) if necessary for subsequent computations when NNU121 is performing other computations associated with other lines of data RAM 122 and/or weight RAM 124.

Hybrid NNU operation: convolution capability and pooling capability

An advantage of NNU121 according to embodiments described herein is that NNU121 is capable of operating in parallel in a manner similar to a coprocessor executing its own internal programs, and in a manner similar to a processor's execution unit executing architectural instructions issued to the execution unit (or microinstructions translated from architectural instructions). Architectural instructions have an architectural program that is executed by a processor that includes NNU 121. As such, NNUs 121 operate in a hybrid manner, which is advantageous because it provides the ability to maintain high utilization of NNUs 121. For example, fig. 24-26 illustrate operations in which NNU121 performs convolution operations in which the utilization of the NNU is high, and fig. 27-28 illustrate operations in which NNU121 performs pooling operations, which are required by convolution layers, pooling layers, and other digital data computing applications such as image processing (e.g., edge detection, sharpening, blurring, recognition/classification), and the like. However, the blending operation of NNU121 is not limited to performing convolution or pooling operations, but rather the blending feature may also be used to perform other operations, such as the conventional neural network multiply-accumulate operations and activation function operations described above with respect to fig. 4-13. That is, processor 100 (more specifically, reservation station 108) issues MTNN instruction 1400 and MFNN instruction 1500 to NNU121, where in response to these instructions, NNU121 writes data to memory 122/124/129 and reads results from memory 122/124 written by NNU121, while at the same time, in response to executing a program written by processor 100 (via MTNN 1400 instructions) to program memory 129, NNU121 reads and writes to memory 122/124/129.

Referring now to FIG. 24, a block diagram is shown that illustrates an example of a data structure used by NNU121 of FIG. 1 to perform a convolution operation. The block diagram includes a convolution kernel 2402, a data array 2404, and the data RAM122 and weight RAM124 of fig. 1. Preferably, the data array 2404 (e.g., of image pixels) is maintained in a system memory (not shown) attached to the processor 100 and loaded into the weight RAM124 of the NNU121 by the processor 100 executing the MTNN instruction 1400. The convolution operation is an operation that convolves a first matrix with a second matrix, where the second matrix is referred to herein as a convolution kernel. As described in the context of the present invention, a convolution kernel is a matrix of coefficients, where these coefficients may also be referred to as weights, parameters, elements or values. Preferably, the convolution kernel 2402 is static data of the architectural program being executed by the processor 100.

The data array 2404 is a two-dimensional array of data values, and each data value (e.g., image pixel value) is a size (e.g., 16 bits or 8 bits) of a word of the data RAM122 or weight RAM 124. In this example, the data values are 16-bit words and NNU121 is configured as 512 wide configuration NPUs 126. Further, in an embodiment, as described in more detail below, the NPU 126 includes a multiplexing register (such as multiplexing register 705 of fig. 7) for receiving the weight words 206 from the weight RAM124 to perform an overall rotator operation on a row of data values received from the weight RAM 124. In this example, the data array 2404 is a 2560 column by 1600 row pixel array. As shown, when the architect program convolves the data array 2404 with the convolution kernel 2402, the architect program divides the data array 2402 into 20 data blocks, where each data block is a 512 x 400 data matrix 2406.

In an example, the convolution kernel 2402 is a 3 × 3 matrix composed of coefficients, weights, parameters, or elements. The first row coefficients are labeled C0,0, C0,1, and C0, 2; the second row coefficients are labeled C1,0, C1,1, and C1, 2; and the third row coefficients are labeled C2,0, C2,1, and C2, 2. For example, a convolution kernel that can be used to perform edge detection has the following coefficients: 0,1,0,1, -4,1,0,1,0. As another example, a convolution kernel that can be used to Gaussian blur an image has the following coefficients: 1,2,1,2,4,2,1,2,1. In this case, division is typically performed on the final accumulated value, where the divisor is the sum of the absolute values of the elements of the convolution kernel 2402 (16 in this example). As another example, the divisor is the number of elements of the convolution kernel 2402. As another example, a divisor is a value that compresses the convolution back to within a range of expected values, and is determined from the values of the elements of convolution kernel 2402, the expected range, and the range of input values of the matrix on which the convolution operation is being performed.

As shown in fig. 24 and described in more detail with respect to fig. 25, the architecture program writes the coefficients of the convolution kernel 2402 to the data RAM 122. Preferably, all words of each of the successive nine rows (number of elements within the convolution kernel 2402) of the data RAM 122 are written to different elements of the convolution kernel 2402 in a row-major order. That is, as shown, each word of a row is written with a first coefficient C0, 0; the next row is written with a second coefficient C0, 1; the next column is written with a third coefficient C0, 2; the next row is written with a fourth coefficient C1, 0; and so on, each word of the ninth row is written with a ninth coefficient C2, 2. To convolve the data matrix 2406 of data blocks of the data array 2404, the NPU 126 repeatedly reads nine rows of coefficients holding the convolution kernel 2402 in the data RAM 122 in sequence, as described in more detail below, particularly with respect to fig. 26.

As shown in fig. 24 and described in more detail with respect to fig. 25, the framework program writes the values of the data matrix 2406 to the weight RAM 124. When the NNU program performs convolution, the result matrix is written back to weight RAM 124. Preferably, as described in more detail below with respect to fig. 25, the architectural routine writes the first data matrix 2406 to the weight RAM 124 and starts the NNU 121, and when the NNU 121 is convolving the first data matrix 2406 with the convolution kernel 2402, the architectural routine writes the second data matrix 2406 to the weight RAM 124 so that the NNU 121 can begin performing the convolution on the second data matrix 2406 once it has completed the convolution of the first data matrix 2406. Thus, the configuration process moves to and from between two regions of weight RAM 124 to ensure that NNUs 121 are fully utilized. Thus, the example of fig. 24 shows a first data matrix 2406A and a second data matrix 2406B, where the first data matrix 2406A corresponds to a first data block occupying rows 0 to 399 of the weight RAM 124 and the second data matrix 2406B corresponds to a second data block occupying rows 500 to 899 of the weight RAM 124. Furthermore, as shown, NNU 121 writes the results of the convolution back to rows 900-1299 and 1300-1699 of weight RAM 124, and the architected program then reads these results out of weight RAM 124. The data values of data matrix 2406 held in weight RAM 124 are labeled "Dx, y", where "x" is the number of rows of weight RAM 124 and "y" is the number of words or columns of weight RAM 124. Thus, for example, the data word 511 in row 399, which is received by the multiplexing register 705 of the NPU 511, is labeled D399,511 in fig. 24.

Referring now to FIG. 25, a flow diagram is shown illustrating operation of processor 100 of FIG. 1 to execute an architectural program that uses NNUs 121 to perform convolution on convolution kernels 2402 for data array 2404 of FIG. 24. Flow begins at block 2502.

At block 2502, the processor 100 (i.e., the architecture program running on the processor 100) writes the convolution kernel 2402 of fig. 24 to the data RAM 122 in the manner shown and described with respect to fig. 24. In addition, the framework program initializes a variable N to a value of 1. The variable N represents the current block of data in data array 2404 being processed by NNU 121. In addition, the architectural program initializes a variable NUM _ CHUNKS to a value of 20. Flow proceeds to block 2504.

At block 2504, as shown in fig. 24, processor 100 writes data matrix 2406 of data block 1 to weight RAM124 (e.g., data matrix 2406A of data block 1). Flow proceeds to block 2506.

At block 2506, the processor 100 writes the convolution program to the program memory 129 of the NNU 121 using the MTNN instruction 1400 that specifies the function 1432 to write to the program memory 129. The processor 100 then initiates an NNU convolution procedure using the MTNN instruction 1400 that specifies the function 1432 that initiates execution of the procedure. An example of an NNU convolution procedure is described in more detail below with respect to fig. 26A. Flow proceeds to decision block 2508.

At decision block 2508, the architecture program determines whether the value of variable N is less than NUM _ CHUNKS. If so, flow proceeds to block 2512; otherwise block 2514 is entered.

At block 2512, as shown in FIG. 24, processor 100 writes data matrix 2406 for data block N +1 to weight RAM124 (e.g., data matrix 2406B for data block 2). Thus, advantageously, while NNU 121 is performing convolution on a current data block, the architecture program writes data matrix 2406 of the next data block to weight RAM124, such that NNU 121 can immediately begin performing convolution on the next data block once the convolution of the current data block is complete (i.e., written to weight RAM 124). Flow proceeds to block 2514.

At block 2514, the processor 100 determines that the currently running NNU program (beginning at block 2506 in the case of data block 1 and beginning at block 2518 in the case of data blocks 2-20) has completed. Preferably, processor 100 makes this determination by executing MFNN instruction 1500 to read status register 127 of NNU 121. In an alternative embodiment, NNU 121 generates an interrupt, thereby indicating that it has completed the convolution procedure. Flow proceeds to decision block 2516.

At decision block 2516, the architecture program determines whether the value of the variable N is less than NUM _ CHUNKS. If so, flow proceeds to block 2518; otherwise, block 2522 is entered.

At block 2518, the processor 100 updates the convolution program so that the processor can convolve the data block N + 1. More specifically, processor 100 updates the row value of the initialize NPU instruction at address 0 in weight RAM 124 to the first row of data matrix 2406 (e.g., to row 0 of data matrix 2406A or row 500 of data matrix 2406B), and updates the output row (e.g., to row 900 or 1300). Processor 100 then initiates the updated NNU convolution procedure. Flow proceeds to block 2522.

At block 2522, processor 100 reads the results of the NNU convolution procedure for data block N from weight RAM 124. Flow proceeds to decision block 2524.

At decision block 2524, the architecture program determines whether the value of variable N is less than NUM _ CHUNKS. If so, flow proceeds to block 2526; otherwise, the flow ends.

At block 2526, the framework program increments N by 1. Flow returns to decision block 2508.

Referring now to FIG. 26A, a program listing of an NNU program that performs a convolution of the data matrix 2406 with the convolution kernel 2402 of FIG. 24 and writes it back to the weight RAM 124 is shown. The program will loop through the instruction loop body at addresses 1 through 9 a number of times. The initialize NPU instruction at address 0 specifies the number of times each NPU 126 executes the loop body, in the example of fig. 26A, the loop count value is 400, corresponding to the number of rows in the data matrix 2406 of fig. 24, and the loop instruction at the end of the loop (address 10) decrements the current loop count value and if the result is non-zero, causes control to return to the top of the loop body (i.e., the instruction at return address 1). The initialize NPU instruction also clears accumulator 202. Preferably, the loop instruction at address 10 also clears the accumulator 202. Alternatively, as described above, the multiply accumulate instruction at address 1 may specify clearing the accumulator 202.

For each execution of the loop body of the program, 512 NPUs 126 perform 512 convolutions in parallel on the 3 × 3 convolution kernel 2402 and 512 corresponding 3 × 3 sub-matrices of the data matrix 2406. The convolution is the sum of nine products of the elements of the convolution kernel 2402 and the corresponding elements within the respective sub-matrix. In the embodiment of fig. 26A, the origin (central element) of each of the 512 corresponding 3 × 3 sub-matrices is the data word Dx +1, y +1 of fig. 24, where y (column number) is the number of the NPU 126, and x (row number) is the row number in the current weight RAM 124 read by the multiply-accumulate instruction at address 1 of the program of fig. 26A (again, the row number is initialized by the initialize NPU instruction at address 0, incremented at each multiply-accumulate instruction of

addresses

3 and 5, and updated by the decrement instruction at address 9). Thus, for each cycle of the program, 512 NPUs 126 calculate 512 convolutions and write 512 convolution results back to the designated row of weight RAM 124. In this specification, edge handling (edge handling) is omitted for simplicity, but it should be noted that using the integral rotation feature of these NPUs 126 will cause two of the columns to wrap (wrap) from one vertical edge of the data matrix 2406 (e.g., of an image in the case of image processing) to the other vertical edge (e.g., from the left edge to the right edge or vice versa). The circulation body will now be described.

Address 1 is a multiply-accumulate instruction that specifies row 0 of data RAM122 and implicitly uses the row of current weight RAM124, where the row of current weight RAM124 is preferably held within sequencer 128 (and initialized to zero by the instruction at address 0 to pass through the loop body for the first time). That is, an instruction at address 1 causes each NPU 126 to read its respective word from row 0 of data RAM122, its respective word from a row of current weight RAM124, and perform a multiply-accumulate operation on these two words. Thus, for example, NPU 5 multiplies C0,0 by Dx,5 (where "x" is the row of the current weight RAM 124), adds the result to the value 217 of accumulator 202, and writes the sum back to accumulator 202.

Address 2 is a multiply-accumulate instruction that specifies incrementing a row of the data RAM122 (i.e., to row 1) and subsequently reading the row from the incremented address of the data RAM 122. The instruction also specifies that the value within the multiplexing register 705 of each NPU 126 is to be rotated to the adjacent NPU 126, in this case the row of data matrix 2406 values that was just read from weight RAM124 in response to the instruction of address 1. In the embodiments of fig. 24-26, NPU 126 is configured to cycle the value of the multiplexing register 705 to the left, i.e., from NPU J to NPU J-1, rather than from NPU J to NPU J +1 as described above with respect to fig. 3, 7, and 19. It should be appreciated that in embodiments where the NPU 126 is configured to rotate to the right, the architectural routine may write the coefficient values of the convolution kernel 2042 to the data RAM122 in a different order (e.g., rotate around its center column) to achieve similar convolution results. Further, the framework program may perform additional pre-processing (e.g., transposing) on the convolution kernel 2402, if desired. Further, the instruction specifies a count value of 2. Thus, an instruction at address 2 causes each NPU 126 to read its respective word from row 1 of data RAM122, receive the rotated word into multiplexing register 705, and perform a multiply-accumulate operation on the two words. Since the count value is 2, the instruction also causes each NPU 126 to repeat the foregoing operations. That is, the sequencer 128 increments the row address 123 of the data RAM122 (i.e., to row 2), and each NPU 126 reads its corresponding word from row 2 of the data RAM122, receives the rotated word to the multiplexing register 705, and performs a multiply-accumulate operation on the two words. Thus, for example, assuming behavior 27 of current weight RAM124, NPU 5 accumulates the product of C0,1 and D27,6 and the product of C0,2 and D27,7 into its accumulator 202 after executing the instruction at address 2. Thus, upon completion of the instructions at address 1 and address 2, the product of C0,0 and D27,5, the product of C0,1 and D27,6, and the product of C0,2 and D27,7 will be accumulated to the accumulator 202 along with all other accumulated values from the previous pass through the loop body.

The operations performed by the instructions at

addresses

3 and 4 are similar to the instructions at

addresses

1 and 2, however with the row increment indicator of weight RAM124, these instructions perform operations on the next row of weight RAM124 and on the next three rows (i.e., rows 3 through 5) of data RAM 122. That is, for example, for NPU 5, upon completion of the instructions at addresses 1 through 4, the product of C0,0 and D27,5, the product of C0,1 and D27,6, the product of C0,2 and D27,7, the product of C1,0 and D28,5, the product of C1,1 and D28,6, and the product of C1,2 and D28,7 will be accumulated to accumulator 202 along with all other accumulated values from the previous pass through the loop body.

The instructions at

addresses

5 and 6 perform operations similar to those at

addresses

3 and 4, however these instructions perform operations on the next row of weight RAM124 and the next three rows (i.e., rows 6 through 8) of data RAM 122. That is, for example, for NPU 5, after the instructions of addresses 1 to 6 are completed, the product of C0,0 and D27,5, C0, the product of 1 and D27,6, the product of C0,2 and D27,7, the product of C1,0 and D28,5, the product of C1,1 and D28,6, the product of C1,2 and D28,7, C2, the product of 0 and D29,5, the product of C2,1 and D29,6, and the product of C2,2 and D29,7 will be accumulated to 202 along with all other accumulated values from the previous pass through the loop body. That is, upon completion of the instructions at addresses 1-6, and assuming behavior 27 of weight RAM124 at the beginning of the loop body, NPU 5 will convolve, for example, with convolution kernel 2402, the following 3 × 3 sub-matrix:

More generally, upon completion of the instructions at addresses 1 through 6, each NPU 126 of the 512 NPUs 126 has convolved with the following 3 × 3 submatrix using convolution kernels 2402:

where r is the row address value of the weight RAM 124 at the beginning of the loop body, and n is the number of the NPU 126.

The instruction at address 7 causes the value 217 of the accumulator 202 to pass through the AFU 212. The pass function passes words of a size (in bits, 16 bits in this example) of the size of the words read from the data RAM 122 and the weight RAM 124. Preferably, as described in more detail below, the user may specify the output format, e.g., how many of the output bits are decimal digits. Alternatively, instead of specifying a pass activation function, a division activation function is specified, such as described herein with respect to fig. 29A and 30, which divides the value 217 of the accumulator 202 by a divisor using one of the "dividers" 3014/3016 of fig. 30. For example, in the case of convolution kernel 2402 having coefficients (such as the one-sixteenth coefficients of a gaussian blur kernel described above, etc.), the activate function instruction at address 7 may specify a divide activate function (e.g., divide by 16) rather than a pass function. Alternatively, the architecture program may perform a divide by 16 operation on the convolution kernel 2402 coefficients before writing them to the data RAM 122, and adjust the binary point locations accordingly for the values of the convolution kernel 2402, for example using the data binary point 2922 of fig. 29A as described below.

The instruction at address 8 writes the output of AFU 212 to the row in weight RAM 124 specified by the current value of the output row register initialized by the instruction at address 0 and incremented on each pass through the loop by means of an increment indicator within the instruction.

As can be determined from the example of fig. 24-26 with the 3 x 3 convolution kernel 2402, the NPU 126 reads the weight RAM 124 about every three clock cycles to read the rows of the data matrix 2406, and writes the convolution result matrix to the weight RAM 124 about every 12 clock cycles. Further, assuming an embodiment including write and read buffers such as buffer 1704 of fig. 17, processor 100 reads and writes weight RAM 124 in parallel with NPU 126 reading and writing, such that buffer 1704 performs one write and one read to weight RAM 124 approximately every 16 clock cycles to write data matrix 2406 and read convolution results matrix, respectively. Thus, approximately half of the bandwidth of weight RAM 124 is consumed by the hybrid approach used by NNU 121 to perform convolution kernel operations. Although the present example contains a 3 × 3 convolution kernel 2402, other sizes of convolution kernels may be employed, such as 2 × 2, 4 × 4, 5 × 5, 6 × 6, 7 × 7, 8 × 8, etc. matrices, in which case the NNU procedure will change. In the case of a large convolution kernel, because the count of the round-robin version of the multiply-accumulate instruction is large (e.g., at

addresses

2, 4, and 6 of the program of FIG. 26A, and the additional instructions needed for the large convolution kernel), the percentage of time that NPU 126 reads weight RAM 124 is small, and thus the percentage of bandwidth of weight RAM 124 that is consumed is also small.

Alternatively, rather than writing the convolution results back to different rows (e.g., rows 900-1299 and 1300-1699) of the weight RAM 124, the architected program configures the NNU program to overwrite the rows of the input data matrix 2406 after they are no longer needed. For example, in the case of a 3 × 3 convolution kernel, the schema program writes the data matrix 2406 to rows 2 ~ 401 of the weight RAM 124 instead of writing the data matrix 2406 to rows 0 ~ 399, and the NPU program is configured to write the convolution result to the weight RAM 124 starting at row 0 of the weight RAM 124 and incrementally every time it passes through a loop body. As such, the NNU program only overwrites the rows that are no longer needed. For example, after the first pass through the loop body (or more specifically, after executing the instruction to load line 0 of weight RAM 124 at address 1), the data of line 0 may be overwritten, but the data of lines 1-3 are needed for the second pass through the loop body and thus are not overwritten by the first pass through the loop body; similarly, after the second pass through the loop body, the data for row 1 can be overwritten, but the data for rows 2-4 is needed for the third pass through the loop body and is not overwritten by the second pass through the loop body; and so on. In such embodiments, the height of each data matrix 2406 (data block) may be larger (e.g., 800 rows), resulting in fewer data blocks.

Alternatively, rather than writing the convolution results back to weight RAM 124, the architected program configures the NNU program to write the convolution results back to the rows above (e.g., above row 8) convolution kernel 2402 of data RAM 122, and the architected program reads the results from data RAM 122 when NNU 121 (e.g., using the address of the most recently written row 2606 of data RAM 122 of fig. 26B described below) writes the results. Such an alternative may be advantageous in embodiments where weight RAM 124 is single ported and data RAM 122 is dual ported.

From the operation of NNU 121 according to the embodiment of FIGS. 24-26A, it can be seen that each execution of the program of FIG. 26A requires approximately 5000 clock cycles, and thus, the convolution of the entire 2560 × 1600 data array 2404 of FIG. 24 requires approximately 100000 clock cycles, significantly less than the number of clock cycles required to perform the same task in a conventional manner.

Referring now to FIG. 26B, a block diagram is shown that illustrates certain fields of control register 127 of NNU 121 of FIG. 1, according to one embodiment. The status register 127 includes: a field 2602 to indicate the address of the row in weight RAM 124 that was most recently written to by NPU 126; a field 2606 to indicate the address of the row in the data RAM 122 that was most recently written by the NPU 126; a field 2604 for indicating the address of the row in weight RAM 124 that was last read by NPU 126; and a field 2608 to indicate the address of the row in the data RAM 122 that was last read by the NPU 126. This enables the architectural program executing on processor 100 to determine the progress of NNU 121 as it reads from and/or writes to data RAM 122 and/or weight RAM 124. With this capability, in conjunction with the selection of overwriting the input data matrix as described above (or writing the results to the data RAM 122 as described above), the data array 2404 of FIG. 24 can be processed, for example, as 5 512 x 1600 data blocks, rather than 20 512 x 400 data blocks, as described below. Processor 100 writes the first 512 x 1600 block of data into weight RAM 124 starting at row 2 and starts the NNU program (which has a cycle count of 1600 and an initialization weight RAM 124 output row of 0). When NNU 121 executes the NNU program, processor 100 monitors the location/address of the output of weight RAM 124 to (1) read the rows in weight RAM 124 (using MFNN instruction 1500) that have valid convolution results written by NNU 121 (starting from row 0), (2) overwrite the second 512 x 1600 data matrix 2406 (starting from row 2) with the valid convolution results once they have been read, so that when NNU 121 completes the NNU program for the first 512 x 1600 data block, processor 100 can immediately update the NNU program and start the NNU program again to process the second 512 x 1600 data block as needed. This process is repeated three more times for the remaining three 512 x 1600 blocks of data to achieve high utilization of the NNUs 121.

Advantageously, in one embodiment, AFU 212 has the ability to efficiently perform effective division on the value 217 of accumulator 202, as described in more detail below with respect to fig. 29A, 29B, and 30. For example, the activate function NNU instruction that divides the value 217 of the accumulator 202 by 16 may be used for the Gaussian blur matrix described above.

Although the convolution kernel 2402 used in the example of fig. 24 is a small static convolution kernel that is applied to the entire data array 2404, in other embodiments, the convolution kernel may be a large matrix with unique weights associated with different data values of the data array 2404, such as is common in convolutional neural networks. When NNU 121 is used in this manner, the architecture program can interchange the data matrix with the locations of the convolution kernels, i.e., place the data matrix within data RAM 122 and the convolution kernels within weight RAM 124, and the number of rows that can be processed by a particular execution of the NNU program can be relatively small.

Referring now to FIG. 27, a block diagram is shown that illustrates an example of populating weight RAM 124 of FIG. 1 with input data that is pooled by NNU 121 of FIG. 1. The pooling operation performed by the pooling layer of the artificial neural network reduces the dimensionality of the input data matrix (e.g., image or convolved image) by taking sub-regions or sub-matrices of the input matrix and calculating the maximum or average of these sub-matrices, and these maximum or average values become the result or pooling matrix. In the examples of fig. 27 and 28, the pooling operation calculates the maximum value of each sub-matrix. Pooling operations are particularly useful for artificial neural networks that perform object classification or detection, for example. In general, the pooling operation effectively reduces the size of the input matrix by a factor of the number of elements of the examined sub-matrix, and in particular reduces the input matrix in various dimensional directions by the number of elements of the respective dimension of the sub-matrix. In the example of fig. 27, the input data is a 512 x 1600 matrix of wide words (e.g., 16 bits) stored in rows 0 through 1599 of weight RAM 124. In FIG. 27, the word is labeled as the row and column position in which it is located, e.g., the word at row 0 and column 0 is labeled D0, 0; the word at row 0, column 1 is labeled D0, 1; the word at row 0, column 2 is labeled D0, 2; and so on, the word in row 0 and column 511 is labeled D0,511. Similarly, the word at row 1, column 0 is labeled D1, 0; the word at row 1, column 1 is labeled D1, 1; the 2 word at row 1 and column 2 is labeled D1, 2; and so on, the word in row 1 column 511 is labeled D1,511; by analogy, the word in row 1599 and column 0 is labeled D1599, 0; the words located in row 1599 and column 1 are labeled D1599, 1; the words located in row 1599 and column 2 are labeled D1599, 2; and so on, the word on column 511 of row 1599 is labeled D1599,511.

Referring now to FIG. 28, a program listing of an NNU program that performs the pooling operation of the input data matrix of FIG. 27 and writes it back to weight RAM 124 is shown. In the example of fig. 28, the pooling operation calculates the maximum value of each 4 × 4 sub-matrix in the input data matrix. The program cycles the loop body of instructions at addresses 1 to 10 multiple times. The initialize NPU instruction at address 0 specifies the number of times each NPU 126 executes the loop body, e.g., in the example of fig. 28, the loop count value for the loop body is 400, and the loop instruction at the end of the loop (address 11) decrements the current loop count value, and if the decremented result is a non-zero value, control returns to the top of the loop body (i.e., returns the instruction at address 1). The input data matrix in weight RAM 124 is actually treated by the NNU program as 400 exclusive groups of four adjacent rows, i.e., rows 0-3, rows 4-7, rows 8-11, and so on, up to rows 1596-1599. Each group of four adjacent rows includes 128 4x4 sub-matrices, i.e., 4x4 sub-matrices of elements formed by the intersection of the four rows of the group with the four adjacent column rows (i.e., columns 0-3, columns 4-7, columns 8-11, and so on through columns 508-511). Of the 512 NPUs 126, every fourth NPU 126 (i.e., 128 NPUs 126) of the 512 NPUs 126 performs a pooling operation on the corresponding 4 × 4 sub-matrix, while the other three quarters of the NPUs 126 are unused. More specifically,

NPUs

0, 4, 8, and so on through NPUs 508 each perform a pooling operation on their respective 4x4 sub-matrices, where the leftmost column number of the 4x4 sub-matrix corresponds to the NPU number and the lower row corresponds to the row value of the current weight RAM 124, which value is initialized to zero by the initialization instruction at address 0 and incremented by 4 each time the loop body is repeated, as described in more detail below. The 400 iterations of the loop body correspond to the 4x4 number of submatrix sets in the input data matrix of fig. 27 (i.e., 1600 rows of the input data matrix divided by 4). The initialize NPU instruction also clears accumulator 202. Preferably, the loop instruction at address 11 also clears the accumulator 202. Optionally, the maxwacc instruction for address 1 specifies clearing the accumulator 202.

For each iteration of the loop body of the program, the 128 NPUs 126 used perform 128 pooling operations in parallel on 128 respective 4 × 4 sub-matrices in the current four-row group of input data matrices. More specifically, the pooling operation determines the maximum value element among 16 elements of the 4 × 4 sub-matrix. In the embodiment of FIG. 28, for each NPU y of the 128 NPUs 126 used, the lower left element of the 4 x 4 sub-matrix is element Dx, y of FIG. 27, where x is the row number of the current weight RAM 124 at the beginning of the loop body, read by the maxwacc instruction at address 1 of the program of FIG. 28 (this row number is also initialized by the initializing NPU instruction at address 0 and incremented each time the maxwacc instruction at

addresses

3, 5, and 7 is executed). Thus, for each cycle of the program, the 128 NPUs 126 used write back the respective maximum value elements of the respective 128 4 × 4 sub-matrices of the current row group to the designated row of the weight RAM 124. The circulation body will be described below.

At address 1 is a maxwacc instruction to implicitly use the line of current weight RAM 124, which is preferably held within sequencer 128 (and initialized to zero by the instruction at address 0 for the first pass through the loop body). An instruction at address 1 causes each NPU 126 to read its corresponding word from the current row of weight RAM 124, compare the word to the value 217 of accumulator 202, and store the maximum of the two values in accumulator 202. Thus, for example, NPU 8 determines the value 217 of accumulator 202 and the maximum value in data word Dx,8 (where "x" is the row of current weight RAM 124) and writes the maximum value back to accumulator 202.

At address 2 is a maxwacc instruction that specifies that the value within the multiplexing register 705 of each NPU 126, in this case a row of input data matrix values read from the weight RAM124 only in response to the instruction at address 1, be rotated to the adjacent NPU 126. In the embodiment of fig. 27-28, NPU 126 is configured to cycle the value of multiplexer 705 to the left, i.e., from NPU J to NPU J-1, as described above with respect to fig. 24-26. Further, the instruction specifies a count value of 3. Thus, the instruction at address 2 causes each NPU 126 to receive the rotated word into the multiplexing register 705 and determine the maximum of the rotated word and the value 217 of the accumulator 202, and then repeat the operation two more times. That is, each NPU 126 receives the rotated word three times into the multiplexing register 705 and determines the maximum of the rotated word and the value 217 of the accumulator 202. Thus, for example, assuming behavior 36 of the current weight RAM124 at the beginning of the loop body, taking NPU 8 as an example, upon execution of the instructions at

addresses

1 and 2, NPU 8 will store in its accumulator 202 the accumulator 202 at the beginning of the loop body and the maximum of the four weight RAM124 words D36,8, D36,9, D36,10 and D36, 11.

The operation performed by the maxwacc instruction at

addresses

3 and 4 is similar to the operation performed by the instructions at

addresses

1 and 2, however with the weight RAM 124 row increment indicator, the instructions maxwacc at

addresses

3 and 4 perform the operation on the next row of weight RAM 124. That is, assuming that the line of the current weight RAM 124 at the beginning of the loop body is 36, taking NPU 8 as an example, after completing the instructions at addresses 1-4, NPU 8 will store in its accumulator 202 the maximum of the accumulator 202 at the beginning of the loop body and the words D36,8, D36,9, D36,10, D36,11, D37,8, D37,9, D37,10 and D37,11 of the eight weight RAMs 124.

The operation performed by the maxwacc instruction at addresses 5-8 is similar to the operation performed by the instruction at addresses 3-4, however the instruction at addresses 5-8 performs the operation on the next two rows of weight RAM 124. That is, assuming that the current weight RAM 124 column at the beginning of the loop body is 36, taking NPU 8 as an example, upon completion of the instructions for addresses 1 to 8, NPU 8 will store in its accumulator 202 the accumulator 202 at the beginning of the loop and the maximum of the sixteen weight RAM 124 words D36,8, D36,9, D36,10, D36,11, D37,8, D37,9, D37,10, D37,11, D38,8, D38,9, D38,10, D38,11, D39,8, D39,9, D39,10 and D39, 11. That is, assuming that the row of the current weight RAM 124 at the beginning of the loop body is 36, taking NPU 8 as an example, upon completion of the instructions at addresses 1 through 8, NPU 8 will determine the maximum of the following 4 x 4 sub-matrices:

More specifically, upon completion of the instructions at addresses 1 through 8, each of the 128 NPUs 126 used will determine the maximum of the following 4 x 4 sub-matrices:

where r is the row address value of weight RAM124 at the beginning of the loop body and n is the number of NPU 126.

The instruction at address 9 causes the value 217 of the accumulator 202 to pass through the AFU 212. This pass function passes a word of size (in bits, 16 bits in this example) that is the size of the word read from the weight RAM 124. Preferably, as described in more detail below, the user may specify the output format, e.g., how many of the output bits are decimal digits.

The instruction at address 10 writes the value 217 of accumulator 202 to the row in weight RAM124 specified by the current value of the output row register initialized by the instruction at address 0 and incremented each time through the loop body by an increment indicator within the instruction. More specifically, the instruction at address 10 writes a wide word (e.g., 16 bits) of accumulator 202 into weight RAM 124. Preferably, the instruction writes 16 bits as specified by the output binary point 2916, as described in more detail below with respect to fig. 29A and 29B.

It can be seen that the rows written to weight RAM124 by iterating through the loop body contain holes with invalid data. That is, wide words 1 through 3, 5 through 7, 9 through 11, and so on of result 133, until wide words 509 through 511 are invalid or unused. In one embodiment, AFU 212 includes a multiplexer that enables packing of results into adjacent words of a line buffer (such as line buffer 1104 of fig. 11, etc.) for writing back to output weight RAM124 lines. Preferably, the activate function instruction specifies a number of words in each hole, and the number of words within the hole is used to control the packing result of the multiplexer. In one embodiment, the number of holes can be specified as a value of 2-6 to compact the output of the pooled 3 × 3, 4 × 4, 5 × 5, 6 × 6, or 7 × 7 sub-matrices. Alternatively, the architectural program executing on processor 100 reads the resulting sparse (i.e., with holes) result lines from weight RAM124 and performs compaction functions using other execution units 112, such as media units that use architectural compaction instructions, e.g., x86SSE instructions. Advantageously, in a parallel manner similar to that described above and taking advantage of the hybrid nature of NNU 121, an architectural program executing on processor 100 may read status register 127 to monitor the last-write row of weight RAM124 (e.g., field 2602 of fig. 26B) to read the resulting sparse row, pack it and write it back to the same row of weight RAM124, so as to be ready for use as an input data matrix for the next layer of the neural network, such as a convolutional layer or a traditional neural network layer (i.e., a multiplicative accumulation layer), etc. Further, although the embodiments described herein perform pooling operations on 4 × 4 sub-matrices, the NNU program of fig. 28 may be modified to perform pooling operations on other sized sub-matrices, such as 3 × 3, 5 × 5, 6 × 6, or 7 × 7 sub-matrices.

It can also be seen that the resulting number of rows written into weight RAM 124 is one quarter of the number of rows of the input data matrix. Finally, in this example, the data RAM 122 is not used. However, alternatively, the pooling operation may be performed using the data RAM 122, rather than using the weight RAM 124.

In the examples of fig. 27 and 28, the pooling operation calculates the maximum value of the sub-region. However, the program of FIG. 28 can be modified to calculate the average of the sub-regions, for example, by replacing the maxwacc instruction with a sumwacc instruction (adding the weight word to the value 217 of the accumulator 202) and changing the Activate function instruction at address 9 to divide the accumulated result (preferably via reciprocal multiplication as described below) by the number of elements (16 in this example) of each sub-region.

From the operation of NNU 121 according to the embodiment of fig. 27 and 28, it can be seen that each time the program of fig. 28 is executed, pooling operations are performed on the entire 512 x 1600 data matrix of fig. 27 with approximately 6000 clock cycles, which may be significantly less than the number of clock cycles required to perform similar tasks in a conventional manner.

Alternatively, rather than writing the results back to weight RAM 124, the architected program configures the NNU program to write the results of the pooling operation back to a row of data RAM 122, and the architected program reads the results from data RAM 122 when NNU 121 (e.g., using the address of the most recently written row 2606 of data RAM 122 of FIG. 26B) writes the results to data RAM 122. Such an alternative may be advantageous in embodiments where weight RAM 124 is single ported and data RAM 122 is dual ported.

With fixed point arithmetic of binary point provided by the user, full precision fixed point accumulation, user specified reciprocal value, random rounding of accumulator values, and selectable activation/output functions

In general, a hardware unit that performs arithmetic operations within a digital computing device is typically divided into an "integer" unit and a "floating point" unit because the hardware unit performs arithmetic operations on integer and floating point numbers, respectively. Floating point numbers have a magnitude (or mantissa) and an exponent, and usually a sign. The exponent is an indication of the position of the radix (radix) point (typically a binary point) relative to the magnitude. In contrast, integers do not have an exponent, but only a magnitude, and usually a sign. Floating point units have the advantage of enabling a programmer to work with numbers derived from a very large range of different values, while the hardware is responsible for adjusting the exponent values of the numbers as needed, without the programmer's adjustment.For example, assume two floating-point numbers 0.111 × 10²⁹And 0.81X 10³¹Multiplication. (although floating point units typically operate on 2-based floating point numbers, decimal or 10-based examples are used here.) A floating point unit automatically takes care of multiplying mantissas, adding exponents, and then normalizing the result back to a value 8911 × 10 ⁵⁹. As another example, assume that the same two floating-point numbers are added. The floating point unit is automatically responsible for aligning the binary point of the mantissa before adding to produce a value of 81111 × 10³¹The sum of (a) and (b).

However, the complexity associated with floating point units and the consequent size, power consumption, increase in clock cycles per instruction, and/or cycle time extension are well known. Indeed, for this reason, many devices (e.g., embedded processors, microcontrollers, and relatively low cost and/or low power microprocessors) do not include floating point units. From the above examples, it can be seen that some complex floating point units include: logic for performing exponent calculations associated with floating point addition and multiplication/division (i.e., an adder to perform addition/subtraction operations on exponents of operands to produce a result exponent value for floating point multiplication/division, a subtractor to determine a binary point alignment shift amount for subtraction of exponents of operands to determine floating point addition), a shifter to achieve binary point alignment of mantissas in floating point addition, and a shifter to normalize the floating point result. Furthermore, flow into block units typically requires logic to perform rounding operations on floating point results, logic to convert between integer and floating point formats and between different floating point precision formats (e.g., augmented precision, double precision, single precision, half precision), detectors of leading zeros and leading ones, and logic to handle special floating point numbers (e.g., outliers, non-numeric values, and infinite values, etc.).

Furthermore, there are disadvantages as follows: due to the increase in the numerical space required to be verified in design, the accuracy verification of floating point units becomes significantly extremely complex, possibly extending product development cycle and time to market. Furthermore, as described above, floating point arithmetic implies the storage and use of separate mantissa and exponent fields for each floating point number involved in the computation, which may increase the amount of memory required and/or decrease precision given an equal amount of memory to store integers. Many of these disadvantages can be avoided by using integer units that perform arithmetic operations on integers.

Programmers often write programs that handle decimals, i.e., non-integers. Such programs may be executed on processors that do not have a floating point unit, or that, although having a floating point unit, the integer instructions executed by the integer unit of the processor may be faster. To take advantage of the potential performance advantages associated with integer units, programmers employ well-known fixed-point arithmetic on fixed-point numbers. Such programs include instructions that execute on integer units to process integer or integer data. Software knows that data is a fractional number and contains instructions (e.g., an alignment shift) to perform an operation on integer data to account for the fact that the data is actually a fractional number. Basically, fixed point software manually performs some or all of the functions performed by a floating point unit.

As used herein, a "fixed point" number (or value or operand or input or output) is a number whose stored bits are understood to contain bits representing the fractional part of the fixed point number (referred to herein as "fractional bits"). The fixed-point number of storage bits is contained within a memory or register, such as an 8-bit or 16-bit word within the memory or register. In addition, the stored bits of the fixed-point number are all used to represent a magnitude, and in some cases, one of the bits is used to represent a sign, but the fixed-point number has no stored bit used to represent an exponent of the number. Further, the number of decimal places or binary decimal point positions of the fixed-point number is specified at the time of storage, which is different from the storage bits of the fixed-point number, and the number of decimal places or binary decimal point positions is indicated in a shared or global manner for a fixed-point number set to which the fixed-point number belongs (e.g., a set of input operands, a set of accumulated values, or a set of output results of an array of processing units, etc.).

Advantageously, in the embodiments described herein, the ALU is an integer unit, but activating the function unit includes floating point arithmetic hardware assistance or acceleration. This makes the ALU portion smaller and faster, facilitating the use of more ALUs within a given die space. This means that there are more neurons per unit of grain space, and is therefore particularly advantageous in neural network elements.

Furthermore, advantageously, as opposed to floating point numbers that require exponent storage bits per floating point number, embodiments are described in which the fixed point number is represented in an indication of the number of storage bits as decimal places for the entire set of numbers, however, the indication is located in a single shared storage space that globally indicates the number of decimal places in all numbers for the entire set (e.g., a set of inputs for a series of operations, a set of accumulated values for a series of operations, a set of outputs). Preferably, a user of the NNU is able to specify the number of fractional storage bits for the set of digits. Thus, it should be understood that while in many contexts (e.g., general mathematics), the term "integer" refers to a signed integer, i.e., a number that does not have a fractional part, in this context, the term "integer" may refer to a number that has a fractional part. Further, in the context of this document, the term "integer" is intended to be distinguished from floating point numbers, for which a portion of the number of bits of their respective storage space is used to represent the exponent of the floating point number. Similarly, integer arithmetic operations (such as integer multiplication or addition or comparison performed by an integer unit) assume that the operands do not have exponents, and therefore, the integer elements of the integer unit (such as integer multipliers, integer adders, integer comparators, etc.) do not contain logic to process the exponents, e.g., do not require shifting mantissas to align binary decimal points for addition or comparison operations, and do not require adding exponents for multiplication operations.

Further, embodiments described herein include large hardware integer accumulators to accumulate a series of large integer operations (e.g., 1000 order multiply accumulate) without loss of accuracy. This allows the NNU to avoid processing floating point numbers while maintaining full precision of the accumulated value without saturating it or producing inaccurate results due to an overflow. As described in more detail below, once the series of integer operations accumulate the result to the full-precision accumulator, the fixed-point hardware assists in performing the necessary scaling and saturation operations to convert the full-precision accumulated value to an output value with the user-specified indication of the number of accumulated value decimal places and the desired number of decimal places of the output value.

As described in more detail below, the activate function unit may preferably selectively perform random rounding of the accumulator value as it is compressed from a full precision form for use as an input to the activate function or for passage. Finally, the NPU may selectively accept instructions to apply different activation functions and/or output many different forms of accumulator values, according to different requirements of a given layer of the neural network.

Referring now to FIG. 29A, a block diagram illustrating an embodiment of the control register 127 of FIG. 1 is shown. The control register 127 may include a plurality of control registers 127. As shown, control register 127 includes the following fields: configuration 2902, signed data 2912, signed weights 2914, data binary decimal points 2922, weight binary decimal points 2924, ALU functions 2926, rounding controls 2932, activation functions 2934, reciprocals 2942, shift amounts 2944, output RAM 2952, output binary decimal points 2954, and output commands 2956. The control register 127 value may be written using both the MTNN instruction 1400 and instructions of the NNU program (such as initialization instructions).

The value of configuration 2902 specifies whether the NNU 121 is a narrow configuration, a wide configuration, or a funnel configuration, as described above. Configuration 2902 means the size of the input words received from data RAM 122 and weight RAM 124. In the narrow and funnel configurations, the size of the input word is narrow (e.g., 8 or 9 bits), while in the wide configuration, the size of the input word is wide (e.g., 12 or 16 bits). Further, the configuration 2902 means the size of the output result 133 that is the same as the input word size.

Signed data value 2912 indicates that the data words received from data RAM 122 are signed values if true and unsigned values if false. Signed weight value 2914 indicates that the weight words received from weight RAM124 are signed values if true and unsigned values if false.

The value of the data binary point 2922 indicates the position of the binary point of the data word received from the data RAM 122. Preferably, the value of data binary point 2922 indicates the number of bit positions from the right side of the binary point position. In other words, data binary point 2922 indicates how many of the least significant bits of the data word are decimal bits, i.e., to the right of the binary point. Similarly, the value of weight binary point 2924 indicates the location of the binary point of the weight word received from weight RAM 124. Preferably, where ALU function 2926 is a multiply-accumulate or output accumulator, NPU 126 determines the number of bits to the right of the binary point of the value held in accumulator 202 as the sum of the data binary point 2922 and the weight binary point 2924. Thus, for example, if the value of data binary point 2922 is 5 and the value of weight binary point 2924 is 3, the value within accumulator 202 has 8 bits to the right of the binary point. In the case where ALU function 2926 is and/max accumulator and data/weight word, or passes data/weight word, NPU 126 determines the number of bits held to the right of the binary point of the value of accumulator 202 as data binary point 2922/weight binary point 2924, respectively. In an alternative embodiment, rather than specifying respective data binary decimal points 2922 and weight binary decimal points 2924, a single accumulator binary decimal point 2923 is specified, as described below with respect to fig. 29B.

ALU function 2926 specifies the function performed by ALU 204 of NPU 126. As described above, ALU function 2926 may include, but is not limited to: multiplying the data word 209 with the weight word 203 and accumulating the product with the accumulator 202; adding the accumulator 202 to the weight word 203; adding the accumulator 202 to the data word 209; the maximum value in accumulator 202 and data word 209; the maximum value in accumulator 202 and weight word 203; an output accumulator 202; through data word 209; by weight word 203; a zero value is output. In one embodiment, ALU function 2926 is specified by an NNU initialization instruction and is used by ALU 204 in response to executing an instruction (not shown). In one embodiment, ALU function 2926 is specified by an individual NNU instruction (such as the multiply accumulate and maxwacc instructions described above).

The rounding control 2932 specifies (in figure 30) the form of rounding used by the rounder 3004. In one embodiment, the specifiable rounding modes include, but are not limited to: no rounding, rounding to the nearest value, and random rounding. Preferably, the processor 100 includes a random bit source 3003 (of FIG. 30) for generating random bits 3005, wherein the random bits 3005 are sampled and used to perform random rounding to reduce the likelihood of generating rounding offsets. In one embodiment, where the rounded bit 3005 is 1 and the sticky bit is zero, the NPU 126 rounds up if the sampled random bit 3005 is true and the NPU 126 does not round up if the random bit 3005 is false. In one embodiment, the random bit source 3003 generates random bits 3005 based on sampling random electrical characteristics of the processor 100, such as thermal noise of semiconductor diodes or resistors, etc., although other embodiments are also contemplated.

The activation function 2934 specifies a function that is applied to the value 217 of the accumulator 202 to produce the output 133 of the NPU 126. As described above and in more detail below, activation function 2934 includes, but is not limited to: an S-type function; a hyperbolic tangent function; a soft addition function; a correction function; a specified power divided by two; multiplying by a reciprocal value specified by a user to realize equivalent division; through the entire accumulator; and as a standard size by an accumulator as described in more detail below. In one embodiment, the activate function is specified by an NNU activate function instruction. Alternatively, the activate function is specified by an initialize instruction and applied in response to an output instruction (e.g., a write AFU output instruction at address 4 in FIG. 4), in this embodiment, the activate function instruction at address 3 of FIG. 4 falls under the output instruction.

The reciprocal 2942 value specifies the value that is multiplied with the value 217 of the accumulator 202 to effect division of the value 217 of the accumulator 202. That is, the user specifies the reciprocal 2942 value as the reciprocal of the divisor that is actually desired. This is useful, for example, in conjunction with convolution or pooling operations as described herein. Preferably, the user specifies the reciprocal 2942 value as two parts, as described in more detail below with respect to FIG. 29C. In one embodiment, the control register 127 includes a field (not shown) that enables a user to specify one of a plurality of built-in divisor values to divide by, the size of which is equivalent to the size of a conventional convolution kernel, such as 9, 25, 36, or 49. In such embodiments, AFU 212 may store the reciprocal of these built-in divisors for multiplication by accumulator 202 value 217.

Shift amount 2944 specifies the number of bits that the shifter of AFU212 right-shifts value 217 of accumulator 202 to achieve a division by a power of two. This may also be useful in combination with convolution kernels of power two in size.

The value of output RAM 2952 specifies which of data RAM 122 and weight RAM 124 is to receive output result 133.

The value of output binary point 2954 indicates the position of the binary point of output result 133. Preferably, the value of output binary point 2954 indicates the number of bit positions from the right side of the binary point position of output result 133. In other words, the output binary point 2954 indicates how many of the least significant bits of the output result 133 are decimal bits, i.e., to the right of the binary point. AFU212 performs rounding, compression, saturation, and size conversion based on the value of output binary point 2954 (and in most cases, also based on the value of data binary point 2922, the value of weight binary point 2924, the value of activation function 2934, and/or the value of configuration 2902).

Output commands 2956 control various aspects of output results 133. In one embodiment, AFU212 utilizes the concept of standard size, where the standard size is twice the width size (in bits) specified by configuration 2902. Thus, for example, if configuration 2902 means that the size of the input words received from data RAM 122 and weight RAM 124 is 8 bits, then the standard size is 16 bits; in another example, if configuration 2902 means that the size of the input words received from data RAM 122 and weight RAM 124 is 16 bits, then the standard size is 32 bits. As described herein, the size of the accumulator 202 is larger (e.g., narrow accumulator 202B is 28 bits and wide accumulator 202A is 41 bits) to maintain full precision of intermediate computations (e.g., 1024 and 512 NNU multiply-accumulate instructions, respectively). As such, the value 217 of the accumulator 202 is greater (in bits) than the standard size, and the AFU212 (e.g., CCS 3008 described below with respect to fig. 30) compresses the value 217 of the accumulator 202 down to a value having the standard size for most of the value of the activation function 2934 (except through a full accumulator). The first predetermined value of output command 2956 instructs AFU212 to execute the specified activation function 2934 to produce an internal result that is the same size as the original input word (i.e., half the standard size) and output the internal result as output result 133. A second predetermined value of output command 2956 instructs AFU212 to execute the specified activation function 2934 to produce an internal result that is twice the size of the original input word (i.e., a standard size) and output the lower half of the internal result as output result 133; and a third predetermined value of output command 2956 instructs AFU212 to output the upper half of the standard-sized internal result as output result 133. As described above with respect to fig. 8-10, the fourth predetermined value of output command 2956 instructs AFU212 to output the original least significant word (whose width is specified by configuration 2902) of accumulator 202 as output result 133; a fifth predetermined value instructs AFU212 to output the original middle valid word of accumulator 202 as output result 133; and a sixth predetermined value instructs AFU212 to output the original most significant word of accumulator 202 as output result 133. As described above, outputting the internal result of full accumulator 202 size or a standard size may be advantageous, for example, to enable other execution units 112 of processor 100 to execute an activation function such as a soft maximum activation function.

Although the fields of FIG. 29A (as well as FIGS. 29B and 29C) are described as being located in control register 127, in other embodiments, one or more fields may be located in other portions of NNU 121. Preferably, a number of fields may be included in the NNU instruction itself and decoded by the sequencer 128 to generate the micro-operations 3416 (of FIG. 34) for controlling the ALU 204 and/or the AFU 212. In addition, these fields may be included within a micro-operation 3414 (of FIG. 34) stored in the media register 118, the micro-operation 3414 controlling the ALU 204 and/or AFU 212. In such embodiments, the use of an initialize NNU instruction may be minimized, and in other embodiments, the initialize NNU instruction is eliminated.

As described above, the NNU instruction can specify that an ALU operation be performed on a memory operand (e.g., a word from data RAM 122 and/or weight RAM 124) or a rotated operand (e.g., from multiplexing register 208/705). In one embodiment, the NNU instruction may also specify an operand as a register output of an activate function (e.g., register output 3038 of fig. 30). Further, as described above, the NNU instruction can specify incrementing the current row address of data RAM 122 or weight RAM 124. In one embodiment, the NNU instruction may specify an immediate signed integer difference (deltavalue) to be added to the current line to effect incrementing or decrementing by values other than one.

Referring now to FIG. 29B, a block diagram is shown illustrating an embodiment of the control register 127 of FIG. 1, according to an alternative embodiment. The control register 127 of fig. 29B is similar to the control register 127 of fig. 29A; however, control register 127 of FIG. 29B includes an accumulator binary point 2923. Accumulator binary point 2923 represents the binary point location of accumulator 202. Preferably, the value of the accumulator binary point 2923 represents the number of bit positions from the right side of the binary point position. In other words, the accumulator binary point 2923 represents how many of the least significant bits of the accumulator 202 are decimal bits, i.e., to the right of the binary point. In this embodiment, accumulator binary point 2923 is explicitly specified, rather than implicitly determined as described above for the embodiment of fig. 29A.

Referring now to FIG. 29C, a block diagram illustrating an embodiment of the reciprocal 2942 of FIG. 29A stored in two portions is shown, according to one embodiment. The first part 2962 is a shift value that represents the number 2962 of leading zeros that the user wants to be suppressed in the true reciprocal value multiplied by the value 217 of the accumulator 202. The number of leading zeros is the number of consecutive zeros immediately to the right of the binary point. The second portion 2694 is the value of the inverse 2964 of the leading zero suppressed, i.e., the true inverse value after all leading zeros have been removed. In one embodiment, the number of suppressed leading zeros 2962 is stored as 4 bits, while the value of the reciprocal 2964 of the leading zeros suppressed is stored as an 8-bit unsigned value.

For purposes of illustration by way of example, assume that the user desires to multiply the value 217 of the accumulator 202 by the reciprocal of 49. The binary representation of the reciprocal of 49 in 13 decimal places is 0.0000010100111 (with five leading zeros). In this case, the user fills the suppressed leading zero quantity 2962 with the value 5 and the suppressed reciprocal 2964 of the leading zero with the value 10100111. After the reciprocal multiplier "divider a" 3014 (of fig. 30) multiplies the value 217 of the accumulator 202 with the value of the suppressed reciprocal 2964 of the leading zero, the resulting product is right shifted by the number of suppressed leading zeros 2962. Such an embodiment may advantageously achieve high accuracy by representing the value of reciprocal 2942 with relatively few bits.

Referring now to FIG. 30, a block diagram is shown that illustrates an embodiment of AFU 212 of FIG. 2 in greater detail. AFU 212 includes: control register 127 of FIG. 1; a positive mode converter (PFC) and an Output Binary Point Aligner (OBPA) 3002 for receiving the value 217 of the accumulator 202; a rounder 3004 for receiving the value 217 of the accumulator 202 and an indication of the number of bits out of which the OBPA 3002 is shifted; a random bit source 3003, as described above, for generating random bits 3005; a first multiplexer 3006 for receiving outputs of the PFC and OBPA 3002 and an output of the rounder 3004; a normal-size compressor (CCS) and saturator 3008 for receiving an output of the first multiplexer 3006; a bit selector and saturator 3012 to receive the outputs of CCS and saturator 3008; a corrector 3018 for receiving the outputs of CCS and saturator 3008; a reciprocal multiplier 3014 for receiving the output of CCS and saturator 3008; a right shifter 3016 for receiving the outputs of CCS and saturator 3008; a tanh module 3022 for receiving the output of the bit selector and saturator 3012; an S-type block 3024 for receiving the output of the bit selector and saturator 3012; a soft-summing block 3026 for receiving the output of the bit selector and saturator 3012; a second multiplexer 3032 for receiving the hyperbolic tangent module 3022, the sigmoid module 3024, the soft-addition module 3026, the corrector 3018, the reciprocal multiplier 3014, the output of the right shifter 3016, and the pass-through normal-size output 3028 of the CCS and saturator 3008; a symbol recoverer 3034 for receiving the output of the second multiplexer 3032; a magnitude converter and saturator 3036 to receive the output of the symbol recoverer 3034; a third multiplexer 3037 for receiving the output of the magnitude converter and saturator 3036 and the output 217 of the accumulator; and an output register 3038 which receives the output of the multiplexer 3037 and whose output is the result 133 of fig. 1.

The PFC and OBPA 3002 receives the value 217 of the accumulator 202. Preferably, as described above, the value 217 of the accumulator 202 is a full-precision value. That is, the accumulator 202 has a sufficient number of stored bits to hold an accumulated value, which is the sum of a series of products generated by the integer adder 244 produced by the integer multiplier 242, without discarding any bits in the products of the multiplier 242 or the sum of the adders, so that precision is not lost. Preferably, accumulator 202 has at least a sufficient number of bits to maintain a maximum number of product accumulations that NNU 121 can be programmed to perform. For example, to illustrate with reference to the routine of FIG. 4, in a wide configuration NNU 121 can be programmed to perform a maximum number of multiply-accumulate operations of 512, while accumulator 202 has a bit width of 41. As another example, and as explained with reference to the process of FIG. 20, in a narrow configuration NNU 121 can be programmed to perform a maximum number of multiply-accumulate operations of 1024, while accumulator 202 has a bit width of 28. Generally, the full-precision accumulator 202 has at least Q bits, where Q is M and log₂P, where M is the bit width of the integer product of the multiplier 242 (e.g., 16 bits for a narrow multiplier 242 or 32 bits for a wide multiplier 242), and P is the maximum allowable number of integer products that can be accumulated to the accumulator 202. Preferably, the maximum number of product accumulations is specified via the programming specifications of the programmer of NNU 121. In one embodiment, the sequencer 128 forces the maximum value of the count of multiply accumulate NNU instructions (e.g., the instruction at address 2 of fig. 4) to be 511, for example, assuming one previous multiply accumulate instruction (e.g., the instruction at address 1 of fig. 4) for a row of the data/weight word 206/207 was loaded from the data/weight RAM 122/124.

Advantageously, the design of the ALU 204 portion of the NPU 126 may be simplified by including an accumulator 202 with a bit width large enough to perform accumulation for the maximum number of full precision values that are allowed to be accumulated. In particular, this may alleviate the need to use logic to saturate the sum produced by integer adder 244, which would cause the mini-accumulator to produce an overflow, and would require constant tracking of the accumulator's binary point location to determine if an overflow occurred to know if saturation is needed. To illustrate by way of example the problem of a design with a non-full precision accumulator but with saturation logic to handle an overflow of the non-full precision accumulator, assume that the following scenario exists.

(1) The value of the data word ranges between 0 and 1 and all storage bits are used to store the decimal place. The weight word value ranges between-8 and +8 and all but three bits are used to store the decimal place. The accumulated value of the inputs for the tanh activation function ranges between-8 and 8, and all but three bits are used to store the decimal place.

(2) The bit width of the accumulator is not full precision (e.g., only the bit width of the product).

(3) Assuming the accumulator is at full precision, the final accumulated value will be between-8 and 8 (e.g., + 4.2); however, products before "point a" in the sequence tend more commonly to be positive, while products after point a tend more commonly to be negative.

In this case, inaccurate results (i.e., results other than +4.2) may be obtained. This is because at some point prior to point a, when the accumulator value should have been a larger value (e.g., +8.2), the accumulator may saturate to a maximum of +8, resulting in a loss of +0.2 remaining. The accumulator can maintain even more product accumulations at the saturation value, resulting in the loss of more positive values. Thus, the final value of the accumulator may be a smaller value than if the accumulator had a full precision bit width (i.e., less than + 4.2).

PFC 3002 converts the value 217 of accumulator 202 to positive if it is negative and generates an additional bit to indicate whether the original value is positive or negative, which additional bit is passed down the pipeline of AFU 212 along with the value. Conversion to positive simplifies subsequent operation of AFU 212. For example, this operation allows only positive values to be input to the hyperbolic tangent module 3022 and the S-type module 3024, and thus these modules may be simplified. Further, the rounder 3004 and the saturator 3008 are simplified.

The OBPA3002 shifts or scales the positive value to the right to align it with the output binary radix point 2954 specified within the control register 127. Preferably, the OBPA3002 calculates the shift amount as a difference value that is the output decimal place (e.g., specified by output binary decimal point 2954) subtracted from the decimal place of the value 217 of the accumulator 202 (e.g., specified by accumulator binary decimal point 2923, or the sum of data binary decimal point 2922 and weight binary decimal point 2924). Thus, for example, if the binary point 2923 of the accumulator 202 is 8 (as in the previous embodiment) and the output binary point 2954 is 3, the OBPA3002 right shifts the positive value by 5 bits to produce a result that is provided to the multiplexer 3006 and the rounder 3004.

The rounder 3004 performs rounding on the value 217 of the accumulator 202. Preferably, the rounder 3004 generates a rounded version of the positive value generated by the PFC and OBPA3002 and provides the rounded version to the multiplexer 3006. The rounder 3004 performs rounding in accordance with the rounding control 2932 described above, which rounding control 2932 may include random rounding using the random bits 3005, as described in the context herein. Multiplexer 3006 selects one of its multiple inputs (i.e., the normal value from PFC and OBPA3002 or the rounded version from rounder 3004) based on rounding control 2932 (which may include random rounding as described herein) and provides the selected value to CCS and saturator 3008. Preferably, the multiplexer 3006 selects the output of the PFC and OBPA3002 if the rounding control 2932 specifies that rounding is not to be performed, and otherwise selects the output of the rounder 3004. In other embodiments contemplated, AFU 212 performs additional rounding. For example, in one embodiment, when the bit selector 3012 compresses the output bits of the CCS and saturator 3008 (as described below), the bit selector 3012 rounds off based on the missing low order bits. For another example, in one embodiment, the product of reciprocal multiplier 3014 (described below) is rounded. For another example, in one embodiment, the size converter 3036 rounds when converting to an appropriate output size (as described below), which may involve losing the lower order bits when rounding is determined.

CCS 3008 compresses the output value of multiplexer 3006 to a standard size. Thus, for example, if the NPU126 is in the narrow or funnel configuration 2902, the CCS 3008 compresses the output value of the 28-bit multiplexer 3006 to 16 bits; whereas if the NPU126 is in the wide configuration 2902, the CCS 3008 compresses the output value of the 41-bit multiplexer 3006 to 32-bits. However, before compression to the standard size, if the pre-compression value is larger than the maximum value that can be expressed by the standard form, the saturator 3008 saturates the pre-compression value to the maximum value that can be expressed by the standard form. For example, if any bit of the pre-compression value to the left of the most significant canonical form bit has a value of 1, the saturator 3008 saturates to a maximum value (e.g., saturates to all 1's).

Preferably, the tanh module 3022, the S-type module 3024, and the soft summing module 3026 all contain look-up tables, such as Programmable Logic Arrays (PLAs), Read Only Memories (ROMs), combinational logic gates, and the like. In one embodiment, to simplify and reduce the size of these modules 3022/3024/3026, the modules are provided with input values having the form 3.4, i.e., three integer bits and four decimal bits, i.e., the input value has four bits to the right of the binary point and three bits to the left of the binary point. These values are chosen because at the extreme end of the input value range (-8, +8) in the form of 3.4, the output value is asymptotically near its minimum/maximum value. However, other embodiments are contemplated in which the binary point is placed at a different location, for example, in a 4.3 or 2.5 format. The bit selector 3012 selects bits satisfying the 3.4 format criteria in the output of the CCS and saturator 3008, which involves compression, i.e., some bits are lost because the standard format has a larger number of bits. However, before selecting/compressing the output values of CCS and saturator 3008, if the pre-compression value is greater than the maximum value that the 3.4 form can express, saturator 3012 saturates the pre-compression value to the maximum value that the 3.4 form can express. For example, if any bit of the pre-compression value that is to the left of the most significant bit of the 3.4 form has a value of 1, saturator 3012 saturates to a maximum value (e.g., saturates to all 1's).

The hyperbolic tangent module 3022, the S-type module 3024, and the soft-sum module 3026 perform respective activation functions (as described above) on the 3.4-form values output by the CCS and saturator 3008 to produce results. Preferably, the result of the hyperbolic tangent module 3022 and the S-type module 3024 is a 7-bit result in the form of 0.7, i.e. zero integer bits and seven decimal bits, i.e. the input value has seven bits to the right of the binary decimal point. Preferably, the result of the soft-sum block 3026 is a 7-bit result in the form of 3.4, i.e. in the same form as the input to this block 3026. Preferably, the outputs of the hyperbolic tangent module 3022, the S-type module 3024 and the soft-sum module 3026 are expanded to a standard form (e.g., adding leading zeros if necessary) and aligned to have binary point specified by the output binary point 2954 value.

Corrector 3018 produces a corrected version of the output values of CCS and saturator 3008. That is, if the output values of CCS and saturator 3008 (whose signs are shifted down in the pipeline as described above) are negative, corrector 3018 outputs a zero value; otherwise, the corrector 3018 outputs its input value. Preferably, the output of the corrector 3018 is in standard form and has a binary point specified by the output binary point 2954 value.

Reciprocal multiplier 3014 multiplies the output of CCS and saturator 3008 with a user-specified reciprocal value specified in reciprocal value 2942 to produce its product of standard size, which is effectively the quotient of the output of CCS and saturator 3008 and a divisor of the reciprocal 2942 value. Preferably, the output of reciprocal multiplier 3014 is in standard form and has a binary point specified by the value of output binary point 2954.

Right shifter 3016 shifts the output of CCS and saturator 3008 by the user-specified number of bits specified in shift magnitude 2944 to produce its standard-sized quotient. Preferably, the output of right shifter 3016 is in standard form and has a binary point specified by the value of output binary point 2954.

The multiplexer 3032 selects the appropriate input specified by the value of the activation function 2934 and provides the selection to the symbol restorer 3034, wherein the symbol restorer 3034 converts the positive output of the multiplexer 3032 to negative, for example to two's complement, in the event that the value 217 of the original accumulator 202 is negative.

The magnitude converter 3036 converts the output of the symbol recoverer 3034 to an appropriate magnitude based on the value of the output command 2956 described above with respect to fig. 29A. Preferably, the output of the symbol recoverer 3034 has a binary point specified by the value of the output binary point 2954. Preferably, for a first predetermined value of the output command 2956, the size converter 3036 discards the bits of the upper half of the output of the symbol recoverer 3034. Further, if the output of the sign restorer 3034 is positive and exceeds the maximum value that the word size specified by the configuration 2902 can express, or the output is negative and is smaller than the minimum value that the word size can express, the saturator 3036 saturates its output to the maximum/minimum value that the word size can express, respectively. The magnitude converter 3036 passes the output of the symbol recoverer 3034 for the second predetermined value and the third predetermined value.

Multiplexer 3037 selects either the output of size converter and saturator 3036 or output 217 of accumulator 202 to provide to output register 3038 based on output command 2956. More specifically, for a first predetermined value and a second predetermined value of output command 2956, multiplexer 3037 selects the lower-order word (whose size is specified by configuration 2902) of the output of size converter and saturator 3036. For a third predetermined value, multiplexer 3037 selects the high order word of the output of size converter and saturator 3036. For a fourth predetermined value, multiplexer 3037 selects the lower word of value 217 of the original accumulator 202; for a fifth predetermined value, multiplexer 3037 selects the middle word of value 217 of the original accumulator 202; and for a sixth predetermined value, multiplexer 3037 selects the upper word of the original accumulator 202 value 217. As described above, AFU 212 preferably fills in zeros in the upper bits of the upper word of value 217 of original accumulator 202.

Referring now to FIG. 31, an example of the operation of AFU 212 of FIG. 30 is shown. As shown, the configuration 2902 is set to a narrow configuration of the NPU 126. The signed data 2912 and the signed weight 2914 have true values. Further, the value of the data binary point 2922 represents that the binary point for a word of the data RAM 122 is positioned 7 bits to the right of the binary point, and an example value of the first data word received by one of the NPUs 126 is shown as 0.1001110. Further, the value of the weight binary point 2924 represents that the binary point for the word of the weight RAM 124 is positioned 3 bits to the right of the binary point, and an example value of the first weight word received by one of the NPUs 126 is shown as 00001.010.

The 16-bit product of the first data word and the first weight word (this product is accumulated with the initial zero value of the accumulator 202) is shown as 000000.1100001100. Since the data binary point 2912 is 7 and the weight binary point 2914 is 3, the binary point of the implicit accumulator 202 is defined as 10 bits to the right of the binary point. In the case of a narrow configuration, the accumulator 202 is 28 bits wide in the exemplary embodiment. In the example, the value 217 of the accumulator 202 is shown as 000000000000000001.1101010100 after all ALU operations (e.g., all 1024 multiply-accumulate in FIG. 20) have been performed.

The value of the output binary point 2954 indicates that the output binary point is positioned 7 bits to the right of the binary point. Thus, after passing through the OBPA 3002 and CCS 3008, the value 217 of the accumulator 202 is scaled, rounded and compressed to a standard form value, i.e., 000000001.1101011. In this example, the output binary decimal point position represents 7 decimal digits and the binary decimal point position of the accumulator 202 represents 10 decimal digits. Thus, OBPA 3002 calculates the difference 3 and scales the value 217 of accumulator 202 by shifting it right by 3 bits. This is represented in fig. 31 as the loss of 3 least significant bits (binary 100) of the accumulator 202 value 217. Further, in this example, the value of the rounding control 2932 indicates that random rounding is used, and in this example it is assumed that the sampling random bit 3005 is true. Thus, according to the above description, the least significant bits are rounded up because the rounding bits of the value 217 of the accumulator 202 (the most significant bits of the 3 bits shifted out by scaling of the value 217 of the accumulator 202) are 1, and the sticky bits (the boolean or operation result of the 2 least significant bits of the 3 bits shifted out by scaling of the value 217 of the accumulator 202) are 0.

In this example, activation function 2934 indicates that an sigmoid function is to be used. Thus, bit selector 3012 selects the bits of the canonical form value such that the input of S-type module 3024 has three integer bits and four decimal bits, as described above, i.e., as shown, value 001.1101. The S-type module 3024 outputs a value set in standard form, namely the value 000000000.1101110 shown.

Output command 2956 of this example specifies a first predetermined value, namely the word size represented by configuration 2902, in this case a narrow word (8 bits). Thus, the size converter 3036 converts the standard S-type output value into an 8-bit quantity with an implied binary point positioned 7 bits to the right of the binary point, producing an output value 01101110 as shown.

Referring now to FIG. 32, a second example of the operation of AFU212 of FIG. 30 is shown. The example of FIG. 32 illustrates the operation of AFU212 where activation function 2934 represents passing the value 217 of accumulator 202 at a standard size. As shown, the configuration 2902 is set to a narrow configuration of the NPU 126.

In this example, the width of the accumulator 202 is 28 bits and the binary radix point of the accumulator 202 is positioned 10 bits to the right of the binary radix point (because, as described above, the sum of the binary radix point 2912 and the weight binary radix point 2914 is 10 according to one embodiment data, or because the accumulator binary radix point 2923 is explicitly specified as having a value of 10 according to another embodiment). In this example, FIG. 32 shows the value 217 of the accumulator 202, i.e., 000001100000011011.1101111010, after all ALU operations are performed.

In this example, the value of the output binary point 2954 indicates that the output binary point is positioned 4 bits to the right of the binary point. Thus, after passing through OBPA 3002 and CCS 3008, as shown, the accumulator 202 value 217 is saturated and compressed to a standard form value 111111111111.1111, which is received by multiplexer 3032 as a standard size pass value 3028.

In this example, two output commands 2956 are shown. First output command 2956 specifies a second predetermined value, namely, outputting a lower word of a standard form size. Since the size indicated by configuration 2902 is a narrow word (8 bits) (meaning that the standard size is 16 bits), size converter 3036 selects the lower 8 bits of the standard size pass value 3028 to produce the 8-bit value 11111111111 as shown. Second output command 2956 specifies a third predetermined value, namely outputting a high order word of standard form size. Thus, the size converter 3036 selects the upper 8 bits of the standard size pass value 3028 to produce the 8-bit value 11111111111 as shown.

Referring now to FIG. 33, a third example of the operation of AFU 212 of FIG. 30 is shown. The example of FIG. 33 illustrates the operation of AFU 212 where activation function 2934 indicates that the entire original accumulator 202 value 217 is to be passed through. As shown, configuration 2902 is set to the wide configuration (e.g., 16-bit input word) of NPU 126.

In this example, the accumulator 202 is 41 bits wide and the binary radix point of the accumulator 202 is positioned 8 bits to the right of the binary radix point (because, as described above, the sum of the data binary radix point 2912 and the weight binary radix point 2914 is 8 according to one embodiment, or because the accumulator binary radix point 2923 is explicitly specified as having a value of 8 according to another embodiment). In this example, FIG. 33 shows the value 217, 001000000000000000001100000011011.11011110, of the accumulator 202 after all ALU operations have been performed.

In this example, three output commands 2956 are shown. First output command 2956 specifies a fourth predetermined value, the lower word of the output original accumulator 202 value; a second output command 2956 specifies a fifth predetermined value, the middle word of the original accumulator 202 value is output; and a third output command 2956 specifies a sixth predetermined value, i.e., the upper word of the original accumulator 202 value is output. Since the size indicated by configuration 2902 is a wide word (16 bits), fig. 33 shows that in response to first output command 2956, multiplexer 3037 selects a 16-bit value 0001101111011110; in response to the second output command 2956, the multiplexer 3037 selects the 16-bit value 0000000000011000; and in response to the third output command 2956, the multiplexer 3037 selects the 16-bit value 0000000001000000.

As described above, NNU 121 advantageously performs operations on integer data rather than floating point data. This advantageously simplifies each NPU 126 or at least ALU 204 portion. For example, ALU 204 need not include an adder in a floating-point implementation that is required to add the exponents of the multipliers of multiplier 242. Similarly, ALU 204 need not include the shifter required in a floating point implementation to align the binary point of the addend of adder 234. Those skilled in the art will appreciate that floating point units are generally very complex; thus, these are merely simplified examples for the ALU 204, and other simplifications may be achieved by the immediate integer embodiment with hardware fixed-point assistance that enables a user to specify the relevant binary point. The fact that ALUs 204 are integer units may advantageously result in smaller (and faster) NPUs 126, as compared to floating point embodiments, which further facilitates the integration of large arrays of NPUs 126 into NNUs 121. Portions of AFU 212 may handle scaling and saturation of accumulator 202's value 217 based on the number of decimal places required for the (preferably user-specified) accumulation value and the number of decimal places required for the output value. Advantageously, as described with respect to the embodiment of FIG. 11, any additional complexity and attendant increase in size, power and/or time loss within the fixed point hardware assist of AFU 212, may be amortized by sharing AFU 212 among ALU 204 portions, for example, because the number of AFUs 1112 may be reduced with the shared embodiment.

Advantageously, the embodiments described herein enjoy many of the benefits associated with the reduced complexity of hardware integer arithmetic units (as compared to using floating point arithmetic units), while still providing arithmetic operations for decimals (i.e., numbers having binary decimal points). Floating point arithmetic has the advantages of: arithmetic operations are provided for data whose individual values may fall anywhere within a very wide range of values (which range of values is limited in practice only to the size of the exponential range, which may be very large). That is, each floating point number has its potentially unique exponent value. However, the embodiments described herein recognize and take advantage of the fact that: there are certain applications where the input data are highly parallel and their values are within a relatively narrow range so that the "exponent" of all parallel values may be the same. Thus, these embodiments enable a user to specify binary decimal point positions for all input values and/or accumulated values at once. Similarly, by recognizing and utilizing the similar range characteristics of parallel outputs, these embodiments enable a user to specify binary decimal point positions for all output values at once. An artificial neural network is one example of such an application, but embodiments of the invention may also be used to perform computations for other applications. By specifying binary point locations once for an input rather than for individual input numbers, embodiments may use storage space more efficiently (e.g., require less memory) and/or improve precision if a similar amount of memory is used, as bits for exponents may be used to specify greater precision for magnitudes in floating point implementations, as compared to floating point implementations.

Advantageously, the embodiments recognize potential loss of precision (e.g., overflow or loss of less significant decimal places) that may be experienced during the performance of accumulation for a large series of integer operations, and provide a solution, primarily in the form of a large enough accumulator to avoid loss of precision.

Direct execution of NNU micro-operations

Referring now to FIG. 34, a block diagram illustrating a partial detail of processor 100 and NNU 121 of FIG. 1 is shown. NNU 121 includes pipeline stage 3401 of NPU 126. The pipeline stages 3401, separated by stage registers, include combinatorial logic, such as boolean logic gates, multiplexers, adders, multipliers, comparators, etc., that implement the operations of the NPUs 126 as described herein. Pipeline stage 3401 receives micro-operations 3418 from multiplexer 3402. Micro-operations 3418 flow down to pipeline stage 3401 and control its combinatorial logic. Micro-operation 3418 is a set of bits. Preferably, micro-operations 3418 include bits of memory address 123 of data RAM 122, bits of memory address 125 of weight RAM 124, bits of memory address 131 of program memory 129, bits of control signals 213/713 of multiplexing register 208/705, bits of control signals 803 of multiplexer 802, and many fields of control register 217 (e.g., of FIGS. 29A-29C), and so forth. In one embodiment, micro-operation 3418 includes approximately 120 bits. Multiplexer 3402 receives micro-operations from three different sources and selects one of them as micro-operation 3418 to be provided to pipeline stage 3401.

One source of micro-operations for multiplexer 3402 is sequencer 128 of FIG. 1. The sequencer 128 decodes NNU instructions received from the program memory 129 and in response generates micro-operations 3416 that are provided to a first input of the multiplexer 3402.

The second source of micro-operations for multiplexer 3402 is decoder 3404 that receives micro-instructions 105 from reservation stations 108 of FIG. 1 and operands from GPRs 116 and media registers 118. Preferably, as described above, the microinstructions 105 are generated by the instruction translator 104 in response to the translation of the MTNN instruction 1400 and the MFNN instruction 1500. The microinstructions 105 may include an immediate field that specifies a particular function (specified by the MTNN instruction 1400 or the MFNN instruction 1500), such as starting and stopping execution of programs within the program memory 129, performing micro-operations directly from the media registers 118, or reading/writing to the memory of the NNU as described above. The decoder 3404 decodes the microinstruction 105 and in response generates a micro-operation 3412 that is provided to a second input of the multiplexer 3402. Preferably, in response to certain functions 1432/1532 of the MTNN instruction 1400/MFNN instruction 1500, the decoder 3404 need not generate micro-operations 3412, sent down the pipeline 3401, such as writing to the control registers 127, starting execution of a program in the program memory 129, pausing execution of a program in the program memory 129, waiting for a program in the program memory 129 to complete execution, reading from the status registers 127, and resetting the NNUs 121.

The third source of micro-operations for multiplexer 3402 is the media register 118 itself. Preferably, as described above with respect to fig. 14, the MTNN instruction 1400 may specify a function to instruct the NNU 121 to directly execute the micro-operation 3414 provided from the media register 118 to the third input of the multiplexer 3402. Directly executing the micro-operations 3414 provided by the architectural media registers 118 may be particularly useful for testing (e.g., built-in self-test (BIST)) and debugging the NNU 121.

Preferably, the decoder 3404 generates a mode indicator 3422 for controlling the selection of the multiplexer 3402. When the MTNN instruction 1400 specifies a function to start running the program from the program memory 129, the decoder 3404 generates the mode indicator 3422 value that causes the multiplexer 3402 to select the micro-operation 3416 from the sequencer 128 until an error occurs or until the decoder 3404 encounters the MTNN instruction 1400 that specifies a function to stop running the program from the program memory 129. When the MTNN instruction 1400 specifies a function to instruct the NNU 121 to directly execute the micro-operations 3414 provided from the media registers 118, the decoder 3404 generates a mode indicator 3422 value that causes the multiplexer 3402 to select the micro-operations 3414 from the specified media registers 118. Otherwise, the decoder 3404 generates a mode indicator 3422 value that causes the multiplexer 3402 to select the micro-operation 3412 from the decoder 3404.

Variable rate neural network element

There may be the following situations: NNU121 runs a program and then enters an idle state waiting for processor 100 to process something that needs to be processed before it can execute the next program. For example, assume that NNU121 runs two or more times in succession on a multiply-accumulate activation function program (which may also be referred to as a feed-forward neural network layer program) in a similar situation as described for fig. 3-6A. Processor 100 takes significantly longer to write the 512KB weighted value used by the NNU program for the next run to weight RAM 124 than the time it takes for NNU121 to run the program. In other words, NNU121 may run the program in a relatively short amount of time and then enter an idle state while processor 100 completes writing the next weight values to weight RAM 124 for the next program run. This situation is visually illustrated in fig. 36A, which is described in more detail below. In this case, it may be advantageous to have NNUs 121 run at a slower rate and take longer to execute programs, so that the energy consumption required by NNUs 121 to run programs is spread out over a longer period of time, which may tend to maintain NNUs 121, and thus processor 100, at a lower temperature. This situation is referred to as a mitigation mode and is visually illustrated in fig. 36B described in more detail below.

Referring now to FIG. 35, a block diagram is shown that illustrates a processor 100 having a variable rate NNU 121. The processor 100 is similar in many respects to the processor 100 of fig. 1, and elements having the same reference numerals are identical. Processor 100 of FIG. 35 also includes clock generation logic 3502 coupled to the functional units of processor 100, namely instruction fetch unit 101, instruction cache 102, instruction translator 104, rename unit 106, reservation stations 108, NNU 121, other execution units 112, memory subsystem 114, general purpose registers 116, and media registers 118. Clock generation logic 3502 includes a clock generator, such as a Phase Locked Loop (PLL), that generates a clock signal having a master clock rate or frequency. For example, the master clock rate may be 1GHz, 1.5GHz, 2GHz, etc. The clock rate represents the number of cycles of the clock signal per second, e.g., the number of oscillations between high and low states. Preferably, the clock signal has a balanced duty cycle, i.e. half of the period is high and the other half is low; optionally, the clock signal has an unbalanced duty cycle, wherein the clock signal is in a high state for a longer time than it is in a low state, or vice versa. Preferably, the PLL can be configured to generate a master clock signal having a plurality of clock rates. Preferably, processor 100 includes a power management module that automatically adjusts the master clock rate based on a variety of factors including dynamically detected operating temperature, utilization of processor 100, and commands from system software (e.g., operating system, BIOS) indicating desired performance and/or power savings metrics. In one embodiment, the power management module includes microcode for processor 100.

The clock generation logic 3502 also includes a clock distribution network or clock tree. The clock tree distributes the master clock signal to the functional units of processor 100, i.e., clock signal 3506-1 to instruction pick-up unit 101, clock signal 3506-2 to instruction cache 102, clock signals 3506-10 to instruction translator 104, clock signals 3506-9 to rename unit 106, clock signals 3506-8 to reservation stations 108, clock signals 3506-7 to NNUs 121, clock signals 3506-4 to other execution units 112, clock signals 3506-3 to memory subsystem 114, clock signals 3506-5 to general purpose registers 116, and clock signals 3506-6 to media registers 118, collectively referred to as clock signals 3506, as shown in FIG. 35. The clock tree includes nodes or lines for transmitting master clock signal 3506 to its respective functional unit. Furthermore, clock generation logic 3502 preferably includes a clock buffer that regenerates the master clock signal (especially for more distant nodes) and/or raises the voltage level of the master clock signal when needed to provide a cleaner clock signal. In addition, each functional unit may also include its own sub-clock tree, if desired, that regenerates and/or boosts its received corresponding master clock signal 3506.

NNU 121 includes clock reduction logic 3504 that receives mitigations indicator 3512, receives primary clock signals 3506-7, and in response, generates a secondary clock signal. The secondary clock signal has a clock rate that is the same as the primary clock rate or, in the case of the mitigation mode, is reduced relative to the primary clock rate by the amount programmed into mitigation indicator 3512, thereby potentially providing a thermal benefit. Clock reduction logic 3504 is similar in many respects to clock generation logic 3502, with clock reduction logic 3504 having a clock distribution network or tree that distributes secondary clock signals to the various blocks of NNUs 121, such as represented as an array that distributes clock signal 3508-1 to NPUs 126, clock signal 3508-2 to sequencers 128, and clock signal 3508-3 to interface logic 3514, which are collectively or individually referred to as secondary clock signals 3508. Preferably, as shown with respect to FIG. 34, the NPU126 includes a plurality of pipeline stages 3401 including pipeline stage registers for receiving the secondary clock signal 3508-1 from the clock reduction logic 3504.

NNU 121 also includes interface logic 3514 to receive master clock signals 3506-7 and slave clock signals 3508-3. Interface logic 3514 is coupled between the lower portion of the front end of processor 100 (e.g., reservation stations 108, media registers 118, and general purpose registers 116) and the various blocks of NNU 121, which are clock reduction logic 3504, data RAM 122, weight RAM 124, program memory 129, and sequencer 128. Interface logic 3514 includes a data RAM buffer 3522, a weight RAM buffer 3524, a decoder 3404 of fig. 34, and a mitigation indicator 3512. The mitigation indicator 3512 maintains a value that specifies how fast the array of NPUs 126 is to execute NNU program instructions. Preferably, mitigation indicator 3512 specifies a divisor value of N by which clock reduction logic 3504 divides master clock signals 3506-7 to produce slave clock signals 3508 such that the rate of the slave clock signals is 1/N. Preferably, the value of N can be programmed to any of a plurality of different predetermined values to cause clock reduction logic 3504 to generate secondary clock signal 3508 having a plurality of different rates that are all less than the primary clock rate.

In one embodiment, clock reduction logic 3504 includes a clock divider circuit to divide master clock signals 3506-7 by the value of mitigation indicator 3512. In one embodiment, clock reduction logic 3504 includes a clock gate (e.g., an AND gate) that gates master clock signals 3506-7 with an enable signal, where the enable signal is true only once every N cycles of master clock signals 3506-7. For example, the enable signal may be generated using a circuit that includes a counter for counting up to N. When the accompanying logic detects that the counter output matches N, the logic generates a true pulse on the auxiliary clock signal 3508 and redesigns the counter. Preferably, the value of the mitigation indicator 3512 is programmable by an architectural instruction (such as the MTNN instruction 1400 of fig. 14, etc.). Preferably, as described in more detail with respect to FIG. 37, the architecture program running on processor 100 programs a mitigation value into mitigation indicator 3512 just prior to instructing NNU 121 to begin running the NNU program.

The weight RAM register 3524 is coupled between the weight RAM 124 and the media register 118 for buffering data transmission therebetween. Preferably, weight RAM buffer 3524 is similar to one or more embodiments of buffer 1704 of fig. 17. Preferably, the portion of weight RAM buffer 3524 that receives data from media register 118 is clocked by a master clock signal 3506-7 having a master clock rate, while the portion of weight RAM buffer 3524 that receives data from weight RAM 124 is clocked by a slave clock signal 3508-3 having a slave clock rate, which may or may not be reduced relative to the master clock rate depending on the value programmed into mitigations indicator 3512 (i.e., depending on whether NNU 121 is operating in mitigations mode or normal mode). In one embodiment, as described above with respect to fig. 17, the weight RAM 124 is single-ported and is accessible by the media register 118 in an arbitrated manner (arbitrated failure) via the weight RAM buffer 3524 and by both the NPU 126 or the row buffer 1104 of fig. 11. In an alternative embodiment, as described above with respect to fig. 16, the weight RAM 124 is dual-ported, and each port is accessible by the media register 118 in parallel via both the weight RAM buffer 3524 and the NPU 126 or line buffer 1104.

Also, a data RAM buffer 3522 is coupled between the data RAM 122 and the media register 118 for buffering data transfers therebetween. Preferably, data RAM buffer 3522 is similar to one or more embodiments of buffer 1704 of fig. 17. Preferably, the portion of data RAM buffer 3522 that receives data from media register 118 is clocked by master clock signal 3506-7, which has a master clock rate, while the portion of data RAM buffer 3522 that receives data from data RAM 122 is clocked by slave clock signal 3508-3, which has a slave clock rate, which may or may not be reduced relative to the master clock rate depending on the value programmed into mitigations indicator 3512 (i.e., depending on whether NNU 121 is operating in mitigations mode or normal mode). In one embodiment, the data RAM 122 is single-ported and is accessible by the media registers 118 in an arbitrated manner via the data RAM buffer 3522 and by both the NPU 126 or the row buffer 1104 of fig. 11, as described above with respect to fig. 17. In an alternative embodiment, as described above with respect to fig. 16, the data RAM 122 is dual-ported, and each port is accessible in parallel by both the media register 118 via the data RAM buffer 3522 and the NPU 126 or line buffer 1104.

Preferably, regardless of whether data RAM 122 and/or weight RAM 124 are single-ported or dual-ported, interface logic 3514 includes a data RAM buffer 3522 and a weight RAM buffer 3524 to provide synchronization between the primary and secondary clock domains. Preferably, data RAM 122, weight RAM 124 and program memory 129 each comprise static RAM (sram), which includes corresponding read enable signals, write enable signals and memory select signals.

As described above, NNU 121 is an execution unit of processor 100. An execution unit is a functional unit into which a processor's execution architectural instructions are translated into microinstructions (such as microinstruction 105 into which architectural instruction 103 is transferred in FIG. 1) or the execution architectural instruction 103 itself. The execution units receive operands from general purpose registers of the processor, such as GPRs 116 and media registers 118. The execution unit, in response to executing the microinstruction or architectural instruction, generates a result that may be written to a general purpose register. Examples of the architecture instructions 103 are the MTNN instruction 1400 and the MFNN instruction 1500 described with respect to fig. 14 and 15, respectively. The micro instructions implement architectural instructions. More specifically, the execution unit performs, for the entirety of the one or more microinstructions into which the architectural instruction is translated, the operation prescribed by the architectural instruction on the input specified by the architectural instruction to produce the result defined by the architectural instruction.

Referring now to FIG. 36A, a timing diagram illustrating an example of the operation of processor 100 with NNUs 121 operating in a normal mode, i.e., at the master clock rate, is shown. In the timing diagram, the progression of time is from left to right. Processor 100 is running the architectural program at the master clock rate. More specifically, the front end of processor 100 (e.g., instruction fetch unit 101, instruction cache 102, instruction translator 104, rename unit 106, and reservation stations 108) fetches, decodes, and issues architectural instructions to NNU121 and other execution units 112 at the master clock rate.

Initially, the architectural program executes an architectural instruction (e.g., MTNN instruction 1400), which processor front end 100 issues to NNU121 to instruct NNU121 to begin running its NNU program in program memory 129. Heretofore, the architectural program executed architectural instructions to write a value for the specified master clock rate to the mitigation indicator 3512 even though the NNU was in the normal mode. More specifically, the value programmed into mitigation indicator 3512 causes clock reduction logic 3504 to generate secondary clock signal 3508 that has the primary clock rate of primary clock signal 3506. Preferably, in this case, the clock buffer of clock reduction logic 3504 simply raises master clock signal 3506. Also prior to this, the architectural program executes architectural instructions to write data RAM 122 and weight RAM 124 and to write the NNU program into program memory 129. In response to the MTNN instruction 1400 to begin an NNU program, NNU121 begins executing the NNU program at the master clock rate because mitigation indicator 3512 is programmed to have a master rate value. After the NNU121 execution begins, the architecture program continues to execute architecture instructions at the master clock rate, including and primarily writing and/or reading data RAM 122 and weight RAM 124 with MTNN instructions 1400, in preparation for the next instance or call or execution of the NNU program.

As shown in the example in fig. 36A, NNU121 completes the execution of the NNU program in significantly less time (e.g., one-fourth of the time) than the time it takes for the architectural program to complete the writes/reads to data RAM 122 and weight RAM 124. For example, NNU121 may take approximately 1000 clock cycles to run the NNU program, while the architecture program takes approximately 4000 clock cycles to run, all at the master clock rate. Thus, NNU121 is idle for the remainder of the time (a significant amount of time in this example, approximately 3000 cycles of the master clock rate). As shown in the example in fig. 36A, the pattern continues to be executed another time, and possibly multiple times, depending on the size and configuration of the neural network. Because NNUs 121 can be relatively large and transistor-intensive functional units in processor 100, NNUs 121 can generate a large amount of heat, especially when operating at the master clock rate.

Referring now to FIG. 36B, a timing diagram illustrating an example of the operation of processor 100 with NNUs 121 operating in a mitigative mode, i.e., at a rate less than the master clock rate, is shown. The timing diagram of FIG. 36B is identical to the timing diagram of FIG. 36A in many respects, i.e., the processor 100 runs the architectural program at the master clock rate. Also in this example, assume that the architecture program and NNU program of fig. 36B are the same as the architecture program and NNU program of fig. 36A. However, prior to starting the NNU program, the architecture program executes the MTNN instruction 1400, wherein the MTNN instruction 1400 programs the mitigations indicator 3512 with a value that causes the clock reduction logic 3504 to produce a secondary clock signal 3508 having a secondary clock rate that is less than the primary clock rate. That is, the architecture program places NNU121 in the mitigation mode of FIG. 36B, rather than the general mode of FIG. 36A. Thus, the NPU 126 executes the NNU program at a secondary clock rate that is less than the primary clock rate in the mitigation mode. In this example, assume mitigation indicator 3512 is programmed with a value to designate the secondary clock rate as a quarter of the primary clock rate. As a result, as can be seen by comparing fig. 36A and 36B, NNU121 spends four times as long running an NNU program in the mitigative mode as it spends running an NNU program in the normal mode, such that the amount of time NNU121 is idle is relatively short. Thus, NNU121 in fig. 36B consumes energy used to run the NNU program in a time period that is approximately four times the time required for NNU121 in fig. 36A to run the program in the normal mode. Thus, the rate of heat generated by NNU121 running the NNU program in FIG. 36B is approximately one-fourth that in FIG. 36A, and thus can have the thermal benefits described herein.

Referring now to FIG. 37, a flowchart illustrating operation of the processor 100 of FIG. 35 is shown. The operations illustrated by this flowchart are in many respects identical to the operations described above with respect to fig. 35, 36A, and 36B. Flow begins at block 3702.

At block 3702, the processor 100 executes the MTNN instruction 1400 to write the weights to the weight RAM 124 and to write the data to the data RAM 122. Flow proceeds to block 3704.

At block 3704, the processor 100 executes the MTNN instruction 1400 to program the indicator 3512 with a value specifying a rate lower than the master clock rate even though the NNU 121 is in the mitigative mode. Flow proceeds to block 3706.

At block 3706, in the same manner as presented in fig. 36B, processor 100 executes MTNN instruction 1400 to instruct NNU 121 to begin running the NNU program. Flow proceeds to block 3708.

At block 3708, the NNU 121 begins running the NNU program. In parallel, the processor 100 executes the MTNN instruction 1400 to write new weights to the weight RAM 124 (and possibly new data to the data RAM 122), and/or executes the MFNN instruction 1500 to read results from the data RAM 122 (and possibly results from the weight RAM 124). Flow proceeds to block 3712.

At block 3712, the processor 100 executes the MFNN instruction 1500 (e.g., reads the status register 127) to detect that the NNU 121 has finished running its program. Assuming that the architecture program selects a good value for mitigation indicator 3512, the amount of time it takes the NNU 121 to run the NNU program is approximately the same as the time it takes the processor 100 to execute portions of the access weight RAM 124 and/or data RAM 122 of the architecture program, as shown in fig. 36B. Flow proceeds to block 3714.

At block 3714, the processor 100 executes the MTNN instruction 1400 to program the mitigative indicator 3512 with a value specifying the master clock rate even though the NNU 121 is in the normal mode. Flow proceeds to block 3716.

At block 3716, in the same manner as presented in a similar manner as fig. 36A, processor 100 executes MTNN instruction 1400 to instruct NNU 121 to begin running the NNU program. Flow proceeds to block 3718.

At block 3718, the NNU 121 begins running the NNU program in the normal mode. Flow ends at block 3718.

As described above, running an NNU program in the mitigation mode can spread the time that the NNU is running the program relative to the time that the NNU is running the program in the general mode (i.e., at the master clock rate of the processor), thereby providing a thermal benefit. More specifically, when the NNUs are running a program in the mitigation mode, devices (e.g., transistors, capacitors, wires) will likely operate at lower temperatures because the NNUs generate heat at a slower rate, which is dissipated by the NNUs (e.g., semiconductor devices, metal layers, and underlying substrate) as well as the surrounding package and cooling scheme (e.g., heat sinks, fans). Generally, this also reduces the device temperature in other portions of the processor die. Lower operating temperatures of the devices, particularly their junction temperatures, can have the benefit of reducing leakage currents. In addition, since the amount of current flowing per unit time is small, inductance noise and IR drop noise can be reduced. In addition, lower temperatures also have a positive impact on Negative Bias Temperature Instability (NBTI) and positive bias temperature instability (PBSI) of the processor's MOSFETs, thereby improving device and processor portion reliability and/or lifetime. Lower temperatures may also mitigate joule heating and electromigration within the metal layers of the processor.

Communication mechanism between architectural and non-architectural programs for NNU shared resources

As described above, taking fig. 24 to 28 and fig. 35 to 37 as examples, the data RAM122 and the weight RAM 124 are shared resources. Both the NPU126 and the front end of the processor 100 share the data RAM122 and the weight RAM 124. More specifically, the NPU126 reads from and writes to the data RAM122 and weight RAM 124 with the front end of the processor 100 (e.g., media registers 118). In other words, the architectural programs running on processor 100 share data RAM122 and weight RAM 124 with the NNU programs running on NNU 121, and as described above, in some cases this requires control over the flow between the architectural programs and the NNU programs. This resource sharing also applies to some extent to the program memory 129 because the architectural programs write to the program memory 129 and the sequencer 128 reads from the program memory 129. Embodiments described herein in the context of provide a high performance solution to controlling the flow of access to shared resources between an architecture program and an NNU program.

In the embodiments described herein, the NNU program is also referred to as a non-architectural program, the NNU instructions are also referred to as non-architectural instructions, and the NNU instruction set (also referred to above as the NPU instruction set) is also referred to as a non-architectural instruction set. The non-architectural instruction set is different from the architectural instruction set. In embodiments in which the processor 100 includes an instruction translator 104 for translating architectural instructions into microinstructions, the non-architectural instruction set is also different from the microinstruction set.

Referring now to FIG. 38, a block diagram is shown that illustrates sequencer 128 of NNU 121 in greater detail. As described above, the sequencer 128 provides the memory address 131 to the program memory 129 to select non-architectural instructions that are provided to the sequencer 128. As shown in fig. 38, the memory address 131 is held within the program counter 3802 of the sequencer 128. The sequencer 128 typically increments by the sequential address of the program memory 129 unless the sequencer 128 encounters a non-architectural instruction, such as a loop or branch instruction, in which case the sequencer 128 updates the program counter 3802 to the target address of the control instruction, i.e., to the address of the non-architectural instruction located at the target of the control instruction. Thus, the address 131 maintained in the program counter 3802 specifies the address in the program memory 129 of the non-architectural instructions of the non-architectural program currently being picked up for execution by the NPU 126. Advantageously, the value of the program counter 3802 may be obtained by the architecture program via the NNU program counter field 3912 of the status register 127, as described below with respect to fig. 39. This enables the architected program to decide where to read/write data with respect to data RAM 122 and/or weight RAM 124 based on the progress of the non-architected program.

The sequencer 128 also includes a loop counter 3804, the loop counter 3804 used in conjunction with non-architectural loop instructions, such as the loop to 1 instruction at address 10 of FIG. 26A and the loop to 1 instruction at address 11 of FIG. 28. In the example of fig. 26A and 28, the loop counter 3804 loads a value specified in a non-architectural initialization instruction at address 0, such as the value 400. Each time the sequencer 128 encounters a loop instruction and jumps to a target instruction (e.g., the multiply accumulate instruction at address 1 of FIG. 26A or the maxwacc instruction at address 1 of FIG. 28), the sequencer 128 decrements the loop counter 3804. Once the loop counter 3804 reaches zero, the sequencer 128 enters the next sequential non-architectural instruction. In an alternative embodiment, the loop counter 3804 loads the loop count value specified in the loop instruction the first time the loop instruction is encountered, to obviate the need to initialize the loop counter 3804 via a non-architectural initialization instruction. Thus, the value of loop counter 3804 indicates the number of times the loop body of the non-architectural program is to be executed. Advantageously, the value of the loop counter 3804 may be obtained by the architectural program via the loop count 3914 field of the status register 127, as described below with respect to fig. 39. This enables the architected program to decide where to read/write data with respect to data RAM 122 and/or weight RAM 124 based on the progress of the non-architected program. In one embodiment, sequencer 128 includes three additional loop counters to accommodate nested loops within non-architectural programs, and the values of the other three loop counters may also be read via status register 127. Having a bit in the loop instruction indicates which of the four loop counters is used for the immediate loop instruction.

Sequencer 128 also includes an iteration counter 3806. The iteration counter 3806 is used in conjunction with non-architectural instructions, such as multiply-accumulate instructions at address 2 of fig. 4, 9, 20, and 26A, and maxwacc instructions at address 2 in fig. 28, which are referred to hereinafter as "execute" instructions. In the above example, each execution instruction specifies an

iteration count

511, 1023, 2, and 3, respectively. When the sequencer 128 encounters an executing instruction that specifies a non-zero iteration count, the sequencer 128 loads the iteration counter 3806 with the specified value. In addition, the sequencer 128 generates appropriate micro-operations 3418 to control logic within the pipeline stage 3401 of the NPU 126 of fig. 34 for execution, and to decrement the iteration counter 3806. If the iteration counter 3806 is greater than zero, the sequencer 128 again generates the appropriate micro-operations 3418 to control the logic within the NPU 126 and to decrement the iteration counter 3806. The sequencer 128 continues to operate in this manner until the iteration counter 3806 reaches zero. Thus, the value of the iteration counter 3806 represents the number of times that operations specified within the non-architectural execution instruction (e.g., multiply accumulate, maximize, sum of accumulators and data/weight words) are also to be performed. Advantageously, the value of the iteration counter 3806 may be obtained by the architecture program via the iteration count 3916 field of the status register 127, as described below with respect to fig. 39. This enables the architected program to decide where to read/write data with respect to data RAM 122 and/or weight RAM 124 based on the progress of the non-architected program.

Referring now to FIG. 39, a block diagram is shown that illustrates certain fields of control and status register 127 of NNU 121. As described above for fig. 26B, these fields include the address 2602 of the weight RAM row that was written last by the NPU126 executing a non-architectural program, the address 2604 of the weight RAM row that was read last by the NPU126 executing a non-architectural program, the address 2606 of the data RAM row that was written last by the NPU126 executing a non-architectural program, and the address 2608 of the data RAM row that was read last by the NPU126 executing a non-architectural program. In addition, these fields include an NNU program counter 3912, a loop count 3914, and an iteration count 3916. As described above, the architectural program may read (e.g., via the MFNN instruction 1500) the status register 127 into the media register 118 and/or the general purpose register 116, the status register 127 including an NNU program counter 3912 field value, a loop count 3914 field value, and an iteration count 3916 field value. The value of the program counter 3912 reflects the value of the program counter 3802 of fig. 38. The value of the loop count 3914 reflects the value of the loop counter 3804. The value of the iteration count 3916 reflects the value of the iteration counter 3806. In one embodiment, the sequencer 128 updates the program counter 3912 field value, the loop count 3914 field value, and the iteration count 3916 field value each time the program counter 3802, the loop counter 3804, or the iteration counter 3806 is modified so that these field values are current values when read by the architectural program. In another embodiment, when the NNU 121 executes an architectural instruction to read the status register 127, the NNU 121 simply obtains the value of the program counter 3802, the value of the loop counter 3804, and the value of the iteration counter 3806 and provides these values back to the architectural instruction (e.g., to the media registers 118 or the general purpose registers 116).

From the above, it can be seen that the field values of the status register 127 of FIG. 39 can be characterized as information of the progress of non-architected programs during execution by the NNU. Certain aspects of non-architectural program progress have been described above, such as the value of the program counter 3802, the value of the loop counter 3804, the value of the iteration counter 3806, the weight RAM 124 address 125 of the most recent write/read 2602/2604, and the data RAM 122 address 123 of the most recent write/read 2606/2608. The architectural program executing on processor 100 may read the non-architectural program progress value of FIG. 39 from status register 127 and use this information to make decisions, for example, by architectural instructions such as compare instructions and branch instructions. For example, especially for overlapping execution instances of large data sets and/or different non-architectural instructions, the architectural program decides on which rows to write/read data/weights with respect to the data RAM 122 and/or the weight RAM 124 to control the flow of data in and out with respect to the data RAM 122 or the weight RAM 124. Examples of making decisions with an architectural program are described herein in the context.

For example, as described above with respect to fig. 26A, the architected program configures the non-architected program to write the results of the convolution back to a row in data RAM 122 that is above convolution kernel 2402 (e.g., above row 8), while the architected program reads the results from data RAM 122 when NNU 121 writes the results by using address 2606 of the last write data RAM 122 row.

As another example, as described above for fig. 26B, the architected program utilizes information from the fields of the status register 127 of fig. 38 to determine the progress of the non-architected program in performing a convolution of the data array 2404 of fig. 24 with 5 512 x 1600 blocks of data. The architectural routine writes the first 512 x 1600 data block of the 2560 x 1600 data array 2404 to the weight RAM 124 and begins a non-architectural routine having a loop count of 1600 and an initialized output behavior of 0 from the weight RAM 124. When the NNU 121 executes an off-architecture program, the architecture program reads the status register 127 to determine the row 2602 of the weight RAM 124 that was written most recently so that the architecture program can read the valid convolution result written by the off-architecture program and overwrite the valid convolution result with the next 512 x 1600 block after the architecture program has read the valid convolution result so that when the NNU 121 completes the off-architecture program for the first 512 x 1600 block of data, the processor 100 can immediately update the off-architecture program as needed and start the off-architecture program again to process the next 512 x 1600 block of data.

As another example, assume that the configuration program causes NNU 121 to perform a series of conventional neural network multiply-accumulate boot function operations, wherein weights are stored in weight RAM 124 and the results are written back to data RAM 122. In this case, non-architected programs, once they read a row of weight RAM 124, do not read it again. Thus, the architectural program may be configured to begin overwriting the weights in weight RAM 124 with new weights for the next instance of execution of the non-architectural program (e.g., the next neural network layer) once the current weights have been read/used by the non-architectural program. In this case, the architectural program reads status register 127 to obtain the address 2604 of the most recently read row of weight RAM 2604, thereby deciding where in weight RAM 124 a new set of weights can be written.

As another example, assume that an architected program knows that a non-architected program includes an execute instruction with a large iteration count, such as the non-architected multiply-accumulate instruction at address 2 of FIG. 20. In this case, the architectural process may need to know the iteration count 3916 to know how many clock cycles are roughly needed to complete the non-architectural instructions so that the architectural process can decide which of two or more actions to take next. For example, if the time is long, the architectural program may relinquish control to another architectural program, such as an operating system or the like. Also, assume that the architected program knows that the non-architected program includes a loop body with a significant loop count, such as the non-architected program of FIG. 28. In this case, the architectural process may need to know the cycle count 3914 to know how many clock cycles are substantially needed to complete the non-architectural process so that the architectural process can decide which of two or more actions to take next.

Also for example, assume that the architecture program causes NNU 121 to perform a pooling operation similar to the pooling operation described for fig. 27 and 28 in which data to be pooled is stored in weight RAM 124 and the result is written back to weight RAM 124. However, unlike the examples of FIGS. 27 and 28, assume that the results are written back to the top 400 rows of weight RAM 124, e.g., rows 1600-1999. In this case, once a non-architected program reads the four pooled rows in weight RAM 124, it will not read again. Thus, the architectural program may be configured to begin overwriting the data in weight RAM 124 with new data once the current four rows of data have been read/used by the non-architectural program (e.g., to overwrite with weights for the next instance of execution of the non-architectural program, e.g., to perform traditional multiply-accumulate activate function operations on pooled data). In this case, the architectural program reads status register 127 to obtain the address 2604 of the most recently read weight RAM row, thereby deciding where in weight RAM 124 a new weight set can be written.

The above example may also be performed by the embodiments described below with respect to fig. 41-46, where NNU 121 is coupled to the processing cores, rather than the execution units that are the processing cores, by a ring bus to which system memory is also coupled. In such embodiments, it may be particularly beneficial to have the architectural program decide where to read/write data with respect to data RAM122 and/or weight RAM 124 based on the progress of the non-architectural program, since the transfer of data/weights between NNUs 121 (e.g., data RAM122 and weight RAM 124) and the core and/or system memory may incur longer delays than in embodiments where NNUs 121 are execution units of the core. Furthermore, it may be beneficial to enable NNU 121 to interrupt a core in a very fine-tuned manner in order to handle interrupt latency associated with the core and its operating system, such embodiments being described below with respect to FIGS. 47-53.

Referring now to FIG. 40, a block diagram illustrating an embodiment of portions of NNU 121 is shown. The NNU 121 includes a shift unit 5802, a shift register 5804, a data multiplexing register 208, a weight multiplexing register 705, the NPU 126, a multiplexer 5806, an output unit 5808, and an output register 1104. The data multiplexing register 208 and the multiplexing register 208 are the same as described above, but modified to additionally receive inputs from the move register 5804 and from additional neighboring NPUs 126. In one embodiment, in addition to output 209 from J +1 as described above, data multiplexing register 208 receives output 209 from NPUs J-1 and J-4 on input 211; likewise, in addition to the output 203 from J +1 as described above, the weight multiplexing register 705 also receives the output 203 from NPUs J-1 and J-4 on input 711. The output register 1104 is the same as the buffers referred to above as the line buffer 1104 and the output buffer 1104. Output unit 5808 is identical in many respects to activation function unit 212/1112 described above, and may include activation functions (e.g., sigmoid functions, hyperbolic tangent functions, correction functions, soft-add functions); however, these output units 5808 preferably also comprise a requantization unit for requantizing the value of the accumulator 202, embodiments of which are described below. The NPU 126 is identical to the above in many respects. As described above, different embodiments are contemplated in which the data word width and weight word width may be of various sizes (e.g., 8 bits, 9 bits, 12 bits, or 16 bits), and multiple word sizes may be supported by a given embodiment (e.g., 8 bits and 16 bits). However, a representative embodiment is shown for the following figure, where the data word width and weight word width held in memory 122/124, shift register 5804, mux register 208/705, and output register 1104 are 8-bit words, i.e., bytes.

Fig. 40 shows a cross section of NNU 121. For example, the illustrated NPU 126 is representative of an array of NPUs 126 (such as described above). A representative NPU 126 refers to the NPU [ J ]126 of the N NPUs 126, where J is between 0 and N-1. As mentioned above, N is a large number and is preferably a power of 2. As described above, N may be 512, 1024 or 2048. In one embodiment, N is 4096. Because of the large number of NPUs 126 in the array, it is advantageous for each NPU 126 to be as small as possible to keep the size of the NNUs 121 within desired limits and/or to accommodate more NPUs 126 to increase the speed of neural network related computations performed by the NNUs 121.

Further, although the shift unit 5802 and the shift register 5804 are each N bytes wide, only a portion of the shift register 5804 is shown. In particular, the output 5824 in the move register 5804 is shown providing a portion of bytes, represented as the move register [ J ]5804, to the mux register 208/705 of NPU [ J ] 126. Furthermore, although the output 5822 of the shift unit 5802 provides N bytes (to memory 122/124 and shift register 5804), only byte J is provided for loading into the shift register [ J ]5804, which then provides byte J on output 5824 to the data multiplexing register 208 and weight multiplexing register 705.

Further, although the NNU 121 includes multiple output units 5808, only a single output unit 5808 is shown in fig. 40, i.e., an output unit 5808 that performs operations on the NPU [ J ]126 within a NPU group and the accumulator outputs 217 of multiple NPUs 126 (such as described above with respect to fig. 11). The output unit 5808 is referred to as the output unit [ J/4] because, in the embodiment of fig. 40, each output unit 5808 is shared by a group of four NPUs 126. Likewise, although NNU 121 includes multiple multiplexers 5806, only a single multiplexer 5806, i.e., multiplexer 5806 that receives NPU [ J ]126 within its NPU bank and accumulator outputs 217 of multiple NPUs 126, is shown in fig. 40. Likewise, multiplexer 5806 is referred to as multiplexer [ J/4] because multiplexer 5806 selects one of the four accumulator 202 outputs 217 to provide to output unit [ J/4] 5808.

Finally, although the output register 1104 is N bytes wide, only a single 4-byte segment (denoted as output register [ J/4]1104) is shown in FIG. 40, where the 4-byte segment receives the four quantized bytes generated by the output unit [ J/4]5808 from the four NPUs 126 within the NPU group that includes the NPU [ J ] 126. All N bytes in the output 133 of the output register 1104 are provided to the shift unit 5802, but only four bytes in the quad-byte section of the output register [ J/4]1104 are shown in fig. 40. In addition, as described in more detail above with respect to FIGS. 49 and 52 of the prior application, four bytes in the quad-byte section of the output register [ J/4]1104 are provided as inputs to the multiplexing register 208/705.

Although the multiplexing registers 208/705 are shown in fig. 40 as being distinct from the NPUs 126, there is a corresponding pair of multiplexing registers 208/705 associated with each NPU126, and these multiplexing registers 208/705 may be considered part of the NPUs 126 as described above, for example, with respect to fig. 2 and 7.

The output 5822 of the shift unit 5802 is coupled to a shift register 5804, a data RAM122 and a weight RAM124, each of which can be written to by the output 5822. The output 5822 of the shift unit 5802, the shift register 5804, the data RAM122 and the weight RAM124 are all N bytes wide (e.g., N is 4906). Mobile unit 5802 receives N quantized bytes from five different sources as follows and selects one of them as its input: data RAM122, weight RAM124, move register 5804, output register 1104, and an immediate value. Preferably, the mobile unit 5802 includes a plurality of multiplexers interconnected to be able to perform operations on its inputs to produce its outputs 5822, which operations will now be described.

The operations performed by the mobile unit 5802 on its inputs include: passing an input to an output; rotating the input wheel by a specified amount; and extracting and compacting the input specified bytes. The operation is specified in the MOVE instruction that is fetched from the program memory 129. In one embodiment, the amount of revolutions that can be specified are 8, 16, 32, and 64 bytes. In one embodiment, the wheel direction is to the left, but other embodiments are contemplated in which the wheel direction is to the right or any direction. In one embodiment, the extraction and compaction operations are performed within an input block of a predetermined size. The block size is specified by the MOVE instruction. In one embodiment, the predetermined block sizes are 16, 32, and 64 bytes, and the blocks are located on alignment boundaries that specify block sizes. Thus, for example, when the MOVE instruction specifies a block size of 32, the MOVE unit 5802 extracts the specified bytes within each 32-byte block of the input N bytes (e.g., if N is 4096, there are 128 blocks) and compacts them within the corresponding 32-byte block (preferably at one end of the block). In one embodiment, NNU 121 also includes an N-bit mask register (not shown) associated with move register 5804. A MOVE instruction specifying a load mask register operation may specify a row of data RAM122 or weight RAM124 as its source. In response to the MOVE instruction specifying an operation to load the mask register, the MOVE unit 5802 extracts a bit 0 from each of the N words of the line of RAM and stores the N bits into corresponding bits of the N-bit mask register. During execution of a subsequent MOVE instruction to write to the MOVE register 5804, bits in the bit mask are used as write enable/disable for the corresponding byte of the MOVE register 5804. In an alternative embodiment, the 64-bit mask is specified by an INITIALIZE instruction for loading into a mask register prior to executing a MOVE instruction to specify fetch and compact functions; in response to the MOVE instruction, MOVE unit 5802 fetches the bytes within each block (e.g., of the 128 blocks) specified by the 64-bit mask stored in the mask register. In an alternative embodiment, the MOVE instruction used to specify the fetch and pack operations also specifies a stride and an offset; in response to the MOVE instruction, MOVE unit 5802 fetches every N bytes within each block, where N is the stride, starting with the byte specified by the offset, and compresses the fetched bytes together. For example, if the MOVE instruction specifies a stride of 3 and an offset of 2, then the MOVE unit 5802 fetches every three bytes from byte 2 within each block.

Ring bus connected neural network unit

Embodiments were described above in which NNU 121 is an execution unit of processor 100. An embodiment will now be described in which NNU 121 is located on a ring bus along with a plurality of legacy processing cores of a multicore processor to operate as a neural network accelerator shared by other cores to perform neural network related computations on behalf of the cores in a faster manner than those processing cores can perform. In many aspects, NNU 121 operates like a peripheral device, where a program running on a core can control NNU 121 to perform neural network related computations. Preferably, the multicore processor and NNU 121 are fabricated on a single integrated circuit. Since the size of NNUs 121 may be quite large, particularly for embodiments where the number of NPUs 126 and the size of memory 122/124 are large (e.g., 4096 NPUs 126 with 4096 byte wide data RAMs 122 and weight RAMs 124), such embodiments may provide the advantage that the size of cores is not increased by the size of NNUs 121, but there are fewer NNUs 121 than cores, and the cores share NNUs 121, which allows the integrated circuit to be smaller, albeit in exchange for potentially lower performance.

Referring now to FIG. 41, a block diagram is shown illustrating a processor 100. The processor 100 includes a plurality of ring stations 4004, wherein the plurality of ring stations 4004 are connected to each other in a bidirectional manner to form a ring bus 4024. The embodiment of FIG. 41 includes seven ring stations denoted as 4004-0, 4004-1, 4004-2, 4004-3, 4004-M, 4004-D, and 4004-U. Processor 100 includes four core complexes 4012, referred to as core complex 04012-0, core complex 14012-1, core complex 24012-2 and core complex 34012-3, respectively, where the four core complexes 4012 each include four ring stations 4004-0, 4004-1, 4004-2 and 4004-3 for coupling core complex 4012 to ring bus 4024. The processor 100 also includes an uncore portion 4016 that includes ring stations 4004-U for coupling the uncore 4016 to a ring bus 4024. Finally, the processor 100 includes a Dynamic Random Access Memory (DRAM) controller 4018 coupled to the ring bus 4024 by the ring stations 4004-D. Finally, processor 100 includes NNU 121 coupled to ring bus 4024 by ring stations 4004-M. In one embodiment described in U.S. non-provisional applications 15366027, 15366053, and 15366057 (hereinafter "Dual Use NNU Memory array applications"), each filed on 2016, 12/1/d, and incorporated herein by reference in its entirety), NNU 121, as described therein, comprises a Memory array that can be used as Memory for an array of NPUs 126 of NNU 121 (e.g., weight RAM 124 of FIG. 1) or as cache Memory shared by core complex 4012, e.g., as a victim cache (victims cache) or as a Last Level Cache (LLC) slice. Although the example of fig. 41 includes four core complexes 4012, other embodiments having a different number of core complexes 4012 are also contemplated. For example, in one embodiment, processor 100 includes eight core complexes 4012.

The uncore 4016 includes a bus controller 4014 such as a video controller, a disk controller, a peripheral bus controller (e.g., PCI-E), or the like for controlling access of the processor 100 to a system bus 4022 to which peripheral devices may be coupled. In one embodiment, system bus 4022 is the well-known V4 bus. Uncore 4016 may also include other functional units such as a power management unit and private RAM (e.g., non-architected memory used by microcode of core 4002). In an alternative embodiment, DRAM controller 4018 is coupled to the system bus, and NNU 121 accesses the system memory via ring bus 4024, bus controller 4014, and DRAM controller 4018.

The DRAM controller 4018 controls a DRAM (e.g., an asynchronous DRAM or a synchronous DRAM (sdram), such as a double data rate synchronous DRAM, a direct Rambus DRAM, or a reduced latency DRAM) as a system memory. Core complex 4012, uncore 4016, and NNU 121 access system memory via ring bus 4024. More specifically, NNU 121 reads weights and data of the neural network from the system memory into data RAM 122 and weight RAM 124, and writes the neural network results from data RAM 122 and weight RAM 124 to the system memory via ring bus 4024. In addition, when operating as a victim cache, the memory array (e.g., data RAM 122 or weight RAM 124) evicts cache lines to system memory under the control of cache control logic. Further, when operating as an LLC slice, the memory array and cache control logic fills cache lines from system memory and writes back and evicts cache lines to system memory.

The four core complexes 4012 include respective LLC slices 4012-0, 4012-1, 4012-2, and 4012-3, wherein each LLC slice is coupled to a ring station 4004 and is referred to generically either individually as LLC slice 4006 or collectively as LLC slice(s) 4006. Each core 4002 includes a cache memory, such as a level 2 (L2) cache 4008 coupled to the ring station 4004. Each core 4002 may also include a level 1 cache (not shown). In one embodiment, core 4002 is a x86 Instruction Set Architecture (ISA) core, although other embodiments are contemplated in which core 4002 is another ISA (e.g., ARM, SPARC, MIPS, etc.) core.

As shown in FIG. 41, LLC slices 4006-0, 4006-1, 4006-2, and 4006-3 form an entirety of LLC 4005 of processor 100 shared by core complex 4012. Each LLC slice 4006 includes a memory array and cache control logic. As described in the dual-use NNU memory array application incorporated by reference above, the mode indicator can be set such that the memory array of NNU 121 serves as an additional (e.g., fifth or ninth) tile 4006-4 of LLC 4005. In one embodiment, each LLC slice 4006 comprises a 2MB memory array, although other embodiments having different sizes are contemplated. Moreover, embodiments are contemplated in which the size of the memory array and the size of LLC slice 4006 are different. Preferably, the LLC 4005 includes an L2 cache 4008 as well as any other caches in the cache hierarchy (e.g., an L1 cache).

The ring bus 4024 or ring 4024 is an extensible bi-directional interconnect that facilitates communication between coherent components including the DRAM controller 4018, the uncore 4016, and the LLC slice 4006. Ring 4024 includes two unidirectional rings, each of which also includes five sub-rings: a Request (Request) for transmitting a Request packet of most types including a load; a Snoop (Snoop) for transmitting a Snoop request packet; an acknowledgement (Acknowledge) for transmitting the response packet; data (Data) for transmitting a Data packet and a specific request item including a write; and credits (credits) for transmitting and obtaining credits in the remote queue. The nodes attached to ring 4024 are connected via ring station 4004, where the ring station 4004 contains queues for sending and receiving packets over ring 4024, such as the queues described in more detail with respect to fig. 42-44. A queue is an egress queue that initiates requests on ring 4024 on behalf of an attached component to be received in a remote queue, or an ingress queue that receives requests from ring 4024 to be forwarded to an attached component. Before an egress queue initiates a request on the ring, it first gets credits on the credit ring from the remote destination ingress queue. This ensures that the remote ingress queue has resources available to process the request as it arrives. When an egress queue wishes to send a transaction packet on ring 4024, the egress queue can only send the transaction packet without pre-empting the incoming packet that is ultimately destined for the remote node. When an incoming packet arrives at the ring station 4004 from any direction, the destination ID of the packet is interrogated to determine if the ring station 4004 is the final destination for the packet. If the destination ID is not equal to the node ID of the ring station 4004, the packet proceeds to the next ring station 4004 in the subsequent clock. Otherwise, the packet leaves ring 4024 in the same clock for consumption by any ingress queue involved in the transaction type of the packet.

In general, LLC 4005 comprises N LLC slices 4006, wherein each slice 4006 of the N slices 4006 is responsible for caching a different approximately 1/N of the physical address space of processor 100 as determined by a hashing (hash) algorithm, or simply, hashing. The hash is a function that takes a physical address as input and selects the appropriate LLC slice responsible for caching the physical address. In case a request has to be made from the core 4002 or the snooping agent to the LLC 4005, the request has to be sent to the appropriate LLC slice 4006, which is responsible for caching the physical address of the request. The appropriate LLC slice 4006 is determined by applying a hash to the physical address of the request.

The hashing algorithm is a flood function (spurious function) where the domain of the flood function is a set of physical addresses or a subset thereof and the range of the flood function is the number of LLC slices 4006 that are currently included. More specifically, the range is a set of indices (e.g., 0 to 7 in the case of eight LLC slices 4006) of the LLC slices 4006. The function may be computed by examining the appropriate subset of physical address bits. For example, in a system with eight LLC slices 4006, the output of the hashing algorithm may simply be PA [10:8], i.e., three of the physical address bits, i.e., bits 8 through 10. In another embodiment, where the number of LLC slices 4006 is 8, the output of the hash is a logical function of the other address bits (e.g., three bits produced as { PA [17], PA [14], PA [12] < Lambda > PA [10] < Lambda > PA [9 }).

All requestors of this LLC 4005 must have the same hashing algorithm before any LLC 4005 cache is complete. Since the hash specifies the location where the address is cached during operation and where the snoop is to be sent, the hash is changed only by coordination among all cores 4002, LLC slices 4006, and the snoop agent. As described in the dual-use NNU memory array application, the update hashing algorithm basically comprises: (1) synchronizing all cores 4002 to prevent new cacheable accesses; (2) performing a write back invalidation of all LLC slices 4006 currently included in LLC 4005, which results in modified cache lines being written back to system memory and all cache lines being invalidated (as described below, a write back invalidation may be a selective write back invalidation in which only those cache lines whose addresses are hashed by the new hash algorithm to a different slice than the old hash algorithm are evicted, i.e., invalidated, and if modified, written back before invalidation); (3) broadcasting a hash update message to each core 4002 and snooping source, which instructs each core 4002 and snooping source to change to a new hash (from an inclusive hash to an exclusive hash, or vice versa, as described below); (4) updating a mode input of selection logic for controlling access to the memory array; and (5) resuming execution with the new hashing algorithm.

The hashing algorithms described above are useful when the number N of LLC slices 4006 is 8, a power of 2, and these algorithms can be modified to easily accommodate other powers of 2, e.g., PA [9:8] for 4 slices or PA [11:8] for 16 slices. However, depending on whether NNU LLC slice 4006-4 is contained in LLC 4005 (and depending on the number of core complexes 4012), N may or may not be the power of 2. Thus, as described in the dual-use NNU memory array application, when the NNU 121 memory array has dual use, at least two different hashes can be used.

In an alternative embodiment, NNU 121 and DRAM controller 4018 are both coupled to a single ring station 4004. The single ring station 4004 includes an interface that enables NNU 121 and DRAM controller 4018 to transmit requests and data between each other rather than via ring bus 4024. This may be advantageous because it may reduce traffic on ring bus 4024 and provide high transmission performance between NNU 121 and system memory.

Preferably, processor 100 is fabricated on a single integrated circuit or chip. Thus, data transfer may be achieved between system memory and/or LLC 4005 and NNU 121 at a very high sustainable rate, which may be very advantageous for neural network applications, particularly neural network applications where the amount of weights and/or data is relatively large. That is, although not an execution unit of core 4002 as in the embodiment of fig. 1, NNU 121 is tightly coupled to core 4002, which may provide significant memory performance advantages over, for example, a neural network unit coupled to a peripheral bus such as a PCIe bus.

Referring now to fig. 42, a block diagram is shown that illustrates the ring station 4004-N of fig. 41 in greater detail. The ring station 4004-N includes a slave interface 6301, a first master interface 6302-0, referred to as master interface 0, and a second master interface 6302-1, referred to as master interface 1. Master interface 06302-0 and master interface 16302-1 are referred to generically as master interface 6302 individually or collectively as master interface(s) 6302. The ring station 4004-N further comprises three arbiters 6362, 6364 and 6366 coupled to respective buffers 6352, 6354 and 6356 providing outgoing Requests (REQ), DATA (DATA) and Acknowledgements (ACK), respectively, on the first unidirectional ring 4024-0 of the ring bus 4024; these three arbiters 6362, 6364 and 6366 receive incoming Requests (REQ), DATA (DATA) and Acknowledgements (ACK), respectively, on first unidirectional ring 4024-0. The ring station 4004-N comprises three additional arbiters 6342, 6344 and 6346 coupled to respective

additional buffers

6332, 6334 and 6336 providing outgoing Requests (REQ), DATA (DATA) and Acknowledgements (ACK), respectively, on the second unidirectional ring 4024-1 of the ring bus 4024; these three arbiters 6342, 6344 and 6346 receive incoming Requests (REQ), DATA (DATA) and Acknowledgements (ACK), respectively, on second unidirectional ring 4024-1. The request, data, and acknowledge subrings of each unidirectional ring of ring bus 4024 are described above. The listening and credit subrings are not shown, but slave 6301 and master 6302 interfaces are also coupled to the listening and credit subrings.

Slave interface 6301 includes load queue 6312 and store queue 6314; the master interface 06302-0 includes a load queue 6322 and a store queue 6324; and host interface 16302-1 includes a load queue 6332 and a store queue 6334. The load queue 6312 of the slave interface 6301 receives and queues requests from both the unidirectional rings 4024-0 and 4024-1 of the ring bus 4024 and provides queued data to each of the respective arbiters 6364 and 6344 of the ring bus 4024. The store queue 6314 of the slave interface 6301 receives and queues data from both directions of the ring bus 4024 and provides acknowledgements to each of the respective arbiters 6366 and 6346 of the ring bus 4024. Load queue 6322 of primary interface 06302-0 receives data from second unidirectional ring 4024-1 and provides queued requests to arbiter 6362 of first unidirectional ring 4024-0. Store queue 6324 of primary interface 06302-0 receives the acknowledgement from second unidirectional ring 4024-1 and provides queued data to arbiter 6364 of first unidirectional ring 4024-0. Primary interface 16302-1 load queue 6332 receives data from first unidirectional ring 4024-0 and provides queued requests to arbiter 6342 of second unidirectional ring 4024-1. Store queue 6334 of primary interface 16302-1 receives the acknowledgement from first unidirectional ring 4024-0 and provides queued data to arbiter 6344 of second unidirectional ring 4024-1. Load queue 6312 of slave interface 6301 provides queued requests to NNU121 and receives data from NNU 121. Store queue 6314 of slave interface 6301 provides queued requests and data to NNU121 and receives acknowledgements from NNU 121. Load queue 6322 of first master interface 06302-0 receives and queues requests from NNU121 and provides data to NNU 121. Store queue 6324 of first master interface 06302-0 receives and queues requests and data from NNU121 and provides acknowledgements to NNU 121. Load queue 6332 of second master interface 16302-1 receives and queues requests from NNU121 and provides data to NNU 121. Store queue 6334 of second master interface 16302-2 receives and queues requests and data from NNU121 and provides acknowledgements to NNU 121.

In general, slave interface 6301 receives requests made by core 4002 to load data from NNU 121 (received by load queue 6312) and requests made by core 4002 to store data to NNU 121 (received by store queue 6314), but slave interface 6301 may also receive such requests from other ring bus 4024 agents. For example, via slave interface 6301, core 4002 may: write control data and read status data with respect to control/status register 127; write instructions to program memory 129; write/read data/weights with respect to data RAM 122 and weight RAM 124; and writes control words to bus controller memory 6636 to program DMA controller 6602 (see fig. 45) of NNU 121. More specifically, in embodiments where NNU 121 is located on ring bus 4024 rather than being an execution unit of core 4002, core 4002 may write to control/status register 127 to instruct NNU 121 to perform similar operations as described for MTNN instruction 1400 of fig. 14, and may read from control/status register 127 to instruct NNU 121 to perform similar operations as described for MFNN instruction 1500 of fig. 15. The list of operations includes, but is not limited to: starting execution of the program in program memory 129, suspending execution of the program in program memory 129, requesting notification (e.g., an interrupt) of completion of execution of the program in program memory 129, resetting NNU 121, writing a DMA base register, and writing a strobe (strobe) address to cause a row buffer to be written to or read from data/weight RAM 122/124. In addition, slave interface 6301 may generate interrupts (e.g., PCI interrupts) to each core 4002 at the request of NNU 121. Preferably, the sequencer 128 instructs the slave interface 6301 to generate an interrupt, for example in response to decoding an instruction fetched from the program memory 129. Alternatively, DMAC6602 may instruct the slave interface 6301 to generate an interrupt, for example, in response to completing a DMA operation (e.g., after writing a data word that is the result of a neural network layer computation from the data RAM 122 to system memory). In one embodiment, the interrupt comprises a vector, such as an 8-bit x86 interrupt vector, or the like. Preferably, a flag in a control word read by DMAC6602 from bus control memory 6636 specifies whether DMAC6602 indicates that the slave interface 6301 generates an interrupt upon completion of the DMA operation.

Typically, the NNU 121 generates requests to write data to system memory (received by the store queue 6324/6334) via the master interface 6302 and requests to read data from system memory (received by the load queue 6322/6332) via the master interface 6302 (e.g., via the DRAM controller 4018), but the master interface 6302 may also receive requests to proxy read/write data from the NNU 121 with respect to the other ring bus 4024. For example, NNU 121 may transfer data/weights from system memory to data RAM 122 and weight RAM 124, and may transfer data from data RAM 122 and weight RAM 124 to system memory via master interface 6302.

Preferably, the various entities of NNU 121 accessible via ring bus 4024 (such as data RAM 122, weight RAM 124, program memory 129, bus control memory 6636, and control/status registers 127, etc.) are memory mapped into system memory space. In one embodiment, the accessible NNU 121 entities are memory mapped via Peripheral Component Interconnect (PCI) configuration registers of the PCI configuration protocol, which is well known.

An advantage of having two master interfaces 6302 for the ring stations 4004-N is that it enables NNUs 121 to transmit and/or receive simultaneously with respect to both system memory (via DRAM controller 4018) and the various L3 slices 4006, or alternatively in parallel with respect to system memory at twice the bandwidth of an embodiment having a single master interface.

In one embodiment, the data RAM 122 is 64KB, arranged as 16 lines of 4KB each, thus requiring 4 bits to specify its line address; the weight RAM 124 is 8MB, arranged as 2K lines of 4KB per line, thus requiring 11 bits to specify its line address; the program memory 129 is 8KB, arranged as 64-bit 1K lines per line, thus requiring 10 bits to specify its line address; the bus control memory 6636 is 1KB, which is arranged as 128 lines of 64 bits per line, thus requiring 7 bits to specify its line address; each of the queues 6312/6314/6322/6324/6332/6334 includes 16 entries, thus requiring 4 bits to specify the index of the entry. In addition, the data subring of the unidirectional ring 4024 of the ring bus 4024 has a width of 64 bytes. Accordingly, the 64 byte portion is referred to herein as a block, a block of data, etc. (data can be used to generically refer to both data and weights). Thus, the rows of data RAM 122 or weight RAM 124, although not addressable at the block level, are each subdivided into 64 blocks; in addition, data/weight write buffer 6612/6622 (of FIG. 45) and data/weight read buffer 6614/6624 (of FIG. 45) are each also subdivided into 64 blocks of 64 bytes each, and are addressable at the block level; thus, 6 bits are required to specify the address of the block within the line/buffer. The following description assumes these sizes for ease of illustration; however, various other embodiments of different sizes are contemplated.

Referring now to FIG. 43, a block diagram is shown illustrating the slave interface 6301 of FIG. 42 in greater detail. The slave interface 6301 includes a load queue 6312, a store queue 6314, arbiters 6342, 6344, 6346, 6362, 6364, and 6366, and buffers 6332, 6334, 6336, 6352, 6354, and 6356, which are coupled to the ring bus 4024 of fig. 42. FIG. 43 also includes other requesters 6472 (e.g., master interface 06302-0) that generate requests to arbiter 6362 and other requesters 6474 (e.g., master interface 16302-1) that generate requests to arbiter 6342.

The dependent load queue 6312 includes a queue of entries 6412 coupled to a request arbiter 6416 and a data arbiter 6414. In the illustrated embodiment, the queue includes 16 entries 6412. Each entry 6412 includes storage for an address, a source identifier, a direction, a transaction identifier, and a data block associated with the request. The address specifies the location within NNU 121 to load the requested data to return to the requesting ring bus 4024 agent (e.g., core 4002). The address may specify a block location within control/status register 127, or data RAM 122 or weight RAM 124. In the case where the address specifies a block position within the data RAM 122/weight RAM124, the upper bits specify the rows of the data RAM 122/weight RAM124, and the lower bits (e.g., 6 bits) specify the blocks within the specified rows. Preferably, the low order bits are used to control the data/weight read buffer multiplexer 6615/6625 (see FIG. 45) to select the appropriate block (see FIG. 45) within the data/weight read buffer 6614/6624. The source identifier specifies the requestor ring bus 4024 agent. The direction specifies on which one of the two unidirectional rings 4024-0 or 4024-1 data is to be sent back to the requesting agent. The transaction identifier is specified by the requester proxy and returned by the ring station 4004-N to the requester proxy along with the requested data.

Each entry 6412 also has an associated state. A Finite State Machine (FSM) updates the state. In one embodiment, the FSM operates as follows. When the load queue 6312 detects a load request destined for it on the ring bus 4024, the load queue 6312 allocates an available entry 6412 and fills the allocated entry 6412, and the FSM updates the state of the allocated entry 6412 to the requestor NNU. Request arbiter 6416 arbitrates between requestors NUU entries 6412. When the allocated entry 6412 wins arbitration and is sent as a request to NNU 121, the FSM marks entry 6412 as pending NNU data. When NNU 121 responds with the requested data, load queue 6312 loads the data into entry 6412 and marks entry 6412 as a requestor data ring. The data arbiter 6414 arbitrates between the requester data ring entries 6412. When entry 6412 wins arbitration and data is sent on ring bus 4024 to the ring bus 4024 agent requesting the data, the FSM marks entry 6412 as available and issues credits on its credit ring.

The slave storage queue 6314 comprises a queue of entries 6422 coupled to a request arbiter 6426 and an acknowledge arbiter 6424. In the illustrated embodiment, the queue includes 16 entries 6422. Each entry 6422 includes storage for an address, a source identifier, and data associated with the request. The address specifies the location within NNU 121 to which data provided by the requesting ring bus 4024 agent (e.g., core 4002) is to be stored. The address may specify a block location within control/status register 127, data RAM122 or weight RAM 124, a location within program memory 129, or a location within bus control memory 6636. In the case where the address specifies a block position within the data RAM 122/weight RAM 124, the upper bits specify the rows of the data RAM 122/weight RAM 124, and the lower bits (e.g., 6 bits) specify the blocks within the specified rows. Preferably, the low order bits are used to control the data/weight demultiplexer 6611/6621 to select the appropriate block within the data/weight write buffer 6612/6622 to write to (see FIG. 45). The source identifier specifies the requestor ring bus 4024 agent.

Each entry 6422 also has an associated state. A Finite State Machine (FSM) updates the state. In one embodiment, the FSM operates as follows. When the store queue 6314 detects a store request destined for it on the ring bus 4024, the store queue 6314 allocates an available entry 6422 and fills the allocated entry 6422, and the FSM updates the state of the allocated entry 6422 to the requestor NNU. The request arbiter 6426 arbitrates between the requesters NUU entries 6422. When entry 6422 wins arbitration and is sent to NNU 121 along with the data for entry 6422, the FSM marks entry 6422 as a pending NNU acknowledgement. When NNU 121 responds with an acknowledgement, the store FSM marks entry 6422 as a requestor acknowledgement ring. The acknowledgement arbiter 6424 arbitrates between the requestor acknowledgement ring entries 6422. When entry 6422 wins arbitration and an acknowledgement is sent on the acknowledgement ring to the ring bus 4024 agent requesting to store data, the FSM marks entry 6422 as available and issues credits on its credit ring. The store queue 6314 also receives a wr busy signal from the NNU 121, where the wr busy signal indicates that the store queue 6314 is not requesting from the NNU 121 until the wr busy signal is no longer valid.

Referring now to FIG. 44, a block diagram is shown illustrating the main interface 06302-0 of FIG. 42 in greater detail. Although FIG. 44 shows a master interface 06302-0, the master interface 06302-0 also represents details of the master interface 16302-1 of FIG. 42, and therefore will be referred to generically as master interface 6302. The master interface 6302 includes a load queue 6322, a store queue 6324, arbiters 6362, 6364, and 6366, and buffers 6352, 6354, and 6356 coupled to the ring bus 4024 of fig. 42. Fig. 44 also shows other acknowledge requestors 6576 (e.g., slave interfaces 6301) that generate acknowledge requests for the arbiter 6366.

The master interface 6302 also includes an arbiter 6534 (not shown in fig. 42), where this arbiter 6534 receives requests from the load queue 6322 and from other requesters 6572 (e.g., DRAM controller 4018 in embodiments where the NNU 121 and DRAM controller 4018 share a ring station 4004-N) and presents winning arbitration requests to the arbiter 6362 of fig. 42. The master interface 6302 also includes a buffer 6544, where the buffer 6544 receives data associated with an entry 6512 of the load queue 6312 from the ring bus 4024 and provides it to the NNU 121. The master interface 6302 also includes an arbiter 6554 (not shown in fig. 42), where the arbiter 6554 receives data from the store queue 6324 and from other requesters 6574 (e.g., DRAM controller 4018 in embodiments where the NNU 121 and DRAM controller 4018 share a ring station 4004-N) and presents winning arbitration data to the arbiter 6364 of fig. 42. The master interface 6302 also includes a buffer 6564, where the buffer 6564 receives an acknowledgement from the ring bus 4024 associated with an entry 6522 of the store queue 6314 and provides it to the NNU 121.

The load queue 6322 includes a queue of entries 6512 coupled to an arbiter 6514. In the illustrated embodiment, the queue includes 16 entries 6512. Each entry 6512 includes storage for an address and a destination identifier. The address specifies an address (46 bits in one embodiment) in the ring bus 4024 address space (e.g., a system memory location). The destination identifier specifies the ring bus 4024 agent (e.g., system memory) from which the data is to be loaded.

Load queue 6322 receives a master load request from NNU 121 (e.g., from DMAC 6602) to load data from the ring bus 4024 agent (e.g., system memory) into data RAM 122, weight RAM 124, program memory 129, or bus control memory 6636. The master load request specifies a destination identifier, a ring bus address, and an index of the load queue 6322 entry 6512 to use. When load queue 6322 receives a master load request from NNU 121, load queue 6322 fills in indexed entry 6512, and the FSM updates the entry 6512 state to requestor credit. When load queue 6322 obtains credit from the credit ring to send a request for data to the destination ring bus 4024 agent (e.g., system memory), the FSM updates the state to the requestor request ring. The arbiter 6514 arbitrates between the requestor request ring entries 6512 (and the arbiter 6534 arbitrates between the load queue 6322 and the other requesters 6572). When entry 6512 is granted to the request ring, a request is sent on the request ring to the destination ring bus 4024 agent (e.g., system memory), and the FSM updates the state to the pending data ring. When the ring bus 4024 responds with data (e.g., from system memory), the data is received in the buffer 6544. And provided to NNU 121 (e.g., to data RAM 122, weight RAM 124, program memory 129, or bus control memory 6636), and the FSM updates entry 6512 state as available. Preferably, an index of the entry 6512 is included within the data packet to enable the load queue 6322 to determine the entry 6512 associated with the data packet. Load queue 6322 preferably provides an entry 6512 index to NNU 121 along with the data to enable NNU 121 to determine which entry 6512 the data is associated with and to enable NNU 121 to reuse entry 6512.

The master storage queue 6324 comprises a queue of entries 6522 coupled to an arbiter 6524. In the illustrated embodiment, the queue includes 16 entries 6522. Each entry 6522 includes storage for an address, a destination identifier, a data field for holding data to be stored, and a coherency flag. The addresses specify addresses in the ring bus 4024 address space (e.g., system memory locations). The destination identifier specifies the ring bus 4024 agent (e.g., system memory) into which the data is to be stored. The coherence flag is sent with the data to the destination agent. If the coherency flag is set, it instructs DRAM controller 4018 to snoop LLC 4005 and invalidate the copy in LLC 4005 (if it exists). Otherwise, the DRAM controller 4018 writes the data to system memory without snooping the LLC 4005.

The store queue 6324 receives a main store request from NNU 121 (e.g., from DMAC 6602) to store data from data RAM 122 or weight RAM 124 to the ring bus 4024 agent (e.g., system memory). The master store request specifies the destination identifier, the ring bus address, the index of the store queue 6324 entry 6522 to use, and the data to store. When the store queue 6324 receives a master store request from NNU 121, the store queue 6324 fills the allocated entry 6522 and the FSM updates the entry 6522 status to requester credit. When store queue 6324 gets credit from the credit ring to send data to the destination ring bus 4024 agent (e.g., system memory), the FSM updates the state to the requester data ring. The arbiter 6524 arbitrates between the requester data ring entries 6522 (and the arbiter 6554 arbitrates between the store queue 6324 and the other requesters 6574). When entry 6522 is granted to the data ring, data is sent on the data ring to the destination ring bus 4024 agent (e.g., system memory), and the FSM updates the state to a pending acknowledgement ring. When ring bus 4024 responds with an acknowledgement of data (e.g., from system memory), the acknowledgement is received in buffer 6564. Store queue 6324 then provides an acknowledgement to NNU 121 to inform NNU 121 that a store has been performed, and FSM updates entry 6522 state as available. Preferably, store queue 6324 does not have to arbitrate to provide acknowledgments to NNUs 121 (e.g., DMAC 6602 exists for each store queue 6324 as in the embodiment of fig. 45). However, in embodiments where store queue 6324 must arbitrate to provide acknowledgments, FSM updates entry 6522 state to requester NNU completion when ring bus 4024 responds with an acknowledgment, and updates entry 6522 state to available once entry 6522 wins arbitration and provides an acknowledgment to NNU 121. Preferably, the index of the entry 6522 is included within the acknowledgement packet received from the ring bus 4024, which enables the store queue 6324 to determine the entry 6522 associated with the acknowledgement packet. Storage queue 6324 provides an entry 6522 index to NNU 121 along with the acknowledgement to enable NNU 121 to determine which entry 6512 the data is associated with and to enable NNU 121 to reuse entry 6522.

Referring now to FIG. 45, a block diagram is shown that illustrates a portion of the ring bus coupling embodiments of the ring stations 4004-N and NNUs 121 of FIG. 42. A slave interface 6301, a master interface 06302-0 and a master interface 16302-1 of the ring station 4004-N are shown. The ring bus coupling embodiments of NNU 121 of fig. 45 include embodiments of data RAM 122, weight RAM124, program memory 129, sequencer 128, control/status registers 127 described in detail above. The ring bus coupling embodiment of NNU 121 is similar in many respects to the execution unit embodiment described above, and for the sake of brevity, these aspects will not be re-described. The ring bus coupled embodiment of NNU 121 also includes the elements described in fig. 40, e.g., shift unit 5802, shift register 5804, multiplexing register 208/705, NPU 126, multiplexer 5806, output unit 5808, and output register 1104. NNU 121 also includes a first direct memory access controller (DMAC0)6602-0, a second direct memory access controller (DMAC1)6602-1, a bus control memory 6636, a data demultiplexer 6611, a data write buffer 6612, a data RAM multiplexer 6613, a data read buffer 6614, a data read buffer multiplexer 6615, a weight demultiplexer 6621, a weight write buffer 6622, a weight RAM multiplexer 6623, a weight read buffer 6624, a weight read buffer multiplexer 6625, a slave multiplexer 6691, a master 0 multiplexer 6693, and a master 1 multiplexer 6692. In one embodiment, three of the data demultiplexer 6611, data write buffer 6612, data read buffer 6614, data read buffer multiplexer 6615, weight demultiplexer 6621, weight write buffer 6622, weight read buffer 6624, and weight read buffer multiplexer 6625 are each associated with a slave interface 6301, a master interface 06302-0, and a master interface 16302-1, respectively, of the ring bus 4024. In one embodiment, three of the data demultiplexer 6611, the data write buffer 6612, the data read buffer 6614, the data read buffer multiplexer 6615, the weight demultiplexer 6621, the weight write buffer 6622, the weight read buffer 6624, and the weight read buffer multiplexer 6625, which are each associated with the slave interface 6301, the master interface 06302-0, and the master interface 16302-1 of the ring bus 4024, respectively, are paired to support data transmission in a double buffer manner.

Data demultiplexer 6611 is coupled to receive data blocks from slave interface 6301, master interface 06302-0, and master interface 16302-1, respectively. The data demultiplexers 6611 are further coupled to data write registers 6612, the data write registers 6612 are coupled to data RAM multiplexers 6613, the data RAM multiplexers 6613 are coupled to data RAMs 122, the data RAMs 122 are coupled to data read registers 6614, the data read registers 6614 are coupled to data read buffer multiplexers 6615, respectively, the data read buffer multiplexers 6615 are coupled to slave multiplexers 6691, master 0 multiplexers 6693 and master 1 multiplexers 6692, respectively. The slave multiplexer 6691 is coupled to the slave interface 6301, the master 0 multiplexer 6693 is coupled to the master interface 06302-0, and the master 1 multiplexer 6692 is coupled to the master interface 16302-1. Weight demultiplexer 6621 is coupled to receive data blocks from slave interface 6301, master interface 06302-0, and master interface 16302-1, respectively. The weight demultiplexers 6621 are further coupled to weight write registers 6622, weight write registers 6622 are coupled to weight RAM multiplexers 6623, weight RAM multiplexers 6623 are coupled to weight RAM 124, weight RAM 124 is coupled to weight read registers 6624, weight read registers 6624 are coupled to weight read buffer multiplexers 6625, weight read buffer multiplexers 6625 are coupled to slave multiplexers 6691, master 0 multiplexer 6693 and master 1 multiplexer 6692, respectively. The data RAM multiplexer 6613 and the weight RAM multiplexer 6623 are also coupled to the output register 1104 and the shift register 5804. The data RAM122 and weight RAM 124 are also coupled to the shift unit 5802 and data multiplexing register 208 of the NPU 126 and weight multiplexer register 705, respectively. The control/status register 127 is coupled to the slave interface 6301. The bus control memory 6636 is coupled to the slave interface 6301, the sequencer 128, the DMAC 06602-0 and the DMAC 16602-1. The program memory 129 is coupled to the slave interface 6301 and the sequencer 128. The sequencer 128 is coupled to the program memory 129, the bus control memory 6636, the NPU 126, the moving unit 5802, and the output unit 5808. DMAC 06602-0 is also coupled to the main interface 06302-0, and DMAC16602-1 is also coupled to the main interface 16302-1.

The data write buffer 6612, data read buffer 6614, weight write buffer 6622, and weight read buffer 6624 are the width of the data RAM122 and weight RAM124, i.e., the width of the NPU126 array, and are generally referred to herein as N. Thus, for example, in one embodiment, there are 4096 NPUs 126 and the width of the data write buffer 6612, data read buffer 6614, weight write buffer 6622 and weight read buffer 6624 is 4096 bytes, although other embodiments are contemplated in which N is a value other than 4096. The data RAM122 and weight RAM124 are written to the entire N word line at once. The output register 1104, the shift register 5804, and the data write buffer 6612 are written to the data RAM122 via a data RAM multiplexer 6613, wherein the data RAM multiplexer 6613 selects one of them to write a line word to the data RAM 122. The output register 1104, the shift register 5804, and the weight write buffer 6622 are written to the weight RAM124 via a weight RAM multiplexer 6623, wherein the weight RAM multiplexer 6623 selects one of them to write a line word to the weight RAM 124. Control logic (not shown) controls the data RAM multiplexer 6613 to arbitrate between the data write buffer 6612, the move register 5804, and the output register 1104 to access the data RAM122, and controls the weight RAM multiplexer 6623 to arbitrate between the weight write buffer 6622, the move register 5804, and the output register 1104 to access the weight RAM 124. The data RAM122 and weight RAM124 also read the entire N word lines at once. The NPU126, the moving unit 5802, and the data read buffer 6614 read a line of words from the data RAM 122. The NPU126, the moving unit 5802, and the weight read buffer 6624 read a line of words from the weight RAM 124. The control logic also controls the NPU126 (data multiplexer register 208 and weight multiplexer register 705), the shift unit 5802 and the data read buffer 6614 to determine which, if any, of them reads a line of words output by the data RAM 122. In one embodiment, the micro-operations 3418 described with respect to fig. 34 may include at least some of the control logic signals that control the data RAM multiplexer 6613, the weight RAM multiplexer 662, the NPU126, the shift unit 5802, the shift register 5804, the output register 1104, the data read buffer 6614, and the weight read buffer 6624.

The data write buffer 6612, data read buffer 6614, weight write buffer 6622, and weight read buffer 6624 are addressable in block size aligned blocks. Preferably, the block sizes of the data write buffer 6612, data read buffer 6614, weight write buffer 6622 and weight read buffer 6624 match the width of the data sub-ring of the ring bus 4024. This makes the ring bus 4024 suitable for reading/writing data/weight RAM 122/124 as follows. In general, the ring bus 4024 performs a block-size write to each block of the data write buffer 6612, and once all blocks of the data write buffer 6612 are filled, the data write buffer 6612 writes its N-word content to an entire row of the data RAM 122. Likewise, the ring bus 4024 performs a block-size write to each block of the weight write buffer 6622, and once all blocks of the weight write buffer 6622 are filled, the weight write buffer 6622 writes its N-word content to the entire line of the weight RAM 124. In one embodiment, NNU 121 includes a row address register (not shown) associated with each data/weight write buffer 6612/6622. The row address register is updated each time the ring stations 4004-N write a block of data/weight into the buffer 6612/6622. However, before the line address register is updated, its current value is compared to the new value, and if the two values are not the same (i.e. a new line of data RAM 122/weight RAM124 is being written to), this triggers a write of data/weight write buffer 6612/6622 to data RAM 122/weight RAM 124. In one embodiment, the write program memory 129 also triggers the writing of data/weight write register 6612/6622 to data RAM 122/weight RAM 124. Instead, the N word lines are read from the data RAM 122 into the data read buffer 6614; the ring bus 4024 then performs a read of the block size from each block of the data read buffer 6614. Similarly, N word lines are read from the weight RAM124 into the weight read buffer 6624; the ring bus 4024 then performs reading of the block size from each block of the weight read buffer 6624. Although the data RAM 122 and the weight RAM124 are represented as dual port memories in fig. 45, they are preferably single port memories, such that a single data RAM 122 port is shared by the data RAM multiplexer 6613 and the data read buffer 6614, and a single weight RAM124 port is shared by the weight RAM multiplexer 6623 and the weight read buffer 6624. Thus, the advantage of the full row read/write arrangement is that it enables the data RAM 122 and weight RAM124 to be smaller by having a single port (in one embodiment the weight RAM124 is 8MB and the data RAM 122 is 64KB), while the ring bus 4024 consumes less bandwidth relative to the writing and reading of the data RAM 122 and weight RAM124 than when writing individual blocks, thus freeing more bandwidth for the NPU 126, output register 1104, shift register 5804 and shift unit 5802 to access N word wide rows.

The control/status register 127 is provided to the slave interface 6301. The slave multiplexer 6691 receives the output of the data read buffer multiplexer 6615 associated with the slave interface 6301 and the output of the weight read buffer multiplexer 6625 associated with the slave interface 6301 and selects one of them to provide to the slave interface 6301. In this manner, the slave load queue 6312 receives data for responding to load requests made by the slave interface 6301 to the control/status register 127, the data RAM122, or the weight RAM 124. The master 0 multiplexer 6693 receives the output of the data read buffer multiplexer 6615 associated with the master interface 06302-0 and the output of the weight read buffer multiplexer 6625 associated with the master interface 06302-0 and selects one of them to provide to the master interface 06302-0. In this manner, the master interface 06302-0 receives data for responding to store requests made by the master interface 06302-0 store queue 6324. The primary 1 multiplexer 6692 receives the output of the data read buffer multiplexer 6615 associated with the primary interface 16302-1 and the output of the weight read buffer multiplexer 6625 associated with the primary interface 16302-1 and selects one of them to provide to the primary interface 16302-1. In this manner, the master interface 16302-1 receives data for responding to a store request made by the master interface 16302-1 store queue 6324. If the slave interface 6301 load queue 6312 requests a read from the data RAM122, the slave multiplexer 6691 selects the output of the data read buffer multiplexer 6615 associated with the slave interface 6301; whereas if the slave interface 6301 load queue 6312 requests a read from the weight RAM124, the slave multiplexer 6691 selects the output of the weight read buffer multiplexer 6625 associated with the slave interface 6301. Similarly, if the primary interface 06302-0 store queue requests to read data from the data RAM122, the primary 0 multiplexer 6693 selects the output of the data read buffer multiplexer 6615 associated with the primary interface 06302-0; whereas if the primary interface 06302-0 store queue requests to read data from weight RAM124, the primary 0 multiplexer 6693 selects the output of the weight read buffer multiplexer 6625 associated with the primary interface 06302-0. Finally, if the primary interface 16302-1 store queue requests to read data from the data RAM122, the primary 1 multiplexer 6692 selects the output of the data read buffer multiplexer 6615 associated with the primary interface 16302-1; whereas if primary interface 16302-1 store queue requests to read data from weight RAM124, primary 1 multiplexer 6692 selects the output of weight read buffer multiplexer 6625 associated with primary interface 16302-1. Thus, the ring bus 4024 agent (e.g., core 4002) may read from the control/status register 127, data RAM122, or weight RAM124 via the slave interface 6301 load queue 6312. In addition, the ring bus 4024 agent (e.g., core 4002) may write to the control/status register 127, data RAM122, weight RAM124, program memory 129, or bus control memory 6636 via the slave interface 6301 store queue 6314. More specifically, core 4002 can write a program (e.g., a program that performs full join, convolution, pooling, LSTM, or other recurrent neural network layer computations) to program memory 129 and then to control/status register 127 to start the program. In addition, the core 4002 may write control words to the bus control memory 6636 to cause DMAC 6602 to perform DMA operations between the data RAM122 or weight RAM124 and the ring bus 4024 agent (e.g., system memory or LLC 4005). The sequencer 128 may also write control words to the bus control memory 6636 to cause the DMAC 6602 to perform DMA operations between the data RAM122 or weight RAM124 and the ring bus 4024 agent. Finally, as described in more detail below, DMAC 6602 may perform DMA operations to perform transfers between ring bus 4024 agents (e.g., system memory or LLC 4005) and data/weight RAM 122/124.

Slave interface 6301, master interface 06302-0, and master interface 16302-1 are coupled to each other to provide data blocks to their respective data demultiplexers 6611 and weight demultiplexers 6621. Arbitration logic (not shown) arbitrates for access to the data RAM 122 between the output register 1104, the move register 5804 and the slave interface 6301, the master interface 06302-0 and the master interface 16302-1, and the data write buffer 6612, and arbitrates for access to the weight RAM 124 between the output register 1104, the move register 5804 and the slave interface 6301, the master interface 06302-0 and the master interface 16302-1, and the weight write buffer 6622. In one embodiment, write buffer 6612/6622 takes precedence over output register 1104 and move register 5804, and slave interface 6301 takes precedence over master interface 6302. In one embodiment, each data demultiplexer 6611 has 64 outputs (each output preferably 64 bytes) coupled to 64 blocks of a respective data write buffer 6612. The data demultiplexer 6611 provides the received block on an output of the appropriate block coupled to a data write buffer 6612. Likewise, each weight demultiplexer 6611 has 64 outputs (each output preferably 64 bytes) coupled to 64 blocks of the respective weight write buffer 6622. The weight demultiplexer 6621 provides the received block on an output of the appropriate block coupled to the weight write buffer 6622.

When a slave storage queue 6314 provides a block of data to its data/weight demultiplexer 6611/6621, the slave storage queue 6314 also provides the address of the appropriate block of data/weight write buffer 6612/6622 to be written to as a control input to the data/weight demultiplexer 6611/6621. The block address is the lower six bits of the address held in entry 6422, which is specified by the ring bus 4024 agent (e.g., core 4002) that generates the slave memory transaction. Conversely, when the load store queue 6312 requests a block of data from its data/weight read buffer multiplexer 6615/6625, the load store queue 6312 also provides the data/weight read buffer multiplexer 6615/6625 with the address of the appropriate block of the data/weight read buffer 6614/6624 to be read as a control input. The block address is the lower six bits of the address held in entry 6412, where this entry 6412 is specified by the ring bus 4024 agent (e.g., core 4002) that generates the dependent load transaction. Preferably, core 4002 may perform slave store transactions via slave interface 6301 (e.g., to a predetermined ring bus 4024 address) to cause NNU 121 to write the contents of data/weight write buffer 6612/6622 to data/weight RAM 122/124; instead, the core 4002 may perform a slave store transaction via the slave interface 6301 (e.g., to a predetermined ring bus 4024 address) to cause the NNU 121 to read a row of the data/weight RAM 122/124 into the data/weight read buffer 6614/6624.

When the host interface 6302 load queue 6322/6332 provides a data block to its data/weight demultiplexer 6611/6621, the host interface 6302 load queue 6322/6332 also provides an index of the entry 6512 to the corresponding DMAC6602 that issued the load request to the load queue 6322/6332. In order to transfer the entire 4KB of data from system memory to the columns of data/weight RAMs 122/124, DMAC6602 must generate 64 primary load requests to load queue 6322/6332. DMAC6602 logically divides the 64 main load requests into four groups, each containing sixteen requests. DMAC6602 transfers the 16 requests within a group to a corresponding 16 entries 6512 of the load queue 6322/6322. DMAC6602 maintains a state associated with each entry 6512 index. The status indicates which of the four groups is currently using the entry to load the data block. Thus, as described in more detail below, when DMAC6602 receives an entry 6512 index from load queue 6322/6322, logic of DMAC6602 constructs a block address by concatenating the group number with the index, and provides the constructed block address as a control input to data/weight demultiplexer 6611/6621.

Conversely, when the master interface 6302 store queue 6324/6334 requests a block of data from its data/weight buffer multiplexer 6615/6625, the master interface 6302 load queue 6324/6334 also provides the index of the entry 6522 to the corresponding DMAC6602 that issued the store request to the store queue 6322/6332. In order to transfer the entire 4KB of data from a row of data/weight RAMs 122/124 to system memory, DMAC6602 must generate 64 primary store requests to store queue 6324/6334. DMAC6602 logically divides the 64 store requests into four groups, each group containing sixteen requests. DMAC6602 makes sixteen requests within a group to a corresponding 16 entries 6522 of the store queue 6324/6334. DMAC6602 maintains a state associated with each entry 6522 index. The status indicates which of the four sets is currently using the entry to store the data block. Thus, as described in more detail below, when DMAC6602 receives an entry 6522 index from store queue 6324/6334, logic of DMAC6602 constructs a block address by concatenating the group number with the index, and provides the constructed block address as a control input to data/weight read buffer multiplexer 6615/6625.

Referring now to FIG. 46, a block diagram illustrating a ring bus coupling embodiment of the NNU 121 is shown. Fig. 46 is in some respects identical to fig. 34, and like numbered elements are identical. Like FIG. 34, FIG. 46 illustrates the ability of an NNU 121 to receive micro-operations from multiple sources to provide to its pipeline. However, in the embodiment of fig. 46, NNU 121 is coupled to core 4002 via ring bus 4024 as in fig. 41, differences will now be described.

In the embodiment of FIG. 46, multiplexer 3402 receives micro-operations from five different sources. The multiplexer 3402 provides the selected micro-operations 3418 to the NPU 126 pipeline stage 3401, the data RAM 122 and weight RAM 124, the move unit 5802, and the output unit 5808 for control thereof, as described above. As described with respect to FIG. 34, the first source is the sequencer 128 that produces the micro-operations 3416. The second source is a modified version of the decoder 3404 of fig. 34 to receive blocks of data of a store request from a slave interface 6301 store queue 6314 stored by the core 4002. As described above with respect to fig. 34, the data block may include information similar to micro-instructions translated from the MTNN instruction 1400 or the MFNN instruction 1500. Decoder 3404 decodes the data block and in response produces micro-operations 3412. One example is the micro-operations 3412 generated in response to requests to write data to the data/weight RAM 122/124 received from the slave interface 6301 store queue 6314, or in response to requests to read data from the data/weight RAM 122/124 received from the slave interface 6301 load queue 6312. The third source is a direct data block of store requests from a slave interface 6301 store queue 6314 stored by core 4002, where core 4002 includes micro-operations 3414 that NNU 121 directly executes, as described above with respect to fig. 34. Preferably, the core 4002 stores different memory mapped addresses into the ring bus 4024 address space to enable the decoder 3404 to distinguish the second and third sources of uops. The fourth source is a micro-operation 7217 generated by DMAC 6602. The fifth source is an empty compute micro-operation 7219, where NNU 121 maintains its state in response to the empty compute micro-operation 7219.

In one embodiment, five sources have a priority scheme performed by decoder 3404, with direct micro-operations 3414 having the highest priority; the micro-operations 3412 generated by the decoder 3404 in response to the slave store operation of the slave interface 6301 have a second highest priority; the micro-operation 7217 generated by DMAC 6602 has the next highest priority; micro-operations 3416 generated by sequencer 128 have the next highest priority; and the no-op micro-operation is the default (i.e., lowest priority), the source selected by multiplexer 3402 when no other source requests. According to one embodiment, when DMAC 6602 or slave interface 6301 needs to access data RAM 122 or weight RAM 124, it takes precedence over the program running on sequencer 128, and decoder 3404 pauses sequencer 128 until DMAC 6602 and slave interface 6301 have completed their accesses.

Conditional interrupts to core

As described above, NNU 121 may generate an interrupt request for core 4002. The interrupt request may notify core 4002 of the event that has occurred, so core 4002 may act accordingly. The following is an example of an event associated with an interrupt request from NNU 121 for core 4002.

First, NNU 121 can generate results (e.g., outputs of the neural network layer) that are accessible to core 4002. These results may be obtained in data RAM 122 or weight RAM 124 for core 4002 to read; alternatively, NNU 121 may have already passed the results to system memory (e.g., using one of DMACs 6602). Second, a program currently running on NNU 121 may require core 4002 to provide more input data for processing, e.g., for a new time step (e.g., in a Recurrent Neural Network (RNN) such as an LSTM layer), for a new network layer, or for a different set of nodes within the current network layer (e.g., where the number of neurons in that layer is greater than the number of NPUs 126 in NNU 121). For example, core 4002 may provide data by writing data to data RAM 122/weight RAM 124; alternatively, core 4002 can inform NNU 121 of the location of the data in system memory so that NNU 121 can transfer the data from system memory to data RAM 122/weight RAM 124. Third, the core 4002 may need to provide more weight for neural network computations, e.g., for a new network layer or for a different set of nodes within the current network layer (e.g., where the number of neurons in that layer is greater than the number of NPUs 126 in NNUs 121). The kernel 4002 may also provide weights by, for example, writing the weights to the data RAM 122/weight RAM 124; alternatively, core 4002 can inform NNU 121 of the location of the weights in system memory so that NNU 121 can transfer the weights from system memory to data RAM 122/weight RAM 124. Fourth, core 4002 may need to provide a new program for execution by NNU 121, i.e., a new program for loading into program memory 129. The core 4002 can also supply a program by writing the program into the program memory 129; alternatively, core 4002 can notify NNU 121 of the location of the program in system memory so that NNU 121 can transfer the program from system memory to program memory 129. Fifth, NNU 121 may simply need to let core 4002 know that the procedure is complete.

A problem is that in many systems, depending on the instruction set architecture of core 4002 and/or the operating system running on that core 4002, there may be a relatively large interrupt latency, which is the time required for a system device such as NNU 121 to generate an interrupt request for core 4002, until core 4002 reads the status of the device to determine the occurrence of an event associated with the interrupt request. Interrupt latency can result in a relatively large amount of time for a device to enter an idle state. This relatively low utilization of the device may be particularly detrimental to the performance of the system, especially if the device is an accelerator such as a neural network computational accelerator like NNU 121.

Embodiments are described that enable NNU 121 to reduce latency by NNU 121 preferably executing a set interrupt condition instruction at the start of a program that sets an interrupt condition that, when satisfied, causes NNU 121 to interrupt core 4002, but the program continues to run. Preferably, the interrupt condition can be dynamically programmed with a combination of values for the operating states of NNUs 121. In each of the cases described above, NNU 121 is caused to interrupt core 4002 a certain number of clock cycles before the interrupt request related event, i.e., approximately the number of clocks from the time that NNU 121 sets the interrupt request signal until the first instruction of the interrupt service routine to access NNU 121 to respond to the interrupt request related event is executed (e.g., read a status register in NNU 121 to determine whether the event has actually completed, or begin writing/reading data/weights with respect to data RAM 122/weight RAM 124, or send a pointer to an address of the data/weights to be read/written with respect to data RAM 122/weight RAM 124 to NNU 121, or begin writing the program to program memory 129, etc.).

Referring now to FIG. 47, a block diagram is shown that illustrates an embodiment of NNU 121. NNU 121 is similar in many respects to the embodiments of NNU 121 described above, and like numbered elements are similar, with differences as described herein. In particular, NNU 121 includes the same arrays of data RAM 122, weight RAM 124, program memory 129, sequencer 128, and NPU 126 as described above. Additionally, NNU 121 includes interrupt condition register 4706, status register 4704, and control logic 4702. Interrupt condition registers 4706 and status registers 4704 are coupled to the sequencer 128 and control logic 4702. Status register 4704 holds the status of NNU 121 during its operation. The status may include various fields, embodiments of which are described in more detail below with respect to FIG. 49. The set interrupt condition instruction 4722 stored in the program memory 129 and picked up by the sequencer 128 writes an interrupt condition to the interrupt condition register 4706. An embodiment of an interrupt condition, which may include a combination of various fields corresponding to a status field, is described in more detail below with respect to FIG. 48. The combination may be selected by each set interrupt condition instruction 4722. As described in more detail below, control logic 4702 has an output on which control logic 4702 generates an interrupt request 4712 for processing core 4002 when the status satisfies an interrupt condition. Although FIG. 47 only shows the sequencer 128 updating the status register 4704, the various fields of the status register 4704 may also be updated by operation of other elements of the NNU 121. Preferably, NNU 121 includes multiple interrupt condition registers 4706 to enable a program to set multiple interrupt conditions.

Referring now to FIG. 48, a block diagram is shown illustrating the interrupt condition register 4706 of FIG. 47 in greater detail. Interrupt condition register 4706 includes the following fields: weight RAM write address 4802, weight RAM read address 4804, data RAM write address 4806, data RAM read address 4808, program counter 4812, loop count 4814, and iteration count 4816 (also referred to as repeat count 4816). These fields each have a corresponding valid bit, denoted V in fig. 48. As described in more detail below, in determining whether the state held in the state register 4704 satisfies an interrupt condition, the control logic 4702 only considers the field in the state register 4704 that corresponds to the field in the interrupt condition register 4706 for which a valid bit is set.

As described in more detail below, the weight RAM write address 4802, weight RAM read address 4804, data RAM write address 4806, data RAM read address 4808, program counter 4812, loop count 4814, and iteration count 4816 fields correspond to similarly named fields in the status register 4704. Preferably, the state held in the status register 4704 satisfies the interrupt condition specified in the interrupt condition register 4706, for example for block 5012 of fig. 50, in the event that the value of the interrupt condition field matches the value of the corresponding status field of the interrupt condition field in the status register 4704 for each field in the interrupt condition register 4706 for which a valid bit is set. In this manner, the interrupt condition may be viewed as a combined value of the valid fields of the interrupt condition register 4706.

Referring now to FIG. 49, a block diagram illustrating the status register 4704 of FIG. 47 in greater detail is shown. The status register 4704 includes the following fields: weight RAM write address 4902, weight RAM read address 4904, data RAM write address 4906, data RAM read address 4908, program counter 4912 (also called program memory address 4912), cycle count 4914, and iteration count 4916 (also called repeat count 4916). The weight RAM write address 4902, weight RAM read address 4904, data RAM write address 4906, data RAM read address 4908, program counter 4912, loop count 4914, and iteration count 4916 hold values that indicate the operational state of the NNU 121. That is, these fields are updated during operation of NNU 121. Preferably, the values of the fields are the same as those described above for the status register 127 of FIG. 39. Weight RAM write address 4902 specifies the address of the row of weight RAM 124 that was most recently written (e.g., address 125 of fig. 47). Weight RAM read address 4904 specifies the address of the row of weight RAM 124 that was most recently read (e.g., address 125 of fig. 47). The data RAM write address 4906 specifies the address of the row of the most recently written data RAM122 (e.g., address 123 of fig. 47). The data RAM read address 4908 specifies the address of the row of the data RAM122 that was read most recently (e.g., address 123 of fig. 47). The data RAM122 and weight RAM 124 may be read from and written to by the NPU 126 or ring stations 4004-N, for example, via output register 1104, shift register 5804, shift unit 5802, write buffer 6612/6622, and/or read buffer 6614/6624. The program counter 4912 specifies the address at which the sequencer 128 picks the most recent instruction from the program memory 129 (e.g., address 131 of fig. 47, e.g., the value of program counter 3802 in fig. 38). The loop count 4914 indicates the number of times a loop of the program has yet to be executed (e.g., the value of the loop counter 3804 in fig. 38). The iteration count 4916 indicates the number of times the operation specified in the current program instruction remains in progress (e.g., the value of the iteration counter 3806 in fig. 47).

Referring now to FIG. 50, a flowchart is shown illustrating the operation of NNU 121 in FIG. 47 to generate interrupt requests for core 4002 based on conditions. Flow begins at block 5002.

At block 5002, the sequencer 128 picks the set interrupt condition instruction 4722 from the program memory 129 at address 131 of the program counter 3802 and decodes the instruction 4722. The set interrupt condition instruction 4722 specifies an interrupt condition. Flow proceeds to block 5004.

At block 5004, in response to decoding the set interrupt condition instruction 4722, the sequencer 128 generates a micro-operation 3416, wherein the micro-operation 3416 writes the interrupt condition specified in the set interrupt condition instruction 4722 to the interrupt condition register 4706. Flow proceeds to block 5006.

At block 5006, the sequencer 128 continues to fetch and decode the instruction from the program memory 129 and generate micro-operations 3416 for execution by the NPU 126. This results in a change in the state of the NNU 121, including an update to the status register 4704 (e.g., an update to the program counter 4912, the loop count 3914, the iteration count 4916, and the data/weight RAM read/write address 4908/4906/4904/4902). Flow proceeds to block 5008.

At block 5008, control logic 4702 monitors status register 4704 to see if the status of this status register 4704 meets the interrupt condition specified in interrupt condition register 4706. Preferably, the state held in the state register 4704 satisfies the interrupt condition if the value of the interrupt condition field matches the value of the corresponding state field of the interrupt condition field in the state register 4704 for each field in the interrupt condition register 4706 for which a valid bit is set. Flow proceeds to decision block 5012.

At decision block 5012, if the status satisfies the interrupt condition, flow proceeds to block 5014; otherwise, flow returns to block 5008.

At block 5014, control logic 4702 generates interrupt request 4712 for core 4002. Flow ends at block 5014.

Referring now to FIG. 51, a table is shown that illustrates a program for storage in program memory 129 of NNU 121 of FIG. 47 and execution by NNU 121. The exemplary program performs the calculations associated with the layers of the artificial neural network as described above. For example, the program may be used to perform multiply-accumulate calculations associated with a fully-connected neural network layer, where 30 different instances of input data for that layer are stored in rows 0-29 of data RAM 122 and associated weights are stored in rows 0-511 of weight RAM 124, and NNU 121 writes 30 outputs of that layer corresponding to the 30 input data instances to rows 30-59 of data RAM 122. To achieve this, the program contains a loop body that is executed 30 times, as specified by the loop count 30 in the initialize instruction at address 0. Similar to that described above for fig. 4 (in the example of fig. 4, a different data RAM 122 address is used, and there are no cycles), each instance of execution of the loop body performs 512 multiply-accumulate operations and outputs the results to a different row of the data RAM 122. In this example, assume that the interrupt delay of core 4002 is approximately 600 clock cycles, and it is desirable to have core 4002 begin reading 30 outputs from data RAM 122 immediately after the 30 th output is written to data RAM 122 (referred to herein as an interrupt request associated event). That is, it is desirable to have NNU 121 generate interrupt request 4712 for core 4002 approximately 600 clocks before the interrupt request correlation event. Advantageously, the program includes a set interrupt condition instruction 4722 (at address 1) for specifying an interrupt condition to cause NNU 121 to generate interrupt request 4712 for core 4002 approximately 600 clock cycles before an interrupt request associated event (i.e., writing the 30 th output to line 59 of data RAM 122).

At address 0, the initialization instruction specifies a loop count value of 30, which is the number of times each NPU 126 executes the loop body including the instructions at addresses 2-4. A loop instruction at the end of the loop (address 5) decrements the value of the loop count 4914 and if the result is non-zero, returns control to the top of the loop body (i.e., returns the instruction at address 2). Preferably, the initialization instruction also clears the accumulator 202. Preferably, the loop instruction at address 5 also clears the accumulator 202. Alternatively, the multiply accumulate instruction at address 2 may specify clearing the accumulator 202. The initialization instructions also initialize the rows of the data RAM 122 to zero and the output rows of the data RAM 122 to 30, such that rows 0-29 are read and rows 30-59 are written for 30 corresponding instances of execution of the loop body.

At address 1, the set interrupt condition instruction 4722 loads the interrupt condition register 4706 with the condition specified by the set interrupt condition instruction 4722. In the example of fig. 51, the condition is a combination of: the program counter 4912 is equal to the value of LABEL1 (which is 3, the address of the multiply accumulate instruction at LABEL 1), the loop count 4914 is equal to 1, and the repeat count 4916 is equal to 14. As will be explained below, these values cause NNU 121 to generate interrupt request 4712 for core 4002 approximately 600 clocks before an interrupt request associated event occurs in which NNU 121 writes the result of the last (30 th) execution instance of the loop body to row 59 of data RAM 122.

At

addresses

2 and 3, as described in detail above, the multiply-accumulate instruction performs 512 multiply-accumulate operations on a total of one row of data read from the data RAM 122 that is rotated between NPUs 126 having 512 row weights read from 512 different rows of the weight RAM 124 to produce a result that is accumulated in the accumulator 202 of the NPU 126, e.g., in a manner similar to that described above for FIG. 4. More specifically, the instruction at address 3 specifies a repeat count as 511.

At address 4, the accumulated result of multiply accumulation is output to the row of the current data RAM 122 (30 at the first execution instance of the loop and 59 at the last/30 th execution instance). In one embodiment, the output instruction performs an activation function on the value of the accumulator 202 before writing the result to the data RAM 122.

As can be appreciated from the routine of fig. 51 and the above description, each execution instance of the loop body preferably requires approximately 514 clock cycles (1 clock for multiply-accumulate at

address

2, 511 clocks for multiply-accumulate at

address

3, 1 clock for output instructions at

address

4, and 1 clock for loop instructions at address 5). In this example, assume that the interrupt latency of core 4002 is approximately 600 clock cycles. As a result, advantageously, control logic 4702 will generate interrupt request 4712 for core 4002 approximately 600 clock cycles before the interrupt request correlation event (i.e., writing the 30 th output to row 59 of data RAM 122). This is because, when program counter 4912 equals 3 and loop count 4914 equals 1 and repeat count 4916 equals 86 (i.e., when control logic 4702 generates interrupt request 4712), NNU 121 will typically take 86 more clocks to perform the last 86 iterations of multiply accumulation at address 3; then it takes 2 clocks to execute the instructions at

addresses

4 and 5; then in the last instance of the loop, it takes 1 clock to execute the instruction at address 2; then 511 clocks are spent on the instruction at address 3 and then 1 clock on the instruction at address 4 (which writes the 30 th output to line 59 of the data RAM 122), for a total of about 600 clocks. In this example, assume that the clock cycles of NNU 121 and core 4002 are the same. However, other embodiments are also contemplated in which the clock periods of the two are different. In such an embodiment, the value of the interrupt condition is selected to account for differences in the clock cycles of NNU 121 and core 4002.

Referring now to FIG. 52, a set interrupt condition instruction 4722 is shown for storage by program memory 129 in NNU121 of FIG. 47 and execution by that NNU121, in accordance with an alternative embodiment. The set interrupt condition instruction 4722 of fig. 52 may be substituted at address 1 in the program of fig. 51 to accomplish a similar result to fig. 51 by using a different interrupt condition. The interrupt conditions specified in fig. 52 are data RAM write address 4906 equal to 57 and repeat count 4916 equal to 86. During the 28 th execution instance of the loop, the output instruction at address 4 writes the 28 th output to row 57 of the data RAM 122. Then, during the 29 th execution instance of the loop and during execution of the instruction at address 3, repeat count 4916 will be decremented to value 86, which will cause control logic 4702 to generate interrupt request 4712 for core 4002. This would be about 600 clock cycles before the interrupt request correlation event (i.e., writing the 30 th output to line 59 of data RAM 122). This is because, when the write address 4906 of the data RAM equals 1 and the repeat count 4916 equals 86 (i.e., when control logic 4702 generates interrupt request 4712), NNU121 will typically take 86 more clocks to perform the last 86 iterations of multiply-accumulate at address 3; then it takes 2 clocks to execute the instructions at

addresses

4 and 5; then in the last instance of the loop, it takes 1 clock to execute the instruction at address 2; then 511 clocks are spent on the instruction at address 3 and then 1 clock on the instruction at address 4 (which writes the 30 th output to line 59 of the data RAM 122), for a total of about 600 clocks.

Referring now to FIG. 53, a table is shown that illustrates a program for storage in program memory 129 in NNU 121 of FIG. 47 and execution by NNU 121. The exemplary program performs the calculations associated with the layers of the artificial neural network as described above. For example, as described with respect to the program of fig. 26, the program may be used to perform a convolution of the data matrix with a convolution kernel (e.g., each convolution kernel in fig. 24) and write it back to weight RAM 124. To achieve this, the program contains a loop body that is executed 400 times, as specified by the loop count 400 in the initialize instruction at address 0. Each instance of the loop body executes 9 multiply-accumulate operations (instructions from addresses 2-7), outputs the results to a different row of weight RAM 124 (instruction from address 8), decrements the row register of weight RAM 124 (instruction from address 9), and loops back to the top of the loop body (instruction from address 10). Thus, each execution instance of the loop body may take approximately 12 clock cycles.

Again, in this example, assume that the interrupt latency of core 4002 is approximately 600 clock cycles, and it is desirable to have core 4002 start reading 400 outputs from weight RAM 124 immediately after the 400 th output is written to weight RAM 124 (referred to herein as an interrupt request associated event). That is, it is desirable to have NNU 121 generate interrupt request 4712 for core 4002 approximately 600 clocks before an interrupt request related event. Advantageously, the program includes a set interrupt condition instruction 4722 (at address 1) for specifying an interrupt condition such that NNU 121 generates interrupt request 4712 for core 4002 approximately 600 clock cycles before the interrupt request associated event.

In the example of fig. 53, the set interrupt condition instruction 4722 loads the interrupt condition register 4706 with an interrupt condition as follows: the cycle count 4914 equals 50. This interrupt condition value causes NNU 121 to generate interrupt request 4712 for core 4002 approximately 600 clocks before an interrupt request associated event occurs in which NNU 121 writes the results of the last (400 th) execution instance of the loop body to weight RAM 124. This is because when the cycle count 4914 is equal to 50 (i.e., when control logic 4702 generates interrupt request 4712), the NNU 121 will typically take 12 clocks for each of the remaining cycle execution instances, for a total of about 600 clocks.

An advantage of the described embodiments is that they provide for enabling a programmer to know as accurately as possible the number of clocks an NNU 121 generates an interrupt request before an NNU 121 interrupt request related event occurs. This is particularly advantageous in view of the fact that: (1) a single NNU 121 instruction may require thousands of clocks (e.g., with a large repeat count) to complete, and (2) NNU 121 programs may have loops with relatively large loop counts, but may need to generate one interrupt request while the program is looping, i.e., an interrupt request needs to be generated at a particular iteration of the loop (i.e., before all loops are completed), and in some cases at some point within the loop for a particular iteration.

It should be understood that in some cases, the interrupt condition may be set such that NNU121 interrupts core 4002 "too early". That is, at the end of the interrupt latency, the event associated with the interrupt request may not have occurred. (however, the operating system (e.g., device driver) may still perform the correct operation by reading the status register 127 of NNU121 to determine that an event has occurred before continuing. Thus, advantageously, a programmer can customize interrupt conditions to customize the amount of time an NNU121 generates interrupt request 4712 before an interrupt request associated event according to whether it is more critical to reduce wasted utilization of NNU121 or core 4002. That is, if core 4002 is considered a more critical resource, the programmer may cause NNU121 to generate interrupt request 4712 a number of clock cycles less than the interrupt latency of core 4002/operating system before the interrupt request associated event; conversely, if NNU121 is considered a more critical resource, the programmer can cause NNU121 to generate interrupt request 4712 more than the interrupt latency of NNU121, a clock number prior to the interrupt request correlation event. Another advantage is that it can increase the productivity of programmers developing programs running on NNU 121.

Although an embodiment has been described in which the apparatus for generating an interrupt request according to a condition is a neural network element, in other embodiments the apparatus may be other programmable devices capable of executing a program. For example, embodiments are contemplated in which the device is an encryption/decryption unit, a compression/decompression unit, a multimedia encoder/decoder unit, a database indexing unit, or a graphics processing unit.

While various embodiments of the present invention have been described herein, these embodiments have been presented by way of example, and not limitation. It will be apparent to persons skilled in the relevant computer arts that various changes in form and detail can be made therein without departing from the scope of the invention. For example, software may, for example, support the function, fabrication, modeling, simulation, description, and/or testing, etc., of the apparatus and methods described herein. This may be accomplished using a general programming language (e.g., C, C + +), a Hardware Description Language (HDL) including Verilog HDL, VHDL, and the like, or other available programs. Such software can be disposed on any known computer usable medium such as magnetic tape, semiconductor, magnetic disk, optical disk (e.g., CD-ROM, DVD-ROM, etc.), network, wired or other communications medium, and the like. Embodiments of the apparatus and methods described herein may be included in a semiconductor intellectual property core, such as a processor core (e.g., embodied or specified in HDL) and transformed to hardware by the fabrication of integrated circuits. Furthermore, the apparatus and methods described herein may also be embodied as a combination of hardware and software. Thus, the present invention should not be limited by any of the exemplary embodiments described herein, but should be defined only in accordance with the following claims and their equivalents. In particular, the present invention may be implemented within a processor device, which may be used in a general purpose computer. Finally, those skilled in the art should appreciate that they can readily use the disclosed conception and specific embodiment as a basis for designing or modifying other structures for carrying out the same purposes of the present invention without departing from the scope of the invention as defined by the appended claims.

Cross Reference to Related Applications

This application is related to the following U.S. non-provisional applications, each of which is incorporated herein by reference in its entirety.

Each of the above non-provisional applications claims priority based on the following U.S. provisional applications, each of which is incorporated herein by reference in its entirety.

The present application is also related to the following U.S. non-provisional applications, each of which is incorporated herein by reference in its entirety.

Claims

1. A programmable device, comprising:

an output for generating an interrupt request for a processing core coupled to the apparatus;

a program memory for holding instructions of a program picked up and executed by the apparatus;

a data memory for holding data processed by the instructions;

a status register to hold a status of the device that is updated during operation of the device, the status having fields that include:

a program memory address at which a most recent instruction was fetched from the program memory;

a data store access address at which the device has last accessed data in the data store; and

A repeat count for indicating a number of times the operation specified in the current program instruction has yet to be performed;

a condition register having condition fields corresponding to the status fields held in the status register, wherein the condition register is capable of writing a condition including one or more of the condition fields via an instruction of the program; and

control logic for generating an interrupt request on the output for the processing core in response to detecting that the state held in the state register satisfies the condition specified in the condition register.

2. The apparatus of claim 1, wherein,

the condition fields each have a corresponding valid bit; and

the state held in the status register satisfies the condition specified in the condition register if the value of the condition field matches the value of the corresponding status field of the condition field in the status register for each condition field in the condition register for which a valid indicator is set.

3. The apparatus of claim 1, wherein,

the state also has fields including:

A loop count for indicating a number of times loops of the program remain to be executed.

4. The apparatus of claim 1, further comprising:

a weight memory for holding weights associated with neural network computations;

the state also has fields including:

a weight memory access address at which the device has last accessed a weight in the weight memory.

5. The apparatus of claim 4, wherein,

the weighted memory access addresses include the following two addresses:

a weight memory read address at which the device last read a weight from the weight memory; and

a weight memory write address at which the device last written a weight to the weight memory.

6. The apparatus of claim 1, wherein,

the apparatus includes a neural network unit to accelerate computations associated with a neural network.

7. The apparatus of claim 1, further comprising:

a ring station to couple the apparatus to a ring bus to which the processing cores are also coupled.

8. The apparatus of claim 1, wherein,

the data memory access addresses include the following two addresses:

a data store read address at which the device last read data from the data store; and

a data store write address at which the device last written data to the data store.

9. A method of operation of a device, the device comprising: a program memory for holding instructions of a program picked up and executed by the apparatus; a data memory for holding data processed by the instructions; a status register to hold a status of the device that is updated during operation of the device, wherein the status has fields comprising: a program memory address at which a most recent instruction was fetched from the program memory; a data store access address at which the device has last accessed data in the data store; and a repeat count for indicating a number of times an operation specified in a current program instruction has yet to be performed, the apparatus further comprising a condition register having a condition field corresponding to a status field held in the status register, the method comprising:

Writing, via an instruction of the program, a condition to the condition register that includes one or more of the condition fields; and

in response to detecting that the condition specified in the condition register is satisfied for the state held in the state register, an interrupt request is generated for a processing core.

10. The method of claim 9, wherein,

the condition fields each have a corresponding valid bit; and

11. The method of claim 9, wherein,

the state also has fields including:

12. The method of claim 9, wherein,

the apparatus also includes a weight store for maintaining weights associated with neural network computations;

the state also has fields including:

13. The method of claim 12, wherein,

the weighted memory access addresses include the following two addresses:

14. The method of claim 9, wherein,

15. The method of claim 9, wherein the device further comprises:

16. The method of claim 9, wherein,

the data memory access addresses include the following two addresses:

17. A computer readable storage medium comprising a computer usable program for causing a computer to perform the method of any one of claims 9 to 16.

18. The computer-readable storage medium of claim 17, wherein the computer-readable storage medium is selected from a group of disks, tapes, or other magnetic, optical, and electronic storage media.