CN108805276B

CN108805276B - Processor, method for operating processor and computer usable medium

Info

Publication number: CN108805276B
Application number: CN201810618974.4A
Authority: CN
Inventors: 道格拉斯·R·瑞德; G·葛兰·亨利; 泰瑞·派克斯
Original assignee: Shanghai Zhaoxin Integrated Circuit Co Ltd
Current assignee: Shanghai Zhaoxin Semiconductor Co Ltd
Priority date: 2017-06-16
Filing date: 2018-06-15
Publication date: 2020-09-22
Anticipated expiration: 2038-06-15
Also published as: CN108805276A; CN110443360A; CN110443360B

Abstract

The present invention relates to a processor, a method for operating a processor, and a computer usable medium. A first data store holding cache line; the accelerator has a second data store holding accelerator data and a cache line evicted from the first data store; a tag directory holding tags for cache lines stored in both the first and second data stores; the mode indicator indicates whether the second data store is operating in a first mode in which the second data store holds cache lines evicted from the first data store or is operating in a second mode in which the second data store holds accelerator data. In response to a request to evict a cache line from a first data store, in a first mode the control logic writes the cache line to a second data store and updates a tag in a tag directory to indicate that the cache line is present in the second data store; and in a second mode, the control logic in turn writes the cache line to system memory.

Description

Processor, method for operating processor and computer usable medium

Technical Field

The present invention relates to a processor, a method for operating a processor, and a computer usable medium, and more particularly, to a processor having selective data storage operable as victim cache (victim cache) data storage or accelerated memory and having victim cache tags in a low-level cache.

Background

Recently, Artificial Neural Networks (ANN) have attracted renewed interest, and such research is commonly referred to as deep learning, computer learning, and similar terms. The increase in computing power of general purpose processors has led to a renewed interest that has subsided decades ago. Recent applications of ANN include voice recognition and image recognition, among others. There is an increasing demand for improving the performance and efficiency of the computations associated with ANN.

Disclosure of Invention

A processor, comprising: a processing core; a first data store coupled to the processing core, an accelerator comprising: a second data store for selectively maintaining: a cache line evicted from the first data store, and accelerator data processed by the accelerator; a tag directory coupled to the processing core, the tag directory to hold tags for cache lines stored in both the first data store and the second data store; a mode indicator to indicate whether the second data store is operating in a first mode in which the second data store holds cache lines evicted from the first data store or a second mode in which the second data store holds accelerator data processed by the accelerator; and control logic configured to, in response to a request to evict a cache line from the first data store: if the mode indicator indicates that the second data store is operating in the first mode, writing a cache line to the second data store and updating a tag in the tag directory to indicate that a cache line exists in the second data store; and writing a cache line to system memory instead of to the second data store if the mode indicator indicates that the second data store is operating in the second mode.

A method for operating a processor, the processor having: a processing core; a first data store for holding cache lines processed by the processing core; an accelerator having a second data store for selectively holding cache lines evicted from the first data store and accelerator data processed by the accelerator; and a tag directory for holding tags for cache lines stored in both the first data store and the second data store, the method comprising the steps of: in response to a request to evict a cache line from the first data store: if the second data store is operating in a first mode in which the second data store holds cache lines evicted from the first data store, writing a cache line to the second data store and updating a tag in the tag directory to indicate that a cache line exists in the second data store; and writing a cache line to system memory instead of to the second data store if the second data store is operating in a second mode, wherein the second data store holds accelerator data processed by the accelerator in the second mode.

A non-transitory computer usable medium includes a computer usable program that causes a computer to function as each component in a processor according to the present application.

Drawings

Fig. 1 is a block diagram illustrating a processor including a Neural Network Unit (NNU).

Fig. 2 is a block diagram illustrating the NPU of fig. 1.

FIG. 3 is a block diagram illustrating an embodiment of an arrangement of N multiplexing registers (mux-regs) of the N NPUs of the NNU of FIG. 1 to illustrate operation of the N multiplexing registers as an N-word rotator or circular shifter for a row of data words received from the data RAM of FIG. 1.

FIG. 4 is a table that illustrates a program for storing in and executing by the NNU of FIG. 1's program memory.

FIG. 5 is a timing diagram illustrating the execution of the routine of FIG. 4 by an NNU.

FIG. 6A is a block diagram illustrating the NNU of FIG. 1 executing the routine of FIG. 4.

FIG. 6B is a flow diagram illustrating operation of the processor of FIG. 1 to perform an architectural procedure that uses NNUs to perform multiply-accumulate activation function computations (such as performed by the procedure of FIG. 4) typically associated with neurons of a hidden layer of an artificial neural network.

Fig. 7 is a block diagram illustrating the NPU of fig. 1 according to an alternative embodiment.

Fig. 8 is a block diagram illustrating the NPU of fig. 1 according to an alternative embodiment.

FIG. 9 is a table that illustrates a program for storing in the program memory of the NNU of FIG. 1 and for execution by the NNU.

FIG. 10 is a timing diagram illustrating the execution of the routine of FIG. 9 by an NNU.

FIG. 11 is a block diagram illustrating an embodiment of the NNU of FIG. 1. In the embodiment of fig. 11, the neuron element is divided into two parts, an activation function unit part and an ALU part (this part also includes a shift register part), and each activation function unit part is shared by a plurality of ALU parts.

FIG. 12 is a timing diagram illustrating the NNUs of FIG. 11 executing the routine of FIG. 4.

FIG. 13 is a timing diagram illustrating the NNUs of FIG. 11 executing the routine of FIG. 4.

FIG. 14 is a block diagram illustrating a Move To Neural Network (MTNN) architecture instruction and the operation of the architecture instruction with respect to portions of the NNUs of FIG. 1.

Fig. 15 is a block diagram illustrating a Move From Neural Network (MFNN) architecture instruction and the operation of the architecture instruction with respect to portions of the NNUs of fig. 1.

FIG. 16 is a block diagram illustrating an embodiment of the data RAM of FIG. 1.

FIG. 17 is a block diagram illustrating an embodiment of the weight RAM and buffer of FIG. 1.

FIG. 18 is a block diagram illustrating the dynamically configurable NPU of FIG. 1.

FIG. 19 is a block diagram illustrating an embodiment of an arrangement of 2N multiplexing registers of the N NPUs of the NNU of FIG. 1 to illustrate operation of the 2N multiplexing registers as a rotator for a row of data words received from the data RAM of FIG. 1, in accordance with the embodiment of FIG. 18.

FIG. 20 is a table illustrating a program for storage in and execution by the NNUs of FIG. 1 having NPUs according to the embodiment of FIG. 18.

FIG. 21 is a timing diagram illustrating an NNU executing the program of FIG. 20, where the NNU includes the NPU of FIG. 18 operating in a narrow configuration.

Fig. 22 is a block diagram illustrating the NNU of fig. 1, wherein the NNU includes the NPU of fig. 18 to execute the routine of fig. 20.

FIG. 23 is a block diagram illustrating the dynamically configurable NPU of FIG. 1 in accordance with an alternative embodiment.

FIG. 24 is a block diagram illustrating an example of a data structure used by the NNUs of FIG. 1 to perform convolution operations.

FIG. 25 is a flow diagram illustrating operation of the processor of FIG. 1 to execute an architectural program that uses NNUs that perform convolution on convolution kernels for the data array of FIG. 24.

FIG. 26A is a program listing of an NNU program that performs convolution on the data matrix using the convolution kernel of FIG. 24 and writes it back to the weight RAM.

FIG. 26B is a block diagram that illustrates certain fields of the control registers of the NNU of FIG. 1, according to one embodiment.

FIG. 27 is a block diagram illustrating an example of the weight RAM of FIG. 1 filled with input data to which the NNU of FIG. 1 performs a pooling (pooling) operation.

FIG. 28 is a program listing of an NNU program that pools the input data matrix of FIG. 27 and writes it back to the weight RAM.

FIG. 29A is a block diagram illustrating an embodiment of the control register of FIG. 1.

FIG. 29B is a block diagram illustrating an embodiment of the control register of FIG. 1 according to an alternative embodiment.

FIG. 29C is a block diagram illustrating an embodiment of the inverse of FIG. 29A stored as two parts, according to one embodiment.

FIG. 30 is a block diagram illustrating an embodiment of the AFU of FIG. 2 in greater detail.

FIG. 31 is an example of the operation of the AFU of FIG. 30.

FIG. 32 is a second example of the operation of the AFU of FIG. 30.

FIG. 33 is a third example of the operation of the AFU of FIG. 30.

FIG. 34 is a block diagram illustrating a more detailed portion of the processors of FIG. 1 and the NNUs of FIG. 1.

FIG. 35 is a block diagram that illustrates a processor that includes a variable rate NNU.

FIG. 36A is a timing diagram that illustrates an example of operation of a processor having NNUs that operate in a normal mode, i.e., at a master clock rate.

FIG. 36B is a timing diagram that illustrates an example of operation of a processor having NNUs that operate in a mitigative mode, i.e., at a rate less than the master clock rate.

FIG. 37 is a flow chart illustrating operation of the processor of FIG. 35.

FIG. 38 is a block diagram illustrating the sequence of NNUs in more detail.

FIG. 39 is a block diagram illustrating certain fields of the control and status registers of an NNU.

Fig. 40 is a block diagram illustrating a processor.

Fig. 41 is a block diagram illustrating the NNUs of fig. 40, and the ring station (ring stop) of fig. 40 in greater detail.

FIG. 42 is a flow chart illustrating operation of the processor of FIG. 40 in the case where the memory array of FIG. 41 is transitioning from a cache memory mode when used as an LLC slice to an NNU mode when used as a weight/data RAM for NNUs.

FIG. 43 is a flow chart illustrating operation of the processor of FIG. 40 in the case where the memory array of FIG. 41 is transitioning from NNU mode when used as weight/data RAM for NNUs to cache memory mode when used as LLC slices.

FIG. 44 is a flow chart illustrating operation of the processor of FIG. 40 in the case where the memory array of FIG. 41 is transitioning from NNU mode when serving as weight/data RAM for NNUs to cache memory mode when serving as a victim cache.

FIG. 45 is a flow chart illustrating operation of the processor of FIG. 40 in the case where the memory array of FIG. 41 is being switched from a cache memory mode when used as a victim cache to an NNU mode when used as a weight/data RAM for NNUs.

FIG. 46 is a block diagram illustrating a processor in accordance with an alternative embodiment.

FIG. 47 is a block diagram illustrating collections in the markup directory of FIG. 46.

FIG. 48 is a flowchart illustrating operation of the processor of FIG. 46 in performing an eviction of a cache line from the L3 cache to the victim cache.

FIG. 49 is a flow diagram illustrating operation of the processor of FIG. 46 in making a load request from a victim cache to a cache line of a core.

FIG. 50 is a flow chart illustrating operation of the processor of FIG. 46 in the event that the selective data storage of FIG. 46 is transitioning from NNU mode when serving as weight/data RAM for NNUs to victim cache mode when serving as victim cache.

FIG. 51 is a flow chart illustrating operation of the processor of FIG. 46 in the case where the selective data storage of FIG. 46 is transitioning from a victim cache mode when serving as a victim cache to an NNU mode when serving as a weight/data RAM for the NNUs.

FIG. 52 is a block diagram illustrating an embodiment of portions of an NNU.

FIG. 53 is a block diagram illustrating the ring station of FIG. 46 in greater detail.

FIG. 54 is a block diagram illustrating the slave interface of FIG. 53 in greater detail.

Fig. 55 is a block diagram illustrating the host interface 0 of fig. 53 in more detail.

FIG. 56 is a block diagram illustrating portions of the ring station of FIG. 53 and a ring bus coupling embodiment of the NNUs.

FIG. 57 is a block diagram illustrating a ring bus coupling embodiment of an NNU.

Detailed Description

Processor with architected neural network elements

Referring now to FIG. 1, a block diagram is shown illustrating a processor 100 including a Neural Network Unit (NNU) 121. Processor 100 includes an instruction fetch unit 101, an instruction cache 102, an instruction translator 104, a rename unit 106, a reservation station 108, media registers 118, General Purpose Registers (GPRs) 116, an execution unit 112 other than an NNU 121, and a memory subsystem 114.

Processor 100 is an electronic device that functions as a Central Processing Unit (CPU) on an integrated circuit. Processor 100 receives digital data as input, processes the data according to instructions fetched from memory, and generates as output the results of operations specified by the instructions. The processor 100 may be used in a desktop computer, a mobile computer, or a tablet computer, and for purposes such as computing, word editing, multimedia display, and internet browsing. The processor 100 may also be provided in an embedded system to control a variety of devices including home appliances, mobile phones, smart phones, vehicles, industrial control devices, and the like. A CPU is an electronic circuit (i.e., "hardware") that executes computer program (also referred to as a "computer application" or "application") instructions by performing operations on data, including arithmetic operations, logical operations, and input/output operations. An Integrated Circuit (IC) is a set of electronic circuits fabricated on a small piece of semiconductor material, typically silicon. An IC is also known as a chip, microchip, or die.

Instruction fetch unit 101 controls the fetching of architectural instructions 103 from system memory (not shown) into instruction cache 102. The instruction fetch unit 101 provides a fetch address specifying a memory address at which the processor 100 fetches a cache line of architectural instruction bytes into the instruction cache 102 to the instruction cache 102. The pick address is based on the current value of an instruction pointer (not shown) or program counter of processor 100. Typically, the program counter is sequentially incremented by instruction size unless a control instruction such as a branch, call, or return instruction is encountered in the instruction stream or an exception condition such as an interrupt, trap (trap), exception, or error occurs, in which case the program counter is updated with a non-sequential address such as a branch target address, return address, or exception vector. Generally, the program counter is updated in response to execution of instructions by execution unit 112/121. The program counter may also be updated in response to detecting an exception condition, such as instruction translator 104 encountering an instruction 103 not defined by the instruction set architecture of processor 100.

Instruction cache 102 caches architectural instructions 103 fetched from system memory coupled to processor 100. The architecture instructions 103 include Move To Neural Network (MTNN) instructions and Move From Neural Network (MFNN) instructions, which are described in more detail below. In one embodiment, the architectural instructions 103 are instructions of the x86 Instruction Set Architecture (ISA) with the addition of MTNN instructions and MFNN instructions. In the invention Hereinafter, an x86ISA processor is referred to as

Generating and executing instructions in the instruction set architecture layer when the processor executes the same machine language instruction

A processor that generates the same result. However, other embodiments contemplate other instruction set architectures, such as advanced reduced instruction set machines

Or

The instruction cache 102 provides the architectural instructions 103 to the instruction translator 104, and the instruction translator 104 translates the architectural instructions 103 into micro instructions 105.

The microinstructions 105 are provided to the rename unit 106 and ultimately executed by the execution unit 112/121. The microinstructions 105 implement architectural instructions. Preferably, the instruction translator 104 includes a first portion that translates frequently executed and/or relatively less complex architectural instructions 103 into microinstructions 105. Instruction translator 104 also includes a second portion, where the second portion includes a microcode unit (not shown). The microcode unit includes a microcode memory that holds microcode instructions that implement complex and/or infrequently used instructions of the architectural instruction set. The microcode unit also includes a micro sequencer (micro-sequencer) that provides a non-architected micro program counter (micro-PC) to the microcode memory. Preferably, microcode instructions are translated into microinstructions 105 via a micro-translator (not shown). The selector selects a microinstruction 105 from either the first portion or the second portion to provide to the rename unit 106 depending on whether the microcode unit currently has control.

Rename unit 106 renames architectural registers specified in architectural instructions 103 to physical registers of processor 100. Preferably, processor 100 includes a reorder buffer (not shown). The rename unit 106 allocates entries in a reorder buffer in program order for each microinstruction 105. This enables the processor 100 to retire (retire) the microinstruction 105 and its corresponding architectural instruction 103 in program order. In one embodiment, media registers 118 have a 256-bit width and GPRs 116 have a 64-bit width. In one embodiment, the media registers 118 are x86 media registers such as advanced vector extension (AVX) registers.

In one embodiment, each entry of the reorder buffer includes storage space for the results of the microinstructions 105; further, processor 100 includes an architectural register file that includes physical registers for each architectural register (e.g., media registers 118, GPRs 116, and other architectural registers). (preferably, there is a separate register file for both, e.g., due to the different sizes of media registers 118 and GPRs 116.) for each source operand in a microinstruction 105 that specifies an architectural register, the rename unit populates the source operand field of the microinstruction 105 with the reorder buffer index of the newest microinstruction in the old microinstruction 105 that is written to the architectural register. When the execution unit 112/121 completes execution of the microinstruction 105, the execution unit 112/121 writes the result to the reorder buffer entry of the microinstruction 105. When a microinstruction 105 retires, a retirement unit (not shown) writes the result from the microinstruction's reorder buffer entry to a register of the physical register file associated with the architectural destination register specified by the retired microinstruction 105.

In another embodiment, processor 100 includes a physical register file that includes a greater number of physical registers than architectural registers and does not include result storage space than does the architectural register file. (preferably, there is a separate physical register file for both media registers 118 and GPRs 116, e.g., due to the different sizes of these registers.) processor 100 also includes a pointer table with associated pointers for each architectural register. For operands in the microinstruction 105 that specify architectural registers, the rename unit populates the destination operand field of the microinstruction 105 with pointers to free registers in the physical register file. Rename unit 106 stalls the pipeline (pipeline) if there are no free registers within the physical register file. For each source operand of a specified architectural register of the microinstruction 105, the rename unit populates the source operand field of the microinstruction 105 with a pointer to a register in the physical register file that is assigned to the newest microinstruction in the older microinstruction 105 that is written to the architectural register. When the execution unit 112/121 completes execution of the microinstruction 105, the execution unit 112/121 writes the result to the register in the physical register file pointed to by the destination operand field of the microinstruction 105. When a microinstruction 105 retires, the retirement unit copies the destination operand field value of the microinstruction 105 to the pointer in the pointer table associated with the architectural destination register specified by the retired microinstruction 105.

Reservation station 108 holds microinstructions 105 until they are ready to be issued to execution unit 112/121 for execution. The microinstructions 105 are ready to issue when all of the source operands of the microinstructions 105 are available and the execution unit 112/121 is available to execute the microinstructions 105. Execution unit 112/121 receives register source operands from a reorder buffer or an architectural register file as described in the first embodiment above, or from a physical register file as described in the second embodiment above. Furthermore, execution units 112/121 may receive register source operands directly from execution units 112/121 via a result forwarding bus (not shown). Additionally, the execution unit 112/121 may receive immediate operands specified by the microinstructions 105 from the reservation station 108. As described in more detail below, the MTNN and MFNN architecture instruction 103 includes immediate operands for specifying a function to be performed by the NNU 121, where the function is provided in one of the one or more microinstructions 105 into which the MTNN and MFNN architecture instruction 103 is translated.

Execution units 112 include one or more load/store units (not shown) that load data from memory subsystem 114 and store data to memory subsystem 114. Memory subsystem 114 preferably includes a memory management unit (not shown), which may include, for example, a translation look-up (lookup) buffer and table move (tablewalk) unit, a level 1 data cache (and instruction cache 102), a level 2 unified cache, and a bus interface unit for interfacing processor 100 with system memory. In one embodiment, processor 100 of FIG. 1 is representative of a processing core that is one of a plurality of processing cores sharing a last level cache memory in a multi-core processor. Execution units 112 may also include integer units, media units, floating point units, and branch units.

NNU121 includes a weighted Random Access Memory (RAM)124, a data RAM122, N Neural Processing Units (NPUs) 126, a program memory 129, a sequencer 128, and a Control and Status Register (CSRS) 127. The NPU 126 conceptually functions as a neuron in a neural network. The weight RAM 124, data RAM122, and program memory 129 may be written to and read from the MTNN and MFNN architecture instructions 103, respectively. The weight RAM 124 is arranged in W rows of N weight words each, and the data RAM122 is arranged in D rows of N data words each. Each data word and each weight word has a plurality of bits, preferably 8, 9, 12 or 16 bits. Each data word serves as an output value (sometimes also referred to as an activation value) for a neuron of a previous layer in the network, and each weight word serves as a weight associated with a connection of a neuron into a current layer of the network. Although in many applications of NNU121, the words or operands held in weight RAM 124 are actually weights associated with connections into neurons, it should be understood that in other applications of NNU121, the words held in weight RAM 124 are not weights, but are still referred to as "weight words" because they are stored in weight RAM 124. For example, in certain applications of NNU121, such as the convolution example of fig. 24-26A or the pooling example of fig. 27-28, weight RAM 124 may hold non-weights, such as elements of a data matrix (e.g., image pixel data), and the like. Likewise, while in many applications of NNU121 the words or operands held in data RAM122 are actually output values or activation values of neurons, it will be appreciated that in other applications of NNU121 the words held in data RAM122 are not, but are still referred to as "data words" because they are stored in data RAM 122. For example, in certain applications of NNU121, such as the convolution examples of fig. 24-26A, data RAM122 may hold non-neuronal outputs, such as elements of a convolution kernel, and the like.

In one embodiment, the NPU 126 and the sequencer 128 include combinatorial logic, sequencing logic, state machines, or a combination thereof. An architectural instruction (e.g., MFNN instruction 1500) loads the contents of status register 127 into one of GPRs 116 to determine the status of NNU 121, e.g., to determine the status of a program that NNU 121 has completed a command or that NNU 121 is running from program memory 129, or to determine the status of a new command or a new NNU program that NNU 121 is free to receive.

Advantageously, the number of NPUs 126 may be increased as desired, and the size of the weight RAM 124 and data RAM 122 may be expanded in width and depth accordingly. Preferably, the weight RAM 124 is large because in a typical neural network layer, there are many connections associated with individual neurons and thus many weights. Various embodiments are described herein relating to the size of the data words and weight words, the size of the weight RAM 124 and data RAM 122, and the number of NPUs 126. In one embodiment, the NNU 121 with 64KB (8192 bits x 64 rows) data RAM 122, 2MB (8192 bits x 2048 rows)

weight RAM

124, and 512 NPUs 126 is implemented in taiwan semiconductor manufacturing limited (TSMC) 16 nanometer technology, occupying an area of about 3.3 square millimeters.

Sequencer 128 picks up instructions from program memory 129 and executes, and also generates address and control signals to provide to data RAM 122, weight RAM 124, and NPU 126. The sequencer 128 generates the memory address 123 and read commands to provide to the data RAM 122 to select one of the D rows having N data words per row and provide to the N NPUs 126. Sequencer 128 also generates memory addresses 125 and read commands to provide to weight RAM 124 to select one of the W rows with N weight words per row to provide to the N NPUs 126. Sequencer 128 generates an order determination "connection" between neurons in the order of

addresses

123 and 125 provided to NPU 126. The sequencer 128 also generates memory addresses 123 and write commands to provide to the data RAM 122 to select one of the D rows having N data words per row to write from the N NPUs 126. Sequencer 128 also generates memory addresses 125 and write commands to provide to weight RAM 124 to select one of the W rows with N weight words per row to write from N NPUs 126. Sequencer 128 also generates a memory address 131 for program memory 129 to select the NNU instruction provided to sequencer 128 as described below. The memory address 131 corresponds to a program counter (not shown) that the sequencer 128 increments, typically by sequential position in the program memory 129, unless the sequencer 128 encounters a control instruction, such as a loop instruction (see, e.g., FIG. 26A), in which case the sequencer 128 updates the program counter to the target address of the control instruction. The sequencer 128 also generates control signals for the NPU126 to instruct the NPU126 to perform various operations or functions such as initialization, arithmetic/logical operations, rotate and shift operations, activate functions, and write back operations, examples of which are described in more detail below (see, e.g., micro-operation 3418 of fig. 34).

The N NPUs 126 generate N result words 133, where the result words 133 may be written back to the rows of the weight RAM 124 or to the data RAM 122. Preferably, the weight RAM 124 and the data RAM122 are coupled directly to the N NPUs 126. More specifically, the weight RAM 124 and the data RAM122 are dedicated to the NPU 126 and are not shared by other execution units 112 of the processor 100, and these NPUs 126 are able to consume one row from one or both of the weight RAM 124 and the data RAM122 on each clock cycle in a continuous manner (preferably, in a pipelined manner). In one embodiment, the data RAM122 and weight RAM 124 are each capable of providing 8192 bits to the NPU 126 on each clock cycle. As described in more detail below, these 8192 bits may be consumed as 512 16-bit words or 1024 8-bit words.

Advantageously, the size of the data sets that can be processed by the NNU 121 is not limited by the size of the weight RAM 124 and the data RAM122, but only by the size of the system memory, since the MTNN and MFNN instructions (e.g., via the media registers 118) can be used to move data and weights between the system memory and the weight RAM 124 and the data RAM 122. In one embodiment, the data RAM122 is dual ported, enabling data words to be written to the data RAM122 while data words are being read from or written to the data RAM122 in parallel. In addition, the large memory hierarchy of memory subsystem 114, including cache memory, provides a very large data bandwidth for transfers between system memory and NNU 121. Further, memory subsystem 114 preferably includes a hardware data pre-picker that tracks memory access patterns (such as loading of neural data and weights from system memory, etc.) and performs data pre-picking on the cache hierarchy to facilitate high bandwidth and low latency transfers to weight RAM 124 and data RAM 122.

Although embodiments are described in which one of the operands provided to each NPU 126 is provided from a weight store and is represented as weights (the term is used in general in neural networks), it should be understood that the operands may be other types of data associated with calculations that can be accelerated by the apparatus.

Referring now to FIG. 2, a block diagram is shown illustrating the NPU 126 of FIG. 1. The NPU 126 operates to perform a number of functions or operations. In particular, the NPU 126 is advantageously configured to operate as a neuron or node in an artificial neural network to perform a classical multiply-accumulate function or operation. That is, in general, the NPU 126 (neuron) is configured to: (1) receive input values from neurons having connections to the NPU 126 (typically, but not necessarily, from an immediately preceding layer in an artificial neural network); (2) multiplying each input value by a respective weight value associated with the connection to produce a product; (3) adding all products to produce a sum; and (4) performing an activation function on the sum to produce an output of the neuron. However, rather than performing all of the multiplications associated with all of the connected inputs and then adding all of the products together as is conventional, advantageously each neuron is configured to perform a weighted multiplication operation associated with one of the connected inputs in a given clock cycle and then add (accumulate) the product with an accumulated value of the products associated with the connected inputs processed in the previous clock cycle up to that point. Assuming there are M connections to a neuron, after all M products are accumulated (taking about M clock cycles), the neuron performs an activation function on the accumulated values to produce an output or result. This has the following advantages: fewer multipliers and smaller, simpler, and faster adder circuits (e.g., 2-input adders) are needed within the neuron, as compared to adders that add all or even a subset of the products associated with all of the connected inputs. This has the following advantages: it is advantageous to implement a very large number (N) of neurons (NPUs 126) within NNUs 121, such that after about M clock cycles, NNUs 121 have produced the output of all of these large number (N) of neurons. Finally, NNUs 121, which are composed of such neurons, have the advantage of performing efficiently as an artificial neural network layer for a large number of different connection inputs. That is, as M increases or decreases for different layers, the number of clock cycles required to generate neuron outputs increases or decreases accordingly, and resources (e.g., multipliers and accumulators) are fully utilized; while in more conventional designs, some multipliers and partial adders are not utilized for smaller values of M. Thus, the embodiments described herein have the benefit of flexibility and efficiency with respect to the number of connection inputs to the neurons of NNUs 121, and provide extremely high performance.

NPU126 includes registers 205, a 2-input multiplexing register (mux-reg)208, an Arithmetic Logic Unit (ALU)204, an accumulator 202, and an Activation Function Unit (AFU) 212. Register 205 receives weight word 206 from weight RAM 124 and provides its output 203 in a subsequent clock cycle. The multiplexer register 208 selects one of the

inputs

207 or 211 for storage in its register and then provided on the output 209 in a subsequent clock cycle. An input 207 receives a data word from the data RAM 122. Another input 211 receives the output 209 of the neighboring NPU 126. The NPU126 shown in fig. 2 is labeled NPU J among the N NPUs 126 of fig. 1. That is, NPU J is a representative example of N NPUs 126. Preferably, an input 211 of the multiplexing register 208 of the NPU J receives an output 209 of the multiplexing register 208 of the instance J-1 of the NPU126, and an output 209 of the multiplexing register 208 of the NPU J is provided to an input 211 of the multiplexing register 208 of the instance J +1 of the NPU 126. As such, the multiplexing registers 208 of the N NPUs 126 operate as an N-word rotator or circular shifter as a whole, as described in more detail below with respect to fig. 3. Control input 213 controls which of these two inputs is selected by the multiplexing register 208 for storage in the register and subsequent provision on output 209.

The ALU 204 has three inputs. One input receives a weight word 203 from a register 205. The other input receives an output 209 of the multiplexing register 208. And a further input receives the output 217 of the accumulator 202. The ALU 204 performs arithmetic and/or logical operations on its inputs to produce results that are provided on its outputs. Preferably, the arithmetic and/or logical operations performed by ALU 204 are specified by instructions stored in program memory 129. For example, the multiply-accumulate instruction of FIG. 4 specifies a multiply-accumulate operation, i.e., the result 215 is the sum of the product of the weight word 203 and the data word of the output 209 of the multiplexing register 208 and the value 217 of the accumulator 202. Other operations that may be specified include, but are not limited to: result 215 is the pass value of the multiplexed register output 209; result 215 is the pass value of weight word 203; the result 215 is zero; result 215 is the pass value of weight word 203; the result 215 is the sum of the value 217 of the accumulator 202 and the weight word 203; result 215 is the sum of the value 217 of the accumulator 202 and the output 209 of the multiplexing register; result 215 is the value 217 of accumulator 202 and the maximum value of weight word 203; result 215 is the maximum of the value 217 of accumulator 202 and the output 209 of the multiplexing register.

The ALU 204 provides an output 215 to the accumulator 202 for storage in the accumulator 202. The ALU 204 includes a multiplier 242 for multiplying the weight word 203 with the data word of the output 209 of the multiplexing register 208 to produce a product 246. In one embodiment, multiplier 242 multiplies two 16-bit operands to generate a 32-bit result. The ALU 204 also includes an adder 244 for adding a product 246 to the output 217 of the accumulator 202 to produce a sum, which is the result 215 accumulated in the accumulator 202 for storage in the accumulator 202. In one embodiment, the adder 244 adds the 32-bit result of the multiplier 242 to the 41-bit value 217 of the accumulator 202 to produce a 41-bit result. As such, the NPU 126 completes the addition of the product for the neuron required by the neural network by using the rotator aspect of the multiplexing register 208 over the course of multiple clock cycles. The ALU 204 may also include other circuit elements to perform other arithmetic/logic operations as previously described. In one embodiment, the second adder subtracts the weight word 203 from the data word at output 209 of the multiplexing register 208 to produce a difference value, which is then added by adder 244 to output 217 of the accumulator 202 to produce sum 215, which is the result of the accumulation in the accumulator 202. In this manner, the NPU 126 may complete the addition of the difference values over the course of multiple clock cycles. Preferably, the weight word 203 is the same size (in bits) as the data word 209, but may have a different binary decimal point location, as described in more detail below. Preferably, as described in more detail below, the multipliers 242 and adders 244 are integer multipliers and adders to advantageously implement the ALU 204 that is less complex, smaller, faster, and consumes less power than floating-point multipliers and adders. However, it should be understood that in other embodiments, ALU 204 performs floating point operations.

Although FIG. 2 only shows the multiplier 242 and adder 244 within the ALU 204, the ALU 204 preferably includes other elements to perform the other operations described above. For example, the ALU 204 preferably includes a comparator (not shown) for comparing the accumulator 202 with the data/weight words and a multiplexer (not shown) for selecting the larger of the two values indicated by the comparator (the maximum value) to be stored in the accumulator 202. As another example, ALU 204 preferably includes selection logic (not shown) for causing the data/weight word to skip multiplier 242 to enable adder 244 to add the data/weight word to the value 217 of accumulator 202 to produce the sum for storage in accumulator 202. These additional operations are described in more detail below (e.g., with respect to fig. 18-29A), and may be used to perform, for example, convolution operations and pooling operations.

AFU 212 receives output 217 of accumulator 202. AFU 212 performs an activation function on output 217 of accumulator 202 to produce result 133 of fig. 1. In general, activation functions within neurons of intermediate layers of an artificial neural network may be used to preferably employThe non-linear approach normalizes the sum of the products. To "normalize" the sum, the activation function of the current neuron produces a resultant value within a range of values that other neurons connected to the current neuron expect to receive as inputs. (the normalized result is sometimes referred to as an "activation value," which, as described herein, is the output of the current node, and the receiving node multiplies that output by a weight associated with the connection between the output node and the receiving node to produce a product that is accumulated with other products associated with other input connections to the receiving node.) for example, a receiving/connected neuron is expected to receive as input a value between 0 and 1, in which case the output neuron may need to compress and/or adjust (e.g., shift up to convert negative values to positive values) the accumulation and non-linearly outside the range of 0-1 to a value within the expected range. Thus, AFU 212 performs an operation on the value 217 of accumulator 202 to bring result 133 within a known range. The results 133 of all N NPUs 126 may be written back to the data RAM 122 or the weight RAM 124 in parallel. Preferably, AFU 212 is configured to execute a plurality of activation functions, and one of these activation functions is selected, for example, from an input of control register 127, to be executed on output 217 of accumulator 202. Activation functions may include, but are not limited to, step functions, correction functions, sigmoid functions, hyperbolic tangent functions, and soft addition functions (also referred to as smooth correction functions). The soft addition function is an analytic function f (x) ln (1+ e) ^x) I.e. 1 and e^xWhere "e" is the Euler (Euler) number and x is the input 217 to the function. Preferably, the activation function may also include a pass-through function of the value 217 or a portion thereof through the accumulator 202, as described in more detail below. In one embodiment, the circuitry of AFU 212 executes the activation function in a single clock cycle. In one embodiment, AFU 212 includes a table that receives accumulated values and outputs values for certain activation functions (e.g., sigmoid functions, hyperbolic tangent functions, soft-add functions, etc.) that approximate the values that a true activation function would provide.

Preferably, the width (in bits) of accumulator 202 is greater than the width of output 133 of AFU 212. For example, in one embodiment, the accumulator is 41 bits wide to avoid loss of precision for accumulation of up to 512 32-bit products (as described in more detail below, e.g., with respect to fig. 30), and the result 133 is 16 bits wide. In one embodiment, an example of which is described in more detail below with respect to fig. 8, during a subsequent clock cycle, a different portion of the output 217 value of the "raw" accumulator 202 passes through AFU 212 and is written back to data RAM 122 or weight RAM 124. This enables loading of the value of the raw accumulator 202 back into the media register 118 via the MFNN instruction, such that instructions executing on the other execution units 112 of the processor 100 may perform complex activation functions (also referred to as standardized exponential functions), such as the well-known soft max (softmax) activation function, that AFU 212 cannot perform. In one embodiment, the instruction set architecture of processor 100 includes instructions to perform an exponential function, commonly referred to as e ^xOr exp (x), which may be used to expedite the execution of the soft maximin activation function by other execution units 112 of the processor 100.

In one embodiment, the NPU 126 is a pipelined design. For example, the NPU 126 may include registers of the ALU 204 (such as registers located between multipliers and adders and/or other circuitry of the ALU 204) and registers that hold the output of the AFU 212, among other things. Other embodiments of the NPU 126 are described below.

Referring now to FIG. 3, a block diagram is shown illustrating an embodiment of an arrangement of N multiplexing registers 208 of N NPUs 126 of NNU 121 of FIG. 1 to illustrate the operation of the N multiplexing registers as an N-word rotator or circular shifter for a row of data words 207 received from data RAM 122 of FIG. 1. In the embodiment of FIG. 3, N is 512, such that NNU 121 has 512 multiplexing registers 208 labeled 0 through 511 corresponding to 512 NPUs 126 as shown. Each multiplexing register 208 receives a respective data word 207 on one of the D rows of the data RAM 122. That is, multiplexing register 0 receives data word 0 in a row of data RAM 122, multiplexing register 1 receives data word 1 in a row of data RAM 122, multiplexing register 2 receives data word 2 in a row of data RAM 122, and so on, multiplexing register 511 receives data word 511 in a row of data RAM 122. Furthermore, multiplexing register 1 receives the output 209 of multiplexing register 0 on another input 211, multiplexing register 2 receives the output 209 of multiplexing register 1 on another input 211, multiplexing register 3 receives the output 209 of multiplexing register 2 on another input 211, and so on, multiplexing register 511 receives the output 209 of multiplexing register 510 on another input 211, and multiplexing register 0 receives the output 209 of multiplexing register 511 on another input 211. Each multiplexing register 208 receives a control input 213 for controlling whether the data word 207 or the toggle input 211 is selected. As described in more detail below, in one mode of operation, during a first clock cycle, the control input 213 controls each of the multiplexer registers 208 to select a data word 207 for storage in the register and subsequent provision to the ALU 204; and during a subsequent clock cycle (e.g., M-1 clock cycles as described above), the control input 213 controls each of the multiplexer registers 208 to select the toggle input 211 for storage in the register and subsequent provision to the ALU 204.

Although in the embodiment depicted in fig. 3 (and fig. 7 and 19 below), the NPU 126 is configured to rotate the value of the multiplexing register 208/705 to the right, i.e., from NPU J to NPU J +1, embodiments are contemplated (such as for the embodiments of fig. 24-26, etc.) in which the NPU 126 is configured to rotate the value of the multiplexing register 208/705 to the left, i.e., from NPU J to NPU J-1. Further, embodiments are contemplated in which the NPU 126 is configured to selectively rotate the value of the multiplexing register 208/705 to the left or right, as specified by the NNU instruction, for example.

Referring now to FIG. 4, a table is shown that illustrates a program for storage in program memory 129 of NNU 121 of FIG. 1 and execution by NNU 121. As described above, the exemplary program performs computations associated with the layers of the artificial neural network. In the table of fig. 4, five rows and three columns are shown. Each row corresponds to an address in program memory 129 labeled as the first row. The second column specifies the instruction and the third column indicates the number of clock cycles associated with the instruction. Preferably, the number of clock cycles represents the number of clocks available per instruction clock type value in a pipelined embodiment, rather than the latency of the instruction. As shown, because of the pipelined nature of NNU 121, each instruction has an associated one clock cycle, with the instruction at address 2 being an exception, 511 clocks being required because the instruction actually repeats itself 511 times, as described in more detail below.

For each instruction of the program, all NPUs 126 process the instruction in parallel. That is, all N NPUs 126 execute instructions in the first row in the same clock cycle(s), all N NPUs 126 execute instructions in the second row in the same clock cycle(s), and so on. However, other embodiments are described below in which some instructions are executed in a partially parallel and partially sequential manner, e.g., in embodiments where the NPUs 126 share units of an active function, such as for the embodiment of FIG. 11, the active function and output instructions at

addresses

3 and 4, respectively, are executed in this manner. The example of fig. 4 assumes that one layer has 512 neurons (NPU 126), with each neuron having 512 connection inputs from 512 neurons of the previous layer, for a total of 256K connections. Each neuron receives a 16-bit data value from a respective coupling input and multiplies the 16-bit data value by an appropriate 16-bit weight value.

The first row at address 0 (although other addresses may be specified) specifies the initialize NPU instruction. The initialization instruction clears the value of the accumulator 202. In one embodiment, the initialization instruction may also specify loading accumulator 202 with the corresponding word in a row of data RAM122 or weight RAM 124 that is addressed by the instruction. The initialization instruction also loads configuration values into control registers 127, as described in more detail below with respect to fig. 29A and 29B. For example, the width of the data word 207 and weight word 209 may be loaded, where the width may be used by the ALU 204 to determine the size of the operation performed by the circuit and may affect the result 215 stored in the accumulator 202. In one embodiment, the NPU126 includes circuitry for saturating the output 215 of the ALU 204 before the output 215 is stored in the accumulator 202, and an initialization instruction loads a configuration value into the circuitry to affect saturation. In one embodiment, the accumulator 202 may also be cleared to a zero value by so specifying in an ALU function instruction (e.g., a multiply accumulate instruction at address 1) or an output instruction (such as a write AFU output instruction at address 4).

The second row at address 1 specifies a multiply-accumulate instruction that instructs 512 NPUs 126 to load a corresponding data word from one row of data RAM 122 and a corresponding weight word from one row of weight RAM 124, and to perform a first multiply-accumulate operation on data word input 207 and weight word input 206, the first multiply-accumulate operation being accumulated with the initialized accumulator 202 at a zero value. More specifically, the instruction instructs the sequencer 128 to generate a value on the control input 213 to select the data word input 207. In the example of fig. 4, the specified row of data RAM 122 is row 17 and the specified row of weight RAM 124 is row 0, thereby instructing sequencer 128 to output a value of 17 for data RAM address 123 and a value of 0 for weight RAM address 125. Thus, 512 data words from row 17 of the data RAM 122 are provided to respective data inputs 207 of the 512

NPUs

126, and 512 weight words from row 0 of the weight RAM 124 are provided to respective weight inputs 206 of the 512 NPUs 126.

The third row at address 2 specifies a multiply-accumulate rotate instruction of count 511 that instructs each of the 512 NPUs 126 to perform 511 multiply-accumulate operations. The instruction indicates to the 512 NPUs 126 that the data word 209 input to the ALU 204 in each of the 511 multiply-accumulate operations is a round-robin value 211 from the adjacent NPU 126. That is, the instruction instructs the sequencer 128 to generate a value on the control input 213 to select the wheel value 211. Further, the instruction instructs the 512 NPUs 126 to load the respective weight values for each of the 511 multiply-accumulate operations from the "next" line of the weight RAM 124. That is, the instruction instructs sequencer 128 to increment the weight RAM address 125 by 1 relative to its value in the previous clock cycle, which in this example is row 1 for the first clock cycle of the instruction, row 2 for the next clock cycle, row 3 for the next clock cycle, and so on, and row 511 for the 511 th clock cycle. For each of these 511 multiply-accumulate operations, the product of the round-robin input 211 and the weight word input 206 is accumulated with the previous value of the accumulator 202. The 512 NPUs 126 perform 511 multiply-accumulate operations in 511 clock cycles, where each NPU 126 performs a multiply-accumulate operation on a different data word from row 17 of the data RAM 122, i.e., the data word on which the adjacent NPU 126 performed the operation in the previous cycle, and a different weight word associated with that data word, conceptually for different connected inputs of the neuron. In this example, it is assumed that the number of connection inputs of each NPU 126 (neuron) is 512, thus involving 512 data words and 512 weight words. Once the last iteration of the multiply-accumulate rotate instruction for line 2 is performed, the accumulator 202 contains the sum of the products of all 512 concatenated inputs. In one embodiment, the instruction set of the NPU 126 includes an "execute" instruction to instruct the ALU 204 to perform an ALU operation specified by the initializing NPU instruction (such as specified in the ALU function 2926 of fig. 29A), rather than having separate instructions for each type of ALU operation (e.g., multiply-accumulate, accumulator-and-weight word maximization, etc., as described above).

The fourth row at address 3 specifies the activate function instruction. The activate function instruction instructs AFU212 to perform the specified activate function on the value 217 of accumulator 202 to produce result 133. The activation function according to one embodiment is described in more detail below.

The fifth row at address 4 specifies a write AFU output instruction that instructs 512 NPUs 126 to write back the output of AFU212 as result 133 to one row of data RAM 122 (row 16 in this example). That is, the instruction instructs the sequencer 128 to output a data RAM address 123 with a value of 16 and a write command (as opposed to a read command in the case of a multiply accumulate instruction at address 1). Preferably, under the essence of the pipeline, the execution of the write AFU output instruction may overlap with the execution of other instructions, such that the write AFU output instruction actually executes within a single clock cycle.

Preferably, each NPU 126 is configured as a pipeline, wherein the pipeline includes various functional elements, such as multiplexing registers 208 (and multiplexing registers 705 of fig. 7), ALUs 204, accumulators 202, AFUs 212, multiplexer 802 (of fig. 8), row buffers 1104, and AFUs 1112 (of fig. 11), among others, wherein some of these functional elements may be pipelined themselves. In addition to data words 207 and weight words 206, the pipeline receives instructions from the program memory 129. These instructions flow along the pipeline and control the various functional units. In an alternative embodiment, no activate function instruction is included within the program. Instead, the initialize NPU instruction specifies the activation function to be performed on the value 217 of the accumulator 202, and indicates that the value of the specified activation function is saved in a configuration register for later use by the AFU212 portion of the pipeline after the last accumulator 202 value 217 has been generated, i.e., after the last iteration of the multiply-accumulate rotate instruction at address 2 is complete. Preferably, for power saving purposes, the AFU212 portion of the pipeline is inactive until the write AFU output instruction reaches the AFU212 portion, at which point the AFU212 starts and performs an activation function on the output 217 of the accumulator 202 specified by the initialize instruction.

Referring now to FIG. 5, a timing diagram is shown that illustrates the execution of the program of FIG. 4 by NNU 121. Each row of the timing diagram corresponds to successive clock cycles as indicated by the first row. The other columns correspond to and indicate operations of a different one 126 of the 512 NPUs 126. For simplicity and clarity of illustration, only the operations of

NPUs

0, 1 and 511 are shown.

At clock 0, each of the 512 NPUs 126 executes the initialization instruction of FIG. 4, which is illustrated in FIG. 5 by assigning a zero value to accumulator 202.

At clock 1, each of the 512 NPUs 126 executes the multiply-accumulate instruction at address 1 in FIG. 4. As shown, NPU 0 accumulates the product of word 0 of row 17 of data RAM 122 and word 0 of row 0 of weight RAM 124 with the value of accumulator 202 (i.e., zero); NPU 1 accumulates the product of word 1 of row 17 of data RAM 122 and word 1 of row 0 of weight RAM 124 with the value of accumulator 202 (i.e., zero); by analogy, NPU 511 accumulates the product of word 511 of row 17 of data RAM 122 and word 511 of row 0 of weight RAM 124 with the value of accumulator 202 (i.e., zero).

At clock 2, each of the 512 NPUs 126 executes the first iteration of the multiply-accumulate rotate instruction at address 2 in FIG. 4. As shown, NPU 0 accumulates the product of the round-robin data word 211 received from the output 209 of the multiplexing register 208 of NPU 511 (i.e., the data word 511 received from data RAM 122) and word 0 of row 1 of weight RAM 124 with the value of the accumulator 202; NPU 1 accumulates the product of the round-robin data word 211 received from output 209 of multiplexing register 208 of NPU 0 (i.e., data word 0 received from data RAM 122) and word 1 of row 1 of weight RAM 124 with the value of accumulator 202; by analogy, NPU 511 accumulates the product of the round-robin data word 211 received from output 209 of multiplexing register 208 of NPU 510 (i.e., data word 510 received from data RAM 122) and word 511 of row 1 of weight RAM 124 with the value of accumulator 202.

At clock 3, each of the 512 NPUs 126 performs a second iteration of the multiply-accumulate round-robin instruction at address 2 in fig. 4. As shown, NPU 0 accumulates the product of the round-robin data word 211 received from the output 209 of the multiplexing register 208 of NPU 511 (i.e., the data word 510 received from the data RAM 122) and word 0 of row 2 of the weight RAM124 with the value of the accumulator 202; NPU 1 accumulates the product of the round-robin data word 211 received from the output 209 of the multiplexing register 208 of NPU 0 (i.e., the data word 511 received from the data RAM 122) and word 1 of row 2 of the weight RAM124 with the value of the accumulator 202; by analogy, NPU 511 accumulates the product of the round-robin data word 211 received from output 209 of multiplexing register 208 of NPU 510 (i.e., data word 509 received from data RAM 122) and word 511 of row 2 of weight RAM124 with the value of accumulator 202. As indicated by the ellipses in FIG. 5, the next 509 clock cycles each continue in turn until clock 512.

At clock 512, each of the 512 NPUs 126 executes the 511 th iteration of the multiply-accumulate round-robin instruction at address 2 in fig. 4. As shown, NPU 0 accumulates the product of the round-robin data word 211 received from the output 209 of the multiplexing register 208 of NPU 511 (i.e., data word 1 received from data RAM 122) and word 0 of row 511 of weight RAM124 with the value of accumulator 202; NPU 1 accumulates the product of the round-robin data word 211 received from output 209 of multiplexing register 208 of NPU 0 (i.e., data word 2 received from data RAM 122) and word 1 of row 511 of weight RAM124 with the value of accumulator 202; by analogy, NPU 511 accumulates the product of the round-robin data word 211 received from output 209 of multiplexing register 208 of NPU 510 (i.e., data word 0 received from data RAM 122) and word 511 of row 511 of weight RAM124 with the value of accumulator 202. In one embodiment, multiple clock cycles are required to read a data word and a weight word from data RAM 122 and weight RAM124 to execute the multiply-accumulate instruction at address 1 in FIG. 4; however, data RAM 122, weight RAM124, and NPU 126 are pipelined such that once a first multiply-accumulate operation begins (e.g., as shown during clock 1 of fig. 5), subsequent multiply-accumulate operations begin in successive clock cycles (e.g., as shown during clocks 2-512). Preferably, the NPU 126 may be temporarily stalled in response to access to the data RAM 122 and/or the weight RAM124 by architectural instructions (e.g., MTNN or MFNN instructions, described later with respect to FIGS. 14 and 15) or microinstructions translated by architectural instructions.

At clock 513, AFU 212 of each of the 512 NPUs 126 executes the Activate function instruction at address 3 in FIG. 4. Finally, at clock 514, each of the 512 NPUs 126 executes the write AFU output instruction at address 4 of fig. 4 by writing the result 133 back to the corresponding word in row 16 of the data RAM 122, i.e., writing the result 133 of NPU 0 to word 0 of the data RAM 122, writing the result 133 of NPU 1 to word 1 of the data RAM 122, and so on, until the result 133 of NPU 511 is written to word 511 of the data RAM 122. The operation described above with respect to fig. 5 is also shown in block diagram form in fig. 6A.

Referring now to FIG. 6A, a block diagram is shown that illustrates execution of the program of FIG. 4 by NNU 121 of FIG. 1. NNU 121 includes 512 NPUs 126, a data RAM 122 that receives address inputs 123, and a weight RAM124 that receives address inputs 125. Although not shown, at

clock

0, 512 NPUs 126 execute an initialization instruction. As shown, at

clock

1, 512 16-bit data words of row 17 are read from the data RAM 122 and provided to 512 NPUs 126. At clocks 1 through 512, 512 16-bit weight words for lines 0 through 511 are read from the weight RAM124 and provided to 512 NPUs 126, respectively. Although not shown, at

clock

1, 512 NPUs 126 perform corresponding multiply-accumulate operations on the loaded data word and weight word. At clocks 2 through 512, the multiplexing registers 208 of the 512 NPUs 126 operate as 512 round-robin for 16-bit words to round-robin the data word of row 17 of the previously loaded data RAM 122 to the adjacent NPU 126, and the NPU 126 performs a multiply-accumulate operation on the round-robin data words and the weight words loaded from the weight RAM 124. Although not shown, at

clock

513, 512 AFUs 212 execute the activate instruction. At clock 514, the 512 NPUs 126 write the corresponding 512 16-bit results 133 back to row 16 of the data RAM 122.

It can be seen that the number of clocks required to generate the result word (neuron output) and write back to the data RAM 122 or weight RAM 124 is approximately the square root of the number of data inputs (connections) received by the current layer of the neural network. For example, if the current layer includes 512 neurons each having 512 connections from the previous layer, the sum of these connections is 256K, and the number of clocks required to produce the results for the current layer slightly exceeds 512. Thus, NNU 121 provides extremely high performance for neural network computations.

Referring now to FIG. 6B, a flow diagram is shown illustrating operation of processor 100 of FIG. 1 to execute an architectural program that uses NNUs 121 to perform multiply-accumulate activation function computations (such as the operations performed by the program of FIG. 4) typically associated with neurons of a hidden layer of an artificial neural network. The example of fig. 6B assumes computation of four hidden LAYERS (denoted by initialization of the NUM _ LAYERS variable of block 602), each hidden layer having 512 neurons, each neuron connecting all 512 neurons of the previous layer (by the procedure of fig. 4). However, it should be understood that the number of layers and neurons are chosen for illustration purposes, and that NNU 121 may be used to perform the same calculations for different numbers of hidden layers, different numbers of neurons in each layer, and not all connected neurons. In one embodiment, the weight value may be set to zero for neurons not present in this layer or connections to neurons not present. Preferably, the architectural program writes a first set of weights to weight RAM 124 and enables NNU 121, and while NNU 121 is performing computations associated with the first layer, this architectural program writes a second set of weights to weight RAM 124 so that NNU 121 can begin computations for the second layer once NNU 121 completes the computations for the first hidden layer. Thus, the configuration process moves to and from between two regions of weight RAM 124 to ensure that NNUs 121 are fully utilized. Flow begins at block 602.

At block 602, as shown and described with respect to fig. 6A, the processor 100 (i.e., the architectural program running on the processor 100) writes input values to a current neuron hiding layer of the data RAM 122, e.g., to row 17 of the data RAM 122. Alternatively, these values may also have been in row 17 of data RAM 122 as the result 133 of the operation of NNU 121 on a previous layer (e.g., a convolutional, pooling, or input layer). In addition, the framework program initializes a variable N to a value of 1. The variable N indicates the current layer in the hidden layer being processed by NNU 121. In addition, the architectural program initializes the variable NUM _ LAYERS to a value of 4 because there are four hidden LAYERS in this example. Flow proceeds to block 604.

At block 604, as shown in FIG. 6A, processor 100 writes the weight word for tier 1 to weight RAM 124, e.g., to rows 0 through 511. Flow proceeds to block 606.

At block 606, the processor 100 writes the multiply-accumulate activate function program (e.g., of FIG. 4) to the program memory 129 of the NNU 121 using the MTNN instruction 1400 that specifies the function 1432 to write to the program memory 129. The processor 100 then initiates the NNU program with the MTNN instruction 1400 specifying the function 1432 to begin executing the program. Flow proceeds to decision block 608.

At decision block 608, the architecture program determines whether the value of variable N is less than NUM _ LAYERS. If so, flow proceeds to block 612; otherwise block 614 is entered.

At block 612, the processor 100 writes the weight word for tier N +1 to the weight RAM 124, e.g., to rows 512-1023. Thus, advantageously, the architecture program writes the weight words of the next layer to weight RAM 124 while NNU121 is performing hidden layer calculations for the current layer, so that NNU121 can begin performing hidden layer calculations for the next layer immediately upon completion of the calculations for the current layer, i.e., writing to data RAM 122. Flow proceeds to block 614.

At block 614, the processor 100 determines that the currently running NNU program (starting at block 606 in the case of layer 1, and beginning at block 618 in the case of layers 2 through 4) has completed. Preferably, processor 100 determines this by executing MFNN instruction 1500 to read status register 127 of NNU 121. In an alternative embodiment, NNU121 generates an interrupt to indicate that it has completed the multiply-accumulate activate function layer routine. Flow proceeds to decision block 616.

At decision block 616, the architecture program determines whether the value of variable N is less than NUM _ LAYERS. If so, flow proceeds to block 618; otherwise flow proceeds to block 622.

At block 618, the processor 100 updates the multiply-accumulate activation function program so that the processor can perform hidden layer computations for layer N + 1. More specifically, the processor 100 updates the row value of the data RAM 122 of the multiply-accumulate instruction at address 1 of fig. 4 to the row of the data RAM 122 to which the result of the previous layer was written (e.g., to row 16), and also updates the output row (e.g., to row 15). Processor 100 then begins the updated NNU program. Alternatively, the program of FIG. 4 specifies the same row in the output instruction at address 4 as the row specified by the multiply accumulate instruction at address 1 (i.e., the row read from data RAM 122). In this embodiment, the current line of the input data word is overwritten (this way of processing is acceptable as long as this line of data words need not be used for other purposes, as this line of data words has already been read into the multiplexing register 208 and rotated through the NPUs 126 via the N-word rotator). In this case, at block 618, the NNU program need not be updated, but rather only restarted. Flow proceeds to block 622.

At block 622, the processor 100 reads the results of the layer N NNU program from the data RAM 122. However, if these results are only used for the next layer, the architectural program need not read these results from the data RAM 122, but instead can retain them in the data RAM 122 for the next hidden layer computation. Flow proceeds to decision block 624.

At decision block 624, the architecture program determines whether the value of variable N is less than NUM _ LAYERS. If so, flow proceeds to block 626; otherwise, the flow ends.

At block 626, the framework program increments N. Flow returns to decision block 608.

As can be determined from the example of fig. 6B, the NPU 126 (by virtue of operation of the NNU program of fig. 4) performs a read and a write to the data RAM 122 substantially every 512 clock cycles. Further, NPU 126 reads weight RAM 124 approximately every clock cycle to read a row of weight words. Thus, the entire bandwidth of weight RAM 124 is consumed by NNU 121 in the hybrid manner used to perform hidden layer operations. Further, assuming an embodiment that includes writing and reading buffers (such as buffer 1704 of FIG. 17), in parallel with NPU 126 reading, processor 100 writes to weight RAM 124 such that buffer 1704 performs a write to weight RAM 124 to write a weight word approximately every 16 clock cycles. Thus, in a single port embodiment of weight RAM 124 (such as the embodiment described with respect to fig. 17), NPU 126 must temporarily stall reading of weight RAM 124 approximately every 16 clock cycles, thereby enabling buffer 1704 to write to weight RAM 124. However, in embodiments where weight RAM 124 is dual ported, NPU 126 need not stall.

Referring now to FIG. 7, a block diagram is shown illustrating the NPU126 of FIG. 1 in accordance with an alternative embodiment. The NPU126 of fig. 7 is similar in many respects to the NPU126 of fig. 2. However, the NPU126 of fig. 7 additionally includes a second 2-input multiplexing register 705. The multiplexing register 705 selects one of the

inputs

206 or 711 to be stored in the register and then provided on the output 203 on a subsequent clock cycle. Input 206 receives weight words from weight RAM 124. Another input 711 receives the output 203 of the second multiplexing register 705 of the adjacent NPU 126. Preferably, an input 711 of the multiplexing register 705 of the NPU J receives the output 203 of the multiplexing register 705 of the NPU126 instance J-1, and an output of the NPU J is provided to an input 711 of the multiplexing register 705 of the NPU126 instance J + 1. Thus, in the same manner as described above with respect to FIG. 3, the multiplexing registers 705 of the N NPUs 126 operate as N-word rotators in their entirety, but for weight words rather than data words. Control input 713 controls which of these two inputs is selected by multiplexing register 705 to be stored in the register and subsequently provided on output 203.

The inclusion of multiplexing register 208 and/or multiplexing register 705 (as well as multiplexing registers in other embodiments such as those shown in fig. 18 and 23) to actually form a large round robin for rotating a row of data/weights received from data RAM 122 and/or weight RAM 124 has the following advantages: NNUs 121 do not require the otherwise required extensive multiplexers between data RAM 122 and/or weight RAM 124 to provide the necessary data words/weight words to the appropriate NNUs 121.

Writing back accumulator values in addition to activating function results

In some applications, it may be useful for the processor 100 to receive back (e.g., to the media registers 118 via the MFNN instruction of fig. 15) the values 217 of the original accumulators 202, where instructions executing on other execution units 112 may perform calculations on the values 217 of these accumulators 202. For example, in one embodiment, to reduce the complexity of AFU212, AFU212 is not configured to perform a soft maximum activation function. Thus, NNU 121 can output the values 217 or a subset thereof of the original accumulator 202 to data RAM 122 or weight RAM 124, and the architectural program then reads the values 217 or a subset thereof of the original accumulator 202 from data RAM 122 or weight RAM 124 and computes the original values. However, the application of the value 217 of the original accumulator 202 is not limited to the execution of soft maximum operations, and other applications are also contemplated.

Referring now to FIG. 8, a block diagram is shown illustrating the NPU126 of FIG. 1 in accordance with an alternative embodiment. The NPU126 of fig. 8 is similar in many respects to the NPU126 of fig. 2. However, NPU126 of fig. 8 includes a multiplexer (mux)802 within AFU212, where AFU212 has a control input 803. The width (in bits) of the accumulator 202 is greater than the width of the data word. Multiplexer 802 has a plurality of inputs for receiving the data word width portion of output 217 of accumulator 202. In one embodiment, the accumulator 202 is 41 bits wide, and the NPU126 is configured to output a 16-bit result word 133; thus, for example, multiplexer 802 (or multiplexer 3032 and/or multiplexer 3037 of fig. 30) has bits [ 15: 0), bit [ 31: 16] and bit [ 47: 32 ]. Preferably, output bits not provided by the accumulator 202 (e.g., bits [ 47: 41]) are forced to zero value bits.

In response to a write ACC instruction (e.g., a write ACC instruction at addresses 3-5 of fig. 9 described below, etc.), sequencer 128 generates a value on control input 803 to control multiplexer 802 to select one of the words (e.g., 16 bits) of accumulator 202. Preferably, multiplexer 802 also has one or more inputs for receiving the outputs of the activate function circuits (e.g.,

elements

3022, 3024, 3026, 3018, 3014, and 3016 in fig. 30) which produce outputs as the width of a data word. In response to an instruction, such as a write AFU output instruction at address 4 of FIG. 4, the sequencer 128 generates a value on the control input 803 to control the multiplexer 802 to select one of the active function circuit outputs instead of one of the words of the accumulator 202.

Referring now to FIG. 9, a table illustrating a program for storage in program memory 129 of NNU 121 of FIG. 1 and execution by NNU 121 is shown. The exemplary process of fig. 9 is similar in many respects to the process of fig. 4. Specifically, the instructions at addresses 0 to 2 are the same. However, the instructions at

addresses

3 and 4 of FIG. 4 are replaced in FIG. 9 with write ACC instructions, which instruct 512 NPUs 126 to write the output 217 of their accumulators 202 back as a result 133 into three rows (rows 16 through 18 in this example) of the data RAM 122. That is, the write ACC instruction instructs the sequencer 128 to output a data RAM address 123 with a value of 16 and a write command in a first clock cycle, a data RAM address 123 with a value of 17 and a write command in a second clock cycle, and a data RAM address 123 with a value of 18 and a write command in a third clock cycle. Preferably, the execution of the write ACC instruction may overlap with the execution of other instructions such that the write ACC instruction is actually executed in three clock cycles, one for each row of the write data RAM 122. In one embodiment, the user specifies the values of the activate function 2934 and the output command 2956 fields in the control register 127 (of FIG. 29A) to complete writing the desired portion of the accumulator 202 to the data RAM 122 or weight RAM 124. Alternatively, the write ACC instruction may optionally write back a subset of the accumulator 202, rather than writing back the entire contents of the accumulator 202. In one embodiment, the normal accumulator 202 may be written back as described in more detail below with respect to fig. 29-31.

Referring now to FIG. 10, a timing diagram is shown that illustrates the execution of the program of FIG. 9 by NNU 121. The timing diagram of fig. 10 is similar to that of fig. 5, and clocks 0 to 512 are the same. However, at

clock

513 and 515, AFU 212 of each NPU126 of the 512 NPUs 126 executes one of the write ACC instructions at addresses 3 through 5 of FIG. 9. Specifically, at clock 513, each of the 512 NPUs 126 couples a bit [ 15: 0] as result 133 the corresponding word written back into line 16 of data RAM 122; at clock 514, each of the 512 NPUs 126 couples a bit [ 31: 16] as result 133, the corresponding word in line 17 written back to data RAM 122; and at clock 515, each of the 512 NPUs 126 couples bits [ 40: 32] as result 133 are written back to the corresponding word in line 18 of data RAM 122. Preferably, bit [ 47: 41 is forced to a zero value.

Shared AFU

Referring now to FIG. 11, a block diagram is shown that illustrates an embodiment of NNU 121 of FIG. 1. In the embodiment of fig. 11, the neuron element is divided into two parts, namely, an activation function unit part and an ALU part (the ALU part also includes a shift register part), and each activation function unit part is shared by a plurality of ALU parts. In FIG. 11, the ALU portion is referred to as the NPU126, and the shared active function unit portion is referred to as the AFU 1112. This is in contrast to the embodiment of fig. 2, for example, in the embodiment of fig. 2, each neuron contains its own AFU 212. Thus, for example, in one embodiment, the NPU126 (ALU portion) of the embodiment of fig. 11 includes the accumulator 202, ALU 204, multiplexing register 208, and register 205 of fig. 2, but does not include AFU 212. In the embodiment of FIG. 11, NNUs 121 include, by way of example, 512 NPUs 126; however, other embodiments having other numbers of NPUs 126 are contemplated. In the example of fig. 11, the 512 NPUs 126 are grouped into 64 groups (referred to as groups 0 through 63 in fig. 11), and each group has 8 NPUs 126.

NNU 121 also includes a row buffer 1104 and a plurality of shared AFUs 1112 coupled between NPU126 and row buffer 1104. The width (in bits) of the line buffer 1104 is the same as the line of the data RAM 122 or weight RAM 124, e.g., 512 words. There is one AFU 1112 for each NPU126 group, i.e., each AFU 1112 has a corresponding NPU126 group; thus, in the embodiment of FIG. 11, there are 64 AFUs 1112 corresponding to the 64 NPU126 groups. Each NPU126 of the 8 NPUs 126 within the group shares a respective AFU 1112. Other embodiments are contemplated having different numbers of AFUs 1112 and different numbers of NPUs 126 in each group. For example, other embodiments are contemplated in which two, four, or sixteen NPUs 126 in a group share AFU 1112.

The motivation for sharing AFU 1112 is to reduce the size of NNU 121. Size reduction is achieved at the cost of performance degradation. That is, for example, as shown in fig. 12 below, several clocks that may be longer depending on the sharing rate may be required to produce the results 133 for the entire NPU126 array, in this case, since 8: a sharing rate of 1, thus requiring seven additional clock cycles. However, in general, the aforementioned number of additional clocks (e.g., 7) is relatively small compared to the number of clocks required to generate the accumulated sum (e.g., 512 clocks are required for a layer having 512 connections per neuron). Thus, a relatively small performance impact (e.g., increasing one percent of the computation time) may be a cost-effective compromise for the size reduction of the NNU 121.

In one embodiment, each NPU 126 includes an AFU212, wherein AFU212 is configured to perform relatively simple activation functions, thereby enabling these simple AFUs 212 to be relatively small and therefore contained within each NPU 126; while shared or complex AFU 1112 performs a relatively complex activation function and is therefore relatively significantly larger than simple AFU 212. In such an embodiment, additional clock cycles are only required if complex activation functions are specified that need to share complex AFU 1112, but not if activation functions are specified that are executed by simple AFU212 configuration.

Referring now to fig. 12 and 13, two timing diagrams illustrating the execution of the program of fig. 4 by NNU 121 of fig. 11 are shown. The timing diagram of fig. 12 is similar to that of fig. 5, and clocks 0 to 512 are the same. However, at clock 513, the operation is different from that described in the timing diagram of FIG. 5 because the NPU 126 of FIG. 11 shares AFU 1112; that is, the NPUs 126 in a group share the AFUs 1112 associated with that group, and fig. 11 illustrates the sharing.

Each row of the timing diagram of fig. 13 corresponds to a successive clock cycle indicated in the first column. The other columns correspond to and indicate the operation of different AFUs 1112 in the 64 AFUs 1112. For simplicity and clarity of illustration, only the operations of

AFUs

0, 1, and 63 are shown. The clock cycles of FIG. 13 correspond to the clock cycles of FIG. 12, but the sharing of AFU 1112 by NPU 126 is shown in a different manner. As shown in FIG. 13, at clocks 0-512, each AFU 1112 of 64 AFUs 1112 is inactive and the NPU 126 executes an initialize NPU instruction, a multiply-accumulate instruction, and a multiply-accumulate rotation instruction.

As shown in both fig. 12 and 13, at clock 513, AFU 0 (AFU 1112 associated with set 0) begins performing the specified activation function on the value 217 of the accumulator 202 of NPU 0 (i.e., the first NPU126 in set 0), and the output of AFU 0 will be stored to word 0 of the line buffer 1104. Also at clock 513, each AFU 1112 begins executing the designated activate function on the accumulator 202 of the first NPU126 in the corresponding set of NPUs 126. Thus, as shown in FIG. 13, at clock 513, AFU 0 begins performing the specified activate function on NPU 0's accumulator 202 to produce a result of word 0 to be stored to line buffer 1104; AFU 1 begins executing the specified activate function on accumulator 202 of NPU 8 to produce a result for word 8 to be stored to line buffer 1104; by analogy, AFU 63 begins performing the specified activate function on the accumulator 202 of NPU 504 to produce the result of word 504 to be stored to row register 1104.

As shown, at clock 514, AFU 0 (AFU 1112 associated with Bank 0) begins performing the specified activation function on the value 217 of the accumulator 202 of NPU1 (i.e., the second NPU126 in Bank 0), and the output of AFU 0 will be stored to word 1 of the line buffer 1104. Also at clock 514, each AFU 1112 begins executing the designated activate function on the accumulator 202 of the second NPU126 in the corresponding set of NPUs 126. Thus, as shown in FIG. 13, at clock 514, AFU 0 begins performing the specified activate function on NPU 1's accumulator 202 to produce a result of word 1 to be stored to line buffer 1104; AFU 1 begins executing the specified activate function on accumulator 202 of NPU 9 to produce a result for word 9 to be stored to line buffer 1104; by analogy, AFU 63 begins performing the specified activate function on accumulator 202 of NPU 505 to produce the result of word 505 to be stored to row register 1104. As shown, this mode continues until clock cycle 520, AFU 0 (AFU 1112 associated with Bank 0) begins to perform the specified activation function on the value 217 of the accumulator 202 of NPU 7 (i.e., the eighth (last) NPU126 in Bank 0), and the output of AFU 0 will be stored to word 7 of the line buffer 1104. Also at clock 520, each AFU 1112 begins executing the specified activation function on the accumulator 202 of the eighth NPU126 in the corresponding set of NPUs 126. Thus, as shown in FIG. 13, at clock 520, AFU 0 begins performing the specified activate function on the accumulator 202 of NPU 7 to produce the result of word 7 to be stored to the line buffer 1104; AFU 1 begins executing the specified activate function on accumulator 202 of NPU 15 to produce a result for word 15 to be stored to line buffer 1104; by analogy, AFU 63 begins performing the specified activate function on accumulator 202 of NPU 511 to produce the result of word 511 to be stored to row register 1104.

At clock 521, once all 512 results associated with the 512 NPUs 126 have been generated and written to the line buffer 1104, the line buffer 1104 begins writing its contents to either the data RAM 122 or the weight RAM 124. As such, the AFU 1112 in each of the 64 NPU 126 groups executes a portion of the Activate function instruction at address 3 of FIG. 4.

As described in more detail below, e.g., with respect to fig. 29A-33, embodiments that share AFU 1112 between groups of ALUs 204 (such as the embodiment in fig. 11, etc.) may be particularly advantageous in conjunction with integer ALUs 204.

MTNN and MFNN architecture instructions

Referring now to FIG. 14, a block diagram is shown that illustrates a Move To Neural Network (MTNN) architecture instruction 1400 and its operation with respect to portions of the NNUs 121 of FIG. 1. The MTNN instruction 1400 includes an operation code (opcode) field 1402, a src1 field 1404, a src2 field 1406, a gpr field 1408, and a real time field 1412. The MTNN instruction 1400 is an architectural instruction, i.e., the instruction is contained within the instruction set architecture of the processor 100. Preferably, the instruction set architecture associates predetermined values of the opcode field 1402 with the MINN instruction 1400 to distinguish the MTNN instruction 1400 from other instructions in the instruction set architecture. The operation code 1402 of the MTNN instruction 1400 may or may not include a prefix (prefix), such as is common in the x86 architecture.

The immediate field 1412 provides a value for specifying a function 1432 to the control logic 1434 of the NNU 121. Preferably, the function 1432 is provided as an immediate operand to the microinstruction 105 of FIG. 1. Functions 1432 that may be executed by NNU121 include, but are not limited to, writing to data RAM 122, writing to weight RAM 124, writing to program memory 129, writing to control registers 127, starting execution of a program in program memory 129, pausing execution of a program in program memory 129, completing a request notification (e.g., an interrupt) to execute a program in program memory 129, and resetting NNU 121. Preferably, the NNU instruction set includes instructions whose results indicate that the NNU program has completed. Optionally, the NNU instruction set includes an explicit interrupt-generating instruction. Preferably, resetting NNU121 includes effectively forcing NNU121 back to the reset state (e.g., clearing the internal state machine and setting it to an idle state) in addition to the contents of data RAM 122, weight RAM 124, and program memory 129 remaining intact. In addition, internal registers such as accumulator 202 are not affected by the reset function and must be cleared explicitly, for example using the initialize NPU instruction at address 0 of FIG. 4. In one embodiment, function 1432 may comprise a direct execution function, where the first source register contains a micro-operation (see, e.g., micro-operation 3418 of FIG. 34). The direct execution function instructs NNU121 to directly execute the specified micro-operation. Thus, instead of writing instructions to program memory 129 and subsequently directing NNU121 to execute instructions in program memory 129 or by way of MTNN instruction 1400 (or MFNN instruction 1500 of fig. 15), the configuration program may directly control NNU121 to perform operations. FIG. 14 shows an example of a function 1432 written to the data RAM 122.

The GPR field 1408 specifies a GPR within the general register file 116. In one embodiment, each GPR is 64 bits. As shown, general register file 116 provides a value from the selected GPR to NNU 121, which NNU 121 uses as address 1422. Address 1422 selects the row of memory specified in function 1432. In the case of data RAM 122 or weight RAM 124, address 1422 additionally selects a block of data that is twice the size of the location of a media register (e.g., 512 bits) within the selected row. Preferably, this location is on a 512-bit boundary. In one embodiment, multiplexer selects either address 1422 (or address 1422 in the case of the MFNN instruction 1400 described below) or address 123/125/131 from sequencer 128 to provide to data RAM 122/weight RAM 124/program memory 129. In one embodiment, as described in more detail below, the data RAM 122 is dual ported, enabling the NPU 126 to read/write to the data RAM 122 in parallel with the media registers 118 reading/writing to the data RAM 122. In one embodiment, weight RAM 124 is also dual ported for similar purposes.

The src1 field 1404 and the src2 field 1406 each specify a media register in the media register file 118. In one embodiment, each media register 118 is 256 bits. As shown, the media register file 118 provides concatenated data (e.g., 512 bits) from the selected media register to the data RAM 122 (or weight RAM 124 or program memory 129) for writing to the selected row 1428 specified by the address 1422 and for writing to the selected row 1428 at the location specified by the address 1422. Advantageously, by executing a series of MTNN instructions 1400 (and MFNN instructions 1500 described below), an architectural program executing on processor 100 may populate rows of data RAM 122 and rows of weight RAM 124 and write a program, such as the programs described herein (e.g., of fig. 4 and 9) to program memory 129 to cause NNUs 121 to perform operations on the data and weights at very fast speeds, thereby implementing an artificial neural network. In one embodiment, the architecture program directly controls NNUs 121 rather than writing programs to program memory 129.

In one embodiment, the MTNN instruction 1400 specifies a starting source register and a number of source registers, i.e., Q, rather than specifying two source registers (e.g., 1404 and 1406). This form of the MTNN instruction 1400 instructs the processor 100 to write the media register 118 designated as the starting source register and the next Q-1 subsequent media registers 118 to the NNU 121, i.e., to the designated data RAM 122 or weight RAM 124. Preferably, the instruction translator 104 translates the MTNN instruction 1400 into as many microinstructions as are necessary to write all Q specified media registers 118. For example, in one embodiment, when the MTNN instruction 1400 specifies the start source register as MR4 and Q is 8, the instruction translator 104 translates the MTNN instruction 1400 into four micro instructions, wherein the first micro instruction is written to MR4 and MR5, the second micro instruction is written to MR6 and MR7, the third micro instruction is written to MR8 and MR9, and the fourth micro instruction is written to MR10 and MR 11. In an alternative embodiment where the data path from the media registers 118 to the NNU 121 is 1024 bits instead of 512 bits, the instruction translator 104 translates the MTNN instruction 1400 into two micro instructions, where the first micro instruction writes MR 4-MR 7 and the second micro instruction writes MR 8-MR 11. Similar embodiments are contemplated in which the MFNN instructions 1500 specify a starting destination register and a number of destination registers such that each MFNN instruction 1500 is capable of reading a block of data in a row of the data RAM 122 or weight RAM 124 that is larger than a single media register 118.

Referring now to FIG. 15, a block diagram is shown that illustrates a move from neural network (MTNN) architecture instruction 1500 and the operation of the architecture instruction with respect to portions of the NNUs 121 of FIG. 1. The MFNN instruction 1500 includes an opcode field 1502, a dst field 1504, a gpr field 1508, and an immediate field 1512. The MFNN instruction 1500 is an architectural instruction that is included within the instruction set architecture of the processor 100. Preferably, the instruction set architecture associates predetermined values of the opcode field 1502 with the MFNN instruction 1500 to distinguish the MFNN instruction 1500 from other instructions within the instruction set architecture. The operation code 1502 of the MFNN instruction 1500 may or may not include a prefix, such as is common in the x86 architecture.

The immediate field 1512 provides a value for specifying a function 1532 to the control logic 1434 of the NNU 121. Preferably, the function 1532 is provided as an immediate operand to the microinstruction 105 of FIG. 1. Functions 1532 that may be performed by NNU 121 include, but are not limited to, read data RAM 122, read weight RAM 124, read program memory 129, and read status register 127. Fig. 15 shows an example of a function 1532 of the read data RAM 122.

The GPR field 1508 specifies a GPR within general register file 116. As shown, general register file 116 provides the value from the selected GPR to NNU 121, where NNU 121 uses this value as address 1522 and operates in a manner similar to address 1422 of FIG. 14 to select the row of memory specified in function 1532, and in the case of data RAM 122 or weight RAM 124, address 1522 additionally selects a block of data that is the size of the location of the media register (e.g., 256 bits) within the selected row. Preferably, this location is on a 256 bit boundary.

The dst field 1504 specifies the media registers in the media register file 118. As shown, media register file 118 receives data (e.g., 256 bits) from data RAM122 (or weight RAM 124 or program memory 129) to the selected media register, which data is read from selected row 1528 specified by address 1522 and the location specified by address 1522 in selected row 1528.

NNU internal RAM port configuration

Referring now to FIG. 16, a block diagram illustrating an embodiment of the data RAM122 of FIG. 1 is shown. Data RAM122 includes a memory array 1606, read ports 1602, and write ports 1604. Memory array 1606 holds data words and is preferably arranged in D rows of N words each, as described above. In one embodiment, memory array 1606 includes an array of 64 horizontally arranged static RAM cells (where each cell has a width of 128 bits and a height of 64 bits) to provide a 64KB data RAM122 that is 8192 bits wide and has 64 rows, and the die area occupied by data RAM122 is approximately 0.2 square millimeters. However, other embodiments are contemplated.

The read port 1602 is preferably multiplexed to couple to the NPU126 and the media register 118. (more precisely, the media register 118 may be coupled to the read port 1602 via a result bus, which may also provide data to reorder registers and/or a result forwarding bus to other execution units 112.) the NPU126 shares the read port 1602 with the media register 118 to read the data RAM 122. Write port 1604 is also preferably multiplexed to the NPU126 and media registers 118. The NPU126 shares the write port 1604 with the media register 118 to write to the data RAM 122. Thus, advantageously, the media registers 118 may be written to the data RAM122 in parallel while the NPU126 is reading from the data RAM122, or the NPU126 may be written to the data RAM122 in parallel while the media registers 118 are reading from the data RAM 122. This may advantageously provide improved performance. For example, the NPU126 may read the data RAM122 (e.g., continuously perform computations) while the media register 118 may write more data words to the data RAM 122. As another example, the NPU126 may write the computed results to the data RAM122 while the media register 118 reads the computed results from the data RAM 122. In one embodiment, the NPU126 may write a row of computation results to the data RAM122 while the NPU126 also reads a row of data words from the data RAM 122. In one embodiment, the memory array 1606 is configured as a memory block (bank). When the NPU126 accesses the data RAM122, all memory blocks are activated to access an entire row of the memory array 1606; when the media register 118 accesses the data RAM122, only the designated memory block is activated. In one embodiment, each memory bank is 128 bits wide and the media register 118 is 256 bits wide, so for example, two memory banks are activated each time the media register 118 is accessed. In one embodiment, one of the ports 1602/1604 is a read/write port. In one embodiment, both ports 1602/1604 are read/write ports.

The advantage of the rotator capability of the NPU 126 as described herein is: this rotator capability helps to make the rows of the memory array 1606 of the data RAM 122 significantly smaller, thus making the array relatively much smaller, than would be required for a memory array required to continuously provide data to the data RAM 122 and retrieve results from the data RAM 122 while the NPU 126 is performing computations in order to ensure that the NPU 126 is highly utilized.

Internal RAM cache

Referring now to FIG. 17, a block diagram illustrating an embodiment of weight RAM 124 and buffer 1704 of FIG. 1 is shown. The weight RAM 124 includes a memory array 1706 and a port 1702. The memory array 1706 holds weight words and is preferably arranged in W rows, as described above, with N words per row. In one embodiment, memory array 1706 includes an array of 128 horizontally arranged static RAM cells (where each cell has a width of 64 bits and a height of 2048 bits) to provide a 2MB weight RAM 124 that is 8192 bits wide and has 2048 rows, and the die area occupied by weight RAM 124 is approximately 2.4 square millimeters. However, other embodiments are contemplated.

The port 1702 is preferably coupled to the NPU 126 and the buffer 1704 in a multiplexed manner. NPU 126 and register 1704 read from and write to weight RAM 124 via port 1702. The registers 1704 are also coupled to the media registers 118 of FIG. 1 such that the media registers 118 are read from and written to the weight RAM 124 via the registers 1704. Thus, advantageously, while the NPU 126 is reading or writing to the weight RAM 124, the media register 118 may also write or read the buffer 1704 in parallel (but if the NPU 126 is currently executing, the NPU 126 is preferably stalled to avoid accessing the weight RAM 124 while the buffer 1704 is accessing the weight RAM 124). This may advantageously improve performance, particularly because media registers 118 read and write to weight RAM 124 are much smaller than NPU 126 reads and writes to weight RAM 124. For example, in one embodiment, the NPU 126 reads/writes 8192 bits (a row) at a time, while the media registers 118 are 256 bits wide and each MTNN instruction 1400 writes two

media registers

118, 512 bits. Thus, in the case where the architected program executes the sixteen MTNN instructions 1400 to fill the registers 1704, the NPU 126 and the architected program conflict with respect to the access weight RAM 124 by only less than about six percent of the time. In another embodiment, the instruction translator 104 translates the MTNN instruction 1400 into two microinstructions 105, wherein each microinstruction 105 writes a single data register 118 to the register 1704, in which case the NPU 126 and the architected program conflict with respect to the access weight RAM 124 even less frequently.

In embodiments that include registers 1704, writing to weight RAM 124 using the architected program requires multiple MTNN instructions 1400. One or more MTNN instructions 1400 specify a function 1432 to write a specified block of data in the buffer 1704, which is then followed by the MTNN instructions 1400 specifying the function 1432 to instruct the NNU 121 to write the contents of the buffer 1704 into a specified row of the weight RAM 124, where the block of data is twice the size of the number of bits of the media registers 118 and the blocks of data are naturally aligned within the buffer 1704. In one embodiment, each MTNN instruction 1400 that specifies a function 1432 to write to a specified data block of buffer 1704 contains a bitmask (bitmask) with bits corresponding to each data block of buffer 1704. Data from two designated source registers 118 is written into each data block of buffer 1704 with the corresponding bit within the bit mask set. This may be useful for weighting repeated data values within a row of RAM 124. For example, to zero out buffer 1704 (and subsequent rows of weight RAM 124), the programmer may load the source register with a zero value and set all bits of the bit mask. Furthermore, the bit mask enables the programmer to write only selected data blocks in buffer 1704, thereby preserving previous data in other data blocks.

In one embodiment that includes register 1704, multiple MFNN instructions 1500 are required to read weight RAM 124 using a framework program. The initial MFNN instruction 1500 specifies a function 1532 to load the buffers 1704 from the specified line of the weight RAM 124, and then one or more MFNN instructions 1500 specify a function 1532 to read specified data blocks of the buffers 1704 into destination registers, where the size of the data blocks is the number of bits of the media registers 118, and the data blocks are naturally aligned within the buffers 1704. Other embodiments are contemplated in which the weight RAM 124 includes multiple buffers 1704 to further reduce contention between the NPU 126 and the architectural program for access to the weight RAM 124 by increasing the number of accesses to the architectural program by the NPU 126 during execution, which may increase the likelihood that accesses to the buffers 1704 can be performed during clock cycles in which the NPU 126 does not need to access the weight RAM 124.

Although fig. 16 depicts a dual port data RAM 122, other embodiments are contemplated in which the weight RAM 124 is also dual port. Further, while FIG. 17 depicts a buffer for weight RAM 124, other embodiments are contemplated in which data RAM 122 also has an associated buffer similar to buffer 1704.

Dynamically configurable NPU

Referring now to FIG. 18, a block diagram is shown illustrating the dynamically configurable NPU126 of FIG. 1. The NPU126 of fig. 18 is similar in many respects to the NPU126 of fig. 2. However, the NPU126 of fig. 18 may be dynamically configured to operate in one of two different configurations. In a first configuration, the operation of the NPU126 of fig. 18 is similar to the NPU126 of fig. 2. That is, in a first configuration (referred to herein as a "wide" configuration or a "single" configuration), the ALUs 204 of the NPU126 perform operations on a single wide data word and a single wide weight word (e.g., 16 bits) to produce a single wide result. In contrast, in a second configuration (referred to herein as a "narrow" configuration or a "dual" configuration), the NPU126 performs operations on two narrow data words and two corresponding narrow weight words (e.g., 8 bits) to produce two corresponding narrow results. In one embodiment, the configuration (wide or narrow) of the NPU126 is performed by an initialize NPU instruction (e.g., address 0 in FIG. 20 described below). Optionally, the configuration may also be implemented by an MTNN instruction whose function 1432 specifies that the NPU126 is configured to the configuration (wide or narrow). Preferably, the configuration registers are filled by program memory 129 instructions or MTNN instructions that determine configuration (wide or narrow). For example, the output of the configuration register is provided to ALU 204, AFU 212, and logic that generates the multiplexed register control signal 213. In general, elements of the NPU126 of fig. 18 perform similar functions to like-numbered elements in fig. 2, and reference should be made for an understanding of fig. 18. However, reference will now be made to the embodiment of fig. 18 (including the differences from fig. 2).

The NPU126 of fig. 18 includes two

registers

205A and 205B, two 3-input multiplexing registers 208A and 208B, ALU 204, two accumulators 202A and 202B, and two AFUs 212A and 212B. Each register 205A/205B has half the width (e.g., 8 bits) of the register 205 of FIG. 2. Each register 205A/205B receives a respective narrow weight word 206A/B206 (e.g., 8 bits) from the weight RAM 124 and provides its output 203A/203B to the operand selection logic 1898 of the ALU 204 in a subsequent clock cycle. In the case where the NPU126 is in a wide configuration, the registers 205A/205B effectively operate together to receive wide weight words 206A/206B (e.g., 16 bits) from the weight RAM 124, in a manner similar to the registers 205 of the embodiment of FIG. 2; and in the case of the NPU126 being in a narrow configuration, the registers 205A/205B operate essentially independently, each receiving a narrow weight word 206A/206B (e.g., 8 bits) from the weight RAM 124, such that the NPU126 is essentially two separate narrow NPUs. However, the same output bits of weight RAM 124 are coupled to and provided to registers 205A/205B regardless of the configuration of NPU 126. For example, register 205A of NPU 0 receives byte 0, register 205B of NPU 0 receives byte 1, register 205A of NPU1 receives byte 2, register 205B of NPU1 receives byte 3, and so on, register 205B of NPU 511 receives byte 1023.

Each of the multiplexing registers 208A/208B has half the width (e.g., 8 bits) of the registers 208 of FIG. 2, respectively. Multiplexer register 208A selects one of its

inputs

207A, 211A, and 1811A for storage in its register and provision on output 209A in a subsequent clock cycle, and multiplexer register 208B selects one of its

inputs

207B, 211B, and 1811B for storage in its register and provision on output 209B in a subsequent clock cycle to operand selection logic 1898. Input 207A receives a narrow data word (e.g., 8 bits) from data RAM 122, and input 207B receives a narrow data word from data RAM 122. Where the NPU 126 is of a wide configuration, in a manner similar to the multiplexing registers 208 of the embodiment of FIG. 2, the multiplexing registers 208A/208B actually operate together to receive wide data words 207A/207B (e.g., 16 bits) from the data RAM 122; with the NPU 126 in the narrow configuration, the multiplexing registers 208A/208B operate essentially independently, each receiving a narrow data word 207A/207B (e.g., 8 bits) from the data RAM 122, such that the NPU 126 is essentially two separate narrow NPUs. However, the same output bits of the data RAM 122 are coupled to and provided to the multiplexing registers 208A/208B, regardless of the configuration of the NPU 126. For example, multiplexing register 208A of NPU 0 receives byte 0, multiplexing register 208B of NPU 0 receives byte 1, multiplexing register 208A of NPU 1 receives byte 2, multiplexing register 208B of NPU 1 receives byte 3, and so on, multiplexing register 208B of NPU 511 receives byte 1023.

Input 211A receives output 209A of multiplexing register 208A of the neighboring NPU126 and input 211B receives output 209B of multiplexing register 208B of the neighboring NPU 126. As shown, input 1811A receives output 209B of the multiplexing register 208B of the neighboring NPU126, and input 1811B receives output 209A of the multiplexing register 208A of the current NPU 126. Among the N NPUs 126 shown in FIG. 1, the NPU126 shown in FIG. 18 is labeled as NPU J. That is, NPU J is a representative example of N NPUs. Preferably, input 211A of multiplexing register 208A of NPU J receives output 209A of multiplexing register 208A of NPU126 instance J-1, and input 1811A of multiplexing register 208A of NPU J receives output 209B of multiplexing register 208B of NPU126 instance J-1, and output 209A of multiplexing register 208A of NPU J is provided to both input 211A of multiplexing register 208A of NPU126 instance J +1 and input 211B of multiplexing register 208B of NPU126 instance J; and input 211B of multiplexing register 208B of NPU J receives output 209B of multiplexing register 208B of NPU126 instance J-1, input 1811B of multiplexing register 208B of NPU J receives output 209A of multiplexing register 208A of NPU126 instance J, and output 209B of multiplexing register 208B of NPU J is provided to both input 1811A of multiplexing register 208A of NPU126 instance J +1 and input 211B of multiplexing register 208B of NPU126 instance J + 1.

Control input 213 controls which of these three inputs is selected by multiplexing registers 208A/208B for storage in the respective register and subsequent provision on the respective output 209A/209B. In the event that the NPU 126 is instructed (e.g., by a multiply-accumulate instruction at address 1 of fig. 20 as described below) to load a row from the data RAM 122, the control inputs 213 control the respective multiplexing registers 208A/208B to select respective narrow data words 207A/207B (e.g., 8 bits) from corresponding narrow words of the selected row of the data RAM 122, whether the NPU 126 is in the wide or narrow configuration.

Where the NPU 126 is instructed (e.g., by a multiply-accumulate rotate instruction at address 2 of fig. 20 as described below) to rotate the value of a previously received data line, the control input 213 controls each of the multiplexing registers 208A/208B to select a respective input 1811A/1811B if the NPU 126 is in a narrow configuration. In this case, the multiplexing registers 208A/208B operate effectively independently, such that the NPU 126 is effectively two separate narrow NPUs. As such, the multiplexing registers 208A and 208B of the N NPUs 126 collectively operate as a 2N narrow-word round-robin, as described in more detail below with respect to fig. 19.

Where the NPU 126 is instructed to rotate the value of a previously received data line, the control input 213 controls each of the multiplexing registers 208A/208B to select the corresponding input 211A/211B if the NPU 126 is in the wide configuration. In this case, the multiplexing registers 208A/208B operate virtually as if the NPU 126 were a single wide NPU 126 as a whole. As such, in a manner similar to that described with respect to fig. 3, the multiplexing registers 208A and 208B of the N NPUs 126 collectively operate as a round-robin rotator for the N wide words.

ALU 204 includes operand selection logic 1898, wide multiplier 242A, narrow multiplier 242B, wide 2 input multiplexer 1896A, narrow 2 input multiplexer 1896B, wide adder 244A, and narrow adder 244B. In practice, ALU 204 includes operand selection logic 1898, wide ALU 204A (including wide multiplier 242A, wide multiplexer 1896A, and wide adder 244A), and narrow ALU 204B (including narrow multiplier 242B, narrow multiplexer 1896B, and narrow adder 244B). Preferably, wide multiplier 242A multiplies two wide words and is similar to multiplier 242 of fig. 2 (e.g., a 16 bit by 16 bit multiplier). Narrow multiplier 242B multiplies two narrow words (e.g., an 8-bit by 8-bit multiplier that produces a 16-bit result). When the NPU 126 is in a narrow configuration, the wide multiplier 242A effectively acts as a narrow multiplier to multiply two narrow words by means of operand selection logic 1898, such that the NPU 126 effectively acts as two narrow NPUs. Preferably, wide adder 244A adds the output of wide multiplexer 1896A to output 217A of wide accumulator 202A to produce sum 215A to provide to wide accumulator 202A, which is similar to adder 244 of FIG. 2. Narrow adder 244B adds the output of narrow multiplexer 1896B to output 217B of narrow accumulator 202B to produce sum 215B for provision to narrow accumulator 202B. In one embodiment, narrow accumulator 202B has a width of 28 bits to avoid loss of precision when accumulating up to 1024 16-bit products. When NPU 126 is in a wide configuration, narrow multiplier 242B, narrow multiplexer 1896B, narrow adder 244B, narrow accumulator 202B, and narrow AFU 212B are preferably inactive to reduce power consumption.

As described in more detail below, operand selection logic 1898 selects operands from 209A, 209B, 203A, and 203B to provide to other elements of the ALU 204. Operand selection logic 1898 preferably also performs other functions, such as performing signed value data word and sign extension of weight words. For example, if NPU 126 is in a narrow configuration, operand selection logic 1898 sign-extends the narrow data words and weight words to the width of the wide words before providing them to wide multiplier 242A. Similarly, if ALU 204 is instructed to pass narrow data/weight words (skipping wide multiplier 242A via wide multiplexer 1896A), operand selection logic 1898 sign-expands the narrow data/weight words to the width of the wide words before providing them to wide adder 244A. Logic to perform the sign extension function is also preferably present in ALU 204 of NPU 126 of FIG. 2.

Wide multiplexer 1896A receives the output of wide multiplier 242A and the operand from operand selection logic 1898 and selects one of these inputs to provide to wide adder 244A, and narrow multiplexer 1896B receives the output of narrow multiplier 242B and the operand from operand selection logic 1898 and selects one of these inputs to provide to narrow adder 244B.

The operands provided by the operand selection logic 1898 depend on the configuration of the NPU126 and the arithmetic and/or logical operations performed by the ALU 204 based on the functions specified by the instruction being executed by the NPU 126. For example, if the instruction instructs ALU 204 to perform multiply-accumulate and NPU126 is in a wide configuration, operand selection logic 1898 provides the concatenated wide word as

outputs

209A and 209B to one input of wide multiplier 242A and the concatenated wide word as

outputs

203A and 203B to the other input, while narrow multiplier 242B is inactive, such that NPU126 functions as a single wide NPU126 similar to NPU126 of FIG. 2. Whereas if the instruction instructs ALU 204 to perform multiply-accumulate and NPU126 is in a narrow configuration, operand selection logic 1898 provides an expanded or widened version of narrow data word 209A to one input of wide multiplier 242A and an expanded version of narrow weight word 203A to the other input; further, operand selection logic 1898 provides narrow data word 209B to one input of narrow multiplier 242B and narrow weight word 203B to the other input. To expand or widen a narrow word, operand selection logic 1898 sign-expands the narrow word if it is signed; whereas if the narrow word is unsigned, operand selection logic 1898 fills the narrow word with the high order bits having a value of zero.

As another example, if NPU 126 is in a wide configuration and the instruction instructs ALU 204 to perform an accumulation of weight words, wide multiplier 242A is skipped and operand selection logic 1898 provides a concatenation of

outputs

203A and 203B to wide multiplexer 1896A for supply to wide adder 244A. Whereas if NPU 126 is in the narrow configuration and the instruction instructs ALU 204 to perform an accumulation of weight words, wide multiplier 242A is skipped and operand selection logic 1898 provides an expanded version of output 203A to wide multiplexer 1896A for provision to wide adder 244A; and narrow multiplier 242B is skipped and operand selection logic 1898 provides an expanded version of output 203B to narrow multiplexer 1896B for provision to narrow adder 244B.

As another example, if NPU 126 is in a wide configuration and the instruction instructs ALU 204 to perform accumulation of data words, wide multiplier 242A is skipped and operand selection logic 1898 provides the concatenation of

outputs

209A and 209B to wide multiplexer 1896A for supply to wide adder 244A. Whereas if NPU 126 is in the narrow configuration and the instruction instructs ALU 204 to perform accumulation of data words, wide multiplier 242A is skipped and operand selection logic 1898 provides an expanded version of output 209A to wide multiplexer 1896A for provision to wide adder 244A; and narrow multiplier 242B is skipped and operand selection logic 1898 provides an expanded version of output 209B to narrow multiplexer 1896B for provision to narrow adder 244B. The accumulation of weights/data words may help to perform averaging operations for pooling layers for certain artificial neural network applications, such as image processing.

Preferably, the NPU 126 further comprises: a second wide multiplexer (not shown) for skipping wide adder 244A in order to load wide accumulator 202A with wide data/weight words in a wide configuration or with expanded narrow data/weight words in a narrow configuration; and a second narrow multiplexer (not shown) for skipping narrow adder 244B in order to load narrow accumulator 202B with narrow data/weight words in a narrow configuration. Preferably, ALU 204 also includes wide and narrow comparator/multiplexer combinations (not shown) that receive the respective accumulator values 217A/217B and the respective multiplexer 1896A/1896B outputs to select a maximum value between the accumulator values 217A/217B and the data/weight words 209A/B/203A/B, as described in more detail below, e.g., with respect to FIGS. 27 and 28, such operations being used in pooling layers for certain artificial neural network applications. Further, operand selection logic 1898 is configured to provide operands having a value of zero (for adding zeros or for clearing accumulators) and to provide operands having a value of one (for multiplying by one).

Narrow AFU 212B receives output 217B of narrow accumulator 202B and performs an activation function thereon to produce narrow result 133B, while wide AFU 212A receives output 217A of wide accumulator 202A and performs an activation function thereon to produce wide result 133A. When NPU 126 is in the narrow configuration, wide AFU 212A accordingly considers output 217A of wide accumulator 202A and performs an activation function thereon to produce a narrow result (e.g., 8 bits), as described in more detail below, e.g., with respect to fig. 29A-30.

From the above description it can be seen that advantageously a single NPU 126, when in a narrow configuration, actually operates as two narrow NPUs, thus providing a throughput for smaller words that is roughly twice the throughput in a wide configuration. For example, assume that the neural network layer has 1024 neurons, and each neuron receives 1024 narrow inputs (and has narrow weight words) from the previous layer, resulting in one million connections. In contrast to NNUs 121 having NPUs 126 with 512 wide configurations, NNUs 121 having NPUs 126 with 512 narrow configurations are able to handle four times the number of connections (256K connections per million connections vs) in approximately half the time (about 1026 clocks vs514 clocks), although narrow words are handled instead of wide words.

In one embodiment, dynamically configurable NPU 126 of fig. 18 includes a 3-input multiplexing register similar to multiplexing

registers

208A and 208B in place of

registers

205A and 205B to implement a rotator for a row of weight words received from weight RAM 124 in a manner somewhat similar to that described for the embodiment of fig. 7 but in the dynamically configurable manner described for fig. 18.

Referring now to FIG. 19, a block diagram is shown illustrating an embodiment of an arrangement of 2N multiplexing registers 208A/208B of the N NPUs 126 of NNU 121 of FIG. 1 according to the embodiment of FIG. 18, illustrating the operation of the 2N multiplexing registers 208A/208B as a rotator for a row of data words 207 received from data RAM 122 of FIG. 1. In the embodiment of FIG. 19, as shown, N is 512, such that NNU 121 has 1024 multiplexing registers 208A/208B labeled 0 through 511, which correspond to 512 NPUs 126 (in effect 1024 narrow NPUs). The two narrow NPUs within the NPU 126 are labeled a and B, and in each multiplexing register 208, the designation of the respective narrow NPU is shown. More specifically, multiplexing register 208A of NPU 1260 is assigned 0-A, multiplexing register 208B of NPU 1260 is assigned 0-B, multiplexing register 208A of NPU 1261 is assigned 1-A, multiplexing register 208B of NPU 1261 is assigned 1-B, multiplexing register 208A of NPU 126511 is assigned 511-A, and multiplexing register 208B of NPU 126511 is assigned 511-B, which also correspond to the narrow NPU of FIG. 21 described below.

Each multiplexing register 208A receives a respective narrow data word 207A in one of the D rows of the data RAM122 and each multiplexing register 208B receives a respective narrow data word 207B in one of the D rows of the data RAM 122. That is, multiplexing register 0A receives narrow data word 0 of data RAM122 row, multiplexing register 0B receives narrow data word 1 of data RAM122 row, multiplexing register 1A receives narrow data word 2 of data RAM122 row, multiplexing register 1B receives narrow data word 3 of data RAM122 row, and so on, multiplexing register 511A receives narrow data word 1022 of data RAM122 row, and multiplexing register 511B receives narrow data word 1023 of data RAM122 row. Further, multiplexing register 1A receives on its input 211A the output 209A of multiplexing register 0A, multiplexing register 1B receives on its input 211B the output 209B of multiplexing register 0B, and so on, multiplexing register 511A receives on its input 211A the output 209A of multiplexing register 510A, multiplexing register 511B receives on its input 211B the output 209B of multiplexing register 510B, and multiplexing register 0A receives on its input 211A the output 209A of multiplexing register 511A, multiplexing register 0B receives on its input 211B the output 209B of multiplexing register 511B. Each of the multiplexer registers 208A/208B receives a control input 213, wherein the control input 213 controls whether the data word 207A/207B is to be selected, the post-rotation input 211A/211B is to be selected, or the post-rotation input 1811A/1811B is to be selected. Finally, multiplexing register 1A receives on its input 1811A the output 209B of multiplexing register 0B, multiplexing register 1B receives on its input 1811B the output 209A of multiplexing register 1A, and so on, multiplexing register 511A receives on its input 1811A the output 209B of multiplexing register 510B, multiplexing register 511B receives on its input 1811B the output 209A of multiplexing register 511A, and multiplexing register 0A receives on its input 1811A the output 209B of multiplexing register 511B, multiplexing register 0B receives on its input 1811B the output 209A of multiplexing register 0A. Each of the multiplexer registers 208A/208B receives a control input 213, wherein the control input 213 controls whether the data word 207A/207B is to be selected, the post-rotation input 211A/211B is to be selected, or the post-rotation input 1811A/1811B is to be selected. As described in more detail below, in an operational mode, during a first clock cycle, the control input 213 controls each of the multiplexer registers 208A/208B to select a data word 207A/207B for storage to the register and subsequent provision to the ALU 204; and in a subsequent clock cycle (e.g., the M-1 clock cycle described above), control input 213 controls each of the mux registers 208A/208B to select the post-rotation inputs 1811A/1811B for storage into the registers and subsequent provision to ALU 204.

Referring now to fig. 20, a table illustrating a program stored in program memory 129 of NNU121 of fig. 1 and executed by the NNU121 is shown, where the NNU121 has NPU 126 according to the embodiment of fig. 18. The exemplary process of fig. 20 is similar in many respects to the process of fig. 4. However, the differences will be explained below. The initialize NPU instruction at address 0 specifies that the NPU 126 is to be a narrow configuration. Further, as shown, the multiply-accumulate rotate instruction at address 2 specifies a count of 1023 and takes 1023 clock cycles. This is because the example of fig. 20 assumes that one layer actually has 1024 narrow (e.g., 8-bit) Neurons (NPUs), each narrow neuron having 1024 connection inputs from 1024 neurons of the previous layer, thus there are a total of 1024K connections. Each neuron receives an 8-bit data value from each of the connected inputs and multiplies the 8-bit data value by an appropriate 8-bit weight value.

Referring now to FIG. 21, a timing diagram is shown that illustrates an NNU121 executing the routine of FIG. 20, where the NNU121 includes the NPU 126 of FIG. 18 operating in a narrow configuration. The timing diagram of FIG. 21 is similar in many respects to the timing diagram of FIG. 5; however, the differences will be explained below.

In the timing diagram of FIG. 21, the NPUs 126 are in a narrow configuration because the initializing NPU instruction at address 0 initializes the NPUs to the narrow configuration. Thus, the 512 NPUs 126 actually operate as 1024 narrow NPUs (or neurons), where the 1024 narrow NPUs are designated within the column as NPU 0-A and NPU 0-B (the two narrow NPUs of NPU 1260), NPU1-A and NPU 1-B (the two narrow NPUs of NPU 1261), …, NPU 511-A and NPU 511-B (the two narrow NPUs of NPU 126511). For simplicity and clarity of illustration, only the operation of the narrow NPUs 0-A, 0-B and 511-B is shown. The rows of the timing diagram of fig. 21 include up to 1026 clock cycles due to the fact that the multiply-accumulate rotation at address 2 specifies a count of 1023, which requires 1023 clock cycles.

At clock 0, the 1024 NPUs each execute the initialization instruction of fig. 4, i.e., the initialization instruction that assigns a zero value to accumulator 202 as shown in fig. 5.

At clock 1, the 1024 narrow NPUs each execute the multiply-accumulate instruction at address 1 of fig. 20. As shown, narrow NPU 0-A accumulates the product of narrow word 0 of row 17 of data RAM 122 and narrow word 0 of row 0 of weight RAM 124 with the value of accumulator 202A (i.e., zero); narrow NPU 0-B accumulates the product of narrow word 1 of row 17 of data RAM 122 and narrow word 1 of row 0 of weight RAM 124 with the value of accumulator 202B (i.e., zero); by analogy, narrow NPU 511-B accumulates the product of narrow word 1023 for row 17 of data RAM 122 and narrow word 1023 for row 0 of weight RAM 124 with the value of accumulator 202B (i.e., zero).

At clock 2, the 1024 narrow NPUs each perform the first iteration of the multiply-accumulate rotate instruction of address 2 of fig. 20. As shown, narrow NPU 0-A accumulates the product of rotated narrow data word 1811A received from output 209B of multiplexing register 208B of narrow NPU511-B (i.e., narrow data word 1023 received from data RAM 122) and narrow word 0 of row 1 of weight RAM 124 with value 217A of accumulator 202A; narrow NPU 0-B accumulates the product of rotated narrow data word 1811B received from output 209A of multiplexing register 208A of narrow NPU 0-a (i.e. narrow data word 0 received from data RAM 122) and narrow word 1 of row 1 of weight RAM 124 with value 217B of accumulator 202B; by analogy, narrow NPU511-B accumulates the product of the rotated narrow data word 1811B received from output 209A of multiplexing register 208A of narrow NPU 511-A (i.e. narrow data word 1022 received from data RAM 122) and narrow word 1023 of row 1 of weight RAM 124 with the value 217B of accumulator 202B.

At clock 3, the 1024 narrow NPUs each perform a second iteration of the multiply-accumulate rotate instruction at address 2 of fig. 20. As shown, narrow NPU 0-A accumulates the product of rotated narrow data word 1811A received from output 209B of multiplexing register 208B of narrow NPU511-B (i.e., narrow data word 1022 received from data RAM 122) and narrow word 0 of row 2 of weight RAM 124 with value 217A of accumulator 202A; narrow NPU 0-B accumulates the product of the rotated narrow data word 1811B received from output 209A of multiplexing register 208A of narrow NPU 0-a (i.e. narrow data word 1023 received from data RAM 122) and narrow word 1 of row 2 of weight RAM 124 with the value 217B of accumulator 202B; by analogy, narrow NPU511-B accumulates the product of the rotated narrow data word 1811B received from output 209A of multiplexing register 208A of narrow NPU 511-A (i.e. narrow data word 1021 received from data RAM 122) and narrow word 1023 of row 2 of weight RAM 124 with value 217B of accumulator 202B. This continues for each of the next 1021 clock cycles, as indicated by the ellipsis in FIG. 21, until clock 1024.

At clock 1024, the 1024 narrow NPUs each execute the 1023 rd iteration of the multiply-accumulate round-robin instruction at address 2 of fig. 20. As shown, narrow NPU 0-A accumulates the product of rotated narrow data word 1811A received from output 209B of multiplexing register 208B of narrow NPU 511-B (i.e., narrow data word 1 received from data RAM 122) and narrow word 0 of row 1023 of weight RAM 124 with value 217A of accumulator 202A; narrow NPU 0-B accumulates the product of rotated narrow data word 1811B received from output 209A of multiplexing register 208A of NPU 0-A (i.e. narrow data word 2 received from data RAM 122) and narrow word 1 of row 1023 of weight RAM 124 with value 217B of accumulator 202B; by analogy, narrow NPU 511-B accumulates the product of the rotated narrow data word 1811B received from output 209A of multiplexing register 208A of NPU 511-A (i.e., narrow data word 0 received from data RAM 122) and narrow word 1023 of row 1023 of weight RAM 124 with the value 217B of accumulator 202B.

At clock 1025, AFU 212A/212B of each of the 1024 narrow NPUs execute the Activate function instruction at address 3 of FIG. 20. Finally, at clock 1026, the 1024 narrow NPUs each execute the write AFU output instruction at address 4 of FIG. 20 by writing their narrow results 133A/133B back to the corresponding narrow word of row 16 of data RAM122, i.e., the narrow result 133A of NPU 0-A is written to narrow word 0 of data RAM122, the narrow result 133B of NPU 0-B is written to narrow word 1 of data RAM122, and so on, and the narrow result 133 of NPU 511-B is written to narrow word 1023 of data RAM 122. In fig. 22, the operations described above with respect to fig. 21 are also shown in block diagram form.

Referring now to FIG. 22, a block diagram is shown that illustrates the NNU 121 of FIG. 1, where the NNU 121 includes the NPU 126 of FIG. 18 to execute the routine of FIG. 20. NNU 121 includes 512 NPUs 126, i.e., 1024 narrow NPUs, a data RAM 122 that receives its address inputs 123, and a weight RAM 124 that receives its address inputs 125. Although not shown, at clock 0, the 1024 narrow NPUs execute the initialization instruction of fig. 20. As shown, at clock 1, 1024 8-bit data words of row 17 are read from data RAM 122 and provided to 1024 narrow NPUs. At clocks 1 through 1024, 1024 8-bit weight words of rows 0 through 1023 are read from the weight RAM 124 and provided to 1024 narrow NPUs, respectively. Although not shown, at clock 1, 1024 narrow NPUs perform corresponding multiply-accumulate operations on the loaded data word and weight word. At clocks 2 through 1024, the 1024 narrow NPU multiplexing registers 208A/208B operate as 1024 8-bit word rotator to rotate the previously loaded data word of row 17 of data RAM 122 to the adjacent narrow NPU, and the narrow NPU performs multiply-accumulate operations on the respective rotated data words and the respective narrow weight words loaded from weight RAM 124. Although not shown, at clock 1025, 1024 narrow AFUs 212A/212B execute the activate instruction. At clock 1026, the 1024 narrow NPUs write their respective 1024 8-bit results 133A/133B back to row 16 of the data RAM 122.

It may be found, for example, that the embodiment of fig. 18 may have advantages over the embodiment of fig. 2 because the embodiment of fig. 18 provides the programmer with the flexibility to use wide data words and weight words (e.g., 16 bits) for calculations where a certain degree of accuracy is required for the particular application being modeled, and narrow data words and weight words (e.g., 8 bits) for calculations where a certain degree of accuracy is required for the application. From one perspective, for applications with narrow data, the embodiment of FIG. 18 may provide twice the throughput at the expense of additional narrow elements (e.g., multiplexing register 208B, register 205B, narrow ALU204B, narrow accumulator 202B, narrow AFU 212B) that increase the area of the NPU126 by about 50% as compared to the embodiment of FIG. 2.

Three-mode NPU

Referring now to FIG. 23, a block diagram is shown illustrating the NPU126 of FIG. 1 that is dynamically configurable, in accordance with alternative embodiments. The NPU126 of fig. 23 may be configured not only in a wide configuration and a narrow configuration, but also in a third configuration (referred to herein as a "funnel" configuration). The NPU126 of fig. 23 is similar in many respects to the NPU126 of fig. 18. However, the wide adder 244A in fig. 18 is replaced in the NPU126 of fig. 23 by a 3-input wide adder 2344A, where the 3-input wide adder 2344A receives the third addend 2399 that is an extended version of the output of the narrow multiplexer 1896B. The procedure for operating NNU 121 with NPU126 of fig. 23 is similar in many respects to the procedure of fig. 20. However, the initialize NPU instruction at address 0 initializes these NPUs 126 to a funnel configuration, rather than a narrow configuration. Further, the multiply-accumulate rotate instruction for address 2 has a count of 511 instead of 1023.

In the case of the funnel configuration, the operation of the NPU 126 is similar to that in the case of a multiply-accumulate instruction, such as at address 1 of fig. 20, executed in a narrow configuration in the following respects: the NPU 126 receives two narrow data words 207A/207B and two narrow weight words 206A/206B; wide multiplier 242A multiplies data word 209A with weight word 203A to produce product 246A selected by wide multiplexer 1896A; and narrow multiplier 242B multiplies data word 209B with weight word 203B to produce product 246B selected by narrow multiplexer 1896B. However, wide adder 2344A adds both product 246A (selected by wide multiplexer 1896A) and product 246B/2399 (selected by wide multiplexer 1896B) to value 217A of wide accumulator 202A, while narrow adder 244B is inactive with narrow accumulator 202B. Further, when a multiply-accumulate rotate instruction such as at address 2 of FIG. 20 is executed in a funnel configuration, control input 213 rotates multiplexer registers 208A/208B by two narrow words (e.g., 16 bits), that is, multiplexer registers 208A/208B select their respective inputs 211A/211B, just as with the wide configuration. However, wide multiplier 242A multiplies data word 209A with weight word 203A to produce product 246A selected by wide multiplexer 1896A; narrow multiplier 242B multiplies data word 209B with weight word 203B to produce product 246B selected by narrow multiplexer 1896B; and wide adder 2344A adds both product 246A (selected by wide multiplexer 1896A) and product 246B/2399 (selected by wide multiplexer 1896B) to value 217A of wide accumulator 202A, while narrow adder 244B and narrow accumulator 202B are inactive as described above. Finally, when executing an activate function instruction such as at address 3 of fig. 20 in the funnel configuration, wide AFU 212A executes the activate function on the resulting sum 215A to produce narrow result 133A, while narrow AFU 212B is inactive. Thus, only the narrow NPU labeled A produces narrow result 133A, while the narrow result 133B produced by the narrow NPU labeled B is invalid. Thus, the result line written back (e.g., line 16 indicated by the instruction at address 4 of FIG. 20) contains a hole because only narrow result 133A is valid, while narrow result 133B is invalid. Thus, in contrast to the embodiments of fig. 2 and 18, where each neuron processes one connected data input per clock cycle, conceptually, each neuron (NPU 126 of fig. 23) processes two connected data inputs per clock cycle, i.e., multiplies two narrow data words by respective weights and accumulates the two products.

It can be found for the embodiment of fig. 23 that the number of result words (neuron outputs) generated and written back to the data RAM 122 or weight RAM 124 is half the square root of the number of data inputs (connections) received and that the result lines written back have holes, i.e. every other narrow word result is invalid, more specifically the narrow NPU result denoted B does not make sense. Thus, the embodiment of fig. 23 is particularly effective for neural networks having two consecutive layers, e.g., a first layer having twice the number of neurons as a second layer (e.g., a first layer having 1024 neurons fully connected to 512 neurons of a second layer). Furthermore, other execution units 122 (e.g., media units such as x86AVX units) may perform packing operations (pack operations) on scattered (i.e., with holes) result lines to make them tight (i.e., without holes) if necessary for subsequent computations when NNU121 is performing other computations associated with other lines of data RAM 122 and/or weight RAM 124.

Hybrid NNAnd U operation: convolution capability and pooling capability

An advantage of NNU121 according to embodiments described herein is that NNU121 is capable of operating in parallel in a manner similar to a coprocessor executing its own internal programs, and in a manner similar to a processor's execution unit executing architectural instructions issued to the execution unit (or microinstructions translated from architectural instructions). Architectural instructions have an architectural program that is executed by a processor that includes NNU 121. As such, NNUs 121 operate in a hybrid manner, which is advantageous because it provides the ability to maintain high utilization of NNUs 121. For example, fig. 24-26 illustrate operations in which NNU121 performs convolution operations in which the utilization of the NNU is high, and fig. 27-28 illustrate operations in which NNU121 performs pooling operations, which are required by convolution layers, pooling layers, and other digital data computing applications such as image processing (e.g., edge detection, sharpening, blurring, recognition/classification), and the like. However, the blending operation of NNU121 is not limited to performing convolution or pooling operations, but rather the blending feature may also be used to perform other operations, such as the conventional neural network multiply-accumulate operations and activation function operations described above with respect to fig. 4-13. That is, processor 100 (more specifically, reservation station 108) issues MTNN instruction 1400 and MFNN instruction 1500 to NNU121, where in response to these instructions, NNU121 writes data to memory 122/124/129 and reads results from memory 122/124 written by NNU121, while at the same time, in response to executing a program written by processor 100 (via MTNN 1400 instructions) to program memory 129, NNU121 reads and writes to memory 122/124/129.

Referring now to FIG. 24, a block diagram is shown that illustrates an example of a data structure used by NNU121 of FIG. 1 to perform a convolution operation. The block diagram includes a convolution kernel 2402, a data array 2404, and the data RAM122 and weight RAM124 of fig. 1. Preferably, the data array 2404 (e.g., of image pixels) is maintained in a system memory (not shown) attached to the processor 100 and loaded into the weight RAM124 of the NNU121 by the processor 100 executing the MTNN instruction 1400. The convolution operation is an operation that convolves a first matrix with a second matrix, where the second matrix is referred to herein as a convolution kernel. As described in the context of the present invention, a convolution kernel is a matrix of coefficients, where these coefficients may also be referred to as weights, parameters, elements or values. Preferably, the convolution kernel 2402 is static data of the architectural program being executed by the processor 100.

The data array 2404 is a two-dimensional array of data values, and each data value (e.g., image pixel value) is a size (e.g., 16 bits or 8 bits) of a word of the data RAM122 or weight RAM 124. In this example, the data values are 16-bit words and NNU121 is configured as 512 wide configuration NPUs 126. Further, in an embodiment, as described in more detail below, the NPU 126 includes a multiplexing register (such as multiplexing register 705 of fig. 7) for receiving the weight words 206 from the weight RAM124 to perform an overall rotator operation on a row of data values received from the weight RAM 124. In this example, the data array 2404 is a 2560 column by 1600 row pixel array. As shown, when the architect program convolves the data array 2404 with the convolution kernel 2402, the architect program divides the data array 2402 into 20 data blocks, where each data block is a 512 x 400 data matrix 2406.

In an example, the convolution kernel 2402 is a 3 × 3 matrix composed of coefficients, weights, parameters, or elements. The first row coefficients are labeled C0,0, C0,1, and C0, 2; the second row coefficients are labeled C1,0, C1,1, and C1, 2; and the third row coefficients are labeled C2,0, C2,1, and C2, 2. For example, a convolution kernel that can be used to perform edge detection has the following coefficients: 0,1,0,1, -4,1,0,1,0. As another example, a convolution kernel that can be used to Gaussian blur an image has the following coefficients: 1,2,1,2,4,2,1,2,1. In this case, division is typically performed on the final accumulated value, where the divisor is the sum of the absolute values of the elements of the convolution kernel 2402 (16 in this example). As another example, the divisor is the number of elements of the convolution kernel 2402. As another example, a divisor is a value that compresses the convolution back to within a range of expected values, and is determined from the values of the elements of convolution kernel 2402, the expected range, and the range of input values of the matrix on which the convolution operation is being performed.

As shown in fig. 24 and described in more detail with respect to fig. 25, the architecture program writes the coefficients of the convolution kernel 2402 to the data RAM 122. Preferably, all words of each of the successive nine rows (number of elements within the convolution kernel 2402) of the data RAM 122 are written to different elements of the convolution kernel 2402 in a row-major order. That is, as shown, each word of a row is written with a first coefficient C0, 0; the next row is written with a second coefficient C0, 1; the next column is written with a third coefficient C0, 2; the next row is written with a fourth coefficient C1, 0; and so on, each word of the ninth row is written with a ninth coefficient C2, 2. To convolve the data matrix 2406 of data blocks of the data array 2404, the NPU 126 repeatedly reads nine rows of coefficients holding the convolution kernel 2402 in the data RAM 122 in sequence, as described in more detail below, particularly with respect to fig. 26.

As shown in fig. 24 and described in more detail with respect to fig. 25, the framework program writes the values of the data matrix 2406 to the weight RAM 124. When the NNU program performs convolution, the result matrix is written back to weight RAM 124. Preferably, as described in more detail below with respect to fig. 25, the architectural routine writes the first data matrix 2406 to the weight RAM 124 and starts the NNU 121, and when the NNU 121 is convolving the first data matrix 2406 with the convolution kernel 2402, the architectural routine writes the second data matrix 2406 to the weight RAM 124 so that the NNU 121 can begin performing the convolution on the second data matrix 2406 once it has completed the convolution of the first data matrix 2406. Thus, the configuration process moves to and from between two regions of weight RAM 124 to ensure that NNUs 121 are fully utilized. Thus, the example of fig. 24 shows a first data matrix 2406A and a second data matrix 2406B, where the first data matrix 2406A corresponds to a first data block occupying rows 0 to 399 of the weight RAM 124 and the second data matrix 2406B corresponds to a second data block occupying rows 500 to 899 of the weight RAM 124. Furthermore, as shown, NNU 121 writes the results of the convolution back to rows 900-1299 and 1300-1699 of weight RAM 124, and the architected program then reads these results out of weight RAM 124. The data values of data matrix 2406 held in weight RAM 124 are labeled "Dx, y", where "x" is the number of rows of weight RAM 124 and "y" is the number of words or columns of weight RAM 124. Thus, for example, the data word 511 in row 399, which is received by the multiplexing register 705 of the NPU 511, is labeled D399,511 in fig. 24.

Referring now to FIG. 25, a flow diagram is shown illustrating operation of processor 100 of FIG. 1 to execute an architectural program that uses NNUs 121 to perform convolution on convolution kernels 2402 for data array 2404 of FIG. 24. Flow begins at block 2502.

At block 2502, the processor 100 (i.e., the architecture program running on the processor 100) writes the convolution kernel 2402 of fig. 24 to the data RAM 122 in the manner shown and described with respect to fig. 24. In addition, the framework program initializes a variable N to a value of 1. The variable N represents the current block of data in data array 2404 being processed by NNU 121. In addition, the architectural program initializes a variable NUM _ CHUNKS to a value of 20. Flow proceeds to block 2504.

At block 2504, as shown in fig. 24, processor 100 writes data matrix 2406 of data block 1 to weight RAM124 (e.g., data matrix 2406A of data block 1). Flow proceeds to block 2506.

At block 2506, the processor 100 writes the convolution program to the program memory 129 of the NNU 121 using the MTNN instruction 1400 that specifies the function 1432 to write to the program memory 129. The processor 100 then initiates an NNU convolution procedure using the MTNN instruction 1400 that specifies the function 1432 that initiates execution of the procedure. An example of an NNU convolution procedure is described in more detail below with respect to fig. 26A. Flow proceeds to decision block 2508.

At decision block 2508, the architecture program determines whether the value of variable N is less than NUM _ CHUNKS. If so, flow proceeds to block 2512; otherwise block 2514 is entered.

At block 2512, as shown in FIG. 24, processor 100 writes data matrix 2406 for data block N +1 to weight RAM 124 (e.g., data matrix 2406B for data block 2). Thus, advantageously, while NNU 121 is performing convolution on a current data block, the architecture program writes data matrix 2406 of the next data block to weight RAM 124, such that NNU 121 can immediately begin performing convolution on the next data block once the convolution of the current data block is complete (i.e., written to weight RAM 124). Flow proceeds to block 2514.

At block 2514, the processor 100 determines that the currently running NNU program (beginning at block 2506 in the case of data block 1 and beginning at block 2518 in the case of data blocks 2-20) has completed. Preferably, processor 100 makes this determination by executing MFNN instruction 1500 to read status register 127 of NNU 121. In an alternative embodiment, NNU 121 generates an interrupt, thereby indicating that it has completed the convolution procedure. Flow proceeds to decision block 2516.

At decision block 2516, the architecture program determines whether the value of the variable N is less than NUM _ CHUNKS. If so, flow proceeds to block 2518; otherwise, block 2522 is entered.

At block 2518, the processor 100 updates the convolution program so that the processor can convolve the data block N + 1. More specifically, processor 100 updates the row value of the initialize NPU instruction at address 0 in weight RAM 124 to the first row of data matrix 2406 (e.g., to row 0 of data matrix 2406A or row 500 of data matrix 2406B), and updates the output row (e.g., to row 900 or 1300). Processor 100 then initiates the updated NNU convolution procedure. Flow proceeds to block 2522.

At block 2522, processor 100 reads the results of the NNU convolution procedure for data block N from weight RAM 124. Flow proceeds to decision block 2524.

At decision block 2524, the architecture program determines whether the value of variable N is less than NUM _ CHUNKS. If so, flow proceeds to block 2526; otherwise, the flow ends.

At block 2526, the framework program increments N by 1. Flow returns to decision block 2508.

Referring now to FIG. 26A, a program listing of an NNU program that performs convolution on a data matrix 2406 using the convolution kernel 2402 of FIG. 24 and writes it back to weight RAM 124 is shown. The program will loop through the instruction loop body at addresses 1 through 9 a number of times. The initialize NPU instruction at address 0 specifies the number of times each NPU 126 executes the loop body, in the example of fig. 26A, the loop count value is 400, corresponding to the number of rows in the data matrix 2406 of fig. 24, and the loop instruction at the end of the loop (address 10) decrements the current loop count value and if the result is non-zero, causes control to return to the top of the loop body (i.e., the instruction at return address 1). The initialize NPU instruction also clears accumulator 202. Preferably, the loop instruction at address 10 also clears the accumulator 202. Alternatively, as described above, the multiply accumulate instruction at address 1 may specify clearing the accumulator 202.

For each execution of the loop body of the program, 512 NPUs 126 perform 512 convolutions in parallel on the 3 × 3 convolution kernel 2402 and 512 corresponding 3 × 3 sub-matrices of the data matrix 2406. The convolution is the sum of nine products of the elements of the convolution kernel 2402 and the corresponding elements within the respective sub-matrix. In the embodiment of fig. 26A, the origin (central element) of each of the 512 corresponding 3 × 3 sub-matrices is the data word Dx +1, y +1 of fig. 24, where y (column number) is the number of the NPU 126, and x (row number) is the row number in the current weight RAM 124 read by the multiply-accumulate instruction at address 1 of the program of fig. 26A (again, the row number is initialized by the initialize NPU instruction at address 0, incremented at each multiply-accumulate instruction of

addresses

3 and 5, and updated by the decrement instruction at address 9). Thus, for each cycle of the program, 512 NPUs 126 calculate 512 convolutions and write 512 convolution results back to the designated row of weight RAM 124. In this specification, edge handling (edge handling) is omitted for simplicity, but it should be noted that using the integral rotation feature of these NPUs 126 will cause two of the columns to wrap (wrap) from one vertical edge of the data matrix 2406 (e.g., of an image in the case of image processing) to the other vertical edge (e.g., from the left edge to the right edge or vice versa). The circulation body will now be described.

Address 1 is a multiply-accumulate instruction that specifies row 0 of data RAM122 and implicitly uses the row of current weight RAM124, where the row of current weight RAM124 is preferably held within sequencer 128 (and initialized to zero by the instruction at address 0 to pass through the loop body for the first time). That is, an instruction at address 1 causes each NPU 126 to read its respective word from row 0 of data RAM122, read its respective word from the row of current weight RAM124, and perform a multiply-accumulate operation on these two words. Thus, for example, NPU 5 multiplies C0,0 by Dx,5 (where "x" is the row of the current weight RAM 124), adds the result to the value 217 of accumulator 202, and writes the sum back to accumulator 202.

Address 2 is a multiply-accumulate instruction that specifies incrementing a row of the data RAM122 (i.e., to row 1) and subsequently reading the row from the incremented address of the data RAM 122. The instruction also specifies that the value within the multiplexing register 705 of each NPU 126 is to be rotated to the adjacent NPU 126, in this case the row of data matrix 2406 values that was just read from weight RAM124 in response to the instruction of address 1. In the embodiments of fig. 24-26, NPU 126 is configured to cycle the value of the multiplexing register 705 to the left, i.e., from NPU J to NPU J-1, rather than from NPU J to NPU J +1 as described above with respect to fig. 3, 7, and 19. It should be appreciated that in embodiments where the NPU 126 is configured to rotate to the right, the architectural routine may write the coefficient values of the convolution kernel 2042 to the data RAM122 in a different order (e.g., rotate around its center column) to achieve similar convolution results. Further, the framework program may perform additional pre-processing (e.g., transposing) on the convolution kernel 2402, if desired. Further, the instruction specifies a count value of 2. Thus, an instruction at address 2 causes each NPU 126 to read its respective word from row 1 of data RAM122, receive the rotated word into multiplexing register 705, and perform a multiply-accumulate operation on the two words. Since the count value is 2, the instruction also causes each NPU 126 to repeat the foregoing operations. That is, the sequencer 128 increments the row address 123 of the data RAM122 (i.e., to row 2), and each NPU 126 reads its corresponding word from row 2 of the data RAM122, receives the rotated word to the multiplexing register 705, and performs a multiply-accumulate operation on the two words. Thus, for example, assuming behavior 27 of current weight RAM124, NPU 5 accumulates the product of C0,1 and D27,6 and the product of C0,2 and D27,7 into its accumulator 202 after executing the instruction at address 2. Thus, upon completion of the instructions at address 1 and address 2, the product of C0,0 and D27,5, the product of C0,1 and D27,6, and the product of C0,2 and D27,7 will be accumulated to the accumulator 202 along with all other accumulated values from the previous pass through the loop body.

The operations performed by the instructions at

addresses

3 and 4 are similar to the instructions at

addresses

1 and 2, however with the row increment indicator of weight RAM124, these instructions perform operations on the next row of weight RAM124 and on the next three rows (i.e., rows 3 through 5) of data RAM 122. That is, for example, for NPU 5, upon completion of the instructions at addresses 1 through 4, the product of C0,0 and D27,5, the product of C0,1 and D27,6, the product of C0,2 and D27,7, the product of C1,0 and D28,5, the product of C1,1 and D28,6, and the product of C1,2 and D28,7 will be accumulated to accumulator 202 along with all other accumulated values from the previous pass through the loop body.

The instructions at

addresses

5 and 6 perform operations similar to those at

addresses

3 and 4, however these instructions perform operations on the next row of weight RAM124 and the next three rows (i.e., rows 6 through 8) of data RAM 122. That is, for example, for NPU 5, after the instructions of addresses 1 to 6 are completed, the product of C0,0 and D27,5, C0, the product of 1 and D27,6, the product of C0,2 and D27,7, the product of C1,0 and D28,5, the product of C1,1 and D28,6, the product of C1,2 and D28,7, C2, the product of 0 and D29,5, the product of C2,1 and D29,6, and the product of C2,2 and D29,7 will be accumulated to 202 along with all other accumulated values from the previous pass through the loop body. That is, upon completion of the instructions at addresses 1-6, and assuming behavior 27 of weight RAM124 at the beginning of the loop body, NPU 5 will convolve, for example, with convolution kernel 2402, the following 3 × 3 sub-matrix:

More generally, upon completion of the instructions at addresses 1 through 6, each NPU 126 of the 512 NPUs 126 has convolved with the following 3 × 3 submatrix using convolution kernels 2402:

where r is the row address value of the weight RAM 124 at the beginning of the loop body, and n is the number of the NPU 126.

The instruction at address 7 causes the value 217 of the accumulator 202 to pass through the AFU 212. The pass function passes words of a size (in bits, 16 bits in this example) of the size of the words read from the data RAM 122 and the weight RAM 124. Preferably, as described in more detail below, the user may specify the output format, e.g., how many of the output bits are decimal digits. Alternatively, rather than specifying a pass activation function, a division activation function is specified, such as described herein with respect to fig. 29A and 30, which divides the value 217 of the accumulator 202 by a divisor using one of the "dividers" 3014/3016 of fig. 30. For example, in the case of convolution kernel 2402 having coefficients (such as the one-sixteenth coefficients of a gaussian blur kernel described above, etc.), the activate function instruction at address 7 may specify a divide activate function (e.g., divide by 16) rather than a pass function. Alternatively, the architecture program may perform a divide by 16 operation on the convolution kernel 2402 coefficients before writing them to the data RAM 122, and adjust the binary point locations accordingly for the values of the convolution kernel 2402, for example using the data binary point 2922 of fig. 29A as described below.

The instruction at address 8 writes the output of AFU 212 to the row in weight RAM 124 specified by the current value of the output row register initialized by the instruction at address 0 and incremented on each pass through the loop by means of an increment indicator within the instruction.

As can be determined from the example of fig. 24-26 with the 3 x 3 convolution kernel 2402, the NPU 126 reads the weight RAM 124 about every three clock cycles to read the rows of the data matrix 2406, and writes the convolution result matrix to the weight RAM 124 about every 12 clock cycles. Further, assuming an embodiment including write and read buffers such as buffer 1704 of fig. 17, processor 100 reads and writes weight RAM 124 in parallel with NPU 126 reading and writing, such that buffer 1704 performs one write and one read to weight RAM 124 approximately every 16 clock cycles to write data matrix 2406 and read convolution results matrix, respectively. Thus, approximately half of the bandwidth of weight RAM 124 is consumed by the hybrid approach used by NNU 121 to perform convolution kernel operations. Although the present example contains a 3 × 3 convolution kernel 2402, other sizes of convolution kernels may be employed, such as 2 × 2, 4 × 4, 5 × 5, 6 × 6, 7 × 7, 8 × 8, etc. matrices, in which case the NNU procedure will change. In the case of a large convolution kernel, because the count of the round-robin version of the multiply-accumulate instruction is large (e.g., at

addresses

2, 4, and 6 of the program of FIG. 26A, and the additional instructions needed for the large convolution kernel), the percentage of time that NPU 126 reads weight RAM 124 is small, and thus the percentage of bandwidth of weight RAM 124 that is consumed is also small.

Alternatively, rather than writing the convolution results back to different rows (e.g., rows 900-1299 and 1300-1699) of the weight RAM 124, the architected program configures the NNU program to overwrite the rows of the input data matrix 2406 after they are no longer needed. For example, in the case of a 3 × 3 convolution kernel, the schema program writes the data matrix 2406 to rows 2 ~ 401 of the weight RAM 124 instead of writing the data matrix 2406 to rows 0 ~ 399, and the NPU program is configured to write the convolution result to the weight RAM 124 starting at row 0 of the weight RAM 124 and incrementally every time it passes through a loop body. As such, the NNU program only overwrites the rows that are no longer needed. For example, after the first pass through the loop body (or more specifically, after executing the instruction to load line 0 of weight RAM 124 at address 1), the data of line 0 may be overwritten, but the data of lines 1-3 are needed for the second pass through the loop body and thus are not overwritten by the first pass through the loop body; similarly, after the second pass through the loop body, the data for row 1 can be overwritten, but the data for rows 2-4 is needed for the third pass through the loop body and is not overwritten by the second pass through the loop body; and so on. In such embodiments, the height of each data matrix 2406 (data block) may be larger (e.g., 800 rows), resulting in fewer data blocks.

Alternatively, rather than writing the convolution results back to weight RAM 124, the architected program configures the NNU program to write the convolution results back to the rows above (e.g., above row 8) convolution kernel 2402 of data RAM 122, and when NNU 121 (e.g., using the address of the most recently written row 2606 of data RAM 122 of fig. 26B described below) writes the results, the architected program reads the results from data RAM 122. Such an alternative may be advantageous in embodiments where weight RAM 124 is single ported and data RAM 122 is dual ported.

From the operation of NNU 121 according to the embodiment of FIGS. 24-26A, it can be seen that each execution of the program of FIG. 26A requires approximately 5000 clock cycles, and thus, the convolution of the entire 2560 × 1600 data array 2404 of FIG. 24 requires approximately 100000 clock cycles, significantly less than the number of clock cycles required to perform the same task in a conventional manner.

Referring now to FIG. 26B, a block diagram is shown that illustrates certain fields of control register 127 of NNU 121 of FIG. 1, according to one embodiment. The status register 127 includes: a field 2602 to indicate the address of the row in weight RAM 124 that was most recently written to by NPU 126; a field 2606 to indicate the address of the row in the data RAM 122 that was most recently written by the NPU 126; a field 2604 to indicate the address of the row in weight RAM 124 that was most recently read by NPU 126; and a field 2608 to indicate the address of the row in the data RAM 122 that was most recently read by the NPU 126. This enables the architectural program executing on processor 100 to determine the progress of NNU 121 as it reads from and/or writes to data RAM 122 and/or weight RAM 124. With this capability, in conjunction with the selection of overwriting the input data matrix as described above (or writing the results to the data RAM 122 as described above), the data array 2404 of FIG. 24 can be processed, for example, as 5 512 x 1600 data blocks, rather than 20 512 x 400 data blocks, as described below. Processor 100 writes the first 512 x 1600 block of data into weight RAM 124 starting at row 2 and starts the NNU program (which has a cycle count of 1600 and an initialization weight RAM 124 output row of 0). When NNU 121 executes the NNU program, processor 100 monitors the location/address of the output of weight RAM 124 to (1) read the rows in weight RAM 124 (using MFNN instruction 1500) that have valid convolution results written by NNU 121 (starting from row 0), (2) overwrite the second 512 x 1600 data matrix 2406 (starting from row 2) with the valid convolution results once they have been read, so that when NNU 121 completes the NNU program for the first 512 x 1600 data block, processor 100 can immediately update the NNU program and start the NNU program again to process the second 512 x 1600 data block as needed. This process is repeated three more times for the remaining three 512 x 1600 blocks of data to achieve high utilization of the NNUs 121.

Advantageously, in one embodiment, AFU 212 has the ability to efficiently perform effective division on the value 217 of accumulator 202, as described in more detail below with respect to fig. 29A, 29B, and 30. For example, the activate function NNU instruction that divides the value 217 of the accumulator 202 by 16 may be used for the Gaussian blur matrix described above.

Although the convolution kernel 2402 used in the example of fig. 24 is a small static convolution kernel that is applied to the entire data array 2404, in other embodiments, the convolution kernel may be a large matrix with unique weights associated with different data values of the data array 2404, such as is common in convolutional neural networks. When NNU 121 is used in this manner, the architecture program can interchange the data matrix with the locations of the convolution kernels, i.e., place the data matrix within data RAM 122 and the convolution kernels within weight RAM 124, and the number of rows that can be processed by a particular execution of the NNU program can be relatively small.

Referring now to FIG. 27, a block diagram is shown that illustrates an example of populating weight RAM 124 of FIG. 1 with input data that is pooled by NNU 121 of FIG. 1. The pooling operation performed by the pooling layer of the artificial neural network reduces the dimensionality of the input data matrix (e.g., image or convolved image) by taking sub-regions or sub-matrices of the input matrix and calculating the maximum or average of these sub-matrices, and these maximum or average values become the result or pooling matrix. In the examples of fig. 27 and 28, the pooling operation calculates the maximum value of each sub-matrix. Pooling operations are particularly useful for artificial neural networks that perform object classification or detection, for example. In general, the pooling operation effectively reduces the size of the input matrix by a factor of the number of elements of the examined sub-matrix, and in particular reduces the input matrix in various dimensional directions by the number of elements of the respective dimension of the sub-matrix. In the example of fig. 27, the input data is a 512 x 1600 matrix of wide words (e.g., 16 bits) stored in rows 0 through 1599 of weight RAM 124. In FIG. 27, the word is labeled as the row and column position in which it is located, e.g., the word at row 0 and column 0 is labeled D0, 0; the word at row 0, column 1 is labeled D0, 1; the word at row 0, column 2 is labeled D0, 2; and so on, the word in row 0 and column 511 is labeled D0,511. Similarly, the word at row 1, column 0 is labeled D1, 0; the word at row 1, column 1 is labeled D1, 1; the 2 word at row 1 and column 2 is labeled D1, 2; and so on, the word in row 1 column 511 is labeled D1,511; by analogy, the word in row 1599 and column 0 is labeled D1599, 0; the words located in row 1599 and column 1 are labeled D1599, 1; the words located in row 1599 and column 2 are labeled D1599, 2; and so on, the word on column 511 of row 1599 is labeled D1599,511.

Referring now to FIG. 28, a program listing of an NNU program that performs the pooling operation of the input data matrix of FIG. 27 and writes it back to weight RAM 124 is shown. In the example of fig. 28, the pooling operation calculates the maximum value of each 4 × 4 sub-matrix in the input data matrix. The program cycles the loop body of instructions at addresses 1 to 10 multiple times. The initialize NPU instruction at address 0 specifies the number of times each NPU 126 executes the loop body, e.g., in the example of fig. 28, the loop count value for the loop body is 400, and the loop instruction at the end of the loop (address 11) decrements the current loop count value, and if the decremented result is a non-zero value, control returns to the top of the loop body (i.e., returns the instruction at address 1). The input data matrix in weight RAM 124 is actually treated by the NNU program as 400 exclusive groups of four adjacent rows, i.e., rows 0-3, rows 4-7, rows 8-11, and so on, up to rows 1596-1599. Each group of four adjacent rows includes 128 4x4 sub-matrices, i.e., 4x4 sub-matrices of elements formed by the intersection of the four rows of the group with the four adjacent column rows (i.e., columns 0-3, columns 4-7, columns 8-11, and so on through columns 508-511). Of the 512 NPUs 126, every fourth NPU 126 (i.e., 128 NPUs 126) of the 512 NPUs 126 performs a pooling operation on the corresponding 4 × 4 sub-matrix, while the other three quarters of the NPUs 126 are unused. More specifically,

NPUs

0, 4, 8, and so on through NPUs 508 each perform a pooling operation on their respective 4x4 sub-matrices, where the leftmost column number of the 4x4 sub-matrix corresponds to the NPU number and the lower row corresponds to the row value of the current weight RAM 124, which value is initialized to zero by the initialization instruction at address 0 and incremented by 4 each time the loop body is repeated, as described in more detail below. The 400 iterations of the loop body correspond to the 4x4 number of submatrix sets in the input data matrix of fig. 27 (i.e., 1600 rows of the input data matrix divided by 4). The initialize NPU instruction also clears accumulator 202. Preferably, the loop instruction at address 11 also clears the accumulator 202. Optionally, the maxwacc instruction for address 1 specifies clearing the accumulator 202.

For each iteration of the loop body of the program, the 128 NPUs 126 used perform 128 pooling operations in parallel on 128 respective 4 × 4 sub-matrices in the current four-row group of input data matrices. More specifically, the pooling operation determines the maximum value element among 16 elements of the 4 × 4 sub-matrix. In the embodiment of FIG. 28, for each NPU y of the 128 NPUs 126 used, the lower left element of the 4 x 4 sub-matrix is element Dx, y of FIG. 27, where x is the row number of the current weight RAM 124 at the beginning of the loop body, read by the maxwacc instruction at address 1 of the program of FIG. 28 (this row number is also initialized by the initializing NPU instruction at address 0 and incremented each time the maxwacc instruction at

addresses

3, 5, and 7 is executed). Thus, for each cycle of the program, the 128 NPUs 126 used write back the respective maximum value elements of the respective 128 4 × 4 sub-matrices of the current row group to the designated row of the weight RAM 124. The circulation body will be described below.

At address 1 is a maxwacc instruction to implicitly use the line of current weight RAM 124, which is preferably held within sequencer 128 (and initialized to zero by the instruction at address 0 for the first pass through the loop body). An instruction at address 1 causes each NPU 126 to read its corresponding word from the current row of weight RAM 124, compare the word to the value 217 of accumulator 202, and store the maximum of the two values in accumulator 202. Thus, for example, NPU 8 determines the value 217 of accumulator 202 and the maximum value in data word Dx,8 (where "x" is the row of current weight RAM 124) and writes the maximum value back to accumulator 202.

At address 2 is a maxwacc instruction that specifies that the value within the multiplexing register 705 of each NPU 126, in this case a row of input data matrix values read from the weight RAM 124 only in response to the instruction at address 1, be rotated to the adjacent NPU 126. In the embodiment of fig. 27-28, NPU 126 is configured to cycle the value of multiplexer 705 to the left, i.e., from NPU J to NPU J-1, as described above with respect to fig. 24-26. Further, the instruction specifies a count value of 3. Thus, the instruction at address 2 causes each NPU 126 to receive the rotated word into the multiplexing register 705 and determine the maximum of the rotated word and the value 217 of the accumulator 202, and then repeat the operation two more times. That is, each NPU 126 receives the rotated word three times into the multiplexing register 705 and determines the maximum of the rotated word and the value 217 of the accumulator 202. Thus, for example, assuming behavior 36 of the current weight RAM 124 at the beginning of the loop body, taking NPU 8 as an example, upon execution of the instructions at

addresses

1 and 2, NPU 8 will store in its accumulator 202 the accumulator 202 at the beginning of the loop body and the maximum of the four weight RAM 124 words D36,8, D36,9, D36,10 and D36, 11.

The operation performed by the maxwacc instruction at

addresses

3 and 4 is similar to the operation performed by the instructions at

addresses

1 and 2, however with the weight RAM 124 row increment indicator, the instructions maxwacc at

addresses

3 and 4 perform operations on the next row of weight RAM 124. That is, assuming that the line of the current weight RAM 124 at the beginning of the loop body is 36, taking NPU 8 as an example, after completing the instructions at addresses 1-4, NPU 8 will store in its accumulator 202 the maximum of the accumulator 202 at the beginning of the loop body and the words D36,8, D36,9, D36,10, D36,11, D37,8, D37,9, D37,10 and D37,11 of the eight weight RAMs 124.

The operation performed by the maxwacc instruction at addresses 5-8 is similar to the operation performed by the instruction at addresses 3-4, however the instruction at addresses 5-8 performs the operation on the next two rows of weight RAM 124. That is, assuming that the current weight RAM 124 column at the beginning of the loop body is 36, taking NPU 8 as an example, upon completion of the instructions for addresses 1 to 8, NPU 8 will store in its accumulator 202 the accumulator 202 at the beginning of the loop and the maximum of the sixteen weight RAM 124 words D36,8, D36,9, D36,10, D36,11, D37,8, D37,9, D37,10, D37,11, D38,8, D38,9, D38,10, D38,11, D39,8, D39,9, D39,10 and D39, 11. That is, assuming that the row of the current weight RAM 124 at the beginning of the loop body is 36, taking NPU 8 as an example, upon completion of the instructions at addresses 1 through 8, NPU 8 will determine the maximum of the following 4 x 4 sub-matrices:

More specifically, upon completion of the instructions at addresses 1 through 8, each of the 128 NPUs 126 used will determine the maximum of the following 4 x 4 sub-matrices:

where r is the row address value of weight RAM124 at the beginning of the loop body and n is the number of NPU 126.

The instruction at address 9 causes the value 217 of the accumulator 202 to pass through the AFU 212. This pass function passes a word of size (in bits, 16 bits in this example) that is the size of the word read from the weight RAM 124. Preferably, as described in more detail below, the user may specify the output format, e.g., how many of the output bits are decimal digits.

The instruction at address 10 writes the value 217 of accumulator 202 to the row in weight RAM124 specified by the current value of the output row register initialized by the instruction at address 0 and incremented each time through the loop body by an increment indicator within the instruction. More specifically, the instruction at address 10 writes a wide word (e.g., 16 bits) of accumulator 202 into weight RAM 124. Preferably, the instruction writes 16 bits as specified by the output binary point 2916, as described in more detail below with respect to fig. 29A and 29B.

It can be seen that the rows written to weight RAM124 by iterating through the loop body contain holes with invalid data. That is, wide words 1 through 3, 5 through 7, 9 through 11, and so on of result 133, until wide words 509 through 511 are invalid or unused. In one embodiment, AFU 212 includes a multiplexer that enables packing of results into adjacent words of a line buffer (such as line buffer 1104 of fig. 11, etc.) for writing back to output weight RAM124 lines. Preferably, the activate function instruction specifies a number of words in each hole, and the number of words within the hole is used to control the packing result of the multiplexer. In one embodiment, the number of holes can be specified as a value of 2-6 to compact the output of the pooled 3 × 3, 4 × 4, 5 × 5, 6 × 6, or 7 × 7 sub-matrices. Alternatively, the architectural program executing on processor 100 reads the resulting sparse (i.e., with holes) result lines from weight RAM124 and performs compaction functions using other execution units 112, such as media units that use architectural compaction instructions, e.g., x86SSE instructions. Advantageously, in a parallel manner similar to that described above and taking advantage of the hybrid nature of NNU 121, an architectural program executing on processor 100 may read status register 127 to monitor the most recently written row of weight RAM124 (e.g., field 2602 of fig. 26B) to read the resulting sparse row, pack it and write it back to the same row of weight RAM124, so as to be ready for use as an input data matrix for the next layer of the neural network, such as a convolutional layer or a traditional neural network layer (i.e., a multiplicative accumulation layer), etc. Further, although the embodiments described herein perform pooling operations on 4 × 4 sub-matrices, the NNU program of fig. 28 may be modified to perform pooling operations on other sized sub-matrices, such as 3 × 3, 5 × 5, 6 × 6, or 7 × 7 sub-matrices.

It can also be seen that the resulting number of rows written into weight RAM 124 is one quarter of the number of rows of the input data matrix. Finally, in this example, the data RAM 122 is not used. However, alternatively, the pooling operation may be performed using the data RAM 122, rather than using the weight RAM 124.

In the examples of fig. 27 and 28, the pooling operation calculates the maximum value of the sub-region. However, the program of FIG. 28 can be modified to calculate the average of the sub-regions, for example, by replacing the maxwacc instruction with a sumwacc instruction (adding the weight word to the value 217 of the accumulator 202) and changing the Activate function instruction at address 9 to divide the accumulated result (preferably via reciprocal multiplication as described below) by the number of elements (16 in this example) of each sub-region.

From the operation of NNU 121 according to the embodiment of fig. 27 and 28, it can be seen that each time the program of fig. 28 is executed, pooling operations are performed on the entire 512 x 1600 data matrix of fig. 27 with approximately 6000 clock cycles, which may be significantly less than the number of clock cycles required to perform similar tasks in a conventional manner.

Alternatively, rather than writing the results back to weight RAM 124, the architected program configures the NNU program to write the results of the pooling operation back to a row of data RAM 122, and the architected program reads the results from data RAM 122 when NNU 121 (e.g., using the address of the most recently written row 2606 of data RAM 122 of FIG. 26B) writes the results to data RAM 122. Such an alternative may be advantageous in embodiments where weight RAM 124 is single ported and data RAM 122 is dual ported.

With fixed point arithmetic of binary point provided by the user, full precision fixed point accumulation, user specified reciprocal value, random rounding of accumulator values, and selectable activation/output functions

in general, hardware units that perform arithmetic operations within a digital computing device are typically divided into "integer" units and "floating point" units because the hardware units perform arithmetic operations on integers and floating point numbers, respectively, a floating point number has a magnitude (or mantissa) and an exponent, and often a sign²⁹and 0.81 × 10³¹multiplication (although floating point units typically operate on 2-based floating point numbers, decimal or 10-based examples are used here.) a floating point unit is automatically responsible for multiplying mantissas, adding exponents, and then normalizing the result back to a value of 8911 × 10 ⁵⁹. As another example, assume that the same two floating-point numbers are added. The floating-point unit is automatically responsible for adding the binary decimal of the mantissa before addingthe dots are aligned to produce a value of 81111 × 10³¹The sum of (a) and (b).

However, the complexity associated with floating point units and the consequent size, power consumption, increase in clock cycles per instruction, and/or cycle time extension are well known. Indeed, for this reason, many devices (e.g., embedded processors, microcontrollers, and relatively low cost and/or low power microprocessors) do not include floating point units. From the above examples, it can be seen that some complex floating point units include: logic for performing exponent calculations associated with floating point addition and multiplication/division (i.e., an adder to perform addition/subtraction operations on exponents of operands to produce a result exponent value for floating point multiplication/division, a subtractor to determine a binary point alignment shift amount for subtraction of exponents of operands to determine floating point addition), a shifter to achieve binary point alignment of mantissas in floating point addition, and a shifter to normalize the floating point result. Furthermore, flow into block units typically requires logic to perform rounding operations on floating point results, logic to convert between integer and floating point formats and between different floating point precision formats (e.g., augmented precision, double precision, single precision, half precision), detectors of leading zeros and leading ones, and logic to handle special floating point numbers (e.g., outliers, non-numeric values, and infinite values, etc.).

Furthermore, there are disadvantages as follows: due to the increase in the numerical space required to be verified in design, the accuracy verification of floating point units becomes significantly extremely complex, possibly extending product development cycle and time to market. Furthermore, as described above, floating point arithmetic implies the storage and use of separate mantissa and exponent fields for each floating point number involved in the computation, which may increase the amount of memory required and/or decrease precision given an equal amount of memory to store integers. Many of these disadvantages can be avoided by using integer units that perform arithmetic operations on integers.

Programmers often write programs that handle decimals, i.e., non-integers. Such programs may be executed on processors that do not have a floating point unit, or that, although having a floating point unit, the integer instructions executed by the integer unit of the processor may be faster. To take advantage of the potential performance advantages associated with integer units, programmers employ well-known fixed-point arithmetic on fixed-point numbers. Such programs include instructions that execute on integer units to process integer or integer data. Software knows that data is a fractional number and contains instructions (e.g., an alignment shift) to perform an operation on integer data to account for the fact that the data is actually a fractional number. Basically, fixed point software manually performs some or all of the functions performed by a floating point unit.

As used herein, a "fixed point" number (or value or operand or input or output) is a number whose stored bits are understood to contain bits representing the fractional part of the fixed point number (referred to herein as "fractional bits"). The fixed-point number of storage bits is contained within a memory or register, such as an 8-bit or 16-bit word within the memory or register. In addition, the stored bits of the fixed-point number are all used to represent a magnitude, and in some cases, one of the bits is used to represent a sign, but the fixed-point number has no stored bit used to represent an exponent of the number. Further, the number of decimal places or binary decimal point positions of the fixed-point number is specified at the time of storage, which is different from the storage bits of the fixed-point number, and the number of decimal places or binary decimal point positions is indicated in a shared or global manner for a fixed-point number set to which the fixed-point number belongs (e.g., a set of input operands, a set of accumulated values, or a set of output results of an array of processing units, etc.).

Advantageously, in the embodiments described herein, the ALU is an integer unit, but activating the function unit includes floating point arithmetic hardware assistance or acceleration. This makes the ALU portion smaller and faster, facilitating the use of more ALUs within a given die space. This means that there are more neurons per unit of grain space, and is therefore particularly advantageous in neural network elements.

Furthermore, advantageously, as opposed to floating point numbers that require exponent storage bits per floating point number, embodiments are described in which the fixed point number is represented in an indication of the number of storage bits as decimal places for the entire set of numbers, however, the indication is located in a single shared storage space that globally indicates the number of decimal places in all numbers for the entire set (e.g., a set of inputs for a series of operations, a set of accumulated values for a series of operations, a set of outputs). Preferably, a user of the NNU is able to specify the number of fractional storage bits for the set of digits. Thus, it should be understood that while in many contexts (e.g., general mathematics), the term "integer" refers to a signed integer, i.e., a number that does not have a fractional part, in this context, the term "integer" may refer to a number that has a fractional part. Further, in the context of this document, the term "integer" is intended to be distinguished from floating point numbers, for which a portion of the number of bits of their respective storage space is used to represent the exponent of the floating point number. Similarly, integer arithmetic operations (such as integer multiplication or addition or comparison performed by an integer unit) assume that the operands do not have exponents, and therefore, the integer elements of the integer unit (such as integer multipliers, integer adders, integer comparators, etc.) do not contain logic to process the exponents, e.g., do not require shifting mantissas to align binary decimal points for addition or comparison operations, and do not require adding exponents for multiplication operations.

Further, embodiments described herein include large hardware integer accumulators to accumulate a series of large integer operations (e.g., 1000 order multiply accumulate) without loss of accuracy. This allows the NNU to avoid processing floating point numbers while maintaining full precision of the accumulated value without saturating it or producing inaccurate results due to an overflow. As described in more detail below, once the series of integer operations accumulate the result to the full-precision accumulator, the fixed-point hardware assists in performing the necessary scaling and saturation operations to convert the full-precision accumulated value to an output value with the user-specified indication of the number of accumulated value decimal places and the desired number of decimal places of the output value.

As described in more detail below, the activate function unit may preferably selectively perform random rounding of the accumulator value as it is compressed from a full precision form for use as an input to the activate function or for passage. Finally, the NPU may selectively accept instructions to apply different activation functions and/or output many different forms of accumulator values, according to different requirements of a given layer of the neural network.

Referring now to FIG. 29A, a block diagram illustrating an embodiment of the control register 127 of FIG. 1 is shown. The control register 127 may include a plurality of control registers 127. As shown, control register 127 includes the following fields: configuration 2902, signed data 2912, signed weights 2914, data binary decimal points 2922, weight binary decimal points 2924, ALU functions 2926, rounding controls 2932, activation functions 2934, reciprocals 2942, shift amounts 2944, output RAM 2952, output binary decimal points 2954, and output commands 2956. The control register 127 value may be written using both the MTNN instruction 1400 and instructions of the NNU program (such as initialization instructions).

The value of configuration 2902 specifies whether the NNU 121 is a narrow configuration, a wide configuration, or a funnel configuration, as described above. Configuration 2902 means the size of the input words received from data RAM 122 and weight RAM 124. In the narrow and funnel configurations, the size of the input word is narrow (e.g., 8 or 9 bits), while in the wide configuration, the size of the input word is wide (e.g., 12 or 16 bits). Further, the configuration 2902 means the size of the output result 133 that is the same as the input word size.

Signed data value 2912 indicates that the data words received from data RAM 122 are signed values if true and unsigned values if false. Signed weight value 2914 indicates that the weight words received from weight RAM124 are signed values if true and unsigned values if false.

The value of the data binary point 2922 indicates the position of the binary point of the data word received from the data RAM 122. Preferably, the value of data binary point 2922 indicates the number of bit positions from the right side of the binary point position. In other words, data binary point 2922 indicates how many of the least significant bits of the data word are decimal bits, i.e., to the right of the binary point. Similarly, the value of weight binary point 2924 indicates the location of the binary point of the weight word received from weight RAM 124. Preferably, where ALU function 2926 is a multiply-accumulate or output accumulator, NPU 126 determines the number of bits to the right of the binary point of the value held in accumulator 202 as the sum of the data binary point 2922 and the weight binary point 2924. Thus, for example, if the value of data binary point 2922 is 5 and the value of weight binary point 2924 is 3, the value within accumulator 202 has 8 bits to the right of the binary point. In the case where ALU function 2926 is and/max accumulator and data/weight word, or passes data/weight word, NPU 126 determines the number of bits held to the right of the binary point of the value of accumulator 202 as data binary point 2922/weight binary point 2924, respectively. In an alternative embodiment, rather than specifying respective data binary decimal points 2922 and weight binary decimal points 2924, a single accumulator binary decimal point 2923 is specified, as described below with respect to fig. 29B.

ALU function 2926 specifies the function performed by ALU 204 of NPU 126. As described above, ALU function 2926 may include, but is not limited to: multiplying the data word 209 with the weight word 203 and accumulating the product with the accumulator 202; adding the accumulator 202 to the weight word 203; adding the accumulator 202 to the data word 209; the maximum value in accumulator 202 and data word 209; the maximum value in accumulator 202 and weight word 203; an output accumulator 202; through data word 209; by weight word 203; a zero value is output. In one embodiment, ALU function 2926 is specified by an NNU initialization instruction and is used by ALU 204 in response to executing an instruction (not shown). In one embodiment, ALU function 2926 is specified by an individual NNU instruction (such as the multiply accumulate and maxwacc instructions described above).

The rounding control 2932 specifies (in figure 30) the form of rounding used by the rounder 3004. In one embodiment, the specifiable rounding modes include, but are not limited to: no rounding, rounding to the nearest value, and random rounding. Preferably, the processor 100 includes a random bit source 3003 (of FIG. 30) for generating random bits 3005, wherein the random bits 3005 are sampled and used to perform random rounding to reduce the likelihood of generating rounding offsets. In one embodiment, where the rounded bit 3005 is 1 and the sticky bit is zero, the NPU 126 rounds up if the sampled random bit 3005 is true and the NPU 126 does not round up if the random bit 3005 is false. In one embodiment, the random bit source 3003 generates random bits 3005 based on sampling random electrical characteristics of the processor 100, such as thermal noise of semiconductor diodes or resistors, etc., although other embodiments are also contemplated.

The activation function 2934 specifies a function that is applied to the value 217 of the accumulator 202 to produce the output 133 of the NPU 126. As described above and in more detail below, activation function 2934 includes, but is not limited to: an S-type function; a hyperbolic tangent function; a soft addition function; a correction function; a specified power divided by two; multiplying by a reciprocal value specified by a user to realize equivalent division; through the entire accumulator; and as a standard size by an accumulator as described in more detail below. In one embodiment, the activate function is specified by an NNU activate function instruction. Alternatively, the activate function is specified by an initialize instruction and applied in response to an output instruction (e.g., a write AFU output instruction at address 4 in FIG. 4), in this embodiment, the activate function instruction at address 3 of FIG. 4 falls under the output instruction.

The reciprocal 2942 value specifies the value that is multiplied with the value 217 of the accumulator 202 to effect division of the value 217 of the accumulator 202. That is, the user specifies the reciprocal 2942 value as the reciprocal of the divisor that is actually desired. This is useful, for example, in conjunction with convolution or pooling operations as described herein. Preferably, the user specifies the reciprocal 2942 value as two parts, as described in more detail below with respect to FIG. 29C. In one embodiment, the control register 127 includes a field (not shown) that enables a user to specify one of a plurality of built-in divisor values to divide by, the size of which is equivalent to the size of a conventional convolution kernel, such as 9, 25, 36, or 49. In such embodiments, AFU 212 may store the reciprocal of these built-in divisors for multiplication by accumulator 202 value 217.

Shift amount 2944 specifies the number of bits that the shifter of AFU212 right-shifts value 217 of accumulator 202 to achieve a division by a power of two. This may also be useful in combination with convolution kernels of power two in size.

The value of output RAM 2952 specifies which of data RAM 122 and weight RAM 124 is to receive output result 133.

The value of output binary point 2954 indicates the position of the binary point of output result 133. Preferably, the value of output binary point 2954 indicates the number of bit positions from the right side of the binary point position of output result 133. In other words, the output binary point 2954 indicates how many of the least significant bits of the output result 133 are decimal bits, i.e., to the right of the binary point. AFU212 performs rounding, compression, saturation, and size conversion based on the value of output binary point 2954 (and in most cases, also based on the value of data binary point 2922, the value of weight binary point 2924, the value of activation function 2934, and/or the value of configuration 2902).

Output commands 2956 control various aspects of output results 133. In one embodiment, AFU212 utilizes the concept of standard size, where the standard size is twice the width size (in bits) specified by configuration 2902. Thus, for example, if configuration 2902 means that the size of the input words received from data RAM 122 and weight RAM 124 is 8 bits, then the standard size is 16 bits; in another example, if configuration 2902 means that the size of the input words received from data RAM 122 and weight RAM 124 is 16 bits, then the standard size is 32 bits. As described herein, the size of the accumulator 202 is larger (e.g., narrow accumulator 202B is 28 bits and wide accumulator 202A is 41 bits) to maintain full precision of intermediate computations (e.g., 1024 and 512 NNU multiply-accumulate instructions, respectively). As such, the value 217 of the accumulator 202 is greater (in bits) than the standard size, and the AFU212 (e.g., CCS 3008 described below with respect to fig. 30) compresses the value 217 of the accumulator 202 down to a value having the standard size for most of the value of the activation function 2934 (except through a full accumulator). The first predetermined value of output command 2956 instructs AFU212 to execute the specified activation function 2934 to produce an internal result that is the same size as the original input word (i.e., half the standard size) and output the internal result as output result 133. A second predetermined value of output command 2956 instructs AFU212 to execute the specified activation function 2934 to produce an internal result that is twice the size of the original input word (i.e., a standard size) and output the lower half of the internal result as output result 133; and a third predetermined value of output command 2956 instructs AFU212 to output the upper half of the standard-sized internal result as output result 133. As described above with respect to fig. 8-10, the fourth predetermined value of output command 2956 instructs AFU212 to output the original least significant word (whose width is specified by configuration 2902) of accumulator 202 as output result 133; a fifth predetermined value instructs AFU212 to output the original middle valid word of accumulator 202 as output result 133; and a sixth predetermined value instructs AFU212 to output the original most significant word of accumulator 202 as output result 133. As described above, outputting the internal result of full accumulator 202 size or a standard size may be advantageous, for example, to enable other execution units 112 of processor 100 to execute an activation function such as a soft maximum activation function.

Although the fields of FIG. 29A (as well as FIGS. 29B and 29C) are described as being located in control register 127, in other embodiments, one or more fields may be located in other portions of NNU 121. Preferably, a number of fields may be included in the NNU instruction itself and decoded by the sequencer 128 to generate the micro-operations 3416 (of FIG. 34) for controlling the ALU 204 and/or the AFU 212. In addition, these fields may be included within a micro-operation 3414 (of FIG. 34) stored in the media register 118, the micro-operation 3414 controlling the ALU 204 and/or AFU 212. In such embodiments, the use of an initialize NNU instruction may be minimized, and in other embodiments, the initialize NNU instruction is eliminated.

As described above, the NNU instruction can specify that ALU operations be performed on memory operands (e.g., words from data RAM 122 and/or weight RAM 124) or on rotated operands (e.g., from multiplexing registers 208/705). In one embodiment, the NNU instruction may also specify an operand as a register output of an activate function (e.g., register output 3038 of fig. 30). Further, as described above, the NNU instruction can specify incrementing the current row address of data RAM 122 or weight RAM 124. In one embodiment, the NNU instruction may specify an immediate signed integer difference (deltavalue) to be added to the current line to effect incrementing or decrementing by values other than one.

Referring now to FIG. 29B, a block diagram is shown illustrating an embodiment of the control register 127 of FIG. 1, according to an alternative embodiment. The control register 127 of fig. 29B is similar to the control register 127 of fig. 29A; however, control register 127 of FIG. 29B includes an accumulator binary point 2923. Accumulator binary point 2923 represents the binary point location of accumulator 202. Preferably, the value of the accumulator binary point 2923 represents the number of bit positions from the right side of the binary point position. In other words, the accumulator binary point 2923 represents how many of the least significant bits of the accumulator 202 are decimal bits, i.e., to the right of the binary point. In this embodiment, accumulator binary point 2923 is explicitly specified, rather than implicitly determined as described above for the embodiment of fig. 29A.

Referring now to FIG. 29C, a block diagram illustrating an embodiment of the reciprocal 2942 of FIG. 29A stored in two portions is shown, according to one embodiment. The first part 2962 is a shift value that represents the number 2962 of leading zeros that the user wants to be suppressed in the true reciprocal value multiplied by the value 217 of the accumulator 202. The number of leading zeros is the number of consecutive zeros immediately to the right of the binary point. The second portion 2694 is the value of the inverse 2964 of the leading zero suppressed, i.e., the true inverse value after all leading zeros have been removed. In one embodiment, the number of suppressed leading zeros 2962 is stored as 4 bits, while the value of the reciprocal 2964 of the leading zeros suppressed is stored as an 8-bit unsigned value.

For purposes of illustration by way of example, assume that the user desires to multiply the value 217 of the accumulator 202 by the reciprocal of 49. The binary representation of the reciprocal of 49 in 13 decimal places is 0.0000010100111 (with five leading zeros). In this case, the user fills the suppressed leading zero quantity 2962 with the value 5 and the suppressed reciprocal 2964 of the leading zero with the value 10100111. After the reciprocal multiplier "divider a" 3014 (of fig. 30) multiplies the value 217 of the accumulator 202 with the value of the suppressed reciprocal 2964 of the leading zero, the resulting product is right shifted by the number of suppressed leading zeros 2962. Such an embodiment may advantageously achieve high accuracy by representing the value of reciprocal 2942 with relatively few bits.

Referring now to FIG. 30, a block diagram is shown that illustrates an embodiment of AFU 212 of FIG. 2 in greater detail. AFU 212 includes: control register 127 of FIG. 1; a positive mode converter (PFC) and an Output Binary Point Aligner (OBPA) 3002 for receiving the value 217 of the accumulator 202; a rounder 3004 for receiving the value 217 of the accumulator 202 and an indication of the number of bits out of which the OBPA3002 is shifted; a random bit source 3003, as described above, for generating random bits 3005; a first multiplexer 3006 for receiving outputs of the PFC and OBPA3002 and an output of the rounder 3004; a normal-size compressor (CCS) and saturator 3008 for receiving an output of the first multiplexer 3006; a bit selector and saturator 3012 to receive the outputs of CCS and saturator 3008; a corrector 3018 for receiving the outputs of CCS and saturator 3008; a reciprocal multiplier 3014 for receiving the output of CCS and saturator 3008; a right shifter 3016 for receiving the outputs of CCS and saturator 3008; a tanh module 3022 for receiving the output of the bit selector and saturator 3012; an S-type block 3024 for receiving the output of the bit selector and saturator 3012; a soft-summing block 3026 for receiving the output of the bit selector and saturator 3012; a second multiplexer 3032 for receiving the hyperbolic tangent module 3022, the sigmoid module 3024, the soft-addition module 3026, the corrector 3018, the reciprocal multiplier 3014, the output of the right shifter 3016, and the pass-through normal-size output 3028 of the CCS and saturator 3008; a symbol recoverer 3034 for receiving the output of the second multiplexer 3032; a magnitude converter and saturator 3036 to receive the output of the symbol recoverer 3034; a third multiplexer 3037 for receiving the output of the magnitude converter and saturator 3036 and the output 217 of the accumulator; and an output register 3038 which receives the output of the multiplexer 3037 and whose output is the result 133 of fig. 1.

The PFC and OBPA 3002 receives the value 217 of the accumulator 202. Preferably, as described above, the value 217 of the accumulator 202 is a full-precision value. That is, the accumulator 202 has a sufficient number of stored bits to hold an accumulated value, which is the sum of a series of products generated by the integer adder 244 produced by the integer multiplier 242, without discarding any bits in the products of the multiplier 242 or the sum of the adders, so that precision is not lost. Preferably, accumulator 202 has at least a sufficient number of bits to maintain a maximum number of product accumulations that NNU 121 can be programmed to perform. For example, to illustrate with reference to the routine of FIG. 4, in a wide configuration NNU 121 can be programmed to perform a maximum number of multiply-accumulate operations of 512, while accumulator 202 has a bit width of 41. As another example, and as explained with reference to the process of FIG. 20, in a narrow configuration NNU 121 can be programmed to perform a maximum number of multiply-accumulate operations of 1024, while accumulator 202 has a bit width of 28. Generally, the full-precision accumulator 202 has at least Q bits, where Q is M and log₂P, where M is the bit width of the integer product of the multiplier 242 (e.g., 16 bits for a narrow multiplier 242 or 32 bits for a wide multiplier 242), and P is the maximum allowable number of integer products that can be accumulated to the accumulator 202. Preferably, the maximum number of product accumulations is specified via the programming specifications of the programmer of NNU 121. In one embodiment, sequencer 128 forces the maximum value of the count of multiply accumulate NNU instructions (e.g., the instruction at address 2 of FIG. 4) to be 511, for example, assuming one previous multiply accumulate instruction (e.g., the instruction at address 1 of FIG. 4) for a row of data/weight words 206/207 was loaded from data/weight RAM 122/124.

Advantageously, the design of the ALU 204 portion of the NPU 126 may be simplified by including an accumulator 202 with a bit width large enough to perform accumulation for the maximum number of full precision values that are allowed to be accumulated. In particular, this may alleviate the need to use logic to saturate the sum produced by integer adder 244, which would cause the mini-accumulator to produce an overflow, and would require constant tracking of the accumulator's binary point location to determine if an overflow occurred to know if saturation is needed. To illustrate by way of example the problem of a design with a non-full precision accumulator but with saturation logic to handle an overflow of the non-full precision accumulator, assume that the following scenario exists.

(1) The value of the data word ranges between 0 and 1 and all storage bits are used to store the decimal place. The weight word value ranges between-8 and +8 and all but three bits are used to store the decimal place. The accumulated value of the inputs for the tanh activation function ranges between-8 and 8, and all but three bits are used to store the decimal place.

(2) The bit width of the accumulator is not full precision (e.g., only the bit width of the product).

(3) Assuming the accumulator is at full precision, the final accumulated value will be between-8 and 8 (e.g., + 4.2); however, products before "point a" in the sequence tend more commonly to be positive, while products after point a tend more commonly to be negative.

In this case, inaccurate results (i.e., results other than +4.2) may be obtained. This is because at some point prior to point a, when the accumulator value should have been a larger value (e.g., +8.2), the accumulator may saturate to a maximum of +8, resulting in a loss of +0.2 remaining. The accumulator can maintain even more product accumulations at the saturation value, resulting in the loss of more positive values. Thus, the final value of the accumulator may be a smaller value than if the accumulator had a full precision bit width (i.e., less than + 4.2).

PFC 3002 converts the value 217 of accumulator 202 to positive if it is negative and generates an additional bit to indicate whether the original value is positive or negative, which additional bit is passed down the pipeline of AFU 212 along with the value. Conversion to positive simplifies subsequent operation of AFU 212. For example, this operation allows only positive values to be input to the hyperbolic tangent module 3022 and the S-type module 3024, and thus these modules may be simplified. Further, the rounder 3004 and the saturator 3008 are simplified.

The OBPA3002 shifts or scales the positive value to the right to align it with the output binary radix point 2954 specified within the control register 127. Preferably, the OBPA3002 calculates the shift amount as a difference value that is the output decimal place (e.g., specified by output binary decimal point 2954) subtracted from the decimal place of the value 217 of the accumulator 202 (e.g., specified by accumulator binary decimal point 2923, or the sum of data binary decimal point 2922 and weight binary decimal point 2924). Thus, for example, if the binary point 2923 of the accumulator 202 is 8 (as in the previous embodiment) and the output binary point 2954 is 3, the OBPA3002 right shifts the positive value by 5 bits to produce a result that is provided to the multiplexer 3006 and the rounder 3004.

The rounder 3004 performs rounding on the value 217 of the accumulator 202. Preferably, the rounder 3004 generates a rounded version of the positive value generated by the PFC and OBPA3002 and provides the rounded version to the multiplexer 3006. The rounder 3004 performs rounding in accordance with the rounding control 2932 described above, which rounding control 2932 may include random rounding using the random bits 3005, as described in the context herein. Multiplexer 3006 selects one of its multiple inputs (i.e., the normal value from PFC and OBPA3002 or the rounded version from rounder 3004) based on rounding control 2932 (which may include random rounding as described herein) and provides the selected value to CCS and saturator 3008. Preferably, the multiplexer 3006 selects the output of the PFC and OBPA3002 if the rounding control 2932 specifies that rounding is not to be performed, and otherwise selects the output of the rounder 3004. In other embodiments contemplated, AFU 212 performs additional rounding. For example, in one embodiment, when the bit selector 3012 compresses the output bits of the CCS and saturator 3008 (as described below), the bit selector 3012 rounds off based on the missing low order bits. For another example, in one embodiment, the product of reciprocal multiplier 3014 (described below) is rounded. For another example, in one embodiment, the size converter 3036 rounds when converting to an appropriate output size (as described below), which may involve losing the lower order bits when rounding is determined.

CCS 3008 compresses the output value of multiplexer 3006 to a standard size. Thus, for example, if the NPU126 is in the narrow or funnel configuration 2902, the CCS 3008 compresses the output value of the 28-bit multiplexer 3006 to 16 bits; whereas if the NPU126 is in the wide configuration 2902, the CCS 3008 compresses the output value of the 41-bit multiplexer 3006 to 32-bits. However, before compression to the standard size, if the pre-compression value is larger than the maximum value that can be expressed by the standard form, the saturator 3008 saturates the pre-compression value to the maximum value that can be expressed by the standard form. For example, if any bit of the pre-compression value to the left of the most significant canonical form bit has a value of 1, the saturator 3008 saturates to a maximum value (e.g., saturates to all 1's).

Preferably, the tanh module 3022, the S-type module 3024, and the soft summing module 3026 all contain look-up tables, such as Programmable Logic Arrays (PLAs), Read Only Memories (ROMs), combinational logic gates, and the like. In one embodiment, to simplify and reduce the size of these modules 3022/3024/3026, the modules are provided with input values having the form 3.4, i.e. three integer bits and four decimal bits, i.e. the input value has four bits on the right side of the binary point and three bits on the left side of the binary point. These values are chosen because at the extreme end of the input value range (-8, +8) in the form of 3.4, the output value is asymptotically near its minimum/maximum value. However, other embodiments are contemplated in which the binary point is placed at a different location, for example, in a 4.3 or 2.5 format. The bit selector 3012 selects bits satisfying the 3.4 format criteria in the output of the CCS and saturator 3008, which involves compression, i.e., some bits are lost because the standard format has a larger number of bits. However, before selecting/compressing the output values of CCS and saturator 3008, if the pre-compression value is greater than the maximum value that the 3.4 form can express, saturator 3012 saturates the pre-compression value to the maximum value that the 3.4 form can express. For example, if any bit of the pre-compression value that is to the left of the most significant bit of the 3.4 form has a value of 1, saturator 3012 saturates to a maximum value (e.g., saturates to all 1's).

The hyperbolic tangent module 3022, the S-type module 3024, and the soft-sum module 3026 perform respective activation functions (as described above) on the 3.4-form values output by the CCS and saturator 3008 to produce results. Preferably, the result of the hyperbolic tangent module 3022 and the S-type module 3024 is a 7-bit result in the form of 0.7, i.e. zero integer bits and seven decimal bits, i.e. the input value has seven bits to the right of the binary decimal point. Preferably, the result of the soft-sum block 3026 is a 7-bit result in the form of 3.4, i.e. in the same form as the input to this block 3026. Preferably, the outputs of the hyperbolic tangent module 3022, the S-type module 3024 and the soft-sum module 3026 are expanded to a standard form (e.g., adding leading zeros if necessary) and aligned to have binary point specified by the output binary point 2954 value.

Corrector 3018 produces a corrected version of the output values of CCS and saturator 3008. That is, if the output values of CCS and saturator 3008 (whose signs are shifted down in the pipeline as described above) are negative, corrector 3018 outputs a zero value; otherwise, the corrector 3018 outputs its input value. Preferably, the output of the corrector 3018 is in standard form and has a binary point specified by the output binary point 2954 value.

Reciprocal multiplier 3014 multiplies the output of CCS and saturator 3008 with a user-specified reciprocal value specified in reciprocal value 2942 to produce its product of standard size, which is effectively the quotient of the output of CCS and saturator 3008 and a divisor of the reciprocal 2942 value. Preferably, the output of reciprocal multiplier 3014 is in standard form and has a binary point specified by the value of output binary point 2954.

Right shifter 3016 shifts the output of CCS and saturator 3008 by the user-specified number of bits specified in shift magnitude 2944 to produce its standard-sized quotient. Preferably, the output of right shifter 3016 is in standard form and has a binary point specified by the value of output binary point 2954.

The multiplexer 3032 selects the appropriate input specified by the value of the activation function 2934 and provides the selection to the symbol restorer 3034, wherein the symbol restorer 3034 converts the positive output of the multiplexer 3032 to negative, for example to two's complement, in the event that the value 217 of the original accumulator 202 is negative.

The magnitude converter 3036 converts the output of the symbol recoverer 3034 to an appropriate magnitude based on the value of the output command 2956 described above with respect to fig. 29A. Preferably, the output of the symbol recoverer 3034 has a binary point specified by the value of the output binary point 2954. Preferably, for a first predetermined value of the output command 2956, the size converter 3036 discards the bits of the upper half of the output of the symbol recoverer 3034. Further, if the output of the sign restorer 3034 is positive and exceeds the maximum value that the word size specified by the configuration 2902 can express, or the output is negative and is smaller than the minimum value that the word size can express, the saturator 3036 saturates its output to the maximum/minimum value that the word size can express, respectively. The magnitude converter 3036 passes the output of the symbol recoverer 3034 for the second predetermined value and the third predetermined value.

Multiplexer 3037 selects either the output of size converter and saturator 3036 or output 217 of accumulator 202 to provide to output register 3038 based on output command 2956. More specifically, for a first predetermined value and a second predetermined value of output command 2956, multiplexer 3037 selects the lower-order word (whose size is specified by configuration 2902) of the output of size converter and saturator 3036. For a third predetermined value, multiplexer 3037 selects the high order word of the output of size converter and saturator 3036. For a fourth predetermined value, multiplexer 3037 selects the lower word of value 217 of the original accumulator 202; for a fifth predetermined value, multiplexer 3037 selects the middle word of value 217 of the original accumulator 202; and for a sixth predetermined value, multiplexer 3037 selects the upper word of the original accumulator 202 value 217. As described above, AFU 212 preferably fills in zeros in the upper bits of the upper word of value 217 of original accumulator 202.

Referring now to FIG. 31, an example of the operation of AFU 212 of FIG. 30 is shown. As shown, the configuration 2902 is set to a narrow configuration of the NPU 126. The signed data 2912 and the signed weight 2914 have true values. Further, the value of the data binary point 2922 represents that the binary point for a word of the data RAM 122 is positioned 7 bits to the right of the binary point, and an example value of the first data word received by one of the NPUs 126 is shown as 0.1001110. Further, the value of the weight binary point 2924 represents that the binary point for the word of the weight RAM 124 is positioned 3 bits to the right of the binary point, and an example value of the first weight word received by one of the NPUs 126 is shown as 00001.010.

The 16-bit product of the first data word and the first weight word (this product is accumulated with the initial zero value of the accumulator 202) is shown as 000000.1100001100. Since the data binary point 2912 is 7 and the weight binary point 2914 is 3, the binary point of the implicit accumulator 202 is defined as 10 bits to the right of the binary point. In the case of a narrow configuration, the accumulator 202 is 28 bits wide in the exemplary embodiment. In the example, the value 217 of the accumulator 202 is shown as 000000000000000001.1101010100 after all ALU operations (e.g., all 1024 multiply-accumulate in FIG. 20) have been performed.

The value of the output binary point 2954 indicates that the output binary point is positioned 7 bits to the right of the binary point. Thus, after passing through the OBPA3002 and CCS 3008, the value 217 of the accumulator 202 is scaled, rounded and compressed to a standard form value, i.e., 000000001.1101011. In this example, the output binary decimal point position represents 7 decimal digits and the binary decimal point position of the accumulator 202 represents 10 decimal digits. Thus, OBPA3002 calculates the difference 3 and scales the value 217 of accumulator 202 by shifting it right by 3 bits. This is represented in fig. 31 as the loss of 3 least significant bits (binary 100) of the accumulator 202 value 217. Further, in this example, the value of the rounding control 2932 indicates that random rounding is used, and in this example it is assumed that the sampling random bit 3005 is true. Thus, according to the above description, the least significant bits are rounded up because the rounding bits of the value 217 of the accumulator 202 (the most significant bits of the 3 bits shifted out by scaling of the value 217 of the accumulator 202) are 1, and the sticky bits (the boolean or operation result of the 2 least significant bits of the 3 bits shifted out by scaling of the value 217 of the accumulator 202) are 0.

In this example, activation function 2934 indicates that an sigmoid function is to be used. Thus, bit selector 3012 selects the bits of the canonical form value such that the input of S-type module 3024 has three integer bits and four decimal bits, as described above, i.e., as shown, value 001.1101. The S-type module 3024 outputs a value set in standard form, namely the value 000000000.1101110 shown.

Output command 2956 of this example specifies a first predetermined value, namely the word size represented by configuration 2902, in this case a narrow word (8 bits). Thus, the size converter 3036 converts the standard S-type output value into an 8-bit quantity with an implied binary point positioned 7 bits to the right of the binary point, producing an output value 01101110 as shown.

Referring now to FIG. 32, a second example of the operation of AFU 212 of FIG. 30 is shown. The example of FIG. 32 illustrates the operation of AFU 212 where activation function 2934 represents passing the value 217 of accumulator 202 at a standard size. As shown, the configuration 2902 is set to a narrow configuration of the NPU 126.

In this example, the width of the accumulator 202 is 28 bits and the binary radix point of the accumulator 202 is positioned 10 bits to the right of the binary radix point (because, as described above, the sum of the binary radix point 2912 and the weight binary radix point 2914 is 10 according to one embodiment data, or because the accumulator binary radix point 2923 is explicitly specified as having a value of 10 according to another embodiment). In this example, FIG. 32 shows the value 217 of the accumulator 202, i.e., 000001100000011011.1101111010, after all ALU operations are performed.

In this example, the value of the output binary point 2954 indicates that the output binary point is positioned 4 bits to the right of the binary point. Thus, after passing through OBPA 3002 and CCS 3008, as shown, the accumulator 202 value 217 is saturated and compressed to a standard form value 111111111111.1111, which is received by multiplexer 3032 as a standard size pass value 3028.

In this example, two output commands 2956 are shown. First output command 2956 specifies a second predetermined value, namely, outputting a lower word of a standard form size. Since the size indicated by configuration 2902 is a narrow word (8 bits) (meaning that the standard size is 16 bits), size converter 3036 selects the lower 8 bits of the standard size pass value 3028 to produce the 8-bit value 11111111111 as shown. Second output command 2956 specifies a third predetermined value, namely outputting a high order word of standard form size. Thus, the size converter 3036 selects the upper 8 bits of the standard size pass value 3028 to produce the 8-bit value 11111111111 as shown.

Referring now to FIG. 33, a third example of the operation of AFU 212 of FIG. 30 is shown. The example of FIG. 33 illustrates the operation of AFU 212 where activation function 2934 indicates that the entire original accumulator 202 value 217 is to be passed through. As shown, configuration 2902 is set to the wide configuration (e.g., 16-bit input word) of NPU 126.

In this example, the accumulator 202 is 41 bits wide and the binary radix point of the accumulator 202 is positioned 8 bits to the right of the binary radix point (because, as described above, the sum of the data binary radix point 2912 and the weight binary radix point 2914 is 8 according to one embodiment, or because the accumulator binary radix point 2923 is explicitly specified as having a value of 8 according to another embodiment). In this example, FIG. 33 shows the value 217, 001000000000000000001100000011011.11011110, of the accumulator 202 after all ALU operations have been performed.

In this example, three output commands 2956 are shown. First output command 2956 specifies a fourth predetermined value, the lower word of the output original accumulator 202 value; a second output command 2956 specifies a fifth predetermined value, the middle word of the original accumulator 202 value is output; and a third output command 2956 specifies a sixth predetermined value, i.e., the upper word of the original accumulator 202 value is output. Since the size indicated by configuration 2902 is a wide word (16 bits), fig. 33 shows that in response to first output command 2956, multiplexer 3037 selects a 16-bit value 0001101111011110; in response to the second output command 2956, the multiplexer 3037 selects the 16-bit value 0000000000011000; and in response to the third output command 2956, the multiplexer 3037 selects the 16-bit value 0000000001000000.

As described above, NNU 121 advantageously performs operations on integer data rather than floating point data. This advantageously simplifies each NPU 126 or at least ALU 204 portion. For example, ALU 204 need not include an adder in a floating-point implementation that is required to add the exponents of the multipliers of multiplier 242. Similarly, ALU 204 need not include the shifter required in a floating point implementation to align the binary point of the addend of adder 234. Those skilled in the art will appreciate that floating point units are generally very complex; thus, these are merely simplified examples for the ALU 204, and other simplifications may be achieved by the immediate integer embodiment with hardware fixed-point assistance that enables a user to specify the relevant binary point. The fact that ALUs 204 are integer units may advantageously result in smaller (and faster) NPUs 126, as compared to floating point embodiments, which further facilitates the integration of large arrays of NPUs 126 into NNUs 121. Portions of AFU 212 may handle scaling and saturation of accumulator 202's value 217 based on the number of decimal places required for the (preferably user-specified) accumulation value and the number of decimal places required for the output value. Advantageously, as described with respect to the embodiment of FIG. 11, any additional complexity and attendant increase in size, power and/or time loss within the fixed point hardware assist of AFU 212, may be amortized by sharing AFU 212 among ALU 204 portions, for example, because the number of AFUs 1112 may be reduced with the shared embodiment.

Advantageously, the embodiments described herein enjoy many of the benefits associated with the reduced complexity of hardware integer arithmetic units (as compared to using floating point arithmetic units), while still providing arithmetic operations for decimals (i.e., numbers having binary decimal points). Floating point arithmetic has the advantages of: arithmetic operations are provided for data whose individual values may fall anywhere within a very wide range of values (which range of values is limited in practice only to the size of the exponential range, which may be very large). That is, each floating point number has its potentially unique exponent value. However, the embodiments described herein recognize and take advantage of the fact that: there are certain applications where the input data are highly parallel and their values are within a relatively narrow range so that the "exponent" of all parallel values may be the same. Thus, these embodiments enable a user to specify binary decimal point positions for all input values and/or accumulated values at once. Similarly, by recognizing and utilizing the similar range characteristics of parallel outputs, these embodiments enable a user to specify binary decimal point positions for all output values at once. An artificial neural network is one example of such an application, but embodiments of the invention may also be used to perform computations for other applications. By specifying binary point locations once for an input rather than for individual input numbers, embodiments may use storage space more efficiently (e.g., require less memory) and/or improve precision if a similar amount of memory is used, as bits for exponents may be used to specify greater precision for magnitudes in floating point implementations, as compared to floating point implementations.

Advantageously, the embodiments recognize potential loss of precision (e.g., overflow or loss of less significant decimal places) that may be experienced during the performance of accumulation for a large series of integer operations, and provide a solution, primarily in the form of a large enough accumulator to avoid loss of precision.

Direct execution of NNU micro-operations

Referring now to FIG. 34, a block diagram illustrating a partial detail of processor 100 and NNU 121 of FIG. 1 is shown. NNU 121 includes pipeline stage 3401 of NPU 126. The pipeline stages 3401, separated by stage registers, include combinatorial logic, such as boolean logic gates, multiplexers, adders, multipliers, comparators, etc., that implement the operations of the NPUs 126 as described herein. Pipeline stage 3401 receives micro-operations 3418 from multiplexer 3402. Micro-operations 3418 flow down to pipeline stage 3401 and control its combinatorial logic. Micro-operation 3418 is a set of bits. Preferably, micro-operations 3418 include bits of memory address 123 of data RAM 122, bits of memory address 125 of weight RAM 124, bits of memory address 131 of program memory 129, bits of control signals 213/713 of multiplexing registers 208/705, bits of control signals 803 of multiplexer 802, and many fields of control register 217 (e.g., of FIGS. 29A-29C), and so forth. In one embodiment, micro-operation 3418 includes approximately 120 bits. Multiplexer 3402 receives micro-operations from three different sources and selects one of them as micro-operation 3418 to be provided to pipeline stage 3401.

One source of micro-operations for multiplexer 3402 is sequencer 128 of FIG. 1. The sequencer 128 decodes NNU instructions received from the program memory 129 and in response generates micro-operations 3416 that are provided to a first input of the multiplexer 3402.

The second source of micro-operations for multiplexer 3402 is decoder 3404 that receives micro-instructions 105 from reservation stations 108 of FIG. 1 and operands from GPRs 116 and media registers 118. Preferably, as described above, the microinstructions 105 are generated by the instruction translator 104 in response to the translation of the MTNN instruction 1400 and the MFNN instruction 1500. The microinstructions 105 may include an immediate field that specifies a particular function (specified by the MTNN instruction 1400 or the MFNN instruction 1500), such as starting and stopping execution of programs within the program memory 129, performing micro-operations directly from the media registers 118, or reading/writing to the memory of the NNU as described above. The decoder 3404 decodes the microinstruction 105 and in response generates a micro-operation 3412 that is provided to a second input of the multiplexer 3402. Preferably, in response to certain functions 1432/1532 of the MTNN instruction 1400/MFNN instruction 1500, the decoder 3404 need not generate micro-operations 3412, sent down the pipeline 3401, such as writing to the control register 127, starting execution of a program in the program memory 129, pausing execution of a program in the program memory 129, waiting for a program in the program memory 129 to complete execution, reading from the status register 127, and resetting the NNU 121.

The third source of micro-operations for multiplexer 3402 is the media register 118 itself. Preferably, as described above with respect to fig. 14, the MTNN instruction 1400 may specify a function to instruct the NNU 121 to directly execute the micro-operation 3414 provided from the media register 118 to the third input of the multiplexer 3402. Directly executing the micro-operations 3414 provided by the architectural media registers 118 may be particularly useful for testing (e.g., built-in self-test (BIST)) and debugging the NNU 121.

Preferably, the decoder 3404 generates a mode indicator 3422 for controlling the selection of the multiplexer 3402. When the MTNN instruction 1400 specifies a function to start running the program from the program memory 129, the decoder 3404 generates the mode indicator 3422 value that causes the multiplexer 3402 to select the micro-operation 3416 from the sequencer 128 until an error occurs or until the decoder 3404 encounters the MTNN instruction 1400 that specifies a function to stop running the program from the program memory 129. When the MTNN instruction 1400 specifies a function to instruct the NNU 121 to directly execute the micro-operations 3414 provided from the media registers 118, the decoder 3404 generates a mode indicator 3422 value that causes the multiplexer 3402 to select the micro-operations 3414 from the specified media registers 118. Otherwise, the decoder 3404 generates a mode indicator 3422 value that causes the multiplexer 3402 to select the micro-operation 3412 from the decoder 3404.

Variable rate neural network element

There may be the following situations: NNU121 runs a program and then enters an idle state waiting for processor 100 to process something that needs to be processed before it can execute the next program. For example, assume that NNU121 runs two or more times in succession on a multiply-accumulate activation function program (which may also be referred to as a feed-forward neural network layer program) in a similar situation as described for fig. 3-6A. Processor 100 takes significantly longer to write the 512KB weighted value used by the NNU program for the next run to weight RAM 124 than the time it takes for NNU121 to run the program. In other words, NNU121 may run the program in a relatively short amount of time and then enter an idle state while processor 100 completes writing the next weight values to weight RAM 124 for the next program run. This situation is visually illustrated in fig. 36A, which is described in more detail below. In this case, it may be advantageous to have NNUs 121 run at a slower rate and take longer to execute programs, so that the energy consumption required by NNUs 121 to run programs is spread out over a longer period of time, which may tend to maintain NNUs 121, and thus processor 100, at a lower temperature. This situation is referred to as a mitigation mode and is visually illustrated in fig. 36B described in more detail below.

Referring now to FIG. 35, a block diagram is shown that illustrates a processor 100 having a variable rate NNU 121. The processor 100 is similar in many respects to the processor 100 of fig. 1, and elements having the same reference numerals are identical. Processor 100 of FIG. 35 also includes clock generation logic 3502 coupled to the functional units of processor 100, namely instruction fetch unit 101, instruction cache 102, instruction translator 104, rename unit 106, reservation stations 108, NNU121, other execution units 112, memory subsystem 114, general purpose registers 116, and media registers 118. Clock generation logic 3502 includes a clock generator, such as a Phase Locked Loop (PLL), that generates a clock signal having a master clock rate or frequency. For example, the master clock rate may be 1GHz, 1.5GHz, 2GHz, etc. The clock rate represents the number of cycles of the clock signal per second, e.g., the number of oscillations between high and low states. Preferably, the clock signal has a balanced duty cycle, i.e. half of the period is high and the other half is low; optionally, the clock signal has an unbalanced duty cycle, wherein the clock signal is in a high state for a longer time than it is in a low state, or vice versa. Preferably, the PLL can be configured to generate a master clock signal having a plurality of clock rates. Preferably, processor 100 includes a power management module that automatically adjusts the master clock rate based on a variety of factors including dynamically detected operating temperature, utilization of processor 100, and commands from system software (e.g., operating system, BIOS) indicating desired performance and/or power savings metrics. In one embodiment, the power management module includes microcode for processor 100.

The clock generation logic 3502 also includes a clock distribution network or clock tree. The clock tree distributes the master clock signal to the functional units of processor 100, i.e., clock signal 3506-1 to instruction pick-up unit 101, clock signal 3506-2 to instruction cache 102, clock signals 3506-10 to instruction translator 104, clock signals 3506-9 to rename unit 106, clock signals 3506-8 to reservation stations 108, clock signals 3506-7 to NNUs 121, clock signals 3506-4 to other execution units 112, clock signals 3506-3 to memory subsystem 114, clock signals 3506-5 to general purpose registers 116, and clock signals 3506-6 to media registers 118, collectively referred to as clock signals 3506, as shown in FIG. 35. The clock tree includes nodes or lines for transmitting master clock signal 3506 to its respective functional unit. Furthermore, clock generation logic 3502 preferably includes a clock buffer that regenerates the master clock signal (especially for more distant nodes) and/or raises the voltage level of the master clock signal when needed to provide a cleaner clock signal. In addition, each functional unit may also include its own sub-clock tree, if desired, that regenerates and/or boosts its received corresponding master clock signal 3506.

NNU 121 includes clock reduction logic 3504 that receives mitigations indicator 3512, receives primary clock signals 3506-7, and in response, generates a secondary clock signal. The secondary clock signal has a clock rate that is the same as the primary clock rate or, in the case of the mitigation mode, is reduced relative to the primary clock rate by the amount programmed into mitigation indicator 3512, thereby potentially providing a thermal benefit. Clock reduction logic 3504 is similar in many respects to clock generation logic 3502, with clock reduction logic 3504 having a clock distribution network or tree that distributes secondary clock signals to the various blocks of NNUs 121, such as represented as an array that distributes clock signal 3508-1 to NPUs 126, clock signal 3508-2 to sequencers 128, and clock signal 3508-3 to interface logic 3514, which are collectively or individually referred to as secondary clock signals 3508. Preferably, as shown with respect to FIG. 34, the NPU126 includes a plurality of pipeline stages 3401 including pipeline stage registers for receiving the secondary clock signal 3508-1 from the clock reduction logic 3504.

NNU 121 also includes interface logic 3514 to receive master clock signals 3506-7 and slave clock signals 3508-3. Interface logic 3514 is coupled between the lower portion of the front end of processor 100 (e.g., reservation stations 108, media registers 118, and general purpose registers 116) and the various blocks of NNU 121, which are clock reduction logic 3504, data RAM 122, weight RAM 124, program memory 129, and sequencer 128. Interface logic 3514 includes a data RAM buffer 3522, a weight RAM buffer 3524, a decoder 3404 of fig. 34, and a mitigation indicator 3512. The mitigation indicator 3512 maintains a value that specifies how fast the array of NPUs 126 is to execute NNU program instructions. Preferably, mitigation indicator 3512 specifies a divisor value of N by which clock reduction logic 3504 divides master clock signals 3506-7 to produce slave clock signals 3508 such that the rate of the slave clock signals is 1/N. Preferably, the value of N can be programmed to any of a plurality of different predetermined values to cause clock reduction logic 3504 to generate secondary clock signal 3508 having a plurality of different rates that are all less than the primary clock rate.

In one embodiment, clock reduction logic 3504 includes a clock divider circuit to divide master clock signals 3506-7 by the value of mitigation indicator 3512. In one embodiment, clock reduction logic 3504 includes a clock gate (e.g., an AND gate) that gates master clock signals 3506-7 with an enable signal, where the enable signal is true only once every N cycles of master clock signals 3506-7. For example, the enable signal may be generated using a circuit that includes a counter for counting up to N. When the accompanying logic detects that the counter output matches N, the logic generates a true pulse on the auxiliary clock signal 3508 and redesigns the counter. Preferably, the value of the mitigation indicator 3512 is programmable by an architectural instruction (such as the MTNN instruction 1400 of fig. 14, etc.). Preferably, as described in more detail with respect to FIG. 37, the architecture program running on processor 100 programs a mitigation value into mitigation indicator 3512 just prior to instructing NNU 121 to begin running the NNU program.

The weight RAM register 3524 is coupled between the weight RAM 124 and the media register 118 for buffering data transmission therebetween. Preferably, weight RAM buffer 3524 is similar to one or more embodiments of buffer 1704 of fig. 17. Preferably, the portion of weight RAM buffer 3524 that receives data from media register 118 is clocked by a master clock signal 3506-7 having a master clock rate, while the portion of weight RAM buffer 3524 that receives data from weight RAM 124 is clocked by a slave clock signal 3508-3 having a slave clock rate, which may or may not be reduced relative to the master clock rate depending on the value programmed into mitigations indicator 3512 (i.e., depending on whether NNU 121 is operating in mitigations mode or normal mode). In one embodiment, as described above with respect to fig. 17, the weight RAM 124 is single-ported and is accessible by the media register 118 in an arbitrated manner (arbitrated failure) via the weight RAM buffer 3524 and by both the NPU 126 or the row buffer 1104 of fig. 11. In an alternative embodiment, as described above with respect to fig. 16, the weight RAM 124 is dual-ported, and each port is accessible by the media register 118 in parallel via both the weight RAM buffer 3524 and the NPU 126 or line buffer 1104.

Also, a data RAM buffer 3522 is coupled between the data RAM 122 and the media register 118 for buffering data transfers therebetween. Preferably, data RAM buffer 3522 is similar to one or more embodiments of buffer 1704 of fig. 17. Preferably, the portion of data RAM buffer 3522 that receives data from media register 118 is clocked by master clock signal 3506-7, which has a master clock rate, while the portion of data RAM buffer 3522 that receives data from data RAM 122 is clocked by slave clock signal 3508-3, which has a slave clock rate, which may or may not be reduced relative to the master clock rate depending on the value programmed into mitigations indicator 3512 (i.e., depending on whether NNU 121 is operating in mitigations mode or normal mode). In one embodiment, the data RAM 122 is single-ported and is accessible by the media registers 118 in an arbitrated manner via the data RAM buffer 3522 and by both the NPU 126 or the row buffer 1104 of fig. 11, as described above with respect to fig. 17. In an alternative embodiment, as described above with respect to fig. 16, the data RAM 122 is dual-ported, and each port is accessible in parallel by both the media register 118 via the data RAM buffer 3522 and the NPU 126 or line buffer 1104.

Preferably, regardless of whether data RAM 122 and/or weight RAM 124 are single-ported or dual-ported, interface logic 3514 includes a data RAM buffer 3522 and a weight RAM buffer 3524 to provide synchronization between the primary and secondary clock domains. Preferably, data RAM 122, weight RAM 124 and program memory 129 each comprise static RAM (sram), which includes corresponding read enable signals, write enable signals and memory select signals.

As described above, NNU 121 is an execution unit of processor 100. An execution unit is a functional unit into which a processor's execution architectural instructions are translated into microinstructions (such as microinstruction 105 into which architectural instruction 103 is transferred in FIG. 1) or the execution architectural instruction 103 itself. The execution units receive operands from general purpose registers of the processor, such as GPRs 116 and media registers 118. The execution unit, in response to executing the microinstruction or architectural instruction, generates a result that may be written to a general purpose register. Examples of the architecture instructions 103 are the MTNN instruction 1400 and the MFNN instruction 1500 described with respect to fig. 14 and 15, respectively. The micro instructions implement architectural instructions. More specifically, the execution unit performs, for the entirety of the one or more microinstructions into which the architectural instruction is translated, the operation prescribed by the architectural instruction on the input specified by the architectural instruction to produce the result defined by the architectural instruction.

Referring now to FIG. 36A, a timing diagram illustrating an example of the operation of processor 100 with NNUs 121 operating in a normal mode, i.e., at the master clock rate, is shown. In the timing diagram, the progression of time is from left to right. Processor 100 is running the architectural program at the master clock rate. More specifically, the front end of processor 100 (e.g., instruction fetch unit 101, instruction cache 102, instruction translator 104, rename unit 106, and reservation stations 108) fetches, decodes, and issues architectural instructions to NNU121 and other execution units 112 at the master clock rate.

Initially, the architectural program executes an architectural instruction (e.g., MTNN instruction 1400), which processor front end 100 issues to NNU121 to instruct NNU121 to begin running its NNU program in program memory 129. Heretofore, the architectural program executed architectural instructions to write a value for the specified master clock rate to the mitigation indicator 3512 even though the NNU was in the normal mode. More specifically, the value programmed into mitigation indicator 3512 causes clock reduction logic 3504 to generate secondary clock signal 3508 that has the primary clock rate of primary clock signal 3506. Preferably, in this case, the clock buffer of clock reduction logic 3504 simply raises master clock signal 3506. Also prior to this, the architectural program executes architectural instructions to write data RAM 122 and weight RAM 124 and to write the NNU program into program memory 129. In response to the MTNN instruction 1400 to begin an NNU program, NNU121 begins executing the NNU program at the master clock rate because mitigation indicator 3512 is programmed to have a master rate value. After the NNU121 execution begins, the architecture program continues to execute architecture instructions at the master clock rate, including and primarily writing and/or reading data RAM 122 and weight RAM 124 with MTNN instructions 1400, in preparation for the next instance or call or execution of the NNU program.

As shown in the example in fig. 36A, NNU121 completes the execution of the NNU program in significantly less time (e.g., one-fourth of the time) than the time it takes for the architectural program to complete the writes/reads to data RAM 122 and weight RAM 124. For example, NNU121 may take approximately 1000 clock cycles to run the NNU program, while the architecture program takes approximately 4000 clock cycles to run, all at the master clock rate. Thus, NNU121 is idle for the remainder of the time (a significant amount of time in this example, approximately 3000 cycles of the master clock rate). As shown in the example in fig. 36A, the pattern continues to be executed another time, and possibly multiple times, depending on the size and configuration of the neural network. Because NNUs 121 can be relatively large and transistor-intensive functional units in processor 100, NNUs 121 can generate a large amount of heat, especially when operating at the master clock rate.

Referring now to FIG. 36B, a timing diagram illustrating an example of the operation of processor 100 with NNUs 121 operating in a mitigative mode, i.e., at a rate less than the master clock rate, is shown. The timing diagram of FIG. 36B is identical to the timing diagram of FIG. 36A in many respects, i.e., the processor 100 runs the architectural program at the master clock rate. Also in this example, assume that the architecture program and NNU program of fig. 36B are the same as the architecture program and NNU program of fig. 36A. However, prior to starting the NNU program, the architecture program executes the MTNN instruction 1400, wherein the MTNN instruction 1400 programs the mitigations indicator 3512 with a value that causes the clock reduction logic 3504 to produce a secondary clock signal 3508 having a secondary clock rate that is less than the primary clock rate. That is, the architecture program places NNU121 in the mitigation mode of FIG. 36B, rather than the general mode of FIG. 36A. Thus, the NPU 126 executes the NNU program at a secondary clock rate that is less than the primary clock rate in the mitigation mode. In this example, assume mitigation indicator 3512 is programmed with a value to designate the secondary clock rate as a quarter of the primary clock rate. As a result, as can be seen by comparing fig. 36A and 36B, NNU121 spends four times as long running an NNU program in the mitigative mode as it spends running an NNU program in the normal mode, such that the amount of time NNU121 is idle is relatively short. Thus, NNU121 in fig. 36B consumes energy used to run the NNU program in a time period that is approximately four times the time required for NNU121 in fig. 36A to run the program in the normal mode. Thus, the rate of heat generated by NNU121 running the NNU program in FIG. 36B is approximately one-fourth that in FIG. 36A, and thus can have the thermal benefits described herein.

Referring now to FIG. 37, a flowchart illustrating operation of the processor 100 of FIG. 35 is shown. The operations illustrated by this flowchart are in many respects identical to the operations described above with respect to fig. 35, 36A, and 36B. Flow begins at block 3702.

At block 3702, the processor 100 executes the MTNN instruction 1400 to write the weights to the weight RAM 124 and to write the data to the data RAM 122. Flow proceeds to block 3704.

At block 3704, the processor 100 executes the MTNN instruction 1400 to program the indicator 3512 with a value specifying a rate lower than the master clock rate even though the NNU 121 is in the mitigative mode. Flow proceeds to block 3706.

At block 3706, in the same manner as presented in fig. 36B, processor 100 executes MTNN instruction 1400 to instruct NNU 121 to begin running the NNU program. Flow proceeds to block 3708.

At block 3708, the NNU 121 begins running the NNU program. In parallel, the processor 100 executes the MTNN instruction 1400 to write new weights to the weight RAM 124 (and possibly new data to the data RAM 122), and/or executes the MFNN instruction 1500 to read results from the data RAM 122 (and possibly results from the weight RAM 124). Flow proceeds to block 3712.

At block 3712, the processor 100 executes the MFNN instruction 1500 (e.g., reads the status register 127) to detect that the NNU 121 has finished running its program. Assuming that the architecture program selects a good value for mitigation indicator 3512, the amount of time it takes the NNU 121 to run the NNU program is approximately the same as the time it takes the processor 100 to execute portions of the access weight RAM 124 and/or data RAM 122 of the architecture program, as shown in fig. 36B. Flow proceeds to block 3714.

At block 3714, the processor 100 executes the MTNN instruction 1400 to program the mitigative indicator 3512 with a value specifying the master clock rate even though the NNU 121 is in the normal mode. Flow proceeds to block 3716.

At block 3716, in the same manner as presented in a similar manner as fig. 36A, processor 100 executes MTNN instruction 1400 to instruct NNU 121 to begin running the NNU program. Flow proceeds to block 3718.

At block 3718, the NNU 121 begins running the NNU program in the normal mode. Flow ends at block 3718.

As described above, running an NNU program in the mitigation mode can spread the time that the NNU is running the program relative to the time that the NNU is running the program in the general mode (i.e., at the master clock rate of the processor), thereby providing a thermal benefit. More specifically, when the NNUs are running a program in the mitigation mode, devices (e.g., transistors, capacitors, wires) will likely operate at lower temperatures because the NNUs generate heat at a slower rate, which is dissipated by the NNUs (e.g., semiconductor devices, metal layers, and underlying substrate) as well as the surrounding package and cooling scheme (e.g., heat sinks, fans). Generally, this also reduces the device temperature in other portions of the processor die. Lower operating temperatures of the devices, particularly their junction temperatures, can have the benefit of reducing leakage currents. In addition, since the amount of current flowing per unit time is small, inductance noise and IR drop noise can be reduced. In addition, lower temperatures also have a positive impact on Negative Bias Temperature Instability (NBTI) and positive bias temperature instability (PBSI) of the processor's MOSFETs, thereby improving device and processor portion reliability and/or lifetime. Lower temperatures may also mitigate joule heating and electromigration within the metal layers of the processor.

Communication between architectural and non-architectural programs for NNU shared resourcesMessage mechanism

As described above, taking fig. 24 to 28 and fig. 35 to 37 as examples, the data RAM 122 and the weight RAM 124 are shared resources. Both the NPU 126 and the front end of the processor 100 share the data RAM 122 and the weight RAM 124. More specifically, the NPU 126 reads from and writes to the data RAM 122 and weight RAM 124 with the front end of the processor 100 (e.g., media registers 118). In other words, the architectural programs running on processor 100 share data RAM 122 and weight RAM 124 with the NNU programs running on NNU 121, and as described above, in some cases this requires control over the flow between the architectural programs and the NNU programs. This resource sharing also applies to some extent to the program memory 129 because the architectural programs write to the program memory 129 and the sequencer 128 reads from the program memory 129. Embodiments described herein in the context of provide a high performance solution to controlling the flow of access to shared resources between an architecture program and an NNU program.

In the embodiments described herein, the NNU program is also referred to as a non-architectural program, the NNU instructions are also referred to as non-architectural instructions, and the NNU instruction set (also referred to above as the NPU instruction set) is also referred to as a non-architectural instruction set. The non-architectural instruction set is different from the architectural instruction set. In embodiments in which the processor 100 includes an instruction translator 104 for translating architectural instructions into microinstructions, the non-architectural instruction set is also different from the microinstruction set.

Referring now to FIG. 38, a block diagram is shown that illustrates sequencer 128 of NNU 121 in greater detail. As described above, the sequencer 128 provides the memory address 131 to the program memory 129 to select non-architectural instructions that are provided to the sequencer 128. As shown in fig. 38, the memory address 131 is held within the program counter 3802 of the sequencer 128. The sequencer 128 typically increments by the sequential address of the program memory 129 unless the sequencer 128 encounters a non-architectural instruction, such as a loop or branch instruction, in which case the sequencer 128 updates the program counter 3802 to the target address of the control instruction, i.e., to the address of the non-architectural instruction located at the target of the control instruction. Thus, the address 131 maintained in the program counter 3802 specifies the address in the program memory 129 of the non-architectural instructions of the non-architectural program currently being picked up for execution by the NPU 126. Advantageously, the value of the program counter 3802 may be obtained by the architecture program via the NNU program counter field 3912 of the status register 127, as described below with respect to fig. 39. This enables the architected program to decide where to read/write data with respect to the data RAM 122 and/or the weight RAM 124 based on the progress of the non-architected program.

The sequencer 128 also includes a loop counter 3804, the loop counter 3804 used in conjunction with non-architectural loop instructions, such as the loop to 1 instruction at address 10 of FIG. 26A and the loop to 1 instruction at address 11 of FIG. 28. In the example of fig. 26A and 28, the loop counter 3804 loads a value specified in a non-architectural initialization instruction at address 0, such as the value 400. Each time the sequencer 128 encounters a loop instruction and jumps to a target instruction (e.g., the multiply accumulate instruction at address 1 of FIG. 26A or the maxwacc instruction at address 1 of FIG. 28), the sequencer 128 decrements the loop counter 3804. Once the loop counter 3804 reaches zero, the sequencer 128 enters the next sequential non-architectural instruction. In an alternative embodiment, the loop counter 3804 loads the loop count value specified in the loop instruction the first time the loop instruction is encountered, to obviate the need to initialize the loop counter 3804 via a non-architectural initialization instruction. Thus, the value of loop counter 3804 indicates the number of times the loop body of the non-architectural program is to be executed. Advantageously, the value of the loop counter 3804 may be obtained by the architectural program via the loop count 3914 field of the status register 127, as described below with respect to fig. 39. This enables the architected program to decide where to read/write data with respect to data RAM 122 and/or weight RAM124 based on the progress of the non-architected program. In one embodiment, sequencer 128 includes three additional loop counters to accommodate nested loops within non-architectural programs, and the values of the other three loop counters may also be read via status register 127. Having a bit in the loop instruction indicates which of the four loop counters is used for the immediate loop instruction.

Sequencer 128 also includes an iteration counter 3806. The iteration counter 3806 is used in conjunction with non-architectural instructions, such as multiply-accumulate instructions at address 2 of fig. 4, 9, 20, and 26A, and maxwacc instructions at address 2 in fig. 28, which are referred to hereinafter as "execute" instructions. In the above example, each execution instruction specifies an

iteration count

511, 1023, 2, and 3, respectively. When the sequencer 128 encounters an executing instruction that specifies a non-zero iteration count, the sequencer 128 loads the iteration counter 3806 with the specified value. In addition, the sequencer 128 generates appropriate micro-operations 3418 to control logic within the pipeline stage 3401 of the NPU 126 of fig. 34 for execution, and to decrement the iteration counter 3806. If the iteration counter 3806 is greater than zero, the sequencer 128 again generates the appropriate micro-operations 3418 to control the logic within the NPU 126 and to decrement the iteration counter 3806. The sequencer 128 continues to operate in this manner until the iteration counter 3806 reaches zero. Thus, the value of the iteration counter 3806 represents the number of times that operations specified within the non-architectural execution instruction (e.g., multiply accumulate, maximize, sum of accumulators and data/weight words) are also to be performed. Advantageously, the value of the iteration counter 3806 may be obtained by the architecture program via the iteration count 3916 field of the status register 127, as described below with respect to fig. 39. This enables the architected program to decide where to read/write data with respect to data RAM 122 and/or weight RAM 124 based on the progress of the non-architected program.

Referring now to FIG. 39, a block diagram is shown that illustrates certain fields of control and status register 127 of NNU 121. As described above for fig. 26B, these fields include the address 2602 of the weight RAM row most recently written by the NPU 126 executing the non-architectural program, the address 2604 of the weight RAM row most recently read by the NPU 126 executing the non-architectural program, the address 2606 of the data RAM row most recently written by the NPU 126 executing the non-architectural program, and the address 2608 of the data RAM row most recently read by the NPU 126 executing the non-architectural program. In addition, these fields include an NNU program counter 3912, a loop count 3914, and an iteration count 3916. As described above, the architectural program may read (e.g., via the MFNN instruction 1500) the status register 127 into the media register 118 and/or the general purpose register 116, the status register 127 including an NNU program counter 3912 field value, a loop count 3914 field value, and an iteration count 3916 field value. The value of the program counter 3912 reflects the value of the program counter 3802 of fig. 38. The value of the loop count 3914 reflects the value of the loop counter 3804. The value of the iteration count 3916 reflects the value of the iteration counter 3806. In one embodiment, the sequencer 128 updates the program counter 3912 field value, the loop count 3914 field value, and the iteration count 3916 field value each time the program counter 3802, the loop counter 3804, or the iteration counter 3806 is modified so that these field values are current values when read by the architectural program. In another embodiment, when the NNU 121 executes an architectural instruction to read the status register 127, the NNU 121 simply obtains the value of the program counter 3802, the value of the loop counter 3804, and the value of the iteration counter 3806 and provides these values back to the architectural instruction (e.g., to the media registers 118 or the general purpose registers 116).

From the above, it can be seen that the field values of the status register 127 of FIG. 39 can be characterized as information of the progress of non-architected programs during execution by the NNU. Certain aspects of non-architectural program progress have been described above, such as the value of the program counter 3802, the value of the loop counter 3804, the value of the iteration counter 3806, the weight RAM124 address 125 of the most recent write/read 2602/2604, and the data RAM 122 address 123 of the most recent write/read 2606/2608. The architectural program executing on processor 100 may read the non-architectural program progress value of FIG. 39 from status register 127 and use this information to make decisions, for example, by architectural instructions such as compare instructions and branch instructions. For example, especially for overlapping execution instances of large data sets and/or different non-architectural instructions, the architectural program decides on which rows to write/read data/weights with respect to the data RAM 122 and/or the weight RAM124 to control the flow of data in and out with respect to the data RAM 122 or the weight RAM 124. Examples of making decisions with an architectural program are described herein in the context.

For example, as described above with respect to fig. 26A, the architected program configures the non-architected program to write the results of the convolution back to a row in data RAM 122 that is above convolution kernel 2402 (e.g., above row 8), while the architected program reads the results from data RAM 122 when NNU 121 writes the results by using address 2606 of the most recently written row of data RAM 122.

As another example, as described above for fig. 26B, the architected program utilizes information from the fields of the status register 127 of fig. 38 to determine the progress of the non-architected program in performing a convolution of the data array 2404 of fig. 24 with 5 512 x 1600 blocks of data. The architectural routine writes the first 512 x 1600 data block of the 2560 x 1600 data array 2404 to the weight RAM 124 and begins a non-architectural routine having a loop count of 1600 and an initialized output behavior of 0 from the weight RAM 124. When the NNU 121 executes an off-architecture program, the architecture program reads the status register 127 to determine the row 2602 of weight RAM 124 that was most recently written, so that the architecture program can read the valid convolution result written by the off-architecture program, and overwrite the valid convolution result with the next 512 x 1600 block of data after the architecture program has read the valid convolution result, so that when the NNU 121 completes the off-architecture program for the first 512 x 1600 block of data, the processor 100 can immediately update the off-architecture program as needed and begin the off-architecture program again to process the next 512 x 1600 block of data.

As another example, assume that the configuration program causes NNU 121 to perform a series of conventional neural network multiply-accumulate boot function operations, wherein weights are stored in weight RAM 124 and the results are written back to data RAM 122. In this case, non-architected programs, once they read a row of weight RAM 124, do not read it again. Thus, the architectural program may be configured to begin overwriting the weights in weight RAM 124 with new weights for the next instance of execution of the non-architectural program (e.g., the next neural network layer) once the current weights have been read/used by the non-architectural program. In this case, the architectural program reads status register 127 to obtain the address 2604 of the most recently read row of weight RAM 2604, thereby deciding where in weight RAM 124 a new set of weights can be written.

As another example, assume that an architected program knows that a non-architected program includes an execute instruction with a large iteration count, such as the non-architected multiply-accumulate instruction at address 2 of FIG. 20. In this case, the architectural process may need to know the iteration count 3916 to know how many clock cycles are roughly needed to complete the non-architectural instructions so that the architectural process can decide which of two or more actions to take next. For example, if the time is long, the architectural program may relinquish control to another architectural program, such as an operating system or the like. Also, assume that the architected program knows that the non-architected program includes a loop body with a significant loop count, such as the non-architected program of FIG. 28. In this case, the architectural process may need to know the cycle count 3914 to know how many clock cycles are substantially needed to complete the non-architectural process so that the architectural process can decide which of two or more actions to take next.

Also for example, assume that the architecture program causes NNU 121 to perform a pooling operation similar to the pooling operation described for fig. 27 and 28 in which data to be pooled is stored in weight RAM 124 and the result is written back to weight RAM 124. However, unlike the examples of FIGS. 27 and 28, assume that the results are written back to the top 400 rows of weight RAM 124, e.g., rows 1600-1999. In this case, once a non-architected program reads the four pooled rows in weight RAM 124, it will not read again. Thus, the architectural program may be configured to begin overwriting the data in weight RAM 124 with new data once the current four rows of data have been read/used by the non-architectural program (e.g., to overwrite with weights for the next instance of execution of the non-architectural program, e.g., to perform traditional multiply-accumulate activate function operations on pooled data). In this case, the architectural program reads the status register 127 to obtain the address 2604 of the most recently read weight RAM row, thereby deciding where in the weight RAM 124 a new weight set can be written.

Dual use of memory arrays as NNU memory and cache memory

Referring now to fig. 40, a block diagram is shown illustrating a processor 4000. The processor 4000 comprises a plurality of ring stations 4004, wherein the plurality of ring stations 4004 are connected to each other in a bidirectional manner to form a ring bus 4024. The embodiment of FIG. 40 includes six ring stations designated 4004-0, 4004-1, 4004-2, 4004-3, 4004-M and 4004-U. Processor 4000 includes four core complexes 4012, referred to as core complex 04012-0, core complex 14012-1, core complex 24012-2 and core complex 34012-3, respectively, where the four core complexes 4012 each include four ring stations 4004-0, 4004-1, 4004-2 and 4004-3 for coupling core complex 4012 to ring bus 4024. Processor 4000 also includes an uncore portion 4016, uncore portion 4016 including ring stations 4004-U for coupling uncore 4016 to ring bus 4024. Finally, processor 4000 includes a Dynamic Random Access Memory (DRAM) controller 4018 and NNU 121 coupled to a ring bus 4024 by ring stations 4004-M. As described in more detail below, NNU 121 includes a memory array 4152 (see fig. 41), which memory array 4152 may be used as memory for use by an array of NPUs 126 of NNU 121 (e.g., weight RAM 124 of fig. 1), or as cache memory shared by core complex 4012, e.g., as a victim cache or as a Last Level Cache (LLC) slice. Although the example of fig. 40 includes four core complexes 4012, other embodiments having a different number of core complexes 4012 are also contemplated.

The uncore 4016 includes a bus controller 4014 such as a video controller, a disk controller, a peripheral bus controller (e.g., PCI-E), or the like for controlling access of the processor 4000 to a system bus 4022 to which peripheral devices may be coupled. In one embodiment, system bus 4022 is the well-known V4 bus. Uncore 4016 may also include other functional units such as a power management unit and private RAM (e.g., non-architected memory used by microcode of core 4002).

The DRAM controller 4018 controls a DRAM (e.g., an asynchronous DRAM or a synchronous DRAM (sdram), such as a double data rate synchronous DRAM, a direct Rambus DRAM, or a reduced latency DRAM) as a system memory. Core complex 4012, uncore 4016, and NNU 121 access system memory via ring bus 4024. More specifically, NNU 121 reads the weights and data for the neural network from system memory into memory array 4152 and writes the neural network results from memory array 4152 to system memory via ring bus 4024. Further, in operating as a victim cache (see 4006-4 of FIG. 41), the memory array 4152 evicts a cache line to system memory under the control of cache control logic 4108 (see FIG. 41). Further, when operating as an LLC slice (see 4006-4 of FIG. 41), the memory array 4152 and cache control logic 4108 fills a cache line from system memory and writes back and evicts the cache line to system memory.

The four core complexes 4012 include respective LLC slices 4012-0, 4012-1, 4012-2, and 4012-3, wherein each LLC slice is coupled to a ring station 4004 and is referred to generically either individually as LLC slice 4006 or collectively as LLC slice(s) 4006. Each core 4002 includes a cache memory, such as a level 2 (L2) cache 4008 coupled to the ring station 4004. Each core 4002 may also include a level 1 cache (not shown). In one embodiment, core 4002 is a x86 Instruction Set Architecture (ISA) core, although other embodiments are contemplated in which core 4002 is another ISA (e.g., ARM, SPARC, MIPS, etc.) core.

As shown in FIG. 40, LLC slices 4006-0, 4006-1, 4006-2, and 4006-3 form an entirety of LLC 4005 of processor 4000 shared by core complex 4012. Each LLC slice 4006 includes a memory array, cache control logic similar to memory array 4152, and cache control logic 4108 of FIG. 41. As described in more detail below, a mode indicator (e.g., mode input 4199 of FIG. 41) may be set such that memory array 4152 of NNU 121 operates as an additional (e.g., fifth or ninth) tile 4006-4 of LLC 4005. The memory array 4152 (and cache control logic 4108 of FIG. 41) that selectively constitutes the additional LLC slice 4006-4 is also referred to as an NNU LLC slice 4006-4. In one embodiment, memory array 4152 and each LLC slice 4006 comprise a 2MB memory array, although other embodiments having different sizes are contemplated. Moreover, embodiments are contemplated in which the size of memory array 4152 and the size of LLC tile 4006 are different. Preferably, the LLC 4005 includes an L2 cache 4008 as well as any other caches in the cache hierarchy (e.g., an L1 cache).

The ring bus 4024 or ring 4024 is an extensible bi-directional interconnect that facilitates communication between coherent components including the DRAM controller 4018, the uncore 4016, and the LLC slice 4006. Ring 4024 includes two unidirectional rings, each of which also includes five sub-rings: a Request (Request) for transmitting a Request packet of most types including a load; a Snoop (Snoop) for transmitting a Snoop request packet; an acknowledgement (Acknowledge) for transmitting the response packet; data (Data) for transmitting a Data packet and a specific request item including a write; and credits (credits) for transmitting and obtaining credits in the remote queue. The nodes attached to ring 4024 are connected via ring station 4004, where the ring station 4004 contains queues for sending and receiving packets on ring 4024. A queue is an egress queue that initiates requests on ring 4024 on behalf of an attached component to be received in a remote queue, or an ingress queue that receives requests from ring 4024 to be forwarded to an attached component. Before an egress queue initiates a request on the ring, it first gets credits on the credit ring from the remote destination ingress queue. This ensures that the remote ingress queue has resources available to process the request as it arrives. When an egress queue wishes to send a transaction packet on ring 4024, the egress queue can only send the transaction packet without pre-empting the incoming packet that is ultimately destined for the remote node. When an incoming packet arrives at the ring station 4004 from any direction, the destination ID of the packet is interrogated to determine if the ring station 4004 is the final destination for the packet. If the destination ID is not equal to the node ID of the ring station 4004, the packet proceeds to the next ring station 4004 in the subsequent clock. Otherwise, the packet leaves ring 4024 in the same clock for consumption by any ingress queue involved in the transaction type of the packet.

In general, LLC 4005 comprises N LLC slices 4006, wherein each slice 4006 of the N slices 4006 is responsible for caching a different approximately 1/N of the physical address space of processor 100 as determined by a hashing (hash) algorithm, or simply, hashing. The hash is a function that takes a physical address as input and selects the appropriate LLC slice responsible for caching the physical address. In case a request has to be made from the core 4002 or the snooping agent to the LLC 4005, the request has to be sent to the appropriate LLC slice 4006, which is responsible for caching the physical address of the request. The appropriate LLC slice 4006 is determined by applying a hash to the physical address of the request.

The hashing algorithm is a flood function (spurious function) where the domain of the flood function is a set of physical addresses or a subset thereof and the range of the flood function is the number of LLC slices 4006 that are currently included. More specifically, the range is a set of indices (e.g., 0 to 7 in the case of eight LLC slices 4006) of the LLC slices 4006. The function may be computed by examining the appropriate subset of physical address bits. For example, in a system with eight LLC slices 4006, the output of the hashing algorithm may simply be PA [10:8], i.e., three of the physical address bits, i.e., bits 8 through 10. In another embodiment, where the number of LLC slices 4006 is 8, the output of the hash is a logical function of the other address bits (e.g., three bits produced as { PA [17], PA [14], PA [12] < Lambda > PA [10] < Lambda > PA [9 }).

All requestors of this LLC 4005 must have the same hashing algorithm before any LLC 4005 cache is complete. Since the hash specifies the location where the address is cached during operation and where the snoop is to be sent, the hash is changed only by coordination among all cores 4002, LLC slices 4006, and the snoop agent. As described in more detail below with respect to fig. 42 and 43, updating the hash algorithm basically comprises: (1) synchronizing all cores 4002 to prevent new cacheable accesses; (2) performing a write back invalidation of all LLC slices 4006 currently included in LLC 4005, which results in modified cache lines being written back to system memory and all cache lines being invalidated (as described below, a write back invalidation may be a selective write back invalidation in which only those cache lines whose addresses are hashed by the new hash algorithm to a different slice than the old hash algorithm are evicted, i.e., invalidated, and if modified, written back before invalidation); (3) broadcasting a hash update message to each core 4002 and snooping source, which instructs each core 4002 and snooping source to change to a new hash (from an inclusive hash to an exclusive hash, or vice versa, as described below); (4) updating a mode input 4199 of select logic 4158 (see FIG. 41) for controlling access to the memory array 4152; and (5) resuming execution with the new hashing algorithm.

The hashing algorithms described above are useful when the number N of LLC slices 4006 is 8, a power of 2, and these algorithms can be modified to easily accommodate other powers of 2, e.g., PA [9:8] for 4 slices or PA [11:8] for 16 slices. However, depending on whether NNU LLC slice 4006-4 is contained in LLC 4005 (and depending on the number of core complexes 4012), N may or may not be the power of 2. Thus, as described below with respect to fig. 42 and 43, at least two different hashes are used depending on the number of LLC slices 4006. That is, a first hash, referred to as an inclusive hash, is used if NNU LLC slice 4006-4 is included in LLC 4005, and a second hash, referred to as an exclusive hash, is used if NNU LLC slice 4006-4 is not included in LLC 4005.

A hashing algorithm outputs the remainder of PA [45:6] divided by N. This hash has the following advantages: physical addresses are allocated among the N LLC slices 4006 in a substantially balanced manner even if the value of N is not a power of 2 (assuming a relatively uniform allocation of physical addresses). When N is the power of 2, the remainder operation can be performed by simply outputting the lower N-1 bits of PA [45:6 ]. However, when N is not a power of 2, the remainder operation may disadvantageously require an integer division.

Another hash that approximates the hash of the remainder of PA divided by N but is for a smaller subset of physical addresses is defined as follows (where the hash can be more efficiently implemented in hardware when N is not a power of 2 (in this case when N ═ 9):

in the case of calc _ hash (), PA [11:8] is used for hashing, but PA [10:8] is used if PA [11:8] >8, where PA [10:8] is guaranteed to be within the range [0,7 ]. It can be seen that calc _ hash () is able to allocate physical addresses between LLC slice 4006 in a relatively less balanced manner compared to the remainder of PA [45:6] divided by N (again assuming a relatively even allocation of physical addresses), i.e., slices 0-6 have a probability of about 1/8, while

slices

7 and 8 have a probability of about 1/16. It should be noted, however, that even a hashing algorithm that is the remainder of PA [45:6] divided by N does not produce a perfectly evenly balanced distribution because PA [45:6] represents a domain having a number of elements that are powers of 2, and N is not a power of 2.

In general, when N is not a power of 2, the hashing algorithm first hashes the physical address to a range of 2^ P possible results, where P ^ ceiling (log)₂(N)), i.e., log₂(N) and then mapping the result of the first hash that is greater than or equal to N (i.e., the result that does not correspond to existing tile 4006) to an output that is less than N (i.e., to existing tile 4006).

Another hashing algorithm that is relatively efficiently implemented in hardware is defined as follows:

calc _ hash _2() is able to distribute physical addresses between LLC slices 4006 in a relatively more balanced manner than Calc _ hash () (again assuming a relatively even distribution of physical addresses), i.e., slices 0-7 have a probability of about 7/64, while slice 8 has a probability of about 8/64 or 1/8. In other embodiments, the calc _ hash () and calc _ hash _2() algorithms may be modified to allocate physical addresses in an even relatively more balanced manner by using an even greater number of physical address bits.

As described herein, embodiments advantageously employ two different hashing algorithms: one that does not include memory array 4152 as LLC slice 4006 and one that includes memory array 4152 as LLC slice 4006. For example, in a processor 4000 with 8 number of core complexes 4012 and their corresponding LLC slices 4006, the hash that does not include the memory array 4152 may be PA [10:8 ]; and the hash comprising memory array 4152 may be calc _ hash _2(PA) described above, which maps to 9 different tiles. In one embodiment, the two hashing algorithms may be advantageously designed to support selective write back invalidation as an optimization scheme, i.e., eviction of only cache lines that will be hashed to different slices by the inclusive and exclusive hashing algorithms (i.e., write back invalidation). In some embodiments, as described in more detail below with respect to the blocks of fig. 42 and 43, this requires that only NNU LLC slice 4006-4 be write back invalidated (i.e., non-NNU LLC slice 4006 need not be write back invalidated), with respect to a transition from inclusive hashing to exclusive hashing; and regarding the transition from exclusive hashing to inclusive hashing, only cache lines in non-NNU LLC slice 4006 whose addresses are to be hashed to different slices by the inclusive hashing algorithm and the exclusive hashing algorithm need to be evicted. An embodiment where the exclusive hash is PA [10:8] or the like, and the inclusive hash is calc _ hash _2 or the like is such an embodiment.

Referring now to FIG. 41, a block diagram is shown that illustrates NNU 121 of FIG. 40, and ring station 4004-M of FIG. 40 in greater detail. NNU 121 of fig. 41 is similar in many respects to the embodiments of NNU 121 described above, but also includes cache control logic 4108, selection logic 4158, and memory array 4152, which may be comprised of weight RAM 124 or data RAM 122 of NNU 121. Although not shown in FIG. 41, NNU 121 also includes program memory 129, sequencer 128, data RAM 122, and weight RAM 124 of FIG. 1. NNU 121 also includes an array of NPUs 126 in fig. 1 and control/status registers 127 as shown in fig. 41. As described above, for example, with respect to FIG. 34, the NPUs 126 in the array each include a multi-stage pipeline for processing instructions and data, as described in detail above. The first stage of the NPU pipeline 126 provides data to the select logic 4158 to write to the memory array 4152 and the second stage receives data from the memory array 4152. In some embodiments, the pipeline 126 includes ten stages, and the sixth stage receives data from the memory array 4152 and the ninth stage provides data to the selection logic 4158 for writing to the memory array 4152.

The memory array 4152 is coupled to the NPU array pipeline 126. Select logic 4158 provides an input to memory array 4152. The select logic 4158 is controlled by a mode input 4199 for specifying the mode. Preferably, the mode 4199 input is the output of one bit in the control/status register 127 that is written to change the mode 4199 from cache mode to NNU mode. The mode indicates that memory array 4152 is operating in NNU mode or in cache memory mode. When operating in NNU mode, memory array 4152 functions as weight RAM 124 for NNU 121. (although memory array 4152 is referred to throughout as being used as weight RAM 124, memory array 4152 may alternatively be used as data RAM 122.) however, when operating in a cache memory mode, memory array 4152 is used as cache memory. Two embodiments of a cache memory mode are described: in a first embodiment, the memory array 4152 is used as a slice 4006 of the LLC 4005 shared by the core complex 4012, and in a second embodiment, the memory array 4152 is used as a victim cache shared by the core complex 4012. Where mode control 4199 indicates the NNU mode, selection logic 4158 selects the data provided by NPU array pipeline 126 and memory array 4152 writes the data to NPU array pipeline 126 such that memory array 4152 functions as weight RAM 124 for NNU 121. In contrast, where the mode control 4199 indicates cache memory mode, the selection logic 4158 selects the data provided by the data pipeline 4146 of the cache control logic 4108, and the memory array 4152 writes the data to the data pipeline 4146 of the cache control logic 4108. In this manner, memory array 4152 serves as a cache memory shared by cores 4002, e.g., as a victim cache or LLC slice 4006-4. Preferably, the memory array of the larger of the two RAMs 122/124 is used in cache memory mode. Further, an embodiment is conceived in which memory arrays of both the weight RAM 124 and the data RAM 122 are used as a cache memory shared by the core 4002.

Preferably, the data bus providing data from the data pipeline 4146 to the memory array 4152 is 64 bytes wide (e.g., the size of a cache line), and the data bus providing data from the memory array 4152 to the NPU array pipeline 126 is a number of words equal to the number of NPUs 126 of the array, e.g., 1024 words. In contrast, the data bus providing data from the NPU array pipeline 126 to the memory array 4152 is a number of words equal to the number of NPUs 126 of the array. Preferably, the bus between the select logic 4158 and the memory array 4152 includes an address bus, a write data bus, RD/WR control, and a Cache Line Enable (CLE) for indicating which of the 16 cache lines (assuming, for example, a 1024 byte wide memory array and a 64 byte cache line) is being accessed. In the case of a write from the NPU array pipeline 126 to the memory array 4152, typically all of the CLEs will be true because typically all of the NPUs 126 write to a row of the weight RAM 124. Select logic (not shown) uses the CLE to select the appropriate memory block of the memory array 4152 to enable reading or writing while data is being written to the memory array 4152.

In the embodiment of fig. 41, when operating in NNU mode, it is preferable that architectural programs executing on core 4002 access NNUs 121 as peripheral devices rather than NNUs 121 as execution units of the core via ring bus 4024 (e.g., the embodiments described above with respect to fig. 1-35, etc.). Preferably, interface logic 3514 (not shown in fig. 41) and control/status registers 127 of fig. 35 are coupled to ring stations 4004-M, wherein the ring stations 4004-M enable cores 4002 to read and write control/status registers 127 and to read and write data RAM 122, weight RAM 124 and program memory 129 via interface logic 3514 using architected load/store instructions (rather than the MTNN1400 and MFNN 1500 instructions of fig. 14 and 15). In addition, data/weight words may be transferred between system memory and data RAM 122/weight RAM 124 via Direct Memory Access (DMA) transfers. Finally, embodiments are contemplated in which NNU 121 itself executes load/store instructions to transfer data/weights between system memory and data RAM 122/weight RAM 124. Preferably, the operating system manages NNU 121 as a global resource shared by various processes of systems running on different cores 4002, and the operating system requires a process to obtain ownership of NNU 121 before using NNU 121. Preferably, the operating system controls the mode 4199 in which NNU 121 operates, and more particularly the manner in which memory array 4152 operates, as described in more detail below with respect to FIGS. 42-45. In one embodiment, memory array 4152 is a 2MB static RAM array, although other embodiments with larger or smaller sizes are also contemplated.

The cache control logic 4108 is coupled to the ring stations 4004-M and to the select logic 4158 and the memory array 4152. The cache control logic 4108 includes a tag pipeline 4144 coupled to the ring station 4004-M, a data pipeline 4146 coupled to the ring station 4004-M, and a tag/MESI/LRU array 4142 coupled to the tag pipeline 4144. The cache control logic 4108 also includes an external interface 4147 that includes a fill queue 4122, a snoop queue 4124, and an arbiter 4136, wherein the arbiter 4136 arbitrates between the fill queue 4122 and the snoop queue 4124 for access to the tag pipeline 4144 and the data pipeline 4146. The cache control logic 4108 also includes a core interface 4148 that includes a load queue 4112, an eviction queue 4114, a query queue 4116, and an arbiter 4138, wherein the arbiter 4138 arbitrates among the load queue 4112, the eviction queue 4114, and the query queue 4116 for access to the tag pipeline 4144 and the data pipeline 4146. The arbiter 4132 arbitrates between the external interface 4147 and the core interface 4148 for access to the flag pipeline 4144, and the arbiter 4134 arbitrates between the external interface 4147 and the core interface 4148 for access to the data pipeline 4146. In one embodiment, cache control logic 4108 also includes a state machine that does so in response to a request to perform a write back invalidation operation for memory array 4152 (e.g., in response to a write back invalidation request from core 4002). Further, the state machine does so in response to a request to perform an invalidate operation for memory array 4152 (e.g., in response to an invalidate request from core 4002). To perform an invalidation operation, the state machine updates the state within tag/MESI array 4142 of each cache line in memory array 4152 to an invalid state and resets the replacement information within LRU array 4142 for each set of memory array 4152.

The tag pipeline 4144 receives requests and tag updates from the arbiter 4132 and provides cache line status and responses to the ring stations 4004-M as well as the external interface 4147 and the core interface 4148. The data pipeline 4146 receives requests and data from the arbiter 4134 and provides the data to the selection logic 4158 and the ring stations 4004-M. When in cache mode, tag/MESI array 4142 stores the tag and state of the cache line stored in memory array 4152. Preferably, when in cache mode, memory array 4152 functions as set associative memory, and LRU array 4142 stores cache line replacement information for determining which way of the selected set to replace.

The fill queue 4122 handles both new allocations to the memory array 4152 (reload requests) and evictions from the memory array 4152 (victim requests). In the case of a victim request, the fill queue 4122 requests access to the tag pipeline 4144 to determine which cache line (if any) needs to be evicted, and requests access to the data pipeline 4146 to read modified data out of the memory array 4152 for writing out to system memory. In the case of a reload request, the fill queue 4122 requests access to the tag pipeline 4144 to write the address of the newly allocated cache line to the tag array 4142 and set the initial MESI state in the MESI array 4142, and requests access to the data pipeline 4146 to write the new data to the memory array 4152. Snoop queue 4124 handles snoops originating from system bus 4022. The snoop queue 4124 requests access to the tag pipeline 4144 to determine the status of the cache line specified in the snoop request, and requests access to the data pipeline 4146 to read modified data (if any) from the memory array 4152 in response to the snoop request. In the event that a load misses in all of the lower level caches (including the L3 cache 4005 in the case that the memory array 4152 and cache control logic 4108 are being used as victim caches 4006-4, as described in more detail below), the load queue 4112 handles loads from the L2 cache 4008 of the core 4002 (as well as loads from other caches, such as L1 data and L1 instructions, in embodiments where the L2 cache 4008 does not include other caches). The load queue 4112 requests access to the tag pipeline 4144 to determine whether the specified cache line is present in the memory array 4152, and requests access to the data pipeline 4146 to read specified data from the memory array 4152 to write to the requesting core 4002 via the ring bus 4024. The eviction queue 4114 handles evictions from the core's L2 cache 4008. When acting as the victim cache 4006-4, the eviction queue 4114 requests access to the tag pipeline 4144 to write the address of the cache line evicted from the lower level cache memory to the tag array 4142 and to set the initial MESI state in the MESI array 4142. When used as LLC slice 4006, eviction queue 4114 requests access to the tag pipeline 4144 to update the MESI state in the MESI array 4142 if the evicted cache line is modified. The eviction queue 4114 also requests access to the data pipeline 4146 to write the evicted cache line to the memory array 4152. The query queue 4116 handles snoops to the core's L2 cache 4008. The query queue 4116 requests access to the tag pipeline 4144 to update the MESI status to "modified" after the core 4002 responds to a snoop as "modified" and requests access to the data pipeline 4146 to write the modified cache line from the snoop response to the memory array 4152. In one embodiment, the cache control logic 4108 includes a core interface 4148 for each core 4002. In one embodiment, each core interface 4148 includes separate data/instruction load, eviction, and

query queues

4112, 4114, 4116, respectively, for loading, evicting, and snooping separate data/instruction caches of the core 4002.

Referring now to FIG. 42, a flowchart is shown illustrating the operation of processor 4000 of FIG. 40 in the case where memory array 4152 of FIG. 41 is being transitioned from a cache memory mode when used as LLC tile 4006 to an NNU mode when used as weight RAM 124/data RAM 122 of NNU 121. As described above, NNU 121 has a large amount of memory, e.g., in one embodiment weight RAM 124 is 2 MB. Fig. 42 and 43 describe an embodiment that enables memory array 4152 to be used as an additional LLC slice 4006, advantageously enabling LLC 4005 to be significantly increased in size (e.g., by 25%) without NNU 121 being used as a neural network unit by any process running on the system. More specifically, FIG. 42 depicts a method for converting memory array 4152 from operating as LLC slice 4006 to operating as weight RAM 124 of NNU 121. Advantageously, the cache control logic 4108 and the memory array 4152 and the selection logic 4158 of FIG. 41 collectively function as the LLC slice 4006 in the embodiments of FIGS. 42 and 43, with the mode 4199 set to cache memory mode. Flow begins at block 4202.

At block 4202, a request is made to transition from using memory array 4152 as weight RAM 124 for the LLC 4005 slice 4006 to using memory array 4152 as NNU 121. Preferably, the conversion is controlled by an operating system running on processor 4000. For example, an application running on processor 4000 requests that the operating system use NNU121, and the operating system detects that memory array 4152 is currently serving as LLC tile 4006, and therefore needs to transition from cache memory mode to NNU mode. Flow proceeds to block 4204.

At block 4204, in response to the convert request at block 4202, the operating system synchronizes all cores 4002 themselves. That is, the operating system causes the core 4002 to stop picking architectural instructions and stop accessing memory. More specifically, this suspends access to LLC 4005, which currently includes memory array 4152. In one embodiment, an operating system executes architectural instructions (e.g., x86WRMSR) on each core 4002 that indicate that cores 4002 are synchronized. In an alternative embodiment, an operating system executes instructions on one of cores 4002, and in response, one core 4002 signals, e.g., via microcode, each of the other cores 4002 to synchronize. Flow proceeds to block 4206.

At block 4206, a write back invalidation is performed on LLC 4005. In one embodiment, the core 4002 microcode preferably requests a write back invalidation in response to execution of an architectural instruction of the operating system. The write back invalidation writes back the modified cache lines (if any) and invalidates all cache lines of all LLC slices 4006 (including NNU LLC slices 4006-4). In an alternative embodiment, write back invalidation is selective. In general, selective write back invalidation means operation according to the following pseudo code:

for each ear slice:// 0-N-1, where N is the current number of slices (including NNU slices)

for each cacheline in slice:

if exclusive_hash(cacheline address)！＝slice:

evict cacheline

Of course, when a slice is an NNU slice (e.g., slice 8 in the case of 8 non-NNU slices 4006 plus one NNU LLC slice 4006-4), the exclusive _ hash (cacheline address) will not equal the slice because the exclusive hash never returns the index of NNU LLC slice 4006-4, and therefore all cache lines in NNU LLC slice 4006-4 will be evicted, i.e., written back if modified, otherwise invalid. Depending on the inclusive and exclusive hashes employed, the number of cache lines that need to be evicted from non-NNU LLC slice 4006 varies. For example, assume that the exclusive and inclusive hashes are both PA [45:6 ]% N, but where N is different for each hash, i.e., N is smaller for the exclusive hash than for the inclusive hash, and assume that N is 9 for the inclusive hash and 8 for the exclusive hash. In this case, a significant portion (e.g., about 88%) of the cache lines in non-NNU LLC slice 4006 need to be evicted. In this case, simply invalidating all cache lines write back of all LLC slices 4006 may be the same or more efficient. Conversely, for another example, assume the exclusive hash is PA [ 10: 8], and the inclusive hash is calc _ hash _2, described above. In this case, none of the cache lines in non-NNU LLC slice 4006 need to be evicted at the time of the transition from the inclusive hash to the exclusive hash (i.e., the transition made in FIG. 42). Flow proceeds to block 4208.

At block 4208, the hashing algorithm used to hash the physical address of the cache line to LLC slice 4006 is updated to not include memory array 4152 as slice 4006 of LLC 4005, as described above. Preferably, a hash update message is broadcast to each core 4002 and snooping source to change its hashing algorithm to an exclusive hash, i.e., a hash that does not include NNU LLC slice 4006-4. Flow proceeds to block 4212.

At block 4212, the mode 4199 is updated to indicate the NNU mode such that the selection logic 4158 makes the memory array 4152 available as the weight RAM 124 to be accessed by the NPU pipeline 126 and architectural programs executing on the core 4002. In one embodiment, an operating system (e.g., a device driver) executes an architectural instruction on one of cores 4002, wherein the architectural instruction writes to NNU 121 control/status register 127 to update bits for controlling mode 4199, thereby changing mode 4199 from cache mode to NNU mode. The architectural instructions may be writes to I/O space or memory store instructions for making memory mapped I/O writes to the control/status registers 127. Flow proceeds to block 4214.

At block 4214, the core 4002 resumes operation, i.e., they are no longer synchronized, but begin picking up and executing architectural instructions that may include accessing memory. In one embodiment, an operating system executes architectural instructions on each core 4002 that instruct the cores 4002 to resume operations. In an alternative embodiment, an operating system executes instructions on one of cores 4002, and in response, the one core 4002 signals, e.g., via microcode, recovery operations to each of the other cores 4002. Flow ends at block 4214.

Referring now to FIG. 43, a flowchart is shown illustrating the operation of processor 4000 of FIG. 40 in the case where memory array 4152 of FIG. 41 is being transitioned from the NNU mode when used as weight RAM 124/data RAM 122 of NNU121 to the cache memory mode when used as LLC slice 4006. Flow begins at block 4302.

At block 4302, the request transitions from weight RAM 124 using memory array 4152 as NNU121 to using memory array 4152 as tile 4006 of LLC 4005. Preferably, the conversion is controlled by an operating system running on processor 4000. For example, an application running on processor 4000 notifies the operating system that the application is no longer using NNU121 and that no other applications are requesting to use NNU121, and the operating system detects that memory array 4152 is currently acting as weight RAM 124, and therefore needs to switch from NNU mode to cache memory mode. Flow proceeds to block 4304.

At block 4304, in response to the convert request at block 4302, the operating system synchronizes all cores 4002 themselves, in the manner described above with respect to block 4204. More specifically, this suspends access to LLC4005 that does not currently include memory array 4152. Flow proceeds to block 4306.

At block 4306, a write back invalidation is performed on LLC 4005. The write back invalidation writes back the modified cache lines (if any) and invalidates all cache lines of all LLC slices 4006 (not including NNU LLC slices 4006-4, since they are not currently included in LLC 4005). In an alternative embodiment, write back invalidation is selective. In general, selective write back invalidation means operation according to the following pseudo code:

for each slice:// 0-N-1, where N is the current number of slices (excluding NNU slices)

for each cacheline in slice:

if inclusive_hash(cacheline address)！＝slice:

evict cacheline

In the transition from exclusive to inclusive hashing in FIG. 43, the slice is never an NNU slice, and thus the cache lines in NNU LLC slice 4006-4 will not be evicted. Depending on the inclusive and exclusive hashes employed, the number of cache lines that need to be evicted from non-NNU LLC slice 4006 varies. For example, assume that the exclusive and inclusive hashes are both PA [45:6 ]% N, but where N is different for each hash, i.e., N is smaller for the exclusive hash than for the inclusive hash, and assume that N is 9 for the inclusive hash and 8 for the exclusive hash. In this case, a significant portion (e.g., about 88%) of the cache lines in non-NNULLC tile 4006 need to be evicted. In this case, simply invalidating all cache lines write back of all LLC slices 4006 may be the same or more efficient. Instead, assume that the exclusive hash is PA [ 10: 8], and the inclusive hash is calc _ hash _2, described above. In this case, a relatively small portion (e.g., about 12%) of the cache lines in non-NNU LLC slice 4006 need to be evicted. Flow proceeds to block 4308.

At block 4308, the hashing algorithm used to hash the physical address of the cache line to LLC slice 4006 is updated to include memory array 4152 as slice 4006 of LLC 4005, as described above. That is, a hash update message is broadcast to each core 4002 and snooping source to change its hash algorithm to an inclusive hash, i.e., a hash that includes NNU LLC slice 4006-4. Flow proceeds to block 4311.

At block 4311, the cache control logic 4108 invalidates the memory array 4152 by updating the state in the MESI array 4142 to invalid for all cache lines as described above. Preferably, cache control logic 4108 also resets the replacement information in LRU array 4142. In one embodiment, microcode of core 4002 requests NNU LLC slice 4006-4 to perform an invalidate request, where cache control logic 4108 performs an invalidate in response. Flow proceeds to block 4312.

At block 4312, the mode 4199 is updated to indicate a cache memory mode such that the selection logic 4158 makes the memory array 4152 available as the LLC slice 4006. Flow proceeds to block 4314.

At block 4314, the core 4002 resumes operation, i.e., they are no longer synchronized, but begin to pick up and execute architectural instructions that may include accessing memory, as described for block 4214. Flow ends at block 4314.

Referring now to FIG. 44, a flowchart is shown illustrating the operation of processor 4000 of FIG. 40 in the case where memory array 4152 of FIG. 41 is being switched from the NNU mode when used as weight RAM 124/data RAM 122 of NNU121 to the cache memory mode when used as victim cache 4006-4. A victim cache is a cache memory used to hold only cache lines that are evicted by a lower cache memory in the cache hierarchy of processor 4000. For example, the L2 cache 4008 and/or the L1 data/instruction cache are low-level caches. Additionally, in embodiments where the memory array 4152 may be used as the victim cache 4006-4, the LLC 4005 is considered a level 3 (L3) cache and is considered a low-level cache relative to the victim cache 4006-4, and the L3 cache 4005 evicts a cache line to the victim cache 4006-4. The victim cache provides data when an address hits in the victim cache (e.g., in response to a load request or a snoop request). In one embodiment, the L3 cache 4005 includes an L2 cache 4008 and a lower level cache of the core complex 4012, and the victim cache provides hit data for distribution into the L3 cache 4005, the L3 cache 4005 in turn providing data to the L2 cache 4008, the L2 cache 4008 in turn providing data to a lower level cache. In another embodiment, the L3 cache 4005, L2 cache 4008, and lower level caches of the core complex 4012 are not inclusive, and the victim cache provides hit data to directly allocate the caches of the respective levels. Advantageously, the cache control logic 4108 of FIG. 41 and the memory array 4152 and selection logic 4158 collectively function as the victim cache 4006-4 in the embodiments of FIGS. 44 and 45 when the mode 4199 is set to the cache memory mode. In one embodiment, the victim cache 4006-4 may serve as a write-back cache that caches modified cache lines evicted to the victim cache 4006-4, and may also serve as a write-through cache that does not cache modified cache lines evicted to the victim cache 4006-4, but instead forwards the modified cache lines to system memory. As can be seen from the illustration of FIG. 45 (and more particularly blocks 4506, 4508, and 4512), the write-through victim cache 4006-4 has the advantage of fast switching to use memory array 4152 as weight RAM 124 in NNU mode, while the write-back victim cache 4006-4 can have the advantage of higher overall cache efficiency for processor 4000. Preferably, the victim cache 4006-4 is configurable in a write-back mode or a write-through mode. Flow begins at block 4402.

At block 4402, the request transitions from the weight RAM 124 using the memory array 4152 as an NNU 121 to the victim cache 4006-4 shared using the memory array 4152 as the core complex 4012. The transition is preferably controlled by an operating system running on processor 4000, in a manner similar to that described above with respect to block 4302. Flow proceeds to block 4404.

At block 4404, the mode 4199 is updated to indicate the cache memory mode such that the selection logic 4158 makes the memory array 4152 available as the victim cache 4006-4. Flow proceeds to block 4406.

At block 4406, the bus controller 4014 is informed to begin directing snoops to the victim cache 4006-4 and the low-level caches are informed to begin directing load requests and eviction requests to the victim cache 4006-4. Flow proceeds to block 4408.

At block 4408, the victim cache 4006-4 begins caching victim data. In the embodiment of fig. 41, eviction queue 4114 receives a request (e.g., eviction) to evict a cache line from a lower level cache (e.g., L3 cache 4005, L2 cache 4008, and/or L1D/L1I cache). In response, the victim cache 4006-4 allocates the evicted cache line into the memory array 4152. Flow proceeds to block 4412.

At block 4412, the victim cache 4006-4 receives a request to access data and responds with the data if the address of the request hits the victim cache 4006-4. In the embodiment of FIG. 41, snoop queue 4214 and load queue 4112 receive the request. More specifically, the victim cache 4006-4 is snooped to invalidate a cache line write back that another caching agent is reading. In addition, victim cache 4006-4 receives load requests from the low-level caches for loads that miss in these low-level caches. If a request hits in the victim cache 4006-4, the victim cache 4006-4 provides hit data to the requester. Flow ends at block 4412.

Referring now to FIG. 45, a flowchart is shown illustrating the operation of processor 4000 of FIG. 40 in the case where memory array 4152 of FIG. 41 is being switched from a cache memory mode when used as victim cache 4006-4 to an NNU mode when used as weight RAM 124/data RAM 122 of NNU 121. As described above, the victim cache 4006-4 can be operating as a write-through cache or as a write-back cache. Flow begins at block 4502.

At block 4502, the request transitions from using the memory array 4152 as the victim cache 4006-4 to using the memory array 4152 as the weight RAM 124 of the NNU 121. Preferably, the conversion is controlled by an operating system running on processor 4000. For example, an application running on processor 4000 requests that the operating system use NNU 121, and the operating system detects that memory array 4152 is currently acting as victim cache 4006-4, and therefore needs to switch from cache memory mode to NNU mode. Flow proceeds to block 4504.

At block 4504, the cache is told to stop directing evictions to the victim cache 4006-4. Flow proceeds to decision block 4506.

At decision block 4506, if the victim cache 4006-4 is being used as a write-through cache, flow proceeds to block 4512; otherwise, flow proceeds to block 4508.

At block 4508, the cache control logic 4108 performs a write back invalidation of the victim cache 4006-4. That is, the victim cache 4006-4 writes back all of its modified cache lines to system memory and then invalidates all of its cache lines (by updating the state in the MESI array 4142 to invalid for all cache lines; preferably, the cache control logic 4108 also resets the replacement information in the LRU array 4142). Preferably, cache control logic 4108 continues to respond to load requests and snoop requests while performing write back invalidation. Flow proceeds to block 4514.

At block 4512, the cache control logic 4108 performs an invalidate operation on the memory array 4152. That is, cache control logic 4108 invalidates all cache lines of memory array 4152. It can be seen that if the victim cache 4006-4 is operating as a write-through cache, then the switch to using the memory array 4152 as the weight RAM 124 can advantageously be faster than if the victim cache 4006-4 is operating as a write-back cache, which can be a substantial savings if the victim cache 4006-4 is relatively large, since no write back of the modified cache line at block 4508 needs to be performed (i.e., only the invalidation at block 4512 needs to be performed). Flow proceeds to block 4514.

At block 4514, the low-level cache is told to stop directing load requests to the victim cache 4006-4, and the bus controller 4014 is told to stop directing snoops to the victim cache 4006-4. Flow proceeds to block 4516.

At block 4516, the mode 4199 is updated to indicate the NNU mode, as described for block 4212, such that the selection logic 4158 makes the memory array 4152 available as the weight RAM 124 to be accessed by the NPU pipeline 126 and architectural programs executing on the core 4002. Flow proceeds to block 4518.

At block 4518, the cache control logic 4108 stalls (the cache control logic 4108 does starting at block 4408) caching of the victim data. In addition, the cache control logic 4108 forwards any subsequent load, eviction, or snoop requests received by the cache control logic 4108 to the bus controller 4014. Finally, the architectural program executing on core 4002 may serve as weight RAM 124 to be accessed by NPU pipeline 126. Flow ends at block 4518.

Embodiments will now be described in which the memory array of NNU 121 (e.g., weight RAM 124 or data RAM 122) is used as memory to hold a cache line of victim cache for data processed by core complex 4012 (e.g., to service a cache line eviction from L3 cache 4005), which are similar to the embodiments described above with respect to fig. 40-41 and 44-45, except that in the embodiments described below, the tag (and cache line state and replacement information) is advantageously stored in the tag directory of L3 cache 4005 (rather than in a separate structure in NNU 121, such as tag/MESI/LRU array 4142 of fig. 41), and the control logic of L3 cache 4005 is enhanced to handle requests for victim cache (e.g., load, evict, and snoop) and perform necessary cache line transfers between the NNU 121 memory array (hereinafter selective data store 4652) and the L3 cache 4005 and/or system memory.

Referring now to fig. 46, a block diagram is shown illustrating a processor 4000 in accordance with an alternative embodiment. Processor 4000 includes a ring bus 4024 to which core 4002 and L3 caches 4005, NNUs 121, and DRAM controllers 4018 are coupled through ring stations 4004-0, 4004-N, and 4004-M, respectively. The system memory 4618 is coupled to the DRAM controller 4018. NNU 121 includes selective data store 4652. The L3 cache 4005 includes an L3 data store 4606, a tag directory 4642, and control logic 4644 including a mode indicator 4699. Logically, when operating in a victim cache mode, the processor 4000 includes a victim cache 4602, the victim cache 4602 including a selective data store 4652, and associated portions of the tag directory 4642 and control logic 4644 of the L3 cache 4005, as described below. Preferably, selective data store 4652 is weight RAM 124 and/or data RAM 122 of NNU 121. Preferably, as described in more detail below, mode 4699 is updated by the operating system (e.g., a device driver) to indicate a victim cache mode or NNU mode, depending on whether core 4002 has requested the use of NNU 121 as a neural network element.

The processor 4000 of fig. 46 is similar in many respects to the processor 4000 of fig. 40, and like-numbered elements are similar, with the differences described below. More specifically, processor 4000 may be placed in a first mode in which selective data store 4652 of NNU 121 is used to hold cache lines processed by core 4002 for use as victim cache 4602, or processor 4000 may be placed in a second mode to use selective data store 4652 to hold data (e.g., neural network weights or data) processed by NNU 121. As described in more detail below, the mode is indicated by a mode indicator 4699, where the control logic 4644 checks the mode indicator 4699 to determine what actions it should take. The victim cache 4602 serves the core 4002 in a number of ways similar to the victim caches 4006-4 of FIGS. 41, 44, and 45, with the differences described below. Advantageously, however, the tag of the victim cache 4602 is maintained in the tag directory 4642 of the L3 cache 4005. This is in contrast to the embodiment of FIG. 41, for example, where the tag directory is part of the NNU 121 in the embodiment of FIG. 41, such as tag array 4142. In addition, the control logic 4644 of the L3 cache 4005 controls most of the operation of the victim cache 4602 of FIG. 46. This is in contrast to the embodiment of FIG. 41, for example, where cache control logic 4108 of NNU 121 controls victim cache 4006-4 in the embodiment of FIG. 41. Preferably, tag directory 4642 also maintains the state (e.g., MESI state) of victim cache 4602 and replacement information (e.g., LRU information), as described in more detail below with respect to FIG. 47.

Although fig. 46 shows an L3 cache 4005 with a single L3 data store 4606 and a single tag directory 4642, the L3 data store 4606 may include a plurality of L3 data stores 4606 (e.g., including one L3 data store 4606 for each L3 slice 4006), and a plurality of tag directories 4642 (e.g., including one tag directory 4642 for each L3 slice 4006); further, each L3 slice 4006 may include its own instance of control logic 4644. As described above, such embodiments may employ a hashing algorithm that hashes the addresses of the cache lines to determine which slice 4006 of the plurality of slices 4006 will hold the cache line. However, embodiments are contemplated in which processor 4000 is a single core processor and L3 cache 4005 is not sliced, such that L3 data store 4606, tag directory 4642 and control logic 4644 are also not sliced. As described in more detail below, in a sliced embodiment, the selective data store 4652 is logically sliced into P portions (where P is the number of slices 4006 of the L3 cache 4005) such that each portion of the selective data store 4652 is associated with a respective slice 4006 of the L3 cache 4005.

The L3 cache 4005 is a set associative cache memory with the L3 data store 4606 having L bytes of data storage arranged as S sets and Y ways. For example, assuming each cache line is 64 bytes, the L3 cache 4005 may have L-8 MB of data storage and have S-8192 sets and Y-16 ways. Selective data store 4652 is a memory having a data store of M bytes. Control logic 4644 logically accesses a selective data store 4652 as a set associative cache memory arranged in S sets and X ways, where X is the product of Y and the quotient of M divided by L. For example, assuming the L3 cache 4005 has the above exemplary values, the selective data store 4652 may have a data store of M-4 MB and have X-16 (4MB/8MB) -8 ways. Thus, the tag directory 4642 is arranged to have S sets and Z ways, where Z is the sum of X and Y. As described below with respect to the embodiment of fig. 47, with the above exemplary values, the tag directory 4642 may have 24 ways, Z + 8. In a sliced embodiment having P L3 data storage tiles 4606, each L3 data storage tile 4606 has J sets and Y ways, where J is the quotient of S divided by P. With the above exemplary values (where P-4), each L3 data storage tile 4606 has J-8192/4-2048 sets and Y-16 ways.

In the embodiment of FIG. 46, DRAM controller 4018, and thus system memory, is coupled to ring bus 4024 by ring stations 4004-M that are separate from ring stations 4004-N via which NNU 121 is coupled to ring bus 4024. However, other embodiments are contemplated (such as the embodiment of fig. 41) in which DRAM controller 4018 and NNU 121 are coupled to ring bus 4024 by the same ring station 4004.

Although an embodiment is described herein in which the NNU 121 includes a selective data store 4652 that may serve as the victim cache 4602, the selective data store 4652 may be part of another accelerator that is attached to the L3 cache 4005 and that is accessible by the L3 cache 4005. For example, embodiments are contemplated in which the accelerator is an encryption/decryption unit, a compression/decompression unit, a multimedia encoder/decoder unit, or a database indexing unit.

Preferably, control logic 4644 includes a tag pipeline, a data pipeline, and queues (e.g., a fill queue, a snoop queue, a load queue, an eviction queue, a query queue, etc.). The control logic 4644 performs conventional functions associated with the L3 cache 4005, such as filling cache lines from system memory into the L3 data store 4606, processing snoop requests for the L3 data store 4606, loading data from the L3 data store 4606 into the core complex 4012, evicting cache lines from the L3 data store 4606, and generating query requests for low-level caches, etc. In addition, as described in more detail below, control logic 4644 is enhanced to perform similar functions for victim cache 4602. Many of the enhanced functions are similar to those described with respect to cache control logic 4108 of FIG. 41. For example, control logic 4644 causes a cache line to be provided from selective data store 4652 to core 4002 (e.g., in response to a load request); in addition, control logic 4644 evicts a cache line from the L3 data store 4606 to the selective data store 4652 and updates the tag directory 4642 accordingly; in addition, control logic 4644 causes the cache line to be written from selective data store 4652 to system memory 4618 and updates tag directory 4642 accordingly, e.g., to write back the modified cache line; also, control logic 4644 handles snoops relating to valid cache lines held in selective data storage 4652.

Referring now to FIG. 47, a block diagram is shown illustrating the collection 4700 in the tag directory 4642 of FIG. 46. The tag directory set 4700 includes an L3 tag 4702, a victim cache tag 4704, and replacement information 4706. The number of sets in the tag directory 4642 corresponds to the number of sets in the L3 data store 4606, and also corresponds to the number of logical sets in the selective data store 4652. That is, the tag directory 4642 includes respective tag directory sets 4700 corresponding to respective sets in the L3 cache 4005, and also corresponding to respective sets in the victim cache 4602. For example, in an embodiment where the L3 cache 4005 is arranged as 2048 sets and the victim cache 4602 is also logically arranged as 2048 sets, the tag directory 4642 also has 2048 sets. The number of L3 tags 4702 corresponds to the number of ways in the L3 cache 4005, e.g., 16 tags in an embodiment where the L3 cache 4005 has 16 ways. The number of victim cache tags 4704 corresponds to the number of ways in the victim cache 4602, e.g., 8 tags in an embodiment where the victim cache 4602 logically has 8 ways. Preferably, the entry or store used by each L3 tag 4702 stores not only the tag (i.e., the associated address bits), but also the state (e.g., MESI state) of the corresponding cache line in the L3 data store 4606; likewise, the store entry for each victim cache tag 4704 stores not only the tag, but also the state of the corresponding cache line in the selective data store 4652.

Control logic 4644 reads replacement information 4706 to determine which way to replace in the event that a cache line of L3 cache 4005 or victim cache 4602 needs to be replaced. For example, replacement information 4706 may be Least Recently Used (LRU) information that enables control logic 4644 to track LRU ways within a set. When the cache lines of the L3 data store 4606 and the selective data store 4652 are accessed, control logic 4644 updates the replacement information 4706. Preferably, replacement information 4706 includes separate replacement information for L3 data store 4606 and selective data store 4652. In one embodiment, replacement information 4706 is maintained in a separate storage array from the storage array used to maintain marker 4702/4704.

It can be seen that adding victim cache flag 4704 and additional replacement information 4706, along with control logic 4644 enhancements to support victim cache 4602, can advantageously represent a relatively modest increase in the size of L3 cache 4005 in exchange for the removal of corresponding hardware in NNU 121 (e.g., cache control logic 4108 of fig. 41), and can reduce complexity.

Referring now to FIG. 48, a flowchart is shown illustrating the operation of the processor 4000 of FIG. 46 in evicting a cache line from the L3 cache 4005 to the victim cache 4602. Flow begins at block 4802.

At block 4802, the L3 cache 4005 receives a request to evict a cache line held in the L3 data store 4606. Flow proceeds to decision block 4804.

At decision block 4804, control logic 4644 checks for mode 4699. If mode 4699 indicates victim cache mode, flow proceeds to block 4806; otherwise (i.e., mode 4699 indicates NNU mode), flow proceeds to block 4808.

At block 4806, control logic 4644 reads the specified cache line from L3 data store 4606 and writes it to selective data store 4652. In addition, control logic 4644 updates tag directory 4642 to indicate that a cache line is present in selective data store 4652 (i.e., present in victim cache 4602), e.g., to a shared or modified MESI state. Preferably, to write a cache line to selective data store 4652, control logic 4644 generates a request to a queue (e.g., a fill queue) in its ring station 4004-0, which in turn generates a slave memory transaction on ring bus 4024 to a slave memory queue in ring station 4004-N of NNU 121. The transaction includes an address that is used to enable NNU121 to determine the location within selective data store 4652 to which it is to write the cache line. In one embodiment, the address sent in the slave memory transaction is a memory address within a portion of the system address space to which the selective data store 4652 is mapped, such as the PCI address space. In another embodiment, the memory address sent in the transaction may be a private address space, i.e., not a system address space. For example, the address may be a local address of selective data store 4652, e.g., 0 to the number of cache line size blocks held in selective data store 4652-1. As another example, the address may specify a row of selective data store 4652, and a block index within the row. In either embodiment, the control logic 4644 calculates the address based on the set in the L3 data store 4606 from which the cache line is being evicted and the index in the tag directory 4642 being updated to indicate that the cache line exists in the way of the selective data store 4652, and in the case of the slice embodiment, the index of the slice 4606 from which the cache line is being evicted. In yet another embodiment, the address is a tuple used to specify a way index and a set index and, in the case of the slice embodiment, a slice index, and the NNU121 computes a row of the selective data store 4652 and a block index within the row based on the address tuple. In such an embodiment, since L3 cache 4005 is utilizing selective data store 4652 as victim cache 4602 (i.e., in victim cache mode), NNU121 knows that L3 cache 4005 sent the transaction that included the address tuple, and NNU121 knows that NNU121 needs to compute the row and block indices based on the address tuple. In one embodiment, the transaction includes a flag indicating a mode; in another embodiment, NNU121 also includes a mode indicator that is updated when the L3 cache 4005 mode indicator 4699 gets updated (e.g., similar to mode indicator 4199 of FIG. 41). In one embodiment, NNU121 uses cache line enable (described above with respect to fig. 41) to select a cache line or block of selective data storage 4652 to enable writes to selective data storage 4652. In another embodiment, NNU121 performs a read-modify-write operation to write the evicted cache line to selective data store 4652. Flow ends at block 4806.

At block 4808, control logic 4644 reads the specified cache line from L3 data storage 4606 and writes it to system memory 4618. In addition, control logic 4644 updates tag directory 4642 to a MESI state indicating that the cache line is no longer present in L3 cache 4005, e.g., is invalid. Preferably, to write a cache line to system memory 4618, control logic 4644 generates a request for a queue (e.g., a fill queue) in its ring station 4004-0, which in turn generates a slave memory transaction on ring bus 4024 for a slave memory queue in ring station 4004-M of DRAM controller 4018. Flow ends at block 4808.

Referring now to FIG. 49, a flowchart is shown illustrating operation of processor 4000 of FIG. 46 in connection with a load request for a cache line from a victim cache 4602 to a core 4002. Flow begins at block 4902.

At block 4902, the L3 cache 4005 receives a request from the core 4002 to load data from a specified memory address. Flow proceeds to block 4904.

At block 4904, control logic 4644 reads the tag directory set 4700 to which the memory address refers from the tag directory 4642 and checks its L3 tag 4702 and victim cache tag 4704. Flow proceeds to decision block 4906.

At decision block 4906, if the control logic 4644 determines from the L3 tag 4702 that the memory address hit in the L3 cache 4005, then flow proceeds to block 4908; otherwise, flow proceeds to decision block 4912.

At block 4908, control logic 4644 reads the specified cache line from L3 data store 4606 and provides it to core 4002. Flow ends at block 4908.

At decision block 4912, if the control logic 4644 determines from the victim cache tag 4704 that the memory address hit the victim cache 4602, flow proceeds to block 4914; otherwise, flow proceeds to decision block 4916.

At block 4914, control logic 4644 reads the specified cache line from selective data store 4652 and provides it to core 4002. Preferably, to read a cache line from selective data store 4652, control logic 4644 generates a request for a queue (e.g., a fill queue) in its ring station 4004-0, which in turn generates a slave load transaction on ring bus 4024 for a slave load queue in ring station 4004-N of NNU 121. The transaction includes an address for enabling NNU121 to determine a location within selective data store 4652 from which it is to read a cache line. Preferably, control logic 4644 generates the address in a manner similar to one of the embodiments described above with respect to block 4806. Additionally, in one embodiment, the control logic 4644 swaps the cache line (cache line A) read from the selective data store 4652 with another cache line (cache line B) in the L3 data store 4606 (i.e., the L3 cache 4005). Swapping may be advantageous because a hit of a cache line in the victim cache 4602 may be an indication as follows: cache line a may be accessed earlier and/or more frequently than cache line B, and moving cache line a to L3 cache 4005 may reduce latency of subsequent accesses. Interchanging cache line A read from selective data store 4652 with cache line B held in L3 data store 4606 means: (1) write the tag and state (e.g., MESI state) for row B to the tag in tag directory 4642 corresponding to the location occupied by row A; (2) write the tag and state (e.g., MESI state) for row A to the tag in tag directory 4642 corresponding to the location previously occupied by row B; (3) read row B from L3 data store 4606; (4) row a is read from selective data store 4652 and written to L3 data store 4606; and (5) write row B to selective data store 4652. Swapping a cache line may additionally involve updating replacement information 4706 (e.g., LRU information) maintained in the marked directory set 4700 that corresponds to the set involved. In an alternative embodiment, control logic 4644 does not interchange cache lines, but only provides cache lines read from selective data store 4652 to core 4002. Flow ends at block 4914.

At block 4916, control logic 4644 reads the specified cache line from system memory 4618 and provides it to core 4002. Flow ends at block 4916.

Referring now to FIG. 50, a flowchart is shown illustrating the operation of processor 4000 of FIG. 46 in the event that selective data store 4652 of FIG. 46 is being transitioned from the NNU mode when used as weight RAM 124/data RAM 122 of NNU 121 to the victim cache mode when used as victim cache 4602. A victim cache is a cache memory used to hold only cache lines evicted by a lower cache in the cache hierarchy of processor 4000. For example, the L2 cache 4008 and/or the L1 data/instruction cache are low-level caches. Additionally, in embodiments where a selective data store 4652 may be used as the victim cache 4602, the LLC 4005 is considered a level 3 (L3) cache and is considered a low-level cache relative to the victim cache 4602, and the L3 cache 4005 evicts cache lines to the victim cache 4602. The victim cache provides data when an address hits in the victim cache (e.g., in response to a load request or a snoop request). In one embodiment, the L3 cache 4005 comprises an L2 cache 4008 and a lower level cache of the core complex 4012, and the victim cache 4602 provides hit data for allocation into the L3 cache 4005, which L3 cache 4005 in turn provides data to an L2 cache 4008, which L2 cache 4008 in turn provides data to a lower level cache. In another embodiment, the L3 cache 4005, L2 cache 4008, and lower level caches of the core complex 4012 are not inclusive, and the victim cache provides hit data to directly allocate the levels of cache. Advantageously, when mode 4699 is set to victim cache mode, control logic 4644 and selective data store 4652 of FIG. 46 and tag directory 4642 collectively function as a victim cache 4602 in the embodiments of FIGS. 50 and 51. In one embodiment, the victim cache 4602 may act as a write-back cache that caches modified cache lines evicted to the victim cache 4602, and may also act as a write-through cache that does not cache modified cache lines evicted to the victim cache 4602, but instead forwards the modified cache lines to the system memory 4618. As can be seen from the illustration of fig. 51 (and more particularly blocks 5106, 5108, and 5112), the write-through victim cache 4602 has the advantage of fast switching to use the selective data store 4652 as the weight RAM 124 in NNU mode, while the write-back victim cache 4602 can have the advantage of greater overall cache efficiency for the processor 4000. Preferably, the victim cache 4602 is configurable in a write-back mode or a write-through mode. Flow begins at block 5002.

At block 5002, a request is made to switch from weight RAM 124 using selective data store 4652 as NNU 121 to victim cache 4602 shared using selective data store 4652 as core complex 4012. The transition is preferably controlled by an operating system running on processor 4000, in a manner similar to that described above with respect to block 4302. Flow proceeds to block 5004.

At block 5004, the mode 4699 is updated to indicate the victim cache mode to cause the control logic 4644 to use the selective data store 4652 as the victim cache 4602. Flow proceeds to block 5006.

At block 5006, the L3 cache 4005 is notified to begin directing eviction requests to the victim cache 4602. Flow proceeds to block 5008.

At block 5008, the victim cache 4602 begins (e.g., in the manner described above with respect to fig. 48) caching the victim data. Flow proceeds to block 5012.

At block 5012, the L3 cache 4005 detects an address in the tag directory 4642 that hits the victim cache tag 4704 of FIG. 47 and sends a request to the NNU 121 for the cache line of interest from the selective data store 4652 (e.g., in the manner described above with respect to FIG. 49). Flow ends at block 5012.

Referring now to FIG. 51, a flowchart is shown illustrating the operation of processor 4000 of FIG. 46 in the event that selective data storage 4652 of FIG. 46 is being transitioned from a victim cache mode when used as victim cache 4602 to an NNU mode when used as weight RAM 124/data RAM 122 of NNU 121. As described above, the victim cache 4602 may be acting as a write-through cache or as a write-back cache. Flow begins at block 5102.

At block 5102, a transition is requested from using the selective data store 4652 as the victim cache 4602 to use the selective data store 4652 as the weight RAM 124 of the NNU 121. Preferably, the conversion is controlled by an operating system running on processor 4000. For example, an application running on processor 4000 requests that the operating system use NNU121, and the operating system detects that selective data store 4652 is currently acting as victim cache 4602, and therefore needs to switch from victim cache mode to NNU mode. Flow proceeds to block 5104.

At block 5104, the L3 cache 4005 is told to stop directing evictions to the victim cache 4602. Flow proceeds to decision block 5106.

At decision block 5106, if the victim cache 4602 is operating as a write-through cache, flow proceeds to block 5112; otherwise, flow proceeds to block 5108.

At block 5108, control logic 4644 performs a write back invalidation of the victim cache 4602. That is, the victim cache 4602 writes back all of its modified cache lines to system memory 4618 and then invalidates all of its cache lines (by updating the state in the tag directory 4642 to invalid for all cache lines; preferably, the control logic 4644 also resets the replacement information in the tag directory 4642). In one embodiment, control logic 4644 reads the modified cache line from selective data storage 4652 and then writes it to system memory 4618. Optionally, control logic 4644 sends a command to NNU121 (e.g., to CSR 127), where the command specifies an address of a modified cache line within selective data store 4652, specifies a system memory address, and instructs NNU121 to write the modified cache line directly to system memory 4618. Preferably, control logic 4644 continues to respond to load requests and snoop requests while performing write back invalidations. Flow proceeds to block 5116.

At block 5112, the control logic 4644 performs an invalidation operation on the selective data store 4652. That is, control logic 4644 updates tag directory 4642 to invalidate all cache lines of selective data store 4652. It can be seen that if the victim cache 4602 is operating as a write-through cache, then since write back of the modified cache line at block 5108 need not be performed (i.e., only invalidation of block 5112 is required here), the transition to using the selective data store 4652 as the weight RAM124 can advantageously be faster than if the victim cache 4602 is operating as a write-back cache, which can be a substantial savings if the victim cache 4602 is relatively large. Flow proceeds to block 5116.

At block 5116, the mode 4699 is updated to indicate the NNU mode, as described for block 4212, so that the control logic 4644 does not use the selective data store 4652 as a victim cache, so that the selective data store 4652 can be used as the weight RAM124 to be accessed by the NPU pipeline 126 and architectural programs executing on the cores 4002. Flow proceeds to block 5118.

At block 5118, the architectural program executing on the core 4002 may be used as the weight RAM 124 to be accessed by the NPU pipeline 126. Flow ends at block 5118.

Referring now to FIG. 52, a block diagram illustrating an embodiment of a portion of NNU 121 is shown. The NNU 121 includes a shift unit 5802, a shift register 5804, a data multiplexing register 208, a weight multiplexing register 705, the NPU 126, a multiplexer 5806, an output unit 5808, and an output register 1104. The data mux register 208 and the weight mux register 705 are the same as described above, but are modified to additionally receive inputs from the move register 5804 and from additional neighboring NPUs 126. In one embodiment, in addition to output 209 from J +1 as described above, data multiplexing register 208 receives output 209 from NPUs J-1 and J-4 on input 211; likewise, in addition to the output 203 from J +1 as described above, the weight multiplexing register 705 also receives the output 203 from NPUs J-1 and J-4 on input 711. The output register 1104 is the same as the buffers referred to above as the line buffer 1104 and the output buffer 1104. The output unit 5808 is identical in many respects to the activation function unit 212/1112 described above in that it may include an activation function (e.g., sigmoid function, hyperbolic tangent function, correction function, soft-add function); however, these output units 5808 preferably also comprise a requantization unit for requantizing the value of the accumulator 202. The NPU 126 is identical to the above in many respects. As described above, different embodiments are contemplated in which the data word width and weight word width may be of various sizes (e.g., 8 bits, 9 bits, 12 bits, or 16 bits), and multiple word sizes may be supported by a given embodiment (e.g., 8 bits and 16 bits). However, a representative embodiment is shown for the following figure, where the data word width and weight word width held in memory 122/124, shift register 5804, mux register 208/705, and output register 1104 are 8-bit words, i.e., bytes.

Fig. 52 shows a cross section of NNU 121. For example, the illustrated NPU 126 is representative of an array of NPUs 126 (such as described above). A representative NPU 126 refers to the NPU [ J ]126 of the N NPUs 126, where J is between 0 and N-1. As mentioned above, N is a large number and is preferably a power of 2. As described above, N may be 512, 1024 or 2048. In one embodiment, N is 4096. Because of the large number of NPUs 126 in the array, it is advantageous for each NPU 126 to be as small as possible to keep the size of the NNUs 121 within desired limits and/or to accommodate more NPUs 126 to increase the speed of neural network related computations performed by the NNUs 121.

Further, although the shift unit 5802 and the shift register 5804 are each N bytes wide, only a portion of the shift register 5804 is shown. In particular, the output 5824 in the move register 5804 is shown providing a portion of bytes, represented as the move register [ J ]5804, to the mux register 208/705 of NPU [ J ] 126. Furthermore, although the output 5822 of the shift unit 5802 provides N bytes (to memory 122/124 and shift register 5804), only byte J is provided for loading into the shift register [ J ]5804, which then provides byte J on output 5824 to the data multiplexing register 208 and weight multiplexing register 705.

Further, although the NNU 121 includes multiple output units 5808, only a single output unit 5808 is shown in fig. 52, i.e., an output unit 5808 that performs operations on the NPU [ J ]126 within a NPU group and the accumulator outputs 217 of multiple NPUs 126 (such as the NPUs 126 described above with respect to fig. 11). The output units 5808 are referred to as output units [ J/4] because, in the embodiment of fig. 52, each output unit 5808 is shared by a group of four NPUs 126. Likewise, although NNU 121 includes multiple multiplexers 5806, only a single multiplexer 5806, i.e., multiplexer 5806 that receives NPU [ J ]126 within its NPU bank and accumulator outputs 217 of multiple NPUs 126, is shown in fig. 52. Likewise, multiplexer 5806 is referred to as multiplexer [ J/4] because multiplexer 5806 selects one of the four accumulator 202 outputs 217 to provide to output unit [ J/4] 5808.

Finally, although the output register 1104 is N bytes wide, only a single 4-byte segment (denoted as output register [ J/4]1104) is shown in FIG. 52, where the 4-byte segment receives the four quantized bytes generated by the output unit [ J/4]5808 from the four NPUs 126 of the NPU group that includes NPU [ J ] 126. All N bytes in the output 133 of the output register 1104 are provided to the shift unit 5802, but only four bytes in the quad-byte section of the output register [ J/4]1104 are shown in fig. 52. In addition, four bytes in the quad-byte section of output register [ J/4]1104 are provided as inputs to a multiplexing register 208/705.

Although the multiplexing registers 208/705 are shown as being distinct from the NPUs 126 in fig. 52, there is a corresponding pair of multiplexing registers 208/705 associated with each NPU 126, and these multiplexing registers 208/705 may be considered part of the NPUs 126 as described above with respect to fig. 2 and 7.

The output 5822 of the shift unit 5802 is coupled to a shift register 5804, a data RAM122 and a weight RAM124, each of which can be written to by the output 5822. The output 5822 of the shift unit 5802, the shift register 5804, the data RAM122 and the weight RAM124 are all N bytes wide (e.g., N is 4906). Mobile unit 5802 receives N quantized bytes from five different sources as follows and selects one of them as its input: data RAM122, weight RAM124, move register 5804, output register 1104, and an immediate value. Preferably, the mobile unit 5802 includes a plurality of multiplexers interconnected to be able to perform operations on its inputs to produce its outputs 5822, which operations will now be described.

The operations performed by the mobile unit 5802 on its inputs include: passing an input to an output; rotating the input wheel by a specified amount; and extracting and compacting the input specified bytes. The operation is specified in the MOVE instruction that is fetched from the program memory 129. In one embodiment, the amount of revolutions that can be specified are 8, 16, 32, and 64 bytes. In one embodiment, the wheel direction is to the left, but other embodiments are contemplated in which the wheel direction is to the right or any direction. In one embodiment, the extraction and compaction operations are performed within an input block of a predetermined size. The block size is specified by the MOVE instruction. In one embodiment, the predetermined block sizes are 16, 32, and 64 bytes, and the blocks are located on alignment boundaries that specify block sizes. Thus, for example, when the MOVE instruction specifies a block size of 32, the MOVE unit 5802 extracts the specified bytes within each 32-byte block of the input N bytes (e.g., if N is 4096, there are 128 blocks) and compacts them within the corresponding 32-byte block (preferably at one end of the block). In one embodiment, NNU 121 also includes an N-bit mask register (not shown) associated with move register 5804. A MOVE instruction specifying a load mask register operation may specify a row of data RAM122 or weight RAM124 as its source. In response to the MOVE instruction specifying an operation to load the mask register, the MOVE unit 5802 extracts a bit 0 from each of the N words of the line of RAM and stores the N bits into corresponding bits of the N-bit mask register. During execution of a subsequent MOVE instruction to write to the MOVE register 5804, bits in the bit mask are used as write enable/disable for the corresponding byte of the MOVE register 5804. In an alternative embodiment, the 64-bit mask is specified by an INITIALIZE instruction for loading into a mask register prior to executing a MOVE instruction to specify fetch and compact functions; in response to the MOVE instruction, MOVE unit 5802 fetches the bytes within each block (e.g., of the 128 blocks) specified by the 64-bit mask stored in the mask register. In an alternative embodiment, the MOVE instruction used to specify the fetch and pack operations also specifies a stride and an offset; in response to the MOVE instruction, MOVE unit 5802 fetches every N bytes within each block, where N is the stride, starting with the byte specified by the offset, and compresses the fetched bytes together. For example, if the MOVE instruction specifies a stride of 3 and an offset of 2, then the MOVE unit 5802 fetches every three bytes from byte 2 within each block.

Referring now to fig. 53, a block diagram is shown that illustrates the ring station 4004-N of fig. 46 in greater detail. The ring station 4004-N includes a slave interface 6301, a first master interface 6302-0, referred to as master interface 0, and a second master interface 6302-1, referred to as master interface 1. Master interface 06302-0 and master interface 16302-1 are referred to generically as master interface 6302 individually or collectively as master interface(s) 6302. The ring station 4004-N further comprises three arbiters 6362, 6364 and 6366 coupled to

respective buffers

6352, 6354 and 6356 providing outgoing Requests (REQ), DATA (DATA) and Acknowledgements (ACK), respectively, on the first unidirectional ring 4024-0 of the ring bus 4024; these three arbiters 6362, 6364 and 6366 receive incoming Requests (REQ), DATA (DATA) and Acknowledgements (ACK), respectively, on first unidirectional ring 4024-0. The ring station 4004-N comprises three additional arbiters 6342, 6344 and 6346 coupled to respective

additional buffers

6332, 6334 and 6336 providing outgoing Requests (REQ), DATA (DATA) and Acknowledgements (ACK), respectively, on the second unidirectional ring 4024-1 of the ring bus 4024; these three arbiters 6342, 6344 and 6346 receive incoming Requests (REQ), DATA (DATA) and Acknowledgements (ACK), respectively, on second unidirectional ring 4024-1. The request, data, and acknowledge subrings of each unidirectional ring of ring bus 4024 are described above. The listening and credit subrings are not shown, but slave 6301 and master 6302 interfaces are also coupled to the listening and credit subrings.

Slave interface 6301 includes load queue 6312 and store queue 6314; the master interface 06302-0 includes a load queue 6322 and a store queue 6324; and host interface 16302-1 includes a load queue 6332 and a store queue 6334. The load queue 6312 of the slave interface 6301 receives and queues requests from both the unidirectional rings 4024-0 and 4024-1 of the ring bus 4024 and provides queued data to each of the respective arbiters 6364 and 6344 of the ring bus 4024. The store queue 6314 of the slave interface 6301 receives and queues data from both directions of the ring bus 4024 and provides acknowledgements to each of the respective arbiters 6366 and 6346 of the ring bus 4024. Load queue 6322 of primary interface 06302-0 receives data from second unidirectional ring 4024-1 and provides queued requests to arbiter 6362 of first unidirectional ring 4024-0. Store queue 6324 of primary interface 06302-0 receives the acknowledgement from second unidirectional ring 4024-1 and provides queued data to arbiter 6364 of first unidirectional ring 4024-0. Primary interface 16302-1 load queue 6332 receives data from first unidirectional ring 4024-0 and provides queued requests to arbiter 6342 of second unidirectional ring 4024-1. Store queue 6334 of primary interface 16302-1 receives the acknowledgement from first unidirectional ring 4024-0 and provides queued data to arbiter 6344 of second unidirectional ring 4024-1. Load queue 6312 of slave interface 6301 provides queued requests to NNU121 and receives data from NNU 121. Store queue 6314 of slave interface 6301 provides queued requests and data to NNU121 and receives acknowledgements from NNU 121. Load queue 6322 of first master interface 06302-0 receives and queues requests from NNU121 and provides data to NNU 121. Store queue 6324 of first master interface 06302-0 receives and queues requests and data from NNU121 and provides acknowledgements to NNU 121. Load queue 6332 of second master interface 16302-1 receives and queues requests from NNU121 and provides data to NNU 121. Store queue 6334 of second master interface 16302-2 receives and queues requests and data from NNU121 and provides acknowledgements to NNU 121.

In general, slave interface 6301 receives requests made by core 4002 to load data from NNU121 (received by load queue 6312) and requests made by core 4002 to store data to NNU121 (received by store queue 6314), but slave interface 6301 may also receive requests from other ring bus 4024 agents such as L3 cache 4005, where the L3 cache 4005 reads/writes cache lines relative to selective data store 4652 (e.g., weight RAM 124 or data RAM 122) when acting as victim cache 4602 as described above. For example, via slave interface 6301, core 4002 may: write control data and read status data with respect to control/status register 127; write instructions to program memory 129; write/read data/weights with respect to data RAM 122 and weight RAM 124; and writes control words to bus controller memory 6636 to program DMA controller 6602 (see fig. 56) of NNU 121. More specifically, in embodiments where NNU121 is located on ring bus 4024 rather than being an execution unit of core 4002, core 4002 may write to control/status register 127 to instruct NNU121 to perform similar operations as described for MTNN instruction 1400 of fig. 14, and may read from control/status register 127 to instruct NNU121 to perform similar operations as described for MFNN instruction 1500 of fig. 15. The list of operations includes, but is not limited to: starting execution of the program in program memory 129, suspending execution of the program in program memory 129, requesting notification (e.g., an interrupt) of completion of execution of the program in program memory 129, resetting NNU121, writing a DMA base register, and writing a strobe (strobe) address to cause a row buffer to be written to or read from data/weight RAM 122/124. In addition, slave interface 6301 may generate interrupts (e.g., PCI interrupts) to each core 4002 at the request of NNU 121. Preferably, the sequencer 128 instructs the slave interface 6301 to generate an interrupt, for example in response to decoding an instruction fetched from the program memory 129. Alternatively, DMAC 6602 may instruct the slave interface 6301 to generate an interrupt, for example, in response to completing a DMA operation (e.g., after writing a data word that is the result of a neural network layer computation from the data RAM 122 to system memory). In one embodiment, the interrupt comprises a vector, such as an 8-bit x86 interrupt vector, or the like. Preferably, a flag in a control word read by DMAC 6602 from bus control memory 6636 specifies whether DMAC 6602 indicates that the slave interface 6301 generates an interrupt upon completion of the DMA operation.

Typically, the NNU 121 generates requests to write data to system memory (received by the store queue 6324/6334) via the master interface 6302 and requests to read data from system memory (received by the load queue 6322/6332) via the master interface 6302 (e.g., via the DRAM controller 4018), but the master interface 6302 may also receive requests to proxy read/write data from the NNU 121 with respect to the other ring bus 4024. For example, NNU 121 may transfer data/weights from system memory to data RAM122 and weight RAM 124, and may transfer data from data RAM122 and weight RAM 124 to system memory via master interface 6302. NNU 121 may also generate a request via main storage queue 6324/6334 to write a cache line to system memory 4618 (e.g., write back a modified cache line for each block 5108 in fig. 51).

Preferably, the various entities of NNU 121 accessible via ring bus 4024 (such as data RAM122, weight RAM 124, program memory 129, bus control memory 6636, and control/status registers 127, etc.) are memory mapped into system memory space. In one embodiment, the accessible NNU 121 entities are memory mapped via Peripheral Component Interconnect (PCI) configuration registers of the PCI configuration protocol, which is well known.

An advantage of having two master interfaces 6302 for the ring stations 4004-N is that it enables NNUs 121 to transmit and/or receive simultaneously with respect to both system memory (via DRAM controller 4018) and the various L3 slices 4006, or alternatively in parallel with respect to system memory at twice the bandwidth of an embodiment having a single master interface.

In one embodiment, the data RAM 122 is 64KB, arranged as 16 lines of 4KB each, thus requiring 4 bits to specify its line address; the weight RAM 124 is 8MB, arranged as 2K lines of 4KB per line, thus requiring 11 bits to specify its line address; the program memory 129 is 8KB, arranged as 64-bit 1K lines per line, thus requiring 10 bits to specify its line address; the bus control memory 6636 is 1KB, which is arranged as 128 lines of 64 bits per line, thus requiring 7 bits to specify its line address; each of the queues 6312/6314/6322/6324/6332/6334 includes 16 entries, thus requiring 4 bits to specify the index of the entry. In addition, the data subring of the unidirectional ring 4024 of the ring bus 4024 has a width of 64 bytes. Accordingly, the 64 byte portion is referred to herein as a block, a block of data, etc. (data can be used to generically refer to both data and weights). Thus, the rows of data RAM 122 or weight RAM 124, while not addressable at the block level according to one embodiment, are each subdivided into 64 blocks; in addition, data/weight write buffer 6612/6622 (of FIG. 56) and data/weight read buffer 6614/6624 (of FIG. 56) are each also subdivided into 64 blocks of 64 bytes each, and are addressable at the block level; thus, 6 bits are required to specify the address of the block within the line/buffer. The following description assumes these sizes for ease of illustration; however, various other embodiments of different sizes are contemplated.

Referring now to FIG. 54, a block diagram is shown illustrating the slave interface 6301 of FIG. 53 in greater detail. The slave interface 6301 includes a load queue 6312, a store queue 6314, arbiters 6342, 6344, 6346, 6362, 6364, and 6366, and buffers 6332, 6334, 6336, 6352, 6354, and 6356, which are coupled to the ring bus 4024 of fig. 53. FIG. 54 also includes other requesters 6472 (e.g., master interface 06302-0) that generate requests to arbiter 6362 and other requesters 6474 (e.g., master interface 16302-1) that generate requests to arbiter 6342.

The dependent load queue 6312 includes a queue of entries 6412 coupled to a request arbiter 6416 and a data arbiter 6414. In the illustrated embodiment, the queue includes 16 entries 6412. Each entry 6412 includes storage for an address, a source identifier, a direction, a transaction identifier, and a data block associated with the request. The address specifies the location within NNU 121 to load the requested data to return to the requesting ring bus 4024 agent (e.g., core 4002). The address may specify a block location within control/status register 127, or data RAM 122 or weight RAM 124. Where the address specifies a block location within data RAM 122/weight RAM124, in one embodiment, the upper bits specify a row of data RAM 122/weight RAM124, while the lower bits (e.g., 6 bits) specify the block within the specified row. Preferably, the low order bits are used to control the data/weight read buffer multiplexer 6615/6625 (see FIG. 56) to select the appropriate block (see FIG. 56) within the data/weight read buffer 6614/6624. The source identifier specifies the requestor ring bus 4024 agent. The direction specifies on which one of the two unidirectional rings 4024-0 or 4024-1 data is to be sent back to the requesting agent. The transaction identifier is specified by the requester proxy and returned by the ring station 4004-N to the requester proxy along with the requested data.

Each entry 6412 also has an associated state. A Finite State Machine (FSM) updates the state. In one embodiment, the FSM operates as follows. When the load queue 6312 detects a load request destined for it on the ring bus 4024, the load queue 6312 allocates an available entry 6412 and fills the allocated entry 6412, and the FSM updates the state of the allocated entry 6412 to the requestor NNU. Request arbiter 6416 arbitrates between requestors NUU entries 6412. When the allocated entry 6412 wins arbitration and is sent as a request to NNU 121, the FSM marks entry 6412 as pending NNU data. When NNU 121 responds with the requested data, load queue 6312 loads the data into entry 6412 and marks entry 6412 as a requestor data ring. The data arbiter 6414 arbitrates between the requester data ring entries 6412. When entry 6412 wins arbitration and data is sent on ring bus 4024 to the ring bus 4024 agent requesting the data, the FSM marks entry 6412 as available and issues credits on its credit ring.

The slave storage queue 6314 comprises a queue of entries 6422 coupled to a request arbiter 6426 and an acknowledge arbiter 6424. In the illustrated embodiment, the queue includes 16 entries 6422. Each entry 6422 includes storage for an address, a source identifier, and data associated with the request. The address specifies the location within NNU 121 to which data provided by the requesting ring bus 4024 agent (e.g., core 4002) is to be stored. The address may specify a block location within control/status register 127, data RAM 122 or weight RAM 124, a location within program memory 129, or a location within bus control memory 6636. Where the address specifies a block location within data RAM 122/weight RAM 124, in one embodiment, the upper bits specify a row of data RAM 122/weight RAM 124, while the lower bits (e.g., 6 bits) specify the block within the specified row. Preferably, the low order bits are used to control the data/weight demultiplexer 6611/6621 to select the appropriate block within the data/weight write buffer 6612/6622 to write to (see fig. 56). The source identifier specifies the requestor ring bus 4024 agent.

Each entry 6422 also has an associated state. A Finite State Machine (FSM) updates the state. In one embodiment, the FSM operates as follows. When the store queue 6314 detects a store request destined for it on the ring bus 4024, the store queue 6314 allocates an available entry 6422 and fills the allocated entry 6422, and the FSM updates the state of the allocated entry 6422 to the requestor NNU. The request arbiter 6426 arbitrates between the requesters NUU entries 6422. When entry 6422 wins arbitration and is sent to NNU 121 along with the data for entry 6422, the FSM marks entry 6422 as a pending NNU acknowledgement. When NNU 121 responds with an acknowledgement, the store FSM marks entry 6422 as a requestor acknowledgement ring. The acknowledgement arbiter 6424 arbitrates between the requestor acknowledgement ring entries 6422. When entry 6422 wins arbitration and an acknowledgement is sent on the acknowledgement ring to the ring bus 4024 agent requesting to store data, the FSM marks entry 6422 as available and issues credits on its credit ring. The store queue 6314 also receives a wr busy signal from the NNU 121, where the wr busy signal indicates that the store queue 6314 is not requesting from the NNU 121 until the wr busy signal is no longer valid.

Referring now to FIG. 55, a block diagram is shown illustrating the main interface 06302-0 of FIG. 53 in greater detail. Although FIG. 55 shows a master interface 06302-0, the master interface 06302-0 also represents details of the master interface 16302-1 of FIG. 53, and therefore will be referred to generically as master interface 6302. The master interface 6302 includes a load queue 6322, a store queue 6324, arbiters 6362, 6364, and 6366, and buffers 6352, 6354, and 6356 coupled to the ring bus 4024 of fig. 53. Fig. 55 also shows other acknowledge requestors 6576 (e.g., slave interfaces 6301) that generate acknowledge requests for the arbiter 6366.

The master interface 6302 also includes an arbiter 6534 (not shown in fig. 53), where this arbiter 6534 receives requests from the load queue 6322 and from other requesters 6572 (e.g., DRAM controller 4018 in embodiments where the NNU 121 and DRAM controller 4018 share a ring station 4004-N) and presents winning arbitration requests to the arbiter 6362 of fig. 53. The master interface 6302 also includes a buffer 6544, where the buffer 6544 receives data associated with an entry 6512 of the load queue 6312 from the ring bus 4024 and provides it to the NNU 121. The master interface 6302 also includes an arbiter 6554 (not shown in fig. 53), where the arbiter 6554 receives data from the store queue 6324 and from other requesters 6574 (e.g., DRAM controller 4018 in embodiments where the NNU 121 and DRAM controller 4018 share a ring station 4004-N) and presents winning arbitration data to the arbiter 6364 of fig. 53. The master interface 6302 also includes a buffer 6564, where the buffer 6564 receives an acknowledgement from the ring bus 4024 associated with an entry 6522 of the store queue 6314 and provides it to the NNU 121.

The load queue 6322 includes a queue of entries 6512 coupled to an arbiter 6514. In the illustrated embodiment, the queue includes 16 entries 6512. Each entry 6512 includes storage for an address and a destination identifier. The address specifies an address (46 bits in one embodiment) in the ring bus 4024 address space (e.g., a system memory location). The destination identifier specifies the ring bus 4024 agent (e.g., system memory) from which the data is to be loaded.

Load queue 6322 receives a master load request from NNU 121 (e.g., from DMAC 6602) to load data from the ring bus 4024 agent (e.g., system memory) into data RAM 122 or weight RAM 124. The master load request specifies a destination identifier, a ring bus address, and an index of the load queue 6322 entry 6512 to use. When load queue 6322 receives a master load request from NNU 121, load queue 6322 fills in indexed entry 6512, and the FSM updates the entry 6512 state to requestor credit. When load queue 6322 obtains credit from the credit ring to send a request for data to the destination ring bus 4024 agent (e.g., system memory), the FSM updates the state to the requestor request ring. The arbiter 6514 arbitrates between the requestor request ring entries 6512 (and the arbiter 6534 arbitrates between the load queue 6322 and the other requesters 6572). When entry 6512 is granted to the request ring, a request is sent on the request ring to the destination ring bus 4024 agent (e.g., system memory), and the FSM updates the state to the pending data ring. When the ring bus 4024 responds with data (e.g., from system memory), the data is received in the buffer 6544. And provided to NNU 121 (e.g., to data RAM 122, weight RAM 124, program memory 129, or bus control memory 6636), and the FSM updates entry 6512 state as available. Preferably, an index of the entry 6512 is included within the data packet to enable the load queue 6322 to determine the entry 6512 associated with the data packet. Load queue 6322 preferably provides an entry 6512 index to NNU 121 along with the data to enable NNU 121 to determine which entry 6512 the data is associated with and to enable NNU 121 to reuse entry 6512.

The master storage queue 6324 comprises a queue of entries 6522 coupled to an arbiter 6524. In the illustrated embodiment, the queue includes 16 entries 6522. Each entry 6522 includes storage for an address, a destination identifier, a data field for holding data to be stored, and a coherency flag. The addresses specify addresses in the ring bus 4024 address space (e.g., system memory locations). The destination identifier specifies the ring bus 4024 agent (e.g., system memory) into which the data is to be stored. The coherence flag is sent with the data to the destination agent. If the coherency flag is set, it instructs DRAM controller 4018 to snoop L3 cache 4005 and victim cache 4602 and invalidate the copy (if it exists). Otherwise, the DRAM controller 4018 writes data to the system memory without snooping the L3 cache 4005 and the victim cache 4602.

The store queue 6324 receives a main store request from NNU 121 (e.g., from DMAC 6602) to store data from data RAM 122 or weight RAM 124 to the ring bus 4024 agent (e.g., system memory). The master store request specifies the destination identifier, the ring bus address, the index of the store queue 6324 entry 6522 to use, and the data to store. When the store queue 6324 receives a master store request from NNU 121, the store queue 6324 fills the allocated entry 6522 and the FSM updates the entry 6522 status to requester credit. When store queue 6324 gets credit from the credit ring to send data to the destination ring bus 4024 agent (e.g., system memory), the FSM updates the state to the requester data ring. The arbiter 6524 arbitrates between the requester data ring entries 6522 (and the arbiter 6554 arbitrates between the store queue 6324 and the other requesters 6574). When entry 6522 is granted to the data ring, data is sent on the data ring to the destination ring bus 4024 agent (e.g., system memory), and the FSM updates the state to a pending acknowledgement ring. When ring bus 4024 responds with an acknowledgement of data (e.g., from system memory), the acknowledgement is received in buffer 6564. Store queue 6324 then provides an acknowledgement to NNU 121 to inform NNU 121 that a store has been performed, and FSM updates entry 6522 state as available. Preferably, store queue 6324 does not have to arbitrate to provide acknowledgments to NNUs 121 (e.g., DMAC 6602 exists for each store queue 6324 as in the embodiment of fig. 56). However, in embodiments where store queue 6324 must arbitrate to provide acknowledgments, FSM updates entry 6522 state to requester NNU completion when ring bus 4024 responds with an acknowledgment, and updates entry 6522 state to available once entry 6522 wins arbitration and provides an acknowledgment to NNU 121. Preferably, the index of the entry 6522 is included within the acknowledgement packet received from the ring bus 4024, which enables the store queue 6324 to determine the entry 6522 associated with the acknowledgement packet. Storage queue 6324 provides an entry 6522 index to NNU 121 along with the acknowledgement to enable NNU 121 to determine which entry 6512 the data is associated with and to enable NNU 121 to reuse entry 6522.

Referring now to FIG. 56, a block diagram is shown that illustrates a portion of the ring bus coupling embodiments of ring stations 4004-N and NNUs 121 of FIG. 53. A slave interface 6301, a master interface 06302-0 and a master interface 16302-1 of the ring station 4004-N are shown. The ring bus coupling embodiments of NNU 121 of fig. 56 include embodiments of data RAM 122, weight RAM 124, program memory 129, sequencer 128, control/status registers 127 described in detail above. The ring bus coupling embodiment of NNU 121 is similar in many respects to the execution unit embodiment described above, and for the sake of brevity, these aspects will not be re-described. The ring bus coupled embodiment of NNU 121 also includes the elements described in fig. 52, e.g., shift unit 5802, shift register 5804, multiplexing register 208/705, NPU 126, multiplexer 5806, output unit 5808, and output register 1104. NNU 121 also includes a first direct memory access controller (DMAC0)6602-0, a second direct memory access controller (DMAC1)6602-1, a bus control memory 6636, a data demultiplexer 6611, a data write buffer 6612, a data RAM multiplexer 6613, a data read buffer 6614, a data read buffer multiplexer 6615, a weight demultiplexer 6621, a weight write buffer 6622, a weight RAM multiplexer 6623, a weight read buffer 6624, a weight read buffer multiplexer 6625, a slave multiplexer 6691, a master 0 multiplexer 6693, and a master 1 multiplexer 6692. In one embodiment, three of the data demultiplexer 6611, data write buffer 6612, data read buffer 6614, data read buffer multiplexer 6615, weight demultiplexer 6621, weight write buffer 6622, weight read buffer 6624, and weight read buffer multiplexer 6625 are each associated with a slave interface 6301, a master interface 06302-0, and a master interface 16302-1, respectively, of the ring bus 4024. In one embodiment, three of the data demultiplexer 6611, the data write buffer 6612, the data read buffer 6614, the data read buffer multiplexer 6615, the weight demultiplexer 6621, the weight write buffer 6622, the weight read buffer 6624, and the weight read buffer multiplexer 6625, which are each associated with the slave interface 6301, the master interface 06302-0, and the master interface 16302-1 of the ring bus 4024, respectively, are paired to support data transmission in a double buffer manner.

Data demultiplexer 6611 is coupled to receive data blocks from slave interface 6301, master interface 06302-0, and master interface 16302-1, respectively. The data demultiplexers 6611 are further coupled to data write registers 6612, the data write registers 6612 are coupled to data RAM multiplexers 6613, the data RAM multiplexers 6613 are coupled to data RAMs 122, the data RAMs 122 are coupled to data read registers 6614, the data read registers 6614 are coupled to data read buffer multiplexers 6615, respectively, the data read buffer multiplexers 6615 are coupled to slave multiplexers 6691, master 0 multiplexers 6693 and master 1 multiplexers 6692, respectively. The slave multiplexer 6691 is coupled to the slave interface 6301, the master 0 multiplexer 6693 is coupled to the master interface 06302-0, and the master 1 multiplexer 6692 is coupled to the master interface 16302-1. Weight demultiplexer 6621 is coupled to receive data blocks from slave interface 6301, master interface 06302-0, and master interface 16302-1, respectively. The weight demultiplexers 6621 are further coupled to weight write registers 6622, weight write registers 6622 are coupled to weight RAM multiplexers 6623, weight RAM multiplexers 6623 are coupled to weight RAM 124, weight RAM 124 is coupled to weight read registers 6624, weight read registers 6624 are coupled to weight read buffer multiplexers 6625, weight read buffer multiplexers 6625 are coupled to slave multiplexers 6691, master 0 multiplexer 6693 and master 1 multiplexer 6692, respectively. The data RAM multiplexer 6613 and the weight RAM multiplexer 6623 are also coupled to the output register 1104 and the shift register 5804. The data RAM122 and weight RAM 124 are also coupled to the shift unit 5802 and data multiplexing register 208 of the NPU 126 and weight multiplexer register 705, respectively. The control/status register 127 is coupled to the slave interface 6301. The bus control memory 6636 is coupled to the slave interface 6301, the sequencer 128, the DMAC 06602-0 and the DMAC 16602-1. The program memory 129 is coupled to the slave interface 6301 and the sequencer 128. The sequencer 128 is coupled to the program memory 129, the bus control memory 6636, the NPU 126, the moving unit 5802, and the output unit 5808. DMAC 06602-0 is also coupled to the main interface 06302-0, and DMAC16602-1 is also coupled to the main interface 16302-1.

The data write buffer 6612, data read buffer 6614, weight write buffer 6622, and weight read buffer 6624 are the width of the data RAM 122 and weight RAM124, i.e., the width of the NPU126 array, and are generally referred to herein as N. Thus, for example, in one embodiment, there are 4096 NPUs 126 and the width of the data write buffer 6612, data read buffer 6614, weight write buffer 6622 and weight read buffer 6624 is 4096 bytes, although other embodiments are contemplated in which N is a value other than 4096. The data RAM 122 and weight RAM124 are written to the entire N word line at once. The output register 1104, the shift register 5804, and the data write buffer 6612 are written to the data RAM 122 via a data RAM multiplexer 6613, wherein the data RAM multiplexer 6613 selects one of them to write a line word to the data RAM 122. The output register 1104, the shift register 5804, and the weight write buffer 6622 are written to the weight RAM124 via a weight RAM multiplexer 6623, wherein the weight RAM multiplexer 6623 selects one of them to write a line word to the weight RAM 124. Control logic (not shown) controls the data RAM multiplexer 6613 to arbitrate between the data write buffer 6612, the move register 5804, and the output register 1104 to access the data RAM 122, and controls the weight RAM multiplexer 6623 to arbitrate between the weight write buffer 6622, the move register 5804, and the output register 1104 to access the weight RAM 124. The data RAM 122 and weight RAM124 also read the entire N word lines at once. The NPU126, the moving unit 5802, and the data read buffer 6614 read a line of words from the data RAM 122. The NPU126, the moving unit 5802, and the weight read buffer 6624 read a line of words from the weight RAM 124. The control logic also controls the NPU126 (data multiplexer register 208 and weight multiplexer register 705), the shift unit 5802 and the data read buffer 6614 to determine which, if any, of them reads a line of words output by the data RAM 122. In one embodiment, the micro-operations 3418 described with respect to fig. 34 or fig. 57 may include at least some of the control logic signals that control the data RAM multiplexer 6613, the weight RAM multiplexer 662, the NPU126, the shift unit 5802, the shift register 5804, the output register 1104, the data read buffer 6614, and the weight read buffer 6624.

The data write buffer 6612, data read buffer 6614, weight write buffer 6622, and weight read buffer 6624 are addressable in block size aligned blocks. Preferably, the block sizes of the data write buffer 6612, data read buffer 6614, weight write buffer 6622 and weight read buffer 6624 match the width of the data sub-ring of the ring bus 4024. This makes the ring bus 4024 suitable for reading/writing data/weight RAM 122/124 as follows. In general, the ring bus 4024 performs a block-size write to each block of the data write buffer 6612, and once all blocks of the data write buffer 6612 are filled, the data write buffer 6612 writes its N-word content to an entire row of the data RAM 122. Likewise, the ring bus 4024 performs a block-size write to each block of the weight write buffer 6622, and once all blocks of the weight write buffer 6622 are filled, the weight write buffer 6622 writes its N-word content to the entire line of the weight RAM 124. Instead, the N word lines are read from the data RAM 122 into the data read buffer 6614; the ring bus 4024 then performs a read of the block size from each block of the data read buffer 6614. Similarly, N word lines are read from the weight RAM124 into the weight read buffer 6624; the ring bus 4024 then performs reading of the block size from each block of the weight read buffer 6624. Although the data RAM 122 and the weight RAM124 are represented as dual port memories in fig. 56, they are preferably single port memories such that a single data RAM 122 port is shared by the data RAM multiplexer 6613 and the data read buffer 6614, and a single weight RAM124 port is shared by the weight RAM multiplexer 6623 and the weight read buffer 6624. Thus, the advantage of the full row read/write arrangement is that it enables the data RAM 122 and weight RAM124 to be smaller by having a single port (in one embodiment the weight RAM124 is 8MB and the data RAM 122 is 64KB), while the ring bus 4024 consumes less bandwidth relative to the writing and reading of the data RAM 122 and weight RAM124 than when writing individual blocks, thus freeing more bandwidth for the NPU 126, output register 1104, shift register 5804 and shift unit 5802 to access N word wide rows. However, other embodiments are contemplated in which various blocks of weight RAM124 and data RAM 122 may be written to/read from, for example, to facilitate their use as selective data store 4652 of victim cache 4602, during which time NPU 126, output register 1104, move register 5804, and move unit 5802 are not accessing selective data store 4652.

The control/status register 127 is provided to the slave interface 6301. The slave multiplexer 6691 receives the output of the data read buffer multiplexer 6615 associated with the slave interface 6301 and the output of the weight read buffer multiplexer 6625 associated with the slave interface 6301 and selects one of them to provide to the slave interface 6301. In this manner, the slave load queue 6312 receives data for responding to load requests made by the slave interface 6301 to the control/status register 127, the data RAM122, or the weight RAM 124. The master 0 multiplexer 6693 receives the output of the data read buffer multiplexer 6615 associated with the master interface 06302-0 and the output of the weight read buffer multiplexer 6625 associated with the master interface 06302-0 and selects one of them to provide to the master interface 06302-0. In this manner, the master interface 06302-0 receives data for responding to store requests made by the master interface 06302-0 store queue 6324. The primary 1 multiplexer 6692 receives the output of the data read buffer multiplexer 6615 associated with the primary interface 16302-1 and the output of the weight read buffer multiplexer 6625 associated with the primary interface 16302-1 and selects one of them to provide to the primary interface 16302-1. In this manner, the master interface 16302-1 receives data for responding to a store request made by the master interface 16302-1 store queue 6324. If the slave interface 6301 load queue 6312 requests a read from the data RAM122, the slave multiplexer 6691 selects the output of the data read buffer multiplexer 6615 associated with the slave interface 6301; whereas if the slave interface 6301 load queue 6312 requests a read from the weight RAM 124, the slave multiplexer 6691 selects the output of the weight read buffer multiplexer 6625 associated with the slave interface 6301. Similarly, if the primary interface 06302-0 store queue requests to read data from the data RAM122, the primary 0 multiplexer 6693 selects the output of the data read buffer multiplexer 6615 associated with the primary interface 06302-0; whereas if the primary interface 06302-0 store queue requests to read data from weight RAM 124, the primary 0 multiplexer 6693 selects the output of the weight read buffer multiplexer 6625 associated with the primary interface 06302-0. Finally, if the primary interface 16302-1 store queue requests to read data from the data RAM122, the primary 1 multiplexer 6692 selects the output of the data read buffer multiplexer 6615 associated with the primary interface 16302-1; whereas if primary interface 16302-1 store queue requests to read data from weight RAM 124, primary 1 multiplexer 6692 selects the output of weight read buffer multiplexer 6625 associated with primary interface 16302-1. Thus, the ring bus 4024 agent (e.g., core 4002) may read from the control/status register 127, data RAM122, or weight RAM 124 via the slave interface 6301 load queue 6312. In addition, the ring bus 4024 agent (e.g., core 4002) may write to the control/status register 127, data RAM122, weight RAM 124, program memory 129, or bus control memory 6636 via the slave interface 6301 store queue 6314. More specifically, core 4002 can write a program (e.g., a program that performs full join, convolution, pooling, LSTM, or other recurrent neural network layer computations) to program memory 129 and then to control/status register 127 to start the program. Further, the core 4002 may write control words to the bus control memory 6636 to cause the DMAC 6602 to perform DMA operations between the data RAM122 or weight RAM 124 and the ring bus 4024 agent (e.g., system memory or L3 cache 4005). The sequencer 128 may also write control words to the bus control memory 6636 to cause the DMAC 6602 to perform DMA operations between the data RAM122 or the weight RAM 124 and the ring bus 4024 agent. Finally, as described in more detail below, the DMAC 6602 may perform DMA operations to perform transfers between the ring bus 4024 agent (e.g., system memory or L3 cache 4005) and the data/weight RAM 122/124.

Slave interface 6301, master interface 06302-0, and master interface 16302-1 are coupled to each other to provide data blocks to their respective data demultiplexers 6611 and weight demultiplexers 6621. Arbitration logic (not shown) arbitrates for access to the data RAM 122 between the output register 1104, the move register 5804 and the slave interface 6301, the master interface 06302-0 and the master interface 16302-1, and the data write buffer 6612, and arbitrates for access to the weight RAM124 between the output register 1104, the move register 5804 and the slave interface 6301, the master interface 06302-0 and the master interface 16302-1, and the weight write buffer 6622. In one embodiment, write buffer 6612/6622 takes precedence over output register 1104 and move register 5804, and slave interface 6301 takes precedence over master interface 6302. In one embodiment, each data demultiplexer 6611 has 64 outputs (each output preferably 64 bytes) coupled to 64 blocks of a respective data write buffer 6612. The data demultiplexer 6611 provides the received block on an output of the appropriate block coupled to a data write buffer 6612. Likewise, each weight demultiplexer 6611 has 64 outputs (each output preferably 64 bytes) coupled to 64 blocks of the respective weight write buffer 6622. The weight demultiplexer 6621 provides the received block on an output of the appropriate block coupled to the weight write buffer 6622.

When a slave storage queue 6314 provides a block of data to its data/weight demultiplexer 6611/6621, the slave storage queue 6314 also provides the address of the appropriate block of data/weight write buffer 6612/6622 to be written to as a control input to the data/weight demultiplexer 6611/6621. The block address is the lower six bits of the address held in entry 6422, which is specified by the ring bus 4024 agent (e.g., core 4002 or control logic 4644) that generated the slave memory transaction. Conversely, when the load store queue 6312 requests a block of data from its data/weight read buffer multiplexer 6615/6625, the load store queue 6312 also provides the data/weight read buffer multiplexer 6615/6625 with the address of the appropriate block of the data/weight read buffer 6614/6624 to be read as a control input. The block address is the lower six bits of the address held in entry 6412, where the entry 6412 is specified by the ring bus 4024 agent (e.g., core 4002 or control logic 4644) that generated the dependent load transaction. Preferably, core 4002 may perform slave store transactions via slave interface 6301 (e.g., to a predetermined ring bus 4024 address) to cause NNU 121 to write the contents of data/weight write buffer 6612/6622 to data/weight RAM 122/124; instead, the core 4002 may perform a slave store transaction via the slave interface 6301 (e.g., to a predetermined ring bus 4024 address) to cause the NNU 121 to read a row of the data/weight RAM122/124 into the data/weight read buffer 6614/6624.

When the host interface 6302 load queue 6322/6332 provides a data block to its data/weight demultiplexer 6611/6621, the host interface 6302 load queue 6322/6332 also provides an index of the entry 6512 to the corresponding DMAC6602 that issued the load request to the load queue 6322/6332. In order to transfer the entire 4KB of data from system memory to the columns of data/weight RAMs 122/124, DMAC6602 must generate 64 primary load requests to load queue 6322/6332. DMAC6602 logically divides the 64 main load requests into four groups, each containing sixteen requests. DMAC6602 transfers the 16 requests within a group to a corresponding 16 entries 6512 of the load queue 6322/6322. DMAC6602 maintains a state associated with each entry 6512 index. The status indicates which of the four groups is currently using the entry to load the data block. Thus, when DMAC6602 receives an entry 6512 index from load queue 6322/6322, the logic of DMAC6602 constructs a block address by concatenating the group number with the index, and provides the constructed block address as a control input to data/weight demultiplexer 6611/6621.

Conversely, when the master interface 6302 store queue 6324/6334 requests a block of data from its data/weight buffer multiplexer 6615/6625, the master interface 6302 load queue 6324/6334 also provides the index of the entry 6522 to the corresponding DMAC6602 that issued the store request to the store queue 6322/6332. In order to transfer the entire 4KB of data from a row of data/weight RAMs 122/124 to system memory, DMAC6602 must generate 64 primary store requests to store queue 6324/6334. DMAC6602 logically divides the 64 store requests into four groups, each group containing sixteen requests. DMAC6602 makes sixteen requests within a group to a corresponding 16 entries 6522 of the store queue 6324/6334. DMAC6602 maintains a state associated with each entry 6522 index. The status indicates which of the four sets is currently using the entry to store the data block. Thus, when DMAC6602 receives an entry 6522 index from the store queue 6324/6334, the logic of DMAC6602 constructs a block address by concatenating the group number with the index, and provides the constructed block address as a control input to the data/weight read buffer multiplexer 6615/6625.

Referring now to FIG. 57, a block diagram illustrating a ring bus coupling embodiment of the NNU 121 is shown. Fig. 57 is in some respects identical to fig. 34, and like numbered elements are identical. Like FIG. 34, FIG. 57 illustrates the ability of an NNU 121 to receive micro-operations from multiple sources to provide to its pipeline. However, in the embodiment of fig. 57, NNU 121 is coupled to core 4002 via ring bus 4024 as in fig. 46, differences will now be described.

In the embodiment of FIG. 57, multiplexer 3402 receives micro-operations from five different sources. The multiplexer 3402 provides the selected micro-operations 3418 to the NPU 126 pipeline stage 3401, the data RAM 122 and weight RAM 124, the move unit 5802, and the output unit 5808 for control thereof, as described above. As described with respect to FIG. 34, the first source is the sequencer 128 that produces the micro-operations 3416. The second source is a modified version of the decoder 3404 of fig. 34 to receive blocks of data of a store request from a slave interface 6301 store queue 6314 stored by the core 4002. As described above with respect to fig. 34, the data block may include information similar to micro-instructions translated from the MTNN instruction 1400 or the MFNN instruction 1500. Decoder 3404 decodes the data block and in response produces micro-operations 3412. One example is the micro-operations 3412 generated in response to requests to write data to the data/weight RAM 122/124 received from the slave interface 6301 store queue 6314, or in response to requests to read data from the data/weight RAM 122/124 received from the slave interface 6301 load queue 6312. The third source is a direct data block of store requests from a slave interface 6301 store queue 6314 stored by core 4002, where core 4002 includes micro-operations 3414 that NNU 121 directly executes, as described above with respect to fig. 34. Preferably, the core 4002 stores different memory mapped addresses into the ring bus 4024 address space to enable the decoder 3404 to distinguish the second and third sources of uops. The fourth source is a micro-operation 7217 generated by DMAC 6602. The fifth source is an empty compute micro-operation 7219, where NNU 121 maintains its state in response to the empty compute micro-operation 7219.

In one embodiment, five sources have a priority scheme performed by decoder 3404, with direct micro-operations 3414 having the highest priority; the micro-operations 3412 generated by the decoder 3404 in response to the slave store operation of the slave interface 6301 have a second highest priority; the micro-operation 7217 generated by DMAC 6602 has the next highest priority; micro-operations 3416 generated by sequencer 128 have the next highest priority; and the no-op micro-operation is the default (i.e., lowest priority), the source selected by multiplexer 3402 when no other source requests. According to one embodiment, when DMAC 6602 or slave interface 6301 needs to access data RAM 122 or weight RAM 124, it takes precedence over the program running on sequencer 128, and decoder 3404 pauses sequencer 128 until DMAC 6602 and slave interface 6301 have completed their accesses.

Although embodiments have been described above in which bytes are stored in weight RAM 124 and data RAM 122, other embodiments are contemplated in which the word size may be a different size (e.g., 9 bits, 12 bits, or 16 bits).

While various embodiments of the present invention have been described herein, these embodiments have been presented by way of example, and not limitation. It will be apparent to persons skilled in the relevant computer arts that various changes in form and detail can be made therein without departing from the scope of the invention. For example, software may, for example, support the function, fabrication, modeling, simulation, description, and/or testing, etc., of the apparatus and methods described herein. This may be accomplished using a general programming language (e.g., C, C + +), a Hardware Description Language (HDL) including Verilog HDL, VHDL, and the like, or other available programs. Such software can be disposed on any known computer usable medium such as magnetic tape, semiconductor, magnetic disk, optical disk (e.g., CD-ROM, DVD-ROM, etc.), network, wired or other communications medium, and the like. Embodiments of the apparatus and methods described herein may be included in a semiconductor intellectual property core, such as a processor core (e.g., embodied or specified in HDL) and transformed to hardware by the fabrication of integrated circuits. Furthermore, the apparatus and methods described herein may also be embodied as a combination of hardware and software. Thus, the present invention should not be limited by any of the exemplary embodiments described herein, but should be defined only in accordance with the following claims and their equivalents. In particular, the present invention may be implemented within a processor device, which may be used in a general purpose computer. Finally, those skilled in the art should appreciate that they can readily use the disclosed conception and specific embodiment as a basis for designing or modifying other structures for carrying out the same purposes of the present invention without departing from the scope of the invention as defined by the appended claims.

Cross Reference to Related Applications

This application is related to the following U.S. non-provisional applications, each of which is incorporated herein by reference in its entirety.

Each of the above non-provisional applications claims priority based on the following U.S. provisional applications, each of which is incorporated herein by reference in its entirety.

The present application is also related to the following U.S. non-provisional applications, each of which is incorporated herein by reference in its entirety.

Claims

1. A processor, comprising:

a processing core;

a first data store coupled to the processing core, the first data store to hold cache lines processed by the processing core;

an accelerator, comprising:

a second data store for selectively maintaining:

a cache line evicted from said first data store, an

Accelerator data processed by the accelerator;

a tag directory coupled to the processing core, the tag directory to hold tags for cache lines stored in both the first data store and the second data store;

a mode indicator to indicate whether the second data store is operating in a first mode in which the second data store holds cache lines evicted from the first data store or a second mode in which the second data store holds accelerator data processed by the accelerator; and

Control logic configured to, in response to a request to evict a cache line from the first data store:

if the mode indicator indicates that the second data store is operating in the first mode, writing a cache line to the second data store and updating a tag in the tag directory to indicate that a cache line exists in the second data store; and

writing a cache line to system memory instead of to the second data store if the mode indicator indicates that the second data store is operating in the second mode.

2. The processor of claim 1,

the control logic is further configured to, in response to a load request from the processing core specifying a memory address:

determining from the tag directory whether a second cache line referenced by the memory address is present in the first data store or the second data store;

reading the second cache line from the first data store and providing the second cache line to the processing core if the second cache line is present in the first data store;

In the event that the second cache line is present in the second data store and absent from the first data store, reading the second cache line from the second data store and providing the second cache line to the processing core; and

in a case that the second cache line is not present in either the first data store or the second data store, reading the second cache line from the system memory and providing the second cache line to the processing core.

3. The processor of claim 2,

the control logic is further configured to, in response to a load request from the processing core specifying the memory address and in the event that the second cache line is present in the second data store and absent from the first data store:

interchanging the second cache line read from the second data store with a third cache line maintained in the first data store.

4. The processor of claim 1,

To transition from the first mode to the second mode:

the control logic invalidates all cache lines of the second data store; and

the control logic stops writing cache lines to the second data store in response to a request to evict a cache line from the first data store and updates a tag in the tag directory to indicate that a cache line is present in the second data store.

5. The processor of claim 1, further comprising:

a ring bus to which the processing cores, the first data store, the system memory, and the accelerator are coupled,

wherein, in the event that the mode indicator indicates that the second data store is operating in the first mode, the control logic writes a cache line to the second data store via the ring bus in response to a request to evict a cache line from the first data store, and

in the event that the mode indicator indicates that the second data store is operating in the second mode, the control logic writes a cache line to the system memory rather than to the second data store via the ring bus in response to a request to evict a cache line from the first data store.

6. The processor of claim 1,

the processing core is one of P processing cores for processing cache lines held in the first data store and the second data store, P being greater than 1,

the first data store comprises P data memory slices coupled to the P processing cores, respectively,

the control logic logically accesses the second data store as P portions corresponding to P data memory slices of the first data store, an

In response to a request to evict a cache line from the first data store if the mode indicator indicates that the second data store is operating in the first mode, the control logic writes a cache line to a portion of the P portions of the second data store that corresponds to a data slice of the P data slices that evicted the cache line.

7. The processor of claim 1,

the first data store is L bytes of memory,

the second data store is an M byte memory,

the first data store is arranged as an associative memory of Y ways,

X is the product of Y and the quotient of M divided by L,

the first data store is arranged as S sets of associative memories, an

The control logic logically accesses the second data store as S sets X ways of associative memory.

8. The processor of claim 1,

the first data store is L bytes of memory,

the second data store is an M byte memory,

the first data store is arranged as an associative memory of Y ways,

x is the product of Y and the quotient of M divided by L,

the tag directory is arranged as an associative memory of Z ways, an

Z is the sum of X and Y.

9. The processor of claim 8,

the first data store and the tag directory are each further arranged as an associative memory of S sets, an

The control logic accesses the second data store by sending an address to the second data store calculated based at least on the following indices: an index of a set of S sets of the first data store that evicts a cache line; and an index of a way of the marked X ways of the marked directory that is updated to indicate that a cache line exists in the second data store.

10. The processor of claim 9,

the P data memory slices are each arranged as J sets of associated memory x Y ways,

j is the quotient of S divided by P, and

the control logic accesses the second data store by sending an address to the second data store calculated based at least on the following indices: an index of a set of J sets of data slices that evicts a cache line; an index of a way of the tagged X ways of the tag directory that is updated to indicate that a cache line exists in the second data store; and an index of the data storage slice.

11. The processor of claim 8,

the tag directory also holds information used by the control logic to determine which of the X ways to replace when the control logic writes a cache line to the second data store in response to a request to evict the cache line from the first data store if the mode indicator indicates that the second data store is operating in the first mode.

12. The processor of claim 8,

the second data store is physically arranged such that the array of N processing units of the accelerator reads a row of N bytes from the second data store per clock cycle, where N is at least 1024.

13. The processor of claim 1,

the accelerator is a neural processing unit, NPU, and

the accelerator data are neural network weights processed by the NPU.