CN111832716A

CN111832716A - Processor with a memory having a plurality of memory cells

Info

Publication number: CN111832716A
Application number: CN202010306599.7A
Authority: CN
Inventors: 王磊; 伊利亚·奥夫相尼科夫
Original assignee: Samsung Electronics Co Ltd
Current assignee: Samsung Electronics Co Ltd
Priority date: 2019-04-17
Filing date: 2020-04-17
Publication date: 2020-10-27
Also published as: KR20200122256A

Abstract

A processor is disclosed. The processor includes a register, a non-zero weight value selector, and a multiplier. The register stores a first set of weight values and a second set of weight values. Each set of weight values includes at least one weight value, and each weight value of the first set of weight values corresponds to a weight value of the second set of weight values. The non-zero weight value selector selects a non-zero weight value from either a weight value in the first set of weight values or a non-zero weight value in the second set of weight values that corresponds to a weight value in the first set of weight values. The multiplier multiplies the selected non-zero weight value and the activation value corresponding to the selected non-zero weight value to form an output product value.

Description

Processor with a memory having a plurality of memory cells

This application is a continuation-in-part patent application, filed on 2019, 6/19/16/446610, entitled "neural processor," claiming priority and benefit from the following applications: (i) U.S. provisional application No. 62/689008 entitled "one-way neural processor accelerator architecture" filed on 22/6/2018, (ii) U.S. provisional application No. 62/798297 entitled "one-way NPU" filed on 29/1/2019, (iii) U.S. provisional application No. 62/841590 entitled "mixed-precision NPU block with depth-direction convolution" filed on 1/5/2019, (iv) U.S. provisional application No. 62/841606 entitled "mixed-precision neural processing unit block" filed on 1/5/2019, (v) U.S. provisional application No. 62/835,496 entitled "hardware channel parallel data compression/decompression filed on 17/4/2019, and (vi) U.S. provisional application No. 62/841,819 filed on 1/5/2019 entitled" mixed-precision compression ", the entire contents of said patent application are hereby incorporated by reference in their entirety.

Technical Field

One or more aspects according to embodiments of the present disclosure relate to a processing circuit, and more particularly, to a processing circuit for performing a combination of multiplication and addition.

Background

In operation, the neural network may perform tensor operations (e.g., tensor multiplications and convolutions) that involve a large number of multiplications and additions. If these operations are performed by a general purpose central processor or even a graphics processor (which may be better suited for such tasks), these operations may be relatively slow and each operation incurs a relatively high energy cost. Especially in small devices (e.g., mobile, handheld devices) that may have a tightly constrained power budget, the power consumption associated with the use of a general purpose central processor or the use of a graphics processor can be a significant drawback.

Accordingly, there is a need for an improved processing circuit for neural network computations.

Disclosure of Invention

According to some embodiments of the present disclosure, there is provided a processor, including: the device comprises a first block, a second block, a memory and a bus, wherein the bus is connected to the memory, the first block and the second block, and the first block comprises: a first weight register, a second weight register, an activation buffer, a first multiplier, and a second multiplier, the first block configured to perform convolution of the activated array with a kernel of weights, the performing convolution steps comprising, in order: forming a tensor product of the core and a first subarray of the activated array; forming a tensor product of the core and a second subarray of the activated array, the second subarray being offset from the first subarray by n array elements in the first direction, n being a positive integer; and forming a tensor product of the core and a third subarray of the activated array, the third subarray being offset from the second subarray by one array element in a second direction perpendicular to the first direction.

In some embodiments, the step of performing convolution further comprises, in order: forming a tensor product of the core and a fourth sub-array of the activated array after forming the tensor product of the core and the third sub-array, the fourth sub-array being offset from the third sub-array by m array elements in a third direction opposite to the first direction, m being a positive integer; and forming a tensor product of the core and a fifth subarray of the activated array, the fifth subarray being offset from the fourth subarray by one array element in the second direction.

In some embodiments, m is equal to n.

In some embodiments, n is equal to 1.

In some embodiments, the step of performing convolution further comprises, in order: after forming the product of the kernel and the first subarray, n-1 products of the kernel and n-1 corresponding subarrays of the activated array are formed, a subarray in a kth product of the n-1 products being offset from the first subarray by k +1 array elements in the first direction.

In some embodiments, the processor further comprises: a cache connected to the activation buffer and configured to supply activations to the activation buffer, the cache having a size sufficient to store H + (H + n) × (W-1) -1 activations, wherein H is a size of the core in the first direction and W is a size of the core in the second direction.

In some embodiments, the activation buffer is configured to include: a first queue coupled to the first multiplier and a second queue coupled to the second multiplier, the first queue including a first register and a second register adjacent to the first register, the first register being an output register of the first queue, the first block further configured to: in a first state, the first weight is multiplied in a first multiplier by an activation of an output register from the first queue, and in a second state, the first weight is multiplied in the first multiplier by an activation of a second register from the first queue.

In some embodiments, in the second state, the output register of the first queue contains a zero.

In some embodiments, the processor further comprises: a first adder configured to be connected to an output of the first multiplier and an output of the second multiplier in a first state, and to add a product received from the output of the first multiplier and a product received from the output of the second multiplier.

In some embodiments, the processor further comprises: a second adder configured to be connected to the output of the first multiplier in a second state.

According to some embodiments of the present disclosure, there is provided a method for performing a calculation using a processing circuit, the processing circuit comprising: a first block, a second block, a memory, and a bus connected to the memory, the first block, and the second block, the first block comprising: a first weight register, a second weight register, an activation buffer, a first multiplier, and a second multiplier, the method comprising: performing convolution of the activated array with the weighted kernels, the step of performing convolution comprising in order: forming a tensor product of the core and a first subarray of the activated array; forming a tensor product of the core and a second subarray of the activated array, the second subarray being offset from the first subarray by n array elements in the first direction, n being a positive integer; and forming a tensor product of the core and a third subarray of the activated array, the third subarray being offset from the second subarray by one array element in a second direction perpendicular to the first direction.

In some embodiments, m is equal to n.

In some embodiments, n is equal to 1.

In some embodiments, the processing circuit further comprises: a cache connected to the activation buffer and configured to supply activations to the activation buffer, the cache having a size sufficient to store H + (H + n) × (W-1) -1 activations, wherein H is a size of the core in the first direction and W is a size of the core in the second direction.

In some embodiments, the processor further comprises a first adder, the method further comprising: the first adder is connected to an output of the first multiplier and an output of the second multiplier in the first state, and a product received from the output of the first multiplier and a product received from the output of the second multiplier are added by the first adder.

According to some embodiments of the present disclosure, there is provided a method for computing using an apparatus for processing, the apparatus for processing comprising: a first block, a second block, a memory, and a bus connected to the memory, the first block, and the second block, the first block comprising: a first weight register, a second weight register, an activation buffer, a first multiplier, and a second multiplier, the method comprising: performing convolution of the activated array with the weighted kernels, the step of performing convolution comprising in order: forming a tensor product of the core and the first subarray of the activated array; forming a tensor product of the core and a second subarray of the activated array, the second subarray being offset from the first subarray by n array elements in the first direction, n being a positive integer; and forming a tensor product of the core and a third subarray of the activated array, the third subarray being offset from the second subarray by one array element in a second direction perpendicular to the first direction.

According to some embodiments of the present disclosure, there is provided a processor, including: the memory device comprises a first block, a second block, a memory and a bus, wherein the bus is connected to the memory, the first block and the second block, and the first block comprises: a first weight register, a second weight register, an activation buffer, a first multiplier, and a second multiplier, the processor configured to perform a first convolution of the activated array with a first kernel of weights, the step of performing the first convolution comprising: broadcasting a first sub-array of the activated array to the first tile and the second tile; forming a first tensor product, the first tensor product being a tensor product of a first subarray of the first core of weights and a first subarray of the activated array; storing the first scalar product in a memory; broadcasting a second sub-array of the activated array to the first tile and the second tile; forming a second tensor product, the second tensor product being a tensor product of a second subarray of the first core of weights and a second subarray of the activated array; and adding the first tensor product and the second tensor product.

In some embodiments, the first block further comprises: a weight decompression unit configured to: decompressing the data word encoding the plurality of weights in compressed form to extract a first weight and a second weight; inputting the first weight to a first weight register; and inputting the second weight to a second weight register.

In some embodiments, the first tile is further configured to: performing a second convolution of the activated array with a second kernel of weights, the second performing convolution steps comprising, in order: forming a tensor product of a first portion of the second core and a first subarray of the activated array, the first portion of the second core including weights stored in a first weight register; forming a tensor product of a second portion of the second core and the first subarray of the activated array, the second portion of the second core including weights stored in a second weight register; and forming a tensor product of a first portion of the second core and a second subarray of the activated array, the first portion of the second core including weights stored in the first weight register.

In some embodiments, the processor further comprises: a first adder configured to be connected to an output of the first multiplier and an output of the second multiplier in a first state; and adding the product received from the output of the first multiplier and the product received from the output of the second multiplier.

In some embodiments, the processor further comprises: a first accumulator connected to the first adder and a second accumulator connected to the second adder, the first accumulator including a register and configured to add the sum received from the first adder and a value in the register of the first accumulator to form an accumulated value of the first accumulator and store the accumulated value of the first accumulator in the register of the first accumulator in a first state.

In some embodiments, the second accumulator includes a register and is configured to, in the second state, add the sum received from the second adder to a value in the register of the second accumulator to form an accumulated value of the second accumulator and store the accumulated value of the second accumulator in the register of the second accumulator.

In some embodiments, the processor further comprises: activating a zero skip control circuit (activation zero skip control circuit) configured to: a determination is made whether the output register of the first queue contains a zero, and the first block is caused to operate in the second state in response to determining that the output register of the first queue contains a zero.

According to some embodiments of the present disclosure, there is provided a method for performing a calculation using a processing circuit, the processing circuit comprising: a first block, a second block, a memory, and a bus connected to the memory, the first block, and the second block, the first block comprising: a first weight register, a second weight register, an activation buffer, a first multiplier, and a second multiplier, the method comprising: performing a first convolution of the activated array with a first kernel of weights, the step of performing the first convolution comprising: broadcasting a first sub-array of the activated array to the first tile and the second tile; forming a first tensor product, the first tensor product being a tensor product of a first subarray of the first core of weights and a first subarray of the activated array; storing the first scalar product in a memory; broadcasting a second sub-array of the activated array to the first tile and the second tile; forming a second tensor product, the second tensor product being a tensor product of a second subarray of the first core of weights and a second subarray of the activated array; and adding the first tensor product and the second tensor product.

In some embodiments, the first block further comprises a weight decompression unit, the method further comprising: decompressing, by a weight decompression unit, a data word encoding a plurality of weights in a compressed form to extract a first weight and a second weight; inputting the first weight to a first weight register; and inputting the second weight to the second weight register.

In some embodiments, the method further comprises: performing a second convolution of the activated array with a second kernel of weights, the second performing convolution steps comprising, in order: forming a tensor product of a first portion of the second core and a first subarray of the activated array, the first portion of the second core including weights stored in a first weight register; forming a tensor product of a second portion of the second core and the first subarray of the activated array, the second portion of the second core including weights stored in a second weight register; and forming a tensor product of a first portion of the second core and a second subarray of the activated array, the first portion of the second core including weights stored in the first weight register.

In some embodiments, the processing circuit further comprises: a first adder, the method further comprising: the first adder is connected to an output of the first multiplier and an output of the second multiplier in the first state, and a product received from the output of the first multiplier and a product received from the output of the second multiplier are added by the first adder.

In some embodiments, the processing circuit further comprises a second adder, the method further comprising connecting the second adder to the output of the first multiplier in the second state.

In some embodiments, the processing circuit further comprises: a first accumulator coupled to the first adder and a second accumulator coupled to the second adder, the first accumulator including a register, the method further comprising: in a first state, the sum received from the first adder is added to the value in the register of the first accumulator by the first accumulator to form an accumulated value of the first accumulator, and the accumulated value of the first accumulator is stored in the register of the first accumulator by the first accumulator.

In some embodiments, the second accumulator comprises a register, the method further comprising: in the second state, the sum received from the second adder is added to the value in the register of the second accumulator by the second accumulator to form an accumulated value of the second accumulator, and the accumulated value of the second accumulator is stored in the register of the second accumulator by the second accumulator.

According to some embodiments of the present disclosure, there is provided a method for computing using an apparatus for processing, the apparatus for processing comprising: a first block, a second block, a memory, and a bus connected to the memory, the first block, and the second block, the first block comprising: a first weight register, a second weight register, an activation buffer, a first multiplier, and a second multiplier, the method comprising: performing a first convolution of the activated array with a first kernel of weights, the step of performing the first convolution comprising: broadcasting a first sub-array of the activated array to the first tile and the second tile; forming a first scalar product, the first scalar product being a scalar product of the first subarray of the first core of weights and the first subarray of the activated array; storing the first scalar product in a memory; broadcasting a second sub-array of the activated array to the first tile and the second tile; forming a second tensor product, the second tensor product being a tensor product of a second subarray of the first core of weights and a second subarray of the activated array; and adding the first tensor product and the second tensor product.

According to some embodiments of the present disclosure, there is provided a processor, including: the memory device comprises a first block, a second block, a memory, an input bus and an output bus, wherein the input bus is connected to the memory, the first block and the second block, and the first block comprises: a first weight register, a second weight register, an activation buffer, a first multiplier, and a second multiplier, the first block configured to perform a first convolution of the activated array and the kernel of weights; the memory includes: a first memory bank set (memory bank set) and a second memory bank set; the input bus includes: a first segmented bus (segmentedbus) for data broadcasting in a first direction and a second segmented bus for data broadcasting in a second direction opposite to the first direction; the first segment bus includes: a first switch block and a second switch block; a first switch block connected to the first bank and the first memory bank group; a second switch block connected to the second bank and the second memory bank group; the second segmented bus comprises: a third switch block and a fourth switch block; a third switch block is connected to the first block and the first set of memory banks; a fourth switch block is connected to the second block and the second memory bank group; the input of the first switch block is connected to the output of the second switch block; the output of the third switching block is connected to the input of the fourth switching block.

In some embodiments, the first segment bus is configured to: the first memory bank group is connected to the first block through the first switch block in the first bus state, and the second memory bank group is connected to the second block through the second switch block.

In some embodiments, the first segmented bus is further configured to: the second memory bank group is connected to the first bank through the first switch block and through the second switch block in the second bus state, and the second memory bank group is connected to the second bank through the second switch block.

In some embodiments, the activation buffer is configured to include: a first queue coupled to the first multiplier and a second queue coupled to the second multiplier, the first queue including a first register and a second register adjacent to the first register, the first register being an output register of the first queue, the first block further configured to: the first weight is multiplied by activation of an output register from the first queue in a first multiplier in a first state, and the first weight is multiplied by activation of a second register from the first queue in the first multiplier in a second state.

In some embodiments, the processor further comprises: a first adder configured to be connected to an output of the first multiplier and an output of the second multiplier in a first state; and adds the product received from the output of the first multiplier to the product received from the output of the second multiplier.

In some embodiments, the processor further comprises: a first accumulator connected to the first adder; and a second accumulator connected to the second adder, the first accumulator including a register and configured to add the sum received from the first adder to a value in the register of the first accumulator to form an accumulated value of the first accumulator in the first state, and store the accumulated value of the first accumulator in the register of the first accumulator.

In some embodiments, the processor further comprises: activating a zero skip control circuit configured to: a determination is made whether an output register of the first queue contains a zero, and the first block is caused to operate in the second state in response to determining that the output register of the first queue contains a zero.

In some embodiments, the processor further comprises a multiplexer having: an input connected to the first multiplier at the single port side of the multiplexer; a first output connected to a first adder on the multi-port side of the multiplexer; and a second output connected to the second adder on the multi-port side of the multiplexer.

According to some embodiments of the present disclosure, there is provided a method for performing a calculation using a processing circuit, the processing circuit comprising: the memory device includes a first block, a second block, a memory, an input bus and an output bus, the input bus being connected to the memory, the first block and the second block, the first block including: a first weight register, a second weight register, an activation buffer, a first multiplier, and a second multiplier, the first block configured to perform a first convolution of the activated array and the kernel of weights; the memory includes: a first memory bank group and a second memory bank group; the input bus includes: a first segment bus for data broadcasting in a first direction; and a second segmented bus for data broadcasting in a second direction opposite to the first direction; the first segment bus includes: a first switch block and a second switch block; a first switch block connected to the first block and the first memory bank group; a second switch block connected to the second block and the second memory bank group; the second segmented bus comprises: a third switch block and a fourth switch block; a third switch block is connected to the first block and the first set of memory banks; the fourth switch block is connected to the second block and the second memory bank group; the input end of the first switch block is connected to the output end of the second switch block; an output of the third switching block is connected to an input of the fourth switching block, the method comprising: in the first bus state, a first memory bank group is connected to the first block through a first switch block, and a second memory bank group is connected to the second block through a second switch block.

In some embodiments, the method further comprises: in the second bus state, the second memory bank group is connected to the first bank through the first switch block and the second switch block, and the second memory bank group is connected to the second bank through the second switch block.

In some embodiments, the activation buffer is configured to include: a first queue connected to the first multiplier; and a second queue coupled to the second multiplier, the first queue including a first register and a second register adjacent to the first register, the first register being an output register of the first queue, the first block further configured to: the first weight is multiplied by activation of an output register from the first queue in a first multiplier in a first state and by activation of a second register from the first queue in a first multiplier in a second state.

In some embodiments, the processing circuit further comprises a first adder, the method further comprising: connecting a first adder to an output of the first multiplier and an output of the second multiplier in a first state; and adding, by a first adder, a product received from an output of the first multiplier and a product received from an output of the second multiplier.

In some embodiments, the processing circuit further comprises a second adder, the method further comprising: the second adder is connected to the output of the first multiplier in the second state.

In some embodiments, the processing circuit further comprises: a first accumulator connected to the first adder; and a second accumulator coupled to the second adder, the first accumulator including a register, the method further comprising: in a first state, the sum received from the first adder is added to the value in the register of the first accumulator by the first accumulator to form an accumulated value of the first accumulator, and the accumulated value of the first accumulator is stored in the register of the first accumulator by the first accumulator.

According to some embodiments of the present disclosure, there is provided a method of computing using an apparatus for processing, the apparatus for processing comprising: the memory device includes a first block, a second block, a memory, an input bus and an output bus, the input bus being connected to the memory, the first block and the second block, the first block including: a first weight register, a second weight register, an activation buffer, a first multiplier, and a second multiplier, the first block configured to perform a first convolution of the activated array and the kernel of weights; the memory includes: a first memory bank group and a second memory bank group; the input bus includes: a first segment bus for data broadcasting in a first direction; and a second segmented bus for data broadcasting in a second direction opposite to the first direction; the first segment bus includes: a first switch block and a second switch block; a first switch block connected to the first bank and the first memory bank group; a second switch block connected to the second bank and the second memory bank group; the second segmented bus comprises: a third switch block and a fourth switch block; a third switch block is connected to the first block and the first set of memory banks; a fourth switch block is connected to the second block and the second memory bank group; the input of the first switch block is connected to the output of the second switch block; an output of the third switching block is connected to an input of the fourth switching block, the method comprising: in the first bus state, a first memory bank group is connected to the first block through a first switch block, and a second memory bank group is connected to the second block through a second switch block.

According to some embodiments of the present disclosure, there is provided a processor, including: the memory device comprises a first block, a second block, a memory and a bus, wherein the bus is connected to the memory, the first block and the second block, and the first block comprises: a first weight register, a second weight register, an activation buffer, a first multiplier, and a second multiplier, the activation buffer configured to include: a first queue coupled to the first multiplier; and a second queue coupled to the second multiplier, the first queue including a first register and a second register adjacent to the first register, the first register being an output register of the first queue, the first block configured to: in a first state, the first weight is multiplied in a first multiplier by an activation of an output register from the first queue, and in a second state, the first weight is multiplied in the first multiplier by an activation of a second register from the first queue.

In some embodiments, the processor further comprises: a first accumulator connected to the first adder; and a second accumulator connected to the second adder, the first accumulator including a register and configured to add the sum received from the first adder to a value in the register of the first accumulator to form an accumulated value of the first accumulator and store the accumulated value of the first accumulator in the register of the first accumulator in the first state.

In some embodiments, the active zero skip control circuit is configured to control the multiplexer to connect the input to the first output in a first state and to connect the input to the second output in a second state.

In some embodiments, the second queue includes a first register and a second register adjacent to the first register, the first register being an output register of the second queue; and the first tile is further configured to: the first weight is multiplied by activation of a second register from the second queue in a first multiplier in a third state.

According to some embodiments of the present disclosure, there is provided a method for performing a calculation using a processing circuit, the processing circuit comprising: a first block, a second block, a memory, and a bus connected to the memory, the first block, and the second block, the first block comprising: a first weight register, a second weight register, an activation buffer, a first multiplier, and a second multiplier, the activation buffer configured to include: a first queue coupled to the first multiplier; and a second queue coupled to the second multiplier, the first queue including a first register and a second register adjacent to the first register, the first register being an output register of the first queue, the method comprising: the first weight is multiplied by activation of an output register from the first queue by a first multiplier in a first state, and the first weight is multiplied by activation of a second register from the first queue by the first multiplier in a second state.

In some embodiments, the processing circuit further comprises a first adder, the method further comprising: the first adder is connected to an output of the first multiplier and an output of the second multiplier in the first state, and a product received from the output of the first multiplier and a product received from the output of the second multiplier are added by the first adder.

In some embodiments, the processing circuit further comprises: a first accumulator connected to the first adder; and a second accumulator connected to the second adder, the first accumulator including a register, the method further including adding, by the first accumulator, the sum received from the first adder to a value in the register of the first accumulator to form an accumulated value of the first accumulator, and storing, by the first accumulator, the accumulated value of the first accumulator in the register of the first accumulator.

In some embodiments, the second accumulator includes a register, the method further includes in the second state adding, by the second accumulator, the sum received from the second adder to a value in the register of the second accumulator to form an accumulated value of the second accumulator, and storing, by the second accumulator, the accumulated value of the second accumulator in the register of the second accumulator.

In some embodiments, the processing circuit further comprises activating a zero skip control circuit, and the method further comprises: the method further includes determining whether the output register of the first queue contains a zero by activating a zero skip control circuit, and causing the first block to operate in a second state in response to determining that the output register of the first queue contains a zero.

In some embodiments, the processing circuit further comprises a multiplexer having: an input connected to a first multiplier on a single port side of the multiplexer; a first output connected to a first adder on the multi-port side of the multiplexer; and a second output connected to the second adder on the multi-port side of the multiplexer.

In some embodiments, the method further comprises: the multiplexer is controlled by activating the zero skip control circuit to connect the input to the first output in a first state and to connect the input to the second output in a second state.

According to some embodiments of the present disclosure, there is provided a method for computing using an apparatus for processing, the apparatus for processing comprising: a first block, a second block, a memory, and a bus connected to the memory, the first block, and the second block, the first block comprising: a first weight register, a second weight register, an activation buffer, a first multiplier, and a second multiplier, the activation buffer configured to include: a first queue coupled to the first multiplier; and a second queue coupled to the second multiplier, the first queue including a first register and a second register adjacent to the first register, the first register being an output register of the first queue, the method comprising: multiplying the first weight by an activation of an output register from the first queue in a first multiplier in a first state, and multiplying the first weight by an activation of a second register from the first queue in the first multiplier in a second state.

Drawings

These and other features and advantages of the present disclosure will be appreciated and understood with reference to the specification, claims, and drawings, in which:

FIG. 1A is a block diagram depicting a neural processor in accordance with the subject matter disclosed herein;

FIG. 1B is a block diagram depicting a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 1C depicts data flow in a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 1D depicts data flow in a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 1E depicts data flow in a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 1F depicts data flow in a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 1G depicts data flow in a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 1H depicts data flow in a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 1I is a block diagram depicting a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 1J is a block diagram depicting a portion of a neural processor for three cases, in accordance with the subject matter disclosed herein;

FIG. 1K is a schematic diagram of a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 1L is a block diagram depicting a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 1MA is a block diagram depicting a portion of a neural processor in accordance with the subject matter disclosed herein;

Fig. 1MB is a block diagram depicting a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 1N is a block diagram depicting a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 1O is a block diagram depicting a neural processor in accordance with the subject matter disclosed herein;

FIG. 1P is a block diagram depicting a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 1Q is a table of sizes according to the subject matter disclosed herein;

FIG. 1R is a tensor diagram in accordance with the subject matter disclosed herein;

FIG. 1S is a tensor diagram in accordance with the subject matter disclosed herein;

FIG. 1T depicts data flow in a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 1U depicts data flow in a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 1V is a block diagram depicting a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 1WA is a block diagram depicting a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 1WB depicts data flow in a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 1WC depicts data flow in a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 1WD depicts data flow in a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 1WE depicts data flow in a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 1X is a block diagram depicting a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 2AA is a convolution graph according to the subject matter disclosed herein;

FIG. 2AB is a graph of convolution according to the subject matter disclosed herein;

FIG. 2AC is a graph of convolution according to the subject matter disclosed herein;

FIG. 2AD is a graph of convolution according to the subject matter disclosed herein;

FIG. 2BA is a convolution graph according to the subject matter disclosed herein;

FIG. 2BB is a convolution graph according to the subject matter disclosed herein;

FIG. 2BC is a convolution graph according to the subject matter disclosed herein;

FIG. 2BD is a convolution graph according to the subject matter disclosed herein;

FIG. 2BE is a convolution graph according to the subject matter disclosed herein;

FIG. 2BF is a graph of convolution according to the subject matter disclosed herein;

FIG. 2BG is a convolution graph according to the subject matter disclosed herein;

FIG. 2BH is a convolution graph in accordance with the subject matter disclosed herein;

FIG. 2BI is a convolution graph according to the subject matter disclosed herein;

FIG. 2BJ is a graph of convolution according to the subject matter disclosed herein;

FIG. 2BK is a convolution graph according to the subject matter disclosed herein;

FIG. 2BL is a convolution graph according to the subject matter disclosed herein;

FIG. 2BM is a convolution map according to the subject matter disclosed herein;

FIG. 2C is a convolution graph according to the subject matter disclosed herein;

FIG. 2DA is a convolution graph according to the subject matter disclosed herein;

FIG. 2DB is a convolution graph according to the subject matter disclosed herein;

FIG. 2DC is a convolution graph according to the subject matter disclosed herein;

FIG. 2DD is a convolution map according to the subject matter disclosed herein;

FIG. 2DE is a convolution graph according to the subject matter disclosed herein;

FIG. 2DF is a convolution graph according to the subject matter disclosed herein;

FIG. 2DG is a graph of convolution according to the subject matter disclosed herein;

FIG. 2DH is a graph of convolution according to the subject matter disclosed herein;

FIG. 2DI is a convolution graph according to the subject matter disclosed herein;

FIG. 2DJ is a convolution graph according to the subject matter disclosed herein;

fig. 2DK is a graph of convolution according to the subject matter disclosed herein;

FIG. 2DL is a convolution graph according to the subject matter disclosed herein;

FIG. 2DM is a graph of convolution according to the subject matter disclosed herein;

FIG. 2DN is a convolution graph according to the subject matter disclosed herein;

FIG. 2DO is a graph of convolution according to the subject matter disclosed herein;

FIG. 2DP is a graph of convolution according to the subject matter disclosed herein;

FIG. 2DQ is a graph of convolution according to the subject matter disclosed herein;

FIG. 2DR is a convolution graph according to the subject matter disclosed herein;

FIG. 2DS is a convolution graph according to the subject matter disclosed herein;

FIG. 2DT is a graph of convolution according to the subject matter disclosed herein;

fig. 2DV is a convolution graph according to the subject matter disclosed herein;

FIG. 2DW is a graph of convolution according to the subject matter disclosed herein;

FIG. 2DX is a graph of convolution according to the subject matter disclosed herein;

FIG. 2E is a read table according to the subject matter disclosed herein;

FIG. 2F is a read table according to the subject matter disclosed herein;

FIG. 2GA is a convolution graph according to the subject matter disclosed herein;

FIG. 2GB is a convolution graph according to the subject matter disclosed herein;

FIG. 2HA is a convolution graph according to the subject matter disclosed herein;

FIG. 2HB is a convolution graph according to the subject matter disclosed herein;

FIG. 2HC is a graph of convolution according to the subject matter disclosed herein;

FIG. 2HD is a convolution graph according to the subject matter disclosed herein;

FIG. 3AA is a block diagram depicting a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 3AB depicts a data flow in accordance with the subject matter disclosed herein;

FIG. 3AC depicts a data flow in accordance with the subject matter disclosed herein;

FIG. 3AD depicts a data flow in accordance with the subject matter disclosed herein;

FIG. 3AE depicts a data flow in accordance with the subject matter disclosed herein;

FIG. 3AF depicts a data flow in accordance with the subject matter disclosed herein;

FIG. 3AG depicts a data flow in accordance with the subject matter disclosed herein;

FIG. 3AH depicts data flow in accordance with the subject matter disclosed herein;

FIG. 3AI depicts a data flow in accordance with the subject matter disclosed herein;

FIG. 3AJ depicts data flow in accordance with the subject matter disclosed herein;

FIG. 3AK depicts a data flow in accordance with the subject matter disclosed herein;

FIG. 3BA depicts a block diagram of a portion of a neural processor, in accordance with the subject matter disclosed herein;

FIG. 3BB is a data diagram according to the subject matter disclosed herein;

FIG. 3BC is a data diagram according to the subject matter disclosed herein;

FIG. 3CA is a block diagram depicting a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 3CB is a block diagram depicting a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 3DA is a data diagram according to the subject matter disclosed herein;

FIG. 3EA is a block diagram depicting a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 3EB is a block diagram depicting a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 3FA is a block diagram depicting a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 3FB is a data diagram according to the subject matter disclosed herein;

FIG. 3FC is a block diagram depicting a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 3GA is a data graph according to the subject matter disclosed herein;

FIG. 3GB is a block diagram depicting a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 3GC is a block diagram depicting a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 3GD is a block diagram depicting a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 3HA is a data graph according to the subject matter disclosed herein;

FIG. 3HB is a block diagram depicting a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 3HC is a block diagram depicting a portion of a neural processor in accordance with the subject matter disclosed herein;

fig. 3HD is a data diagram according to the subject matter disclosed herein;

FIG. 3IA is a block diagram depicting a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 3IB is a block diagram depicting a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 3IC is a block diagram depicting a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 3ID is a block diagram depicting a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 3IE is a data graph according to the subject matter disclosed herein;

FIG. 3IF is a data diagram according to the subject matter disclosed herein;

FIG. 3JA depicts a data flow according to the subject matter disclosed herein;

FIG. 3JB depicts data flow in accordance with the subject matter disclosed herein;

fig. 3JC depicts a data flow according to the subject matter disclosed herein;

FIG. 3JD depicts data flow in accordance with the subject matter disclosed herein;

FIG. 3KA is a block diagram depicting a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 3KB is a data diagram according to the subject matter disclosed herein;

FIG. 3LA is a block diagram depicting a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 3LB is a block diagram depicting a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 3LC is a block diagram depicting a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 3LD is a block diagram depicting a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 3MA is a block diagram depicting a portion of a neural processor in accordance with the subject matter disclosed herein;

fig. 3MB is a block diagram depicting a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 3NA is a block diagram depicting a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 3OA is a block diagram depicting a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 3OB is a block diagram depicting a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 3OC is a block diagram depicting a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 3PA is a block diagram depicting a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 3PB is a block diagram depicting a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 3PC is a block diagram depicting a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 4AA is a block diagram depicting a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 4AB is a block diagram depicting a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 4AC is a block diagram depicting a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 4AD is a block diagram depicting a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 4AE is a block diagram depicting a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 4AF is a block diagram depicting a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 4AG is a block diagram depicting a portion of a neural processor in accordance with the subject matter disclosed herein;

figure 4AH is a block diagram depicting a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 4AJ is a block diagram depicting a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 4AK is a block diagram depicting a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 4AL is a block diagram depicting a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 4AM is a block diagram depicting a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 4AN is a block diagram depicting a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 4BA is a block diagram depicting a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 4BB is a block diagram depicting a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 4BC is a block diagram depicting a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 4BD is a block diagram depicting a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 4CA is a block diagram depicting a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 4CB is a block diagram depicting a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 4CC is a block diagram depicting a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 4DA is a block diagram depicting a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 4DB is a block diagram depicting a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 4DC is a block diagram depicting a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 4EA is a block diagram depicting a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 4EB is a block diagram depicting a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 4EC is a block diagram depicting a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 4FA is a block diagram depicting a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 4FB is a block diagram depicting a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 4G is a block diagram depicting a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 4H is a block diagram depicting a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 5A is a block diagram depicting a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 5B is a block diagram depicting a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 5C is a block diagram depicting a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 5D is a block diagram depicting a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 5E is a block diagram depicting a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 5F is a block diagram depicting a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 5G is a block diagram depicting a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 6 is a block diagram depicting a portion of a neural processor in accordance with the subject matter disclosed herein;

FIG. 7A depicts an example of IFM data having a relatively uniform distribution of zero values distributed among IFM slices and in the passages within the IFM slices;

FIG. 7B depicts another example of IFM data in which zero values are aggregated in the same IFM lane (lane) of adjacent IFM slices;

FIG. 7C depicts a block diagram of an example embodiment of a system that uses an IFM shuffler (shuffler) to pseudo-randomly permute (permute) values within each IFM slice to scatter clusters (clusters) of non-zero values within the IFM slices, according to the subject matter disclosed herein;

FIG. 7D depicts a block diagram of an example embodiment of a 16-channel butterfly shuffler (16-channel butterfly shuffler) in accordance with the subject matter disclosed herein;

FIG. 7E depicts a block diagram of an example embodiment of a pseudo-random generator connected to a butterfly shuffler, according to the subject matter disclosed herein;

FIG. 8A depicts a block diagram of an example embodiment of a baseline multiplier unit according to the subject matter disclosed herein;

FIG. 8B depicts a block diagram of an example embodiment of a multiplier unit supporting dual sparsity for both zero-valued activation skip and zero-valued weight skip in accordance with the subject matter disclosed herein; and

FIG. 8C depicts a block diagram of an example embodiment of a system that uses an IFM shuffler to pseudo-randomly permute values within each IFM slice to homogenize the distribution of zero value activations and zero value weights in accordance with the subject matter disclosed herein.

Detailed Description

The detailed description set forth below in connection with the appended drawings is intended as a description of exemplary embodiments of neural processors provided in accordance with the present disclosure and is not intended to represent the only forms in which the present disclosure may be constructed or utilized. The description sets forth features of the subject matter disclosed herein in connection with the embodiments depicted. It is to be understood, however, that the same or equivalent functions and structures may be accomplished by different embodiments that are also intended to be encompassed within the scope of the subject matter disclosed herein. As shown elsewhere herein, like element numbers are intended to indicate like elements or features. Moreover, as used herein, the word "exemplary" means "serving as an example, instance, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

As used herein, the term "module" means any combination of software, firmware, and/or hardware configured to provide the functionality described herein in connection with the module. Software may be implemented as a software package, code and/or instruction set or instructions, and the term "hardware" as used in any implementation described herein may include, for example, singly or in any combination, hardwired circuitry, programmable circuitry, state machine circuitry, and/or firmware that stores instructions executed by programmable circuitry. Modules may be implemented collectively or individually as circuitry that forms part of a larger system, such as, but not limited to, an Integrated Circuit (IC), a system on a chip (SoC), etc. The various components and/or functional blocks disclosed herein may be implemented as modules that may include software, firmware, and/or hardware to provide the functionality described herein in connection with the various components and/or functional blocks.

Fig. 1A depicts a high-level block diagram of a neural processor 100, according to the subject matter disclosed herein. The neural processor 100 may be configured to efficiently determine or calculate the convolution and/or tensor product of the input eigenmap (IFM) (or the tensor of "activation") and the multidimensional array (or tensor) of weights to form the output eigenmap (OFM). The neural processor 100 may also be configured to determine or calculate feature map pooling and/or activation functions; however, for the sake of clarity and brevity, pooling and activation functions are not covered here as a whole.

Multiple banks 109, each including some banks (e.g., four banks 108 in fig. 4AB and 4 AC), may be connected to a Multiply-and-Reduce (MR) block (tile) 102 (described in further detail below) by an IFM transfer structure (delivery fabric)104, which IFM transfer structure 104 brings an input activation map stored in the bank 109 to the block 102 for subsequent computation. As will be discussed in further detail below, block 102 includes an array of Multiplier Units (MUs) 103. The block 102 is also connected to a set of memory banks 109 via an OFM transfer structure 106, the OFM transfer structure 106 sending the results of the calculations from the block 102 to the set of memory banks 109 for storage. In one embodiment, the set of memory banks 109 may be a set of Static Random Access Memory (SRAM) memory banks. Accordingly, the memory bank group 109 may be referred to herein as an SRAM bank group 109, or simply as an SRAM 109. In another embodiment, the set of memory banks 109 may include sets of volatile and/or non-volatile memory banks.

IFM transfer structure 104 may be a segmented bus (as discussed below), and thus, each of SRAM bank groups 109 may be associated with one of banks 102. The central controller 110 may supply control words for controlling registers in the system via a common bus 112. Data may be transferred to the neural processor via an AXI (advanced extensible interconnect by ARM corporation) interconnect 114, and results of processing operations performed by the neural processor 100 may similarly be retrieved via the AXI interconnect 114. The MCU (microcontroller) 116 is operable to: the calculations are scheduled in time by appropriately configuring the central controller 110, and the DMA controller 118 is used to coordinate and perform data transfers between the neural processor 100 and the external memory 120. Each of the different components and/or functional blocks of the neural processor described herein may be implemented as separate components and/or modules.

Each block 102 may include a Multiplicative Reduction (MR) array 122 having a Multiplicative Reduction (MR) column 133. Fig. 1B depicts an MR array 122 as may be configured in some embodiments. Each MR array 122 can include eight MR columns 133, only two of the eight MR columns 133 being depicted for clarity. Each MR column 133 may include sixteen MUs 103 and two

adder trees

128A and 128B, only four MUs 103 of the sixteen MUs 103 being depicted for clarity.

Each MU103 may include a plurality of registers (e.g., a register file 127 containing 18 9-bit registers, which may be referred to as "weight registers") and a multiplier 126. Multiplier 126 multiplies the input activations by the weights in register file 127. Subsequently, the

adder trees

128A and 128B in each MR column 133 sum (i.e., reduce) the resulting products from the sixteen MUs 103 in the column to form a dot product. As described below, the summation may be performed in a particular manner.

Each block 102 may also include an IFM cache 139 and an Active Broadcast Unit (ABU) 141. IFM cache 139 may reduce SRAM reads for the input profile by caching IFM values received from SRAM 109. Just as each MR column 133 may contain sixteen MUs 103, IFM cache 139 may contain sixteen parallel "activation lanes," where each activation lane 137 effectively corresponds to a "row" of MUs 103 in MR array 122.

The activation broadcasting unit 141 may be responsible for preparing for input activation. The first step in the preparation process may include fetching input activations from the IFM cache 139 into the IFM activation buffer 124 according to a convolution sequence, while also omitting zero-valued activations when possible to implement a sparse activation computation function. The sparse activation computation function may optionally be disabled, resulting in a "dense" tensor computation mode. A second step in the preparation process may include converting the digital type of activation to a signed 8-bit size (sign-and-8-bit-size) format, which may include dividing the data type having a bit width exceeding 8 bits into a series of signed 8-bit size values using a type converter 135. When the activation has been encoded using "zero point" encoding as supported by, for example, Google tensor streaming (Google tensorial flow), a zero point constant value Z may be added to the activation before converting the values into a signed 8-bit size format.

Just as each MR column 133 may contain sixteen MUs 103, the ABU 141, IFM buffer 124, and type converter 135 may also each contain sixteen lanes. The resulting converted sixteen activation values are broadcast to MR array 122 in parallel, such that each activation pass brings the input activation value to a corresponding row of eight MUs 103.

Each MR column 133 can also contain

accumulators

130A and 130B, with

accumulators

130A and 130B for each of the

adder trees

128A and 128B. As used herein, an "accumulator" is a combination of an adder and a register that may be configured to add an input value to the contents of the register and to override the contents of the register with the resulting sum.

As previously described, the MUs 103 in the MR array 122 may be arranged as a plurality of rows (e.g., 16 rows) and columns (or "OFM channels") (e.g., eight columns), wherein, for clarity, only four of the 16 rows are depicted in FIG. 1B and only the label "O" is depicted in FIG. 1B₀And O₇Two columns of ".

An IFM vector having a length of sixteen values may be referred to herein as an "IFM slice". The IFM slice may have an associated planar coordinate (x, y) and an associated depth channel index d as an index of an associated IFM tensor (e.g., IFM [ x, y, d: d +15 ]). In general, block 102 receives one IFM slice at a time from an on-chip memory or SRAM that contains a 3D IFM tensor in which each input IFM slice contains values for sixteen depth channels from index D to D +15 (including D and D +15) at planar positions (x, y) in the input layer.

Similarly, an OFM vector having a length of eight values may be referred to herein as an "OFM slice". An OFM slice may have an associated planar coordinate (x, y) and an associated depth channel index d as an index of an associated OFM tensor (e.g., OFM [ x, y, d: d +7 ]). In general, the block 102 produces as output the OFM slice. In some embodiments, when the block is not stopped (stall), as will be seen below, the output rate may vary from one OFM slice per clock up to, for example, a maximum of two OFM slices per clock. Note that the OFM output vectors (OFM slices) output from the block 102 may need to be further reduced by a Reduction Fabric 111 to complete the OFM vector calculations before the final OFM vector results are sent through the OFM transfer Fabric 106 to be stored in the SRAM 109.

Note that both the IFM tensor and the OFM tensor can also have a fourth "batch" dimension; however, in contrast to neural network model training, the primary purpose of the neural processor 100 is to accelerate neural network model real-time inference, and real-time inference is typically performed based on batch size 1. For simplicity of illustration, the batch dimensions will be omitted in much of the discussion below, and the batch dimension details will be described separately later.

The neural processor 100 may be implemented in synchronous logic and each MR column 133 may be entirely within one clock domain. In some embodiments, during each cycle of operation (e.g., during each clock cycle), each of the sixteen multipliers 126 may form a corresponding product from the two multiplicands (or operands) at its inputs. Each of the

adders

128A and 128B may form a sum of some or all of the sixteen products at the inputs of the

adders

128A and 128B (as depicted in fig. 1B for the four paths depicted), and the adder of each

accumulator

130A and 130B may form a sum of (i) the current value of the register of the respective one of the

accumulators

130A and 130B plus (ii) the output of the respective one of the

adders

128A and 128B. At the beginning of the next clock cycle, the output of each adder of each

accumulator

130A and 130B may be written to a register of the

accumulators

130A and 130B.

In some embodiments, the computations provided by block 102 may be pipelined, and additional registers (i.e., flip-flop arrays) may be present between the elements depicted in fig. 1B to provide, for example, sufficient timing margin at the clock speed of circuit operation. In such embodiments, the throughput may be the same (i.e., the same as if there were no additional registers (e.g., one multiply and add per clock cycle)), but the delay between (i) the input data being input to the multiplier 126 and (ii) the final result of the multiply and add being written to the registers of the

accumulators

130A and 130B may be greater (e.g., some clock cycles).

Fig. 1C to 1H depict examples of such operations: neural processor 100 exploits sparsity in the IFM data to take advantage of multiplier 126 by advancing certain multiply and add operations out of order (if an element of the IFM data is equal to zero, multiplier 126 would otherwise perform a multiplication with zero), thereby speeding up the completion of the computation. The IFM data may be stored in the SRAM bank group 109, and retrieving the IFM data from the SRAM bank group 109 may be scheduled such that the activation buffer 124 operates as a plurality of queues. Each queue formed by activating buffer 124 corresponds to a row of data as depicted in fig. 1B, each queue outputting IFM data to a respective lane of MR array 122.

For clarity of illustration, assume that IFM cache 139 between SRAM bank set 109 and activation buffer 124 has been disabled and bypassed. Also assume that the data type of activation is uint8 (8-bit unsigned integer) and the data type of weight is int8 (8-bit signed integer), in which case the type converter 135 is used to pass the activation value unchanged and the multiplication in the MU 103 takes one clock cycle. Another assumption is that: the SRAM bank set 109 contains some sample IFM values as depicted in fig. 1B at the beginning of the example operation, and only one block is being used.

Another assumption is that: the weight tensor W [0.. 15,0.. 7, a.. j ] corresponds to 16 IFM paths, 8 IFM columns, and 10 IFM input vectors, a through j having been preloaded into a corresponding MU register file (i.e., register file 127).

Once the example operation begins, it can be seen from fig. 1C that, in the depicted example, the two IFM vectors a [ ] and b [ ] in the two rightmost columns of the SRAM bank group 109 have been fetched into the activation buffer 124 such that the first column of the activation buffer 124 (i.e., the right column a [ ]) contains the first vector of IFMs (i.e., elements a 0-a 3) and the second column of the activation buffer 124 (i.e., the left column b [ ]) contains the second vector of IFMs (i.e., elements b 0-b 3, and b1 ═ 0). In fig. 1C, the second queue contains a1 as its first element (closest to MR array 122) and zero (0) as its second element (i.e., b1 ═ 0).

In front of the activation buffer 124, an IFM vector a [0.. 3] is broadcast to the MR array 122 (i.e., IFM value a0 is broadcast through the topmost activation path 137 as an input to each of the eight multipliers 126 in the top row). Meanwhile, the top row multipliers 126 in columns 0 through 7 each receive a weight W [0,0.. 7, a ] from their respective local register file 127 as a second input to each multiplier 126.

Similarly, the value a1 is broadcast through the second active path 137 from the top as input to the second row multiplier 126 from the top. At the same time, the second row from the top of multipliers 126 in columns 0 through 7 each receive a weight W [1,0.. 7, a ] from their respective local register file 127 as a second input to each multiplier 126.

In operation, the product of the first vector of IFMs (i.e., elements a 0-a 3) and the corresponding weights may be formed in each of the multipliers 126 of the 16 x 8 array, and the sum of the products corresponding to the desired dot product may be formed in the first adder 128A and stored in the first accumulator 130A. That is, the content of the first accumulator 130A includes:

Σ_A,0＝a₀*w_0,0,a+a₁*w_1,0,a+a₂*w_2,0,a+a₃*w_3,0,a

...

Σ_A,7＝a₀*w_0,7,a+a₁*w_1,7,a+a₂*w_2,7,a+a₃*w_3,7,a。

in this connection, the resin is similar to IFM a [ alpha ]]The determination or calculation of the corresponding OFM output vector utilizes the results available in the accumulator 130A (depicted in FIG. 1C as Σ_A,0...7) Is completed and is ready for output to the OFM transfer structure 106. The accumulators 130A of each column may then be cleared.

In fig. 1D, after the first vector of IFMs has been processed, the third vector of IFMs (i.e., elements c 0-c 3, and c2 ═ 0) may be read into the activation buffer 124. Instead of forming the product of the weight with all elements of the second vector of IFMs (i.e., elements b 0-b 3, and b1 ═ 0), which would require the formation of a product of zero values with corresponding weights in each multiplier 126 of the second pass, the second element of the third vector of IFMs (i.e., element c1) advances out of sequence and is multiplied with corresponding weights in each multiplier 126 of the second pass.

Meanwhile, multipliers 126 in

lanes

0, 2, and 3 are receiving weights W [0,0.. 7, b ], W [2,0.. 7, b ], and W [3,0.. 7, b ] from their respective local register files, respectively. Because lane 1 operates out of order due to activation b1 being skipped at 0, the multiplier in lane 1 receives the weight W [0,0.. 7, c ] associated with IFM vector ("pixel") c, rather than the weight associated with IFM vector (pixel) b.

Since block 102 is now processing two pixels (a portion of pixel c and pixel b) simultaneously, adding the product in the column may produce incorrect results. To obtain the correct result, one of the two

adder trees

128A and 128B is used to calculate the dot product of pixel B, while the other of the two

adder trees

128A and 128B is used to begin calculating the dot product of pixel c.

The product formed by each multiplier 126 of the second path is input to a second adder 128B (indicated as Σ in fig. 1D)_B,0...7) And the products formed by the multipliers 126 of the other paths are input to the first adder 128A. The out-of-order advancement of element c1 forms a "hole" in activation buffer 124 that may be utilized in a subsequent clock cycle by advancing another element out-of-order (as depicted in FIG. 1E, when element d1 advances out-of-order).

Once the products of the non-zero elements of the second vector of IFM data and the corresponding weights have been determined or calculated and their sum in the first accumulator 130A of each column, the first accumulator 130A of each column contains the dot product of the second vector of IFM (b [ ]) and the weight vector of that column, and the dot product can be output to the OFM transfer structure 106. The first accumulator 130A of each column may be cleared. That is, the contents of the first accumulator 130 for each column, prior to clearing, comprise:

Σ_A,0＝b₀*w_0,0,b+b₂*w_2,0,b+b₃*w_3,0,b

...

Σ_A,7＝b₀*w_0,7,b+b₂*w_2,7,b+b₃*w_3,7,b。

at this time, the second accumulator 130B of each column contains only one term of the dot product of the third vector (c1) of IFM and the corresponding weight vector. That is, the contents of the second accumulator 130B include:

Σ_B,0＝c₁*w_1,0,c

...

Σ_B,7＝c₁*w_1,7,c。

referring to fig. 1E, in a subsequent operation (e.g., during a next clock cycle), the product of the elements (c0, c3) of the third vector of IFMs and the corresponding weight vector may be formed by the first multiplier 126 and the fourth multiplier 126 of each column of the MR array 122. Each product may be added to one of the products already stored in the second accumulator 130B to complete the dot product of the third vector (c [ ]) of the IFM and the corresponding weight vector in the second accumulator 130B. That is, the contents of the second accumulator 130B include:

Σ_B,0＝c₀*w_0,0,c+c₁*w_1,0,c+c₃*w_3,0,c

...

Σ_B,7＝c₀*w_0,7,c+c₁*w_1,7,c+c₃*w_3,7,c。

The dot product of the fourth vector of IFM (i.e., elements d 0-d 3, and d 0-d 3-0) and the weight vector may be determined or calculated simultaneously by advancing both element d1 out of order (leaving a "hole" in the activation buffer 124 because the product involving c1 was performed in the previous cycle) and element d2 (because c 2-0). The contents of the first accumulator 130A include:

Σ_A,0＝d₁*w_1,0,d+d₂*w_2,0,d

...

Σ_A,7＝d₁*w_1,7,d+d₂*w_2,7,d。

at this point, the calculation of the OFM data of the IFM vector c [ ] and the IFM vector d [ ] is completed.

In a similar manner, when the activation buffer contains two vectors e [ ] and F [ ] with complementary sparsity as depicted in fig. 1F, each of the MR columns 122 can form two dot products simultaneously. In the example depicted in fig. 1F, for example, the "dot product of the fifth vector of IFM data (i.e., elements e0 through e3, and e 0-e 1-0) and the corresponding weight vector" is formed simultaneously with the "dot product of the sixth vector of IFM data (i.e., elements F0-F3, and F2-F3-0) and the corresponding weight vector" and the two non-zero elements of the sixth vector are not advanced in order.

Fig. 1G depicts the state of the seventh vector G [ ] of IFM data (i.e., elements G0-G3, and G1-G2-G3-0) in the first column of the activation buffer 124 and the eighth vector of IFM data (i.e., elements h 0-h 3, and h 2-h 3-0) in the second column of the activation buffer 124. Fig. 1G depicts how the "dot product of the eighth vector h [ ] of IFM data with each corresponding weight" is formed simultaneously with the "dot product of the seventh vector of IFM data with each corresponding weight" by advancing the (non-zero) elements of the eighth vector of IFM data out of order such that the (non-zero) elements of the eighth vector are processed simultaneously with the (non-zero) elements of the seventh vector of IFM data. Because one of the (non-zero) elements of the eighth vector of IFMs (h0) is in the same pass as the (non-zero) element of the seventh vector of IFMs (g0), each of the (non-zero) elements of the eighth vector of IFM data is shifted to an adjacent pass of MR column 122 so that these elements may not advance in order.

"an eighth vector h [ of IFM data ]]Is input to the second multiplier 126 from the top of each column (which is not used for the seventh vector g [ 2 ] of IFM data because it has a zero element at that position)]) And an eighth vector h [ 2 ] of the IFM data]Is input to the third multiplier 126 (which is also not used for the seventh vector g [ 2 ] of the IFM data) of each column (h1)]) "allow the (non-zero) elements of the eighth vector of IFM data to be processed simultaneously with the (non-zero) elements of the seventh vector of IFM data. The eighth vector h [ 2 ]]The corresponding elements of the weight vector of (a) are also shifted. More specifically, each MU 103 associated with the topmost lane obtains two weights, one associated with G0 (labeled w in fig. 1G)_0，0..7，gWhere 0..7 indicates a corresponding column), and another weight is associated with h0 (labeled w in fig. 1G)_0，0..7，h). Each weight w_0，0..7，gCorresponding input into the topmost lane that is receiving g0In the multiplier 126. However, each weight w_0，0..7，hShifted one way down and input into the multiplier 126 of the second way from the top in the same column that is receiving h 0. Finally, the MUs 103 in the second pass from the top each obtain a weight w _1，0..7，h(associated with h 1) and shifts these weights down one lane, to the third lane from the top in the same column that is receiving h 1.

In the state depicted in FIG. 1G, each multiplier 126 of each MR column 122 in the bottom pass is unused for one cycle. In some embodiments, the likelihood of such an event may be reduced to fully utilize all of the multipliers 126 by configuring the MR block 102 to have deeper (e.g., 3(3-deep)) active buffers 124 so that each active lane may have more (e.g., three) values from the same lane selected therefrom. Taking non-zero activation away (shifting) from a via at a distance greater than one via also provides greater flexibility to replace zero-valued activation with non-zero activation. Having more than two sets of adder trees and associated accumulators may also improve multiplier utilization.

FIG. 1H depicts a subsequent cycle after the cycle depicted in FIG. 1G, in which the first column of the activation buffer 124 contains a ninth vector of IFM data (containing all zeros), and the second column of the activation buffer 124 contains a tenth vector of IFM data (i.e., elements j0 through j 3). In the state depicted in FIG. 1H, all elements of the tenth vector of IFM data may be advanced out of order, and the dot product of the tenth vector of IFM data, j [ ] with each weight vector may be calculated without causing a delay of one cycle of processing of the ninth vector of IFM data.

As depicted in the above example, the output of multiplier 126 may be input to adder tree 128A during some clock cycles and may be input to adder tree 128B during other clock cycles. When the output of multiplier 126 is not input to adder

tree

128A or 128B, the input to the adder tree may be set to zero. Fig. 1I depicts an example configuration using a multiplexer 132 to direct the output of any multiplier 126 to either a first adder 128A or a second adder 128B to support the operations depicted in fig. 1D-1H, for example. Here, multiplexer control signals sel _ addr _ tree [0.. 15] come from block control logic 144 (FIG. 1O) to coordinate the computations within block 102, including fetching IFM vectors from the cache, selecting and multiplexing non-zero activations from the activation buffer to the activation paths, selecting the adder tree used with each IFM vector, multiplexing the multiplier unit outputs to the correct adder tree, and clearing the column accumulator.

Since the output of multiplier 126 is always input to either adder tree 128A or adder tree 128B, but never to both

adder trees

128A and 128B at the same time, both

adder trees

128A and 128B can be implemented using less logic. Fig. 1J depicts how both the first adder 128A and the second adder 128B may be logic concepts implemented using a single physical adder tree and appropriate multiplexers (not shown). For clarity, consider configuring two adder trees, each adder tree including four inputs. A four-input adder tree may be implemented using three adders. In a simple approach, three adder elements will be used per adder tree, so configuring two four-input adder trees will use six adder subelements. Two four-input adder trees can be constructed using only three adder elements with the help of some additional multiplexers. There are three considerable cases. (i) In the first case, all four inputs are summed by the first logical adder 128A (and the output of the second logical adder 128B is zero). (ii) In the second case, three of the inputs are summed by the first logical adder 128A (and the output of the second logical adder 128B is equal to the remaining inputs). (iii) In the third case, two of the inputs are summed by the first logical adder 128A and two of the inputs are summed by the second logical adder 128B. In the other two cases (not depicted in fig. 1J), the second logical adder 128B sums three or all four of the inputs, respectively, and the first logical adder 128A is equal to the remaining inputs or equal to zero. As used herein, an "adder" is a physical circuit for adding at least two numbers to form a sum, as in the example of fig. 1J, or one of a plurality of logical adders formed from a combination of physical adders and multiplexers. As can be seen from fig. 1J, only three adder elements (and some additional multiplexers not shown) instead of six adder elements are sufficient to implement all possible scenarios.

Fig. 1K depicts an internal circuit diagram of multiplier cell 103 according to the subject matter disclosed herein. Multiplier unit 103 may include an unsigned 8-bit by unsigned 8-bit multiplier 126, a register file 127 that may hold local weights, logic 143 that may select input weights for multiplier 126,

logic

149 and 151 that may shift local weights to adjacent lanes,

logic

145, 136, 157, 155, and 159 that may detect a zero multiply condition and idle multiplier 126 to reduce dynamic power consumption, and weight loading logic 157.

The register file 127 holds the weights. One register corresponds to a single int8 or uint8 weight. Weights with larger bit widths occupy more than one register, e.g., int16 (16-bit signed integer) or uint16 (16-bit unsigned integer) weights may occupy two registers. The register file 127 may hold 18 int8 or uint8 weights or 9 int16 or uint16 weights, respectively. As will be described later, the number of registers may be selected to enable the calculation of a 3 by 3 convolution using 16-bit weights without resorting to the generation of partial results.

The register file 127 includes a single input port for loading the weights { swt _ in [ C ], wt _ abs _ ld _ in [7:0] [ C ] } through the vertical weight load bus 101 (FIG. 1N). Each MR column 133C receives its own weight loading bus, where C is in the range of 0 to 7. The weights are loaded from the weight decompression unit 138 (FIG. 1N) one entire lane (i.e., in all eight columns in a single lane at the same time) at a time by placing the weight values { swt _ in [ C ], wt _ abs _ ld _ in [7:0] [ C ] } on the vertical weight load bus 101, specifying the index of the target register (from zero to seventeen) on the weight register index bus wt _ ld _ idx [4:0], and determining the lane weight load enable wt _ ld _ en _ lane [ L ] to load the weights into the lane L.

From fig. 1K, it takes eighteen cycles to load all weights in a single lane, and a total of 18 × 16-288 clock cycles are required to load all weights in the entire MU array 122. In some cases, this weight loading speed may be insufficient, particularly when computing a Fully Connected (FC) layer. Unlike convolutional layer computation, each weight is used only once during FC layer computation and is discarded thereafter. Therefore, in order to maintain the maximum utilization of the multiplier 126 when computing the FC layer, it is necessary to load one weight per multiplier cell 103 per clock, which is 16 times faster than the basic circuit depicted in fig. 1K. In this case, the embodiment may be modified, for example, to include additional weight load buses 101{ swt _ in [ C0], wt _ abs _ ld _ in [7:0] [ C0] }, { swt _ in [ C1], wt _ abs _ ld _ in [7:0] [ C1] }, and so on, to accelerate weight loading.

In fig. 1K, the weight register file 127 comprises three output ports to enable three weights to be taken simultaneously if one of the weights is to be shifted up one lane while a second weight is shifted down one lane and a third weight is being consumed locally.

A multiplexer 147 is used to enable the weights to be retrieved from the local register file for local consumption. For example, in FIG. 1C, multiplexer 147 selects the locally stored weight w to be multiplied by IFM value a0 _0,0,a. As another example, in FIG. 1D, multiplexer 147 selects the locally stored weight w to be multiplied by IFM value c1_1,0,c。

The path of taking the weight from the local register file 134 and shifting it to the bottom is implemented using a multiplexer 149. For example, in FIG. 1G, the weight w is stored locally_0,0,hOne way down to multiply with IFM value h 0.

Finally, the path of retrieving the weights from the local register file 127 and shifting the weights to the top is implemented using multiplexer 151.

Because the Active Broadcast Unit (ABU)141 has complete information about the shift of each active path and the offset in the active buffer associated with each IFM value being broadcast (to the active path), ABU 141 controls all three register file fetch

multiplexers

147, 149 and 151 using signals sel _ wt _ self [4:0], sel _ wt _ dn1[4:0] and sel _ wt _ up1[4:0], respectively.

To reduce the area of the MR column 133, the number of output ports in the register file 127 can be reduced from three to two, for example, by disabling the shifting of weights up and down simultaneously from the same register file. The number of output ports in register file 127 can be further reduced to one, for example, by disabling all weight shifts or allowing one shift weight or local consumption weight. Limiting the shift and the maximum shift distance, however, may reduce the multiplier utilization to some extent. Multiple variations and combinations of shift target path selection and activation buffer depth may be designed to optimize multiplier utilization while reducing MR column 133 and activation broadcast unit 141 complexity, area, and power. As described in the related disclosure (attorney docket No. 1535-467CON2), a particularly efficient method and apparatus for achieving optimized multiplier utilization involves shuffling (permuting) the activation paths in a pseudo-random manner while loading the associated weights accordingly.

In fig. 1K, a multiplexer 143 selects the input weights to be used in the multiplication by the multiplier 126. As previously discussed, the input weights may come from the local weight register file 127, or become "shifted down" from the weight register file in the adjacent upper lane (and in some embodiments the same column), or "shifted up" from the weight register file in the adjacent lower lane (and in some embodiments the same column), as represented by signals { swt _ self, wt _ abs _ self [7:0] }, { swt _ dn1, wt _ abs _ dn1[7:0] }, and { swt _ up1, wt _ abs _ up1[7:0] }, respectively. Since activation broadcast unit 141 has complete information about the shift of each activation path and the activation buffer offset associated with each IFM value being broadcast (to the activation path), ABU 141 uses signals sel _ mult _ wt [1:0] to control multiplexer 143.

For example, in FIG. 1C, multiplexer 143 chooses to carry weight w to be multiplied by IFM value a1_0,0,aOf { swt _ self, wt _ abs _ self [7: 0)]}. In FIG. 1D, multiplexer 143 chooses to carry the weight w to be multiplied by IFM value c1_1,0,cOf { swt _ self, wt _ abs _ self [7: 0)]}. In the case of the embodiment shown in figure 1G,multiplexer 143 selects to carry the weight w to be multiplied by the second from the top multiplier 126 in column zero with the IFM value h0 _0,0,hOf { swt _ self, wt _ abs _ self [7: 0)]}。

Note that as shown in fig. 1K, each register file 127 has a bit width of nine, with eight bits holding the weight sizes stored in the signed 8-bit size format and one bit holding the weight signs stored in the signed 8-bit format (and having a "zero" constant Z that is pre-incremented when applicable). The bit width of the register file 127 may be reduced to eight bits by adding logic that converts the signed int8 type to a signed 8-bit size representation (including zero-point addition, where applicable) on the fly as weights are retrieved from the register file 127. Such an on-the-fly transition may be of interest when the size of register file 127 has been selected to be large enough to result in the described area savings.

The activation broadcasting unit 141 broadcasts activation { sact, act _ abs [7:0] } to be used as an input to the multiplier 126. The

logic gates

145 and 159 use the signals wt _ zero and act _ zero (auxiliary signals from the ABU) to check for a zero multiplication situation where the weight (to be multiplied) is equal to zero or the activation (to be multiplied) is equal to zero or both. If a zero-multiply condition occurs, the resulting signal mult _ by _ zero is set (assert), resulting in gating the weights and clocks that activate the multiplier input registers with the mult _ in _ ce signal. Gating the clock input to the multiplier register keeps (freezes) its previous state for the multiplier input and multiplier internal signals, preventing switching activity from reducing dynamic power. In parallel with this activity, flip-flop gate 157 delays the mult _ in _ ce signal corresponding to the multiplied by zero by one cycle to generate the mult _ out _ zero signal, which causes logic gate 155 to zero multiplier outputs mult _ result [15:0 ]. As discussed later, the ABU 141 also sends a signal en _ mult to idle all multipliers 126 whenever the computation in the entire block is to be stopped.

The signal names in fig. 1K follow the convention where "act" represents activation, "wt" represents weight, "s" in "sact", "swt", "mult _ out _ s", "s _ in _ a", etc. represents "sign", and "abs" in "wt _ abs", "act _ abs", etc. represents absolute value (magnitude).

ABU 141 broadcasts the activations { sact, act _ abs [7:0] } in a signed 8-bit size format. Similarly, the weights { mult _ swt, mult _ wt _ abs [7:0] } (used for multiplication) selected are also supplied in signed 8-bit size format.

Registers

136a and 136b latch the activations and weights, respectively, to be multiplied to create input signals { s _ in _ a, mult _ in _ a [7:0] }, { s _ in _ b, mult _ in _ b [7:0] } to multiplier 126. In some embodiments, multiplier 126 computes the product by multiplying two absolute 8-bit values and xoring the two symbols, producing signed 16-bit size outputs { mult _ out _ s, mult _ out _ abs [15:0] }. Logic 153 converts the signed 16-bit size result to a 16-bit signed output to be input to the adder tree by inverting the absolute magnitude of the product mult _ out _ abs [15:0] to produce the signal mult _ out [15:0] when the sign of the product is determined (i.e., the product result is negative). Finally, as previously described, logic 155 zeroes mult _ out [15:0] if multiplied by zero.

To summarize the role of the ABU 141 for multiplication control, the ABU 141 provides: input IFM data in a signed 8 bit size format, weight select controls including an up shift path and a down shift path, and an auxiliary signal act _ zero indicating that the current activation being broadcast is equal to zero. When the act _ zero signal is asserted, the actual value of { sact, act _ abs [7:0] } may remain unchanged to reduce active path switching activity. Although the case where a zero value activation is broadcast may occur, some embodiments may minimize such occurrence.

Fig. 1B-1H depict calculations that support sparse activation by taking non-zero value activations from IFM buffer 124 within ABU 141 whenever possible and multiplexing the associated weights to multiplier 126 to obtain the correct dot product. IFM buffer 124 retrieves IFM values from cache 139 and stages the retrieved IFM values in an active staging FIFO (or IFM staging FIFO)165 (see FIG. 1L and FIG. 1 MA). Subsequently, the plurality of activation multiplexers 166 are used to fetch non-zero activations (when possible) from the IFM staging FIFO 165 so that activations can be "shifted" up or down from adjacent lanes, as well as out of order fetches.

In fig. 1MA and 1MB (discussed below), "look-ahead" distance (h) is the search distance along the same lane, "look-side" distance (d) is the side-look search distance, and FIFO depth (F) represents the depth of activation FIFO 165. For clarity of terminology, a plurality of activation multiplexers 163166 accept the IFM path as input from IFM staging FIFO 165, apply forward and side lookups to obtain activations, and output the resulting values to the activation "path" (not the channel). The use of the term "path" helps to distinguish the concept of logical indexing of deep "channels" within the tensor from activation flowing along a physical hardware "path".

The register 161 within the IFM classification FIFO 165 may be optional and is shown for clarity of explanation. In some cases, area and power may be reduced by deactivating hierarchical FIFO register 161, connecting IFM multiplexer 163 directly to the multi-port cache output, and modifying IFM cache read logic to fetch IFM values directly from cache 139 to multiplexer 163 in the correct order.

Fig. 1MA depicts a configuration of a multiplexer 163, which multiplexer 163 may be used to select an activation from the activation staging FIFO register 161 from among any one of some possible values stored in the activation FIFO 165, including values in the same lane and values in other lanes, to be broadcast (via type converter 135) to the MR array 122 and input to the multiplier 126 in any of a plurality of lanes of a tile (e.g., 16 lanes total in a tile). For a more general case, each cell may enter a 2 x d multiplexer, and each target may have an equal number of sources (2 x h x d) except that lane 1 and lane 16 have h x (d +1) sources due to being at the ends.

Let output cache size (C) be defined as the size of the output cache residing in the accumulate-and-return unit (ARU) 167 (FIG. 1N) of each MR column. Let the input bandwidth (I) be defined as the IFM streaming bandwidth (number of 16 byte long IFM vectors per clock cycle); and let the output bandwidth (O) be defined as the OFM transfer node Bandwidth (number of OFM vector results 8 bytes long per clock cycle). Furthermore, the original sparsity(s)_r%) can be defined as the sparsity of observation based on counting the zero elements in the activation tensor (in proportion to the total number of activations in the activation tensor). Actual sparsity(s)_a%) can be defined as the actual number of zero elements (proportional to the total number of activations in the activation tensor) applied during the two-dimensional convolution (conv2d) process for the activation tensor, which takes into account the convolution step size (e.g., the convolution step size may not use a particular zero value activation or may include a particular zero value activation multiple times), and which takes into account the convolution fill. Multiplier utilization (U)_M) Can be defined as the percentage of the period that the multiplier performs an effective multiplication (multiplied by a non-zero activation). For a 1 × 1 convolution, for example, if the activation tensor has the original sparsity s_r%, then if a simple, naive approach is used (i.e., a "dense" computation mode without zero skip), the multiplier utilization will be (1-s)_r%) and for non-1 x 1 convolutions, the multiplier utilization is (1-s) when using simple, naive (intensive) computation_a％)。

Fig. 1MB depicts: (i) an enlarged view of four rows of the circuit of fig. 1MA in a first configuration (which is the configuration shown in fig. 1 MA) on the left side of fig. 1 MB; (ii) an enlarged view of four rows of the circuit of fig. 1MA in a second configuration at the center of fig. 1 MB; and (iii) an enlarged view of four rows of the circuit of fig. 1MA in a third configuration to the right of fig. 1 MB. In a first configuration, the side-find multiplexer inputs are from the upper and lower rows, and no forward-find multiplexer input is from the same row. The first configuration typically has fewer lines than the other two configurations, and extends the search for non-zero activation values to the other channels (i.e., to fewer rows), where it may be advantageous if one channel tends to have consecutive zeros. Furthermore, if acceleration by a factor of two is targeted, two positions may be sufficient, and

channels

1 and 16 have the same number of candidates in an h-2, d-1 configuration. The second configuration may be referred to as a "full multiplexing scheme". In this configuration, the side-find multiplexer inputs channels from above and below, and the forward-find inputs the same channel from the next depth. In a third configuration, the side-find multiplexer input is not used, and the forward-find multiplexer input is only from the same channel (i.e., side-find d ═ 0). The third configuration has relatively low complexity (i.e., less than half of the multiplexers and lines are required) and may allow for simpler weight skipping to be supported at the expense of slightly reduced multiplier utilization.

Fig. 1N depicts a top level diagram of the block 102 including the MR array 122, the MR array 122 including a grid of MUs 126 organized in eight

MR columns

133 and 16 rows. Each MU 126 element includes Subscripts (MUs) corresponding to row and column coordinates of MUs within MR array 122_row,col). The weight decompression unit 138 may receive the compressed weights from the bank of SRAM 109 locally located in the bank and decompress the weights during the process of writing the weights to the weight register 127. The weights may be compressed to take advantage of sparsity in the weights, thereby reducing the memory used to store the weights and reducing the bus bandwidth used to send the weights to multiplier unit 126. Alternatively, the weights may be stored uncompressed in the SRAM bank set 109. IFM cache 139 may be used to cache IFM data to reduce the bottleneck effect of IFM transfer fabric 104, as described in the context of fig. 1D-1H, and ABU 141 may be used to implement a skip of zero value activation (or "activation skip").

FIG. 1O depicts a hierarchy of neural processor control. Neural processor 100 may have a state machine or "control finite state machine" (control FSM) or "control logic" that may control the various elements depicted in fig. 1A. The control hierarchy may have two levels including a "global" level and a "local" level. In operation, the Global Control (GC) FSM 140 schedules the operation of the local

control state machines

142 and 144, including the start weight loading phase, and the start and control computation phase. Since the blocks 102 support skipping activation of zero values, the output rate of the blocks 102 may vary slightly depending on the actual sparsity of the IFM slices received by each block 102. Thus, the computations in block 102 may run a few clocks ahead or behind. Thus, global control logic 140 coordinates the operation of local block control logic 144 to bring the outputs from the plurality of blocks 102 back into synchronization to complete the reduction using reduction structure 111, and to send the final OFM results to SRAM bank group 109 via OFM transfer structure 106. Synchronization of the outputs of the multiple banks 102 may be achieved, for example, using a small output FIFO 198 (also 179) (fig. 1X) within the ARU 167, and in the extreme case that the bank output FIFO 198 becomes full, synchronization of the outputs of the multiple banks 102 is achieved by throttling (stopping) the bank 102 with the full output FIFO to allow other banks to catch up.

Each of a plurality of SRAM Control (SC) FSMs 142 may generate an SRAM address and read/write signals for each SRAM bank within SRAM bank group 109. Each of the plurality of Tile Control (TC) FSMs 144 may skip activation when the activation has a value of zero. To prepare for the operation, a host CPU (not shown) loads the start address and size (height, width, depth, batch size) of each IFM and OFM tensor into SRAM control FSM 142; load operation type (i.e., Fully Connected (FC) or convolutional) and IFM, OFM, and weight data types into global control FSM 140, and load IFM and OFM weight loop configuration, order of IFM traversal, number of IFM passes (pass) (explained later), and other choices of compute map settings, activation functions, and pooling (if any); enable or disable partial result generation; load weight tensor size (height, width, number of input and output depth channels); loading a zigzag Z-height (discussed below); and load the options for convolution padding and convolution step size into the configuration registers of the FSM. The host CPU also writes registers associated with the IFM transfer structure 104, the OFM transfer structure 106, and the reduction structure (RF)111 to configure connectivity according to operating parameters including the addresses of the IFM tensor and the OFM tensor within each SRAM bank group 109. To begin operation, the host CPU writes to registers in global control FSM 140. Global control FSM 140 then signals SRAM control FSM 142 and tile control FSM144 to begin.

In some embodiments, global control FSM 140 controls the scan within the convolution window, translates the convolution window, and traverses the IFM tensor to generate the IFM slice stream. Global control FSM 140 sends the planar pixel (x, y) coordinates to SRAM control FSM 142; depth channel index d and IFM slice; and reading the signal. Each of SRAM control FSMs 142 adds a start address, fetches the appropriate IFM data, and outputs the data to IFM transfer structure 104. Typically, the IFM (and OFM) tensor size is too large to fit into a single SRAM bank group 109, resulting in the IFM (and OFM) tensor being subdivided into portions to be stored across multiple SRAM bank groups 109. During computation, global control FSM 140 arranges the IFMs and (accordingly) the OFM tensors to be traversed (acquired or stored in a particular sequence), while also generating on the fly a reconfiguration of IFM transfer structure 104 and OFM transfer structure 106 to acquire IFM data from and write OFM data to the correct SRAM bank set 109.

All of the block caches 139 may receive data substantially simultaneously. Global control FSM 140 calculates and provides to all tile control FSMs 144: (i) an address of the IFM cache 139 register file holding each incoming data, and (ii) a write enable signal for writing data from IFM transfer structure 104 into cache 139. The write enable signal is active when an IFM slice is coming from the SRAM bank set 109 through the IFM transfer structure 104, and is inactive when the IFM slice has been cached. As global control FSM 140 traverses the IFM layers (tensors) in a particular order, global control FSM 140 also keeps track of which IFM slices needed for the computation have been cached and signals SRAM control FSM 142 when to read data not already present in IFM cache 139. If the data is already cached in tile cache 139, global control FSM 140 keeps the read signal inactive so that SRAM control FSM 142 skips the SRAM read. To simplify management of the IFM cache, each IFM slice from the IFM transfer structure is written to all associated target blocks (specified by mapping, as discussed later) at the same address in IFM cache 139 and their respective IFM caches, regardless of the target number of the block. Since block computations run at slightly different rates due to uneven activation sparsity, the control logic for each block manages IFM cache 139 reads locally independent of other blocks.

In some embodiments, the process of writing the OFM result is similar to reading the IFM value. However, due to the active skipping, the computation delay may vary. Each tile control FSM 144 has information indicating when all columns in the tile have completed calculations. The per-tile block control FSM 144 sends OFM _ ready signals to global control FSM 140, which global control FSM 140 instructs SRAM control FSM 142 to write the resulting OFM slices from OFM transfer structure 106 to the SRAM banks at the appropriate (x, y, d) index in the OFM tensor. During the OFM tensor traversal, global control FSM 140 generates the OFM (x, y, d) slice coordinates in a manner similar to how it generates the IFM (x, y, d) slice coordinates during the IFM tensor traversal. Once the computation is complete, global control FSM 140 sends an interrupt to the host CPU.

As previously described, block 102 may produce, for example, up to two output results per clock due to activation skipping. Therefore, the IFM transport structure 104 should be able to supply up to two IFM slices per clock to avoid a reduction in multiplier utilization. Thus, local tile control FSM 144 may inform global control FSM 140 about the amount of data remaining pending in the cache, so that global control FSM 140 may direct SRAM control logic 142 to resume fetching IFM data to avoid IFM cache underflow. When any of the block IFM caches 139 become full, global control FSM 140 instructs SRAM control FSM 142 to halt the IFM tensor traversal, which includes reading the IFM slice from SRAM 109 and writing the IFM slice to block cache 139.

Referring to FIG. 1P, in some embodiments, IFM cache 139 includes sixteen ways 170. Each path contains a register file 169 having dual input ports and dual output ports. Dual ports may be used because system block 102 is able to process up to two activations per clock (when there are enough zero activations) due to activation skipping (and two adder trees per MU column). Three input ports, three output ports, three IFM transfer structure bandwidths, three OFM transfer structure bandwidths, and three adder trees per MU column may be used for faster processing of activations (e.g., three IFM slices per clock).

The activation is input from SRAM 109 at up to double the rate through IFM transfer structure 104. Block control FSM 144 tracks the amount of IFM data remaining to be processed in each cache way 146. When any of the cache ways are about to become full, tile control FSM 144 may notify global control FSM 140 that at least one cache way is about to become full, and global control FSM 140 may throttle (stop) the IFM reads controlled by SRAM control FSM 142 to avoid overflow of one or more tile cache ways until the cache space is released.

Global control FSM 140 may also inform chunk control FSM 144 when the convolution window scan is complete (and the window is translated to the next location) and when the IFM loop is complete, so that the chunks may properly reset the column accumulator and not mix the convolution at one location with the convolution at the next location. The concept of an IFM loop is defined and discussed later.

Block control FSM 144 generates the signals required to read IFM data from each cache way register file 169, including the read address and read enable of the output port of each register file. At each clock cycle, unless tile 102 has completed processing and is waiting for other tiles to complete processing, tile control FSM 144 reads one or two data values (from one port or two cache ports, respectively) so that the results are available for reduction by reduction structure 111. Whether one or two bytes are read per single clock depends on the activation sparsity. IFM buffer 124 within ABU 141 checks whether the activation is sparse and can inform the chunk control FSM 144 so that if ABU IFM staging FIFO 165 releases one slot, the chunk control FSM 144 loads one byte and if ABU IFM staging FIFO 165 releases two slots, the chunk control FSM 144 loads two bytes.

The table in fig. 1Q depicts a cache size sufficient to hold all IFM slices to avoid repeated reads to SRAM 109 while performing convolution operations using convolution window sizes 1 x 1, 2 x 2, 3 x 3, and 4 x 4 when the convolution window slides from one (x, y) location to the next on a plane-by-plane basis. The data in the table assumes that the register file 134 of the multiplier unit 103 contains 18 weight registers and that the convolution window scans the input tensors in a "zigzag" sequence (as discussed below) because: since a single read from the SRAM 109 typically consumes a significant amount of power compared to a single read from the local register file 169, a "zig-zag" scan sequence may be used to maximize the use of the IFM cache 139, thereby minimizing the reads and power consumption from the SRAM 109.

For example, where the zigzag scan value or parameter Z (discussed further below) is set to 2 and the MU 103 holds 18 weights (sufficient to hold two 3 x 38 bit convolution kernels or one 3 x 316 bit convolution kernel), the register file 169 should have a size of 20 bytes.

The neural network may have a multiplicand between 50% and 90%, where at least one multiplicand (activation and/or weight) is zero. This may be the case, for example, for the initial v3 neural network (inclusion v3 neural network) after applying weight pruning. If the MR block 102 can efficiently skip occurrences of multiplying by zero, the MR block 102 may be able to process data five times faster, e.g., 100% -80% to 20% of the time it takes to process without zero skipping (zero skipping). As previously described, in some embodiments, an MR implementation may be configured for a cache to transfer data (to be multiplied or skipped) fast enough using more than two multiplicand inputs. In some of the block diagrams herein, only double the input bandwidth (and only two depths of the activation buffer 124) is depicted for simplicity and clarity of explanation. However, it will be appreciated that the depth of the IFM activation buffer 124 may be greater than 2, and for sufficiently sparse data, the corresponding speed increase (in a configuration where no skip times zero) may be greater than a factor of 2.

As described in the context of fig. 1B-1H and as described in the following paragraphs, data sparsity may be used to achieve significant improvements in processing throughput through appropriate operation of IFM cache 139 and ABU 141. Fig. 1R depicts a 3 x 3 convolution window at a starting position within the IFM tensor (stored in SRAM 109) to initiate input layer convolution. To begin the layer convolution operation, nine IFM slices a are read from SRAM 109₀[0..15]To i₀[0..15]Nine IFM slices a through the IFM structure 104₀[0..15]To i₀[0..15]Transmits to the target block 102 and slices a of nine IFMs₀[0...15]To i₀[0..15]Into the IFM cache 139 of each target block 102. Fig. 1S depicts another example of such data, where several elements are zero.

Fig. 1T depicts how data may be logically stored in IFM cache 139 just before the start of a layer convolution operation, where the values are ordered in an arrival sequence (from SRAM), and not necessarily showing their arrangement according to the actual memory address of the value. Although the cache may store more activation values to accommodate the motion of the convolution window, in this example, a 3 × 3 convolution is performed and the figure depicts nine (3 × 3 ═ 9) 8-bit activation values for clarity. Similarly, fig. 1U depicts the present example from fig. 1T explicitly with some activations having zero values.

FIG. 1V depicts a single lane 171 activating the broadcast unit 141 according to some embodiments. Each ABU lane 171 includes an IFM lane staging FIFO 173, a lane multiplexer 163, a lane control logic block 146, and an active lane digital type conversion circuit 148, the IFM lane staging FIFO 173 may be implemented using a register file. Each ABU path 141, along with the tile control FSM 144 and other ABU paths, can control activation skipping in that path (i.e., skipping activation elements having a zero value).

The activate path digital type conversion circuit 148 may also convert the activation from signed two's complement digital encoding to a signed 8-bit size format to simplify multiplier circuits that handle various bit widths of signed and unsigned data, including uint8, int8, uint16, int16, uint24 (24-bit unsigned integer), int24 (24-bit signed integer), uint32 (32-bit unsigned integer), int32 (32-bit signed integer), and the like. Each ABU path 171 may also broadcast an activation to multiplier cells 126 of the associated row within MR column 133 as part of the signal set that activates path 137.

The path IFM stage FIFO 173 has two input ports, two output ports, and may be two-valued deep. Two input ports may be used to introduce activations from IFM cache 139 at a rate of up to two activations (bytes) per clock cycle. In this way, when there are enough zero value activations, up to two activations may be processed per clock cycle as a result of having two adder trees in the MU column, including two input port and two output port way caches, and a staging buffer 173 of depth two. In some embodiments, if it is expected that IFM data will be sparse enough to justify a large number of activations per clock (e.g., three activations per clock), the activations may be handled by using circuitry having three adder trees per MU column, three way cache input/output ports, three hierarchical FIFO input ports, and a hierarchical FIFO depth of three (where "hierarchical FIFO" in this context means IFM way hierarchical FIFO 173).

Path control logic 146 may broadcast a set of control signals as part of the set of signals that activate path 137 to the associated row of multiplier 126 to inform multiplier 126 whether the activation is zero. If the activations are zero, the control signal indicates which non-zero activation is being multiplexed to replace the zero (including from which lane and how deep in the hierarchical FIFO (offset in the hierarchical FIFO)), so that each multiplier 126 will be able to select the correct weight and adder tree for multiplication. Similarly, path control logic 146 also controls multiplexer 163 to multiplex the activation of the depth offset from the correct staging FIFO 173 located in the correct adjacent IFM channel onto activation path 137.

FIG. 1V depicts an IFM lane staging FIFO 173 having four output logical connections sufficient to provide either of two buffer activations to the upper adjacent lane, either of two buffer activations to the lower adjacent lane, and both buffer activations to the lane activation multiplexer 163. Although fig. 1V depicts a hierarchical FIFO 173 having four output logical connections, the FIFO 173 has only two physical output ports, since the FIFO 173 is only two values deep in the depicted embodiment and therefore only holds two values that are available for simultaneous output.

Fig. 1WA depicts the contents of an IFM classification FIFO 165 having four separate IFM path classification FIFOs 173 (not 16 for clarity of illustration) after the first two vectors of an IFM have been read in (as also depicted in fig. 1C). In this state, the FIFO can check which activation values are zero and which are not. In some embodiments, each FIFO register has a zero detector (e.g., 8-input NOR logic). Each lane classification FIFO 173 reports which activations are zero to the corresponding lane control logic 146, and the lane control logic 146 tracks which activations in the lane have been used (e.g., borrowed, which results in the creation of a "hole" as depicted in fig. 1D). The control logic 146 of each lane sends this information about lane sizing FIFO occupancy (including which activations are zero) to the tile control FSM 144. Activations a0, a1, a2 and a3 undergo a digital format conversion (if the activation is a signed activation like int8 or int 16), become subdivided into 8-bit values (if the activation bit width exceeds 8, e.g. uint16, int16, uint24, int24, uint32, int32, etc.) and are broadcast to the corresponding row of multiplier cells 126.

In the next clock cycle, the IFM classification FIFO 165 may contain the values indicated in FIG. 1WB (and FIG. 1D). At this point, a0. is activated a3 has been processed, and b0, b2, and b3 are being broadcast to respective rows of multiplier units 126. Since b1 is 0, the path of b1 is not used. The control logic 146 of each lane sends this information (which activations are zeros or "holes") to the block control FSM 144. The block control FSM 144 then makes decisions regarding: (i) which data to multiplex out (in fig. 1WB and 1D, b0 to lane 0, c1 to lane 1, b2 to lane 2, b3 to lane 3, etc.) and (ii) use the input from the control logic 146 from each lane to detect whether the entire FIFO column contains holes and/or zeros and can therefore be skipped. When the latter occurs, block control FSM 144 causes: (i) the cache fetches two values (instead of one) and (ii) the FIFO accepts the two values (instead of one), skipping the entire hole and/or zero FIFO columns. In addition, if the plurality of values in the IFM way staging FIFO 173 associated with that way (as opposed to the entire column) includes zeros and/or holes, the way control logic also causes the cache to fetch two values.

For example, lane 1 (output c1) has 6 output options: c0, c1, c2 (zero) and b0, b1 (also zero) and b 2. The multiplexer 163 outputs one of the 6 selections. Which selection is output is determined by block control FSM 144. To enable this, the multiplexer 163 may be configured to be able to retrieve data from both FIFO columns of the top lane, from both FIFO columns of the bottom lane, and from both FIFO columns in the same lane as the multiplexer 163. Such capability may be implemented using, for example, circuitry similar to that depicted in fig. 1MA and 1 MB. As previously mentioned in the description of those figures, the ability to retrieve (and multiplex) data from one lane above and below may be referred to as "look-aside of 1" and the ability to retrieve (and multiplex) data from up to the second FIFO column starting from the right may be referred to as "look-ahead of 2". Each IFM hierarchical FIFO 165 column and lane combination may have a separate forward lookup value and/or a side lookup value associated therewith; however, for clarity and simplicity, it may be assumed that all columns and lanes in the IFM classification FIFO 165 have the same associated lateral lookup value and the same forward lookup value. Further, other variations may be employed based on how many inputs each multiplexer 163 has and where these inputs are connected, which are not covered by the forward and side find concepts, including, for example, disabling the sending of inputs from the hierarchical FIFO onto the same active lane, and connecting lane 0 and lane 15 in a more flexible manner to compensate for lane 0 and lane 15 not having one of two adjacent lanes.

The side-finding and/or the forward-finding may be greater than 2. Larger numbers may lead to better performance by skipping zero activation more optimally, resulting in further reduction in block computation time. This benefit may be achieved because each pass has more options as to where to retrieve non-zero activations when the number of side-lookups and/or forward-lookups is large. More choices of non-zero activations help spread the non-zero activations more evenly across all lanes, so that each lane eventually has approximately the same number of non-zero activations as opposed to more lanes and fewer lanes, potentially causing the block processing to wait for completion until the lane with the most activations completes the computation. As previously described, extended non-zero activation may be achieved by pseudo-randomly shuffling the activation paths and associated weights, as described in the separate related publication (attorney docket No. 1535-467CON 2).

Fig. 1WC depicts a configuration with a forward lookup of 2, a side lookup of 2, and a multiplexer 163 having 10 inputs for each FIFO column. In such an embodiment, the FIFO may be two depths and accordingly may have two output ports.

Fig. 1WD depicts a configuration with a forward lookup of 3, a lateral lookup of 1, and a multiplexer 163 having 9 inputs. In such an embodiment, the FIFO may be three depths and may have three output ports.

Fig. 1WE depicts a configuration in which both the forward and side lookups are 3 and multiplexer 163 has 15 inputs. In such an embodiment, the FIFO may be three depths and may have three output ports.

Activation broadcast unit 141 and block control FSM 144 may similarly be involved in the operations depicted in fig. 1E-1G. For example, FIG. 1E depicts that a "hole" tracked by way control logic 146 (in the way in which c1 was originally located) is created when c1 has been borrowed (multiplexed second column from the rightmost) in a previous clock cycle. Each path control logic 146 informs the block control FSM 144 which data elements in IFM classification FIFO 165 are zero or empty so that the block control FSM 144 can appropriately control activation multiplexer 163. Block control FSM 144 determines multiplexer controls to distribute activation to increase or optimize throughput. The best throughput can be achieved when all lanes have the same number of non-zero activations, as opposed to all lanes being unbalanced such that some lanes have many non-zero activations while other lanes (in the same block) mostly have zeros. In such an unbalanced case, most paths with zeros may complete their computations faster than paths with many non-zero activations (i.e., all non-zero activations may be output faster), which may delay the end of the block's computations and result in reduced multiplier utilization in the zero-rich paths.

As another example, in the state depicted in fig. 1G, path control logic 146 also receives multiplexer select signals from block control FSM 144 to track where (i) the created holes and (ii) the activations are multiplexed. The path control logic 146 then broadcasts this information to the multiplier cells 126 of the associated row so that when an activation has been multiplexed out of order (where "in order" in FIG. 1G, for example, indicates that G0 from the activation buffer is output onto the activation path labeled G0), each multiplier cell 126 in that row may multiply that out of order activation by its corresponding weight.

For example, as depicted, if an activation is multiplexed from one lane up the rightmost second hierarchical FIFO column, the corresponding weight to multiply that activation is located in the multiplier cell of the upper lane (of each column).

When the forward lookup is greater than 2 (e.g., 3) and an activation is retrieved from the third column from the rightmost column, the corresponding weight to be retrieved is forwarded by 3-1 to 2, which means that if the ordered activation has been multiplied by the weight w [ row, col, i ], the appropriate weight to be multiplied is changed to w [ row, col, i +2 ].

Fig. 1H depicts a (advantageous from a throughput perspective) situation when the activations are multiplexed (out-of-order boosting) such that the entire FIFO column (all 16 lanes) becomes free (contains only zeros or holes). Because two FIFO columns are consumed simultaneously, the rightmost all-zero column is skipped (discarded), and the second column from the rightmost is broadcast and used up for computation, the chunk control FSM 144 detects this condition and instructs the IFM cache 139 to load two values into the FIFO. This reduces the computation delay in the block by one clock cycle.

Fig. 1X depicts an Accumulation and Return Unit (ARU) 167. The effect of the ARU 167 is to complete the dot product calculation and apply the activation function (when applicable) to generate a completed output profile (OFM) that is ready to be transferred back to SRAM via the OFM transfer structure for storage. As shown in FIG. 1N, each MR column 133 contains two ARUs 167, one ARU 167 for each

adder tree

128A and 128B.

ARU 167 has two inputs, one from local adder tree 128A (or 128B) and one from reduction structure 111. As explained later, the center of each ARU 167 is an adder 181 and accumulator register 130A, which can complete the dot product calculation by accumulation (over time). To complete the OFM calculation, the fully reduced dot product can be (optionally) truncated (via rounding) using unit 187, scaled by factor 191 using multiplier 189, summed with OFM bias term 195 using adder 193, and passed via activation function 197. The activation function 197 may be a module that may support one or more activation functions, such as a rectifying linear unit (ReLU), sigmoid function, hyperbolic tangent, etc. If (for reasons explained later) the dot product reduction cannot be completed, the partial dot product or just "partial product" from accumulator 130A (130B) can bypass the scaling, biasing and activation functions on its way to OFM transfer structure 106 via multiplexer 199 and output FIFO 198. Multiplexer 183, bypassing adder 181, may allow adder tree values to be loaded directly into accumulator 130A, e.g., to initiate accumulation.

Multiplexer 174 may select an input source for ARU 167 for "return" (scaling, biasing, and activating application, along with partial paths, when applicable) between (i) adder trees within the same (local) block in which ARU 167 is located and (ii) reduction structure 111, which includes a configurable adder tree that combines local ("intra-block")

adder trees

128A and 128B into an even larger ("inter-block") adder tree, which is capable of reducing multiplier unit products from multiple blocks (e.g., from 32 or 64 or 128 or 256 multiplier units).

The bank ARU 167 is controlled by the bank control FSM 144 because the bank control FSM keeps track of which path and adder tree in each MR column 133 is used to obtain each partial IFM reduction. The ARU 167 has two outputs including one connected to the OFM transport structure 106 via FIFO 198 and on-the-fly pooling logic 196 and one connected to the reduction structure 111 via FIFO 179. The block control FSM 144 also tracks the state of the

output FIFOs

198 and 179. Because each block 102 performs computations at slightly different speeds due to the unpredictability of zero activation skipping, each

output FIFO

198 and 179 serves to restore synchronization of the block outputs by delaying the output from the block that eventually ends up running earlier (faster) than the other blocks. It may be desirable to synchronize the tile outputs from the FIFO179 because the tile outputs may undergo further reduction by the reduction structure 111, which reduction structure 111 may be considered as an additional set of adder tree stages, and thus may require their inputs (from the tiles) to arrive in parallel and synchronously. Similarly, it may be desirable to synchronize the block outputs by the FIFO 198 in order to output all lanes of the OFM slice to the OFM transfer structure simultaneously. The size of the

output FIFOs

198 and 179 of four or fewer entries may be sufficient in many cases. In the event that an

output FIFO

198 or 179 is about to overflow in one or more banks, the bank control FSM 144 may stop computing until the

output FIFO

198 or 179 empties. The

output FIFO

198 or 179 may have two input ports to combine the results from the two adder tree (a and B) paths.

Finally, block control FSM 144 and SRAM control 142 work together to read data from output FIFO 198, perform reduction structure processing, transfer results through OFM transfer structure 106, and use for storage in SRAM 109.

The activate digit-type converter 135 works with the accumulate and return unit 167 to support various bit-widths of signed and unsigned input and output data types (including the ability to arbitrarily use one data type for activation and another data type for weight, hereinafter referred to as a "mixed data type").

In some embodiments, the following data types may be used: int8, uint8, int16, uint16, int24, uint24, int32, and uint32 for the IFM data, the OFM data, and the weight data. As described below, the IFM data and the weight data types can be freely mixed. For example, convolution or full-connection layer computations may be performed using uint8 activation and int8 weights, or int8 activation and int8 weights, or int16 activation and int8 weights, or int16 activation and int16 weights, etc. OFM data types including uint8, int8, uint16, int16, uint24, int24, uint32, int32, etc., can also be selected at will by applying a combination of the selection of scaling, rounding and activation functions.

Activation may be prepared for the following operations. The activation may be stored in SRAM 109 as int8 or uint8 or int16 or uint16, for example, as specified by the user. The IFM data may be retrieved to the cache (i.e., to IFM cache 139) and then passed through an activation broadcast unit 141, as shown in fig. 1L, which activation broadcast unit 141 includes activation of the digitizer 135. As a first step, if activation is quantized using "zero" offset quantization as used in the google tensor stream, type converter 135 adds a "zero" offset to the activation. The digitizer 135 then prepares for activation by applying a suitable transformation (or "conversion"), which enables multiplication using data types wider than 8 bits (e.g., 16-bit weights and/or 16-bit activation, signed or unsigned) to be performed using the 8-bit unsigned multiplier 126. As shown in fig. 1K, for each lane, the active broadcast unit 141 broadcasts an active 8-bit absolute value act _ abs [ 7: 0]. The transformation applied by the activation digitizer 135 converts int8/unit8 into "sign and 8-bit absolute value". If the input activation is uint8, type converter 135 sets the output broadcast 8-bit absolute value equal to the input uint8 value (i.e., no transform) and sets the broadcast symbol to zero (which means that a non-negative value is represented).

If the input activation data type is int8, activation digit type converter 135 sets the output absolute value to the absolute value of activation and sets the output sign to 1 if activation is negative and to 0 otherwise.

The weights may be prepared for the following operations. The weights may be stored in the SRAM 109 as int8 or uint8 or int16 or uint16 as specified by the user. When the weights are loaded into the MU registers, the weights are transformed in weight decompression unit 138 (using the same transformation as used to activate the digitizer 141 transform activation). The weights are stored as 8-bit absolute values and 1-bit symbols. Referring to fig. 1K and 1N, when weights are loaded from SRAM 109 into the MU register and input into multiplier unit 103 through vertical weight load bus 101, the values represented as int8 and uint8 are converted into 8-bit absolute values wt _ abs _ Id _ in [ 7: 0C and a 1-bit notation swt _ in C.

The 8-bit multiplication may be performed as follows. The multiplier 126 may be an unsigned 8 bit by unsigned 8 bit multiplier. The multiplication operation may take as input the activations and weights, both expressed in 8-bit absolute values and 1-bit signs. The multiplier 126 then multiplies the two 8-bit absolute values and xors the two symbols. If the product of the two 8-bit absolute values is zero, the output sign is set to zero. The output of multiplier 126 (16 bits absolute value with its sign) is then converted to int17 and passed to adder tree 128A (or 128B). Adder tree 128A (or 128B) then reduces the signed int17 values received from the column multiplier cells and passes the signed sums to the ARU 167 associated with the adder tree.

In some embodiments, 16-bit and 8-bit input data types may be mixed as follows. The 8-bit weight and the 8-bit activation may be multiplied in one cycle. In some embodiments, all possible combinations of 8-bit digital data types are supported (e.g., uint8 activate int8 weight, int8 activate int8 weight, uint8 activate int8 weight, and int8 activate int8 weight). Two cycles may be used to determine or calculate: (i) a product of a 16-bit weight and an 8-bit activation or (ii) a product of a 16-bit activation and an 8-bit weight. The product of 16-bit activation and 16-bit weight may be determined or calculated using four cycles. All possible combinations of 8-bit and 16-bit digital data types may be supported (e.g., a uint16 activation int8 weight, int16 activation int8 weight, uint16 activation int16 weight, uint8 activation int16 weight, int16 activation int16 weight, etc.).

In some embodiments, 16-bit activation may be handled as follows. When the activation is agent 16 or int16, type converter 135 may prepare the data by applying a transformation (similar to the 8-bit transformation described above). Values in the fluid 16 or int16 format can be transformed into 16-bit absolute and symbol formats. If 8-bit (uint8 or int8) weights are used, the first periodic output of the active broadcast unit 141 may be the Least Significant Byte (LSB) of the 16-bit absolute value and symbol resulting from the transform (for multiplication with the 8-bit weights), and the second periodic output of the active broadcast unit 141 may be the Most Significant Byte (MSB) of the 16-bit absolute value and symbol resulting from the transform (also for multiplication with the 8-bit weights). Then, in addition to the most significant byte products being shifted up 8 bits using the sign extension shift 175 (and multiplexer 177) before being added, the two partial product results (both converted to signed int17) may be sent to the

accumulator

130A or 130B of the column (as usual, via the

column adder tree

128A or 128B to the column accumulate and return unit 167) and may be added together by the accumulator 130A (or 130B).

If the weight is 16 bits (uint16 or int16), the multiplication of the (16-bit) activation and weight may be performed using four clock cycles. The first cycle output of the activation broadcasting unit 141 may be the least significant byte of the 16-bit absolute value and the sign generated by the activated transform, the multiplier 126 may be simultaneously input with the least significant byte of the 16-bit absolute value of the weight, and the first multiplication may be performed. During the second cycle, the product of the activated same portion (i.e., the least significant byte of the symbol and the 16-bit absolute value resulting from the activated transform) may be input again to the multiplier together with the most significant byte of the 16-bit absolute value of the weight, and a second multiplication may be performed.

The third periodic output of the active broadcasting unit 141 may be the most significant byte of the 16-bit absolute value and sign generated by the active transform, the multiplier may be simultaneously input with the least significant byte of the 16-bit absolute value of the weight, and the third multiplication may be performed. During the fourth period, the product of the same portion of the activation (i.e., the 16-bit absolute value and the most significant byte of the sign resulting from the activated transform) may be input again to the multiplier 126 along with the most significant byte of the 16-bit absolute value of the weight, and a fourth multiplication may be performed. All four partial product results may be output to the column accumulator 130A (or 130B) and added together (as usual, via the associated

adder tree

128A or 128B for that column to the accumulate and return unit for that column) except that the second partial product and the third partial product may each be pre-shifted by 8 bits prior to addition and the fourth partial product may be pre-shifted by 16 bits using the sign extended up shifter 175 and multiplexer 177.

Performing the convolution operation involves traversing the IFM tensor stored in the SRAM 109 and streaming (streaming) the contents of the IFM tensor to one or more blocks 102 as a series of IFM slices that are transferred through the IFM transfer structure 104. The IFM tensor has three dimensions using coordinates denoted (x, y, d) (and a batch index (batch index), which is now omitted for clarity of explanation), where the x and y indices correspond to the activated planar coordinates and index d corresponds to the depth channel. Neural processor 100 traverses the IFM tensor by cycling through the (x, y, d) index values in a particular order. As used herein, cycling through (x, y) coordinates represents a "planar" traversal, and cycling through (depth-wise) traversal on d coordinates represents a "depth-wise" traversal.

Some of the following paragraphs describe a plane traversal that includes the use of IFM cache 139. Referring to FIG. 1N, IFM transport structure 104 may be coupled to IFM block 102 via IFM cache 139. Each block 102 may have one IFM cache 139, with each IFM cache 139 being locally located to the associated block. Utilizing the IFM cache 139 (per block) helps to reduce the number of reads from the SRAM 109. Reducing the number of reads from the SRAM 109 may be beneficial in three respects, including: (i) reduce the contribution of SRAM 109 to the overall power consumption of the neural processor, (ii) reduce the chance of SRAM read or write stalls, and (iii) reduce the traffic flowing via the IFM transfer structure 104.

When SRAM 109 consumes a relatively high power compared to the flip-flop register power consumption, the SRAM power consumption reduction aspect may be of interest, which may occur in practice. The SRAM stall aspect may be particularly important when the number of SRAM banks located in each SRAM cell 109 is low compared to the number of input/output (I/O, read or write) operations to be performed. For example, as will be described later, each SRAM bank group unit 109 may contain four SRAM banks, and thus be capable of performing up to 4I/O operations (per clock cycle) simultaneously. These I/O operations may be IFM slice reads, writes to one or two OFM slices, partial result reads or writes, and slice reads or writes requested by AXI interconnect 114.

When more than four such I/O operations must access data residing in the same SRAM bank 109 at the same time or one or more I/O operations must access data in the same bank, a bank access conflict may occur, causing the SRAM bank arbitration logic to stop AXI access or IFM data acquisition or OFM data write or partial result I/O, potentially also causing computation to stop. Accordingly, IFM cache 139 may reduce IFM reads from SRAM cells 109, thereby serving to reduce the likelihood of having these types of stalls.

As will be discussed in more detail later, in the case where the weight kernel size is particularly large, the computation may be divided into multiple portions, and the results of the partially completed computation ("partial results" or "portions") may be stored in the SRAM 109. To maintain acceptable computational accuracy, the partial results are typically quite long in bit width (e.g., 4 bytes or 6 bytes) compared to the IFM data and OFM data. Writing partial results with long bit widths to SRAM and reading partial results with long bit widths from SRAM correspondingly consumes higher SRAM bandwidth, which may increase the likelihood of SRAM bank access conflicts, thus increasing the likelihood of AXI and/or computation stalls. Thus, the IFM cache 139 can help alleviate SRAM I/O bottlenecks (particularly for computations that use partial results).

Reducing IFM transport fabric traffic may be of interest when communication bus area is at a premium. Recall that IFM transfer structure 104, as depicted in fig. 1P, may transfer up to two IFM slices per clock to IFM cache 139. When the IFM transfer structure transfers N slices to IFM cache 139 at the same time (e.g., every single clock), IFM transfer structure 104 may be said to have a "width of N slices". By caching the IFM slices locally to each tile, the IFM transport structure 104 may remain idle while the IFM slices needed for computation are already locally cached by the tile and readily available for processing. An IFM transfer structure 104 with idle periods (with less than 100% utilization) allows the use of idle periods to transfer additional IFM slices, resulting in an overall "effective" IFM transfer bandwidth of more than 2 times. Thus, when the area of the IFM conveying structure 104 is at a premium, the width of the IFM conveying structure 104 may be reduced from, for example, two slices to one slice, while still maintaining the overall IFM conveying bandwidth at 1 or more, and sometimes up to 2 or more.

As will be seen below, IFM cache 139 provides the greatest benefit for convolution operations with kernel plane widths and/or heights greater than 1. "depth-wise" convolution (convolution with kernel width and height both equal to 1) and fully-connected computation can also benefit from IFM caching, but typically only in rare cases.

To understand the scheme provided by one embodiment (referred to herein as a "zigzag" planar traversal designed to increase the IFM cache hit rate), consider first traversing the IFM tensor planar-wise (planar-wise) in a simple, naive manner using a 2 × 2 × 16 × 16 weight kernel, as depicted in FIGS. 2AA through 2 AD. Here, 2 × 2 denotes the plane height and width of the weight kernel, 16 denotes the IFM depth (i.e., one slice), and 1 denotes the OFM depth. However, for clarity of explanation, the convolution may be considered to be purely planar (i.e., 2 × 2 × 1 × 1). Figure 2AA depicts the convolution operation starting with the convolution (kernel weight) window placed at the upper left corner of the IFM tensor. After calculating the 2 x 2 convolution at that location, the window is slid one pixel to the right. The calculation is performed by repeating the sliding process until the window reaches the upper right corner of the IFM tensor. As depicted in fig. 2AB, once in the upper right corner, the convolution is computed and the convolution window now slides down one row, rather than to the right. Subsequently, the same calculation and sliding steps are repeated further as depicted in fig. 2AC, except that the convolution window now remains slid to the left until it reaches the left edge of the IFM tensor, which is again slid down one row at the left edge of the IFM tensor, as depicted in fig. 2 AD. Repeating these steps ultimately results in a complete planar scan (traversal) of the IFM tensor. Because the window mainly slides horizontally (i.e., the inner loop loops around on the x-coordinate), such a scan may be referred to as a horizontal scan (as opposed to a vertical scan).

Consider the use of IFM cache 139 in conjunction with a simple, naive "horizontal" scan as depicted in FIGS. 2 BA-2 BL. At the beginning of the convolution operation, IFM cache 139 is cleared, the 2 × 2 convolution window is placed in the upper left corner of the IFM tensor, and the four IFM values needed to compute the convolution at this starting position are then retrieved. As depicted in fig. 2BA, the first of the four IFM values is retrieved from the upper left-most position in the IFM tensor. This position may be referred to as being in row 0, column 0. Because the cache has been flushed, the IFM value at row 0, column 0, must be retrieved from SRAM 109 instead of IFM cache 139, resulting in a cache miss labeled "M" in FIG. 2 BA. Once retrieved, the IFM value is cached. Fig. 2BB depicts the second IFM value (of the four values) retrieved at row 0, column 1. The cache does not contain the value associated with this location (row 0, column 1), resulting in another cache miss marked by "M". The light shading of the 0 th row, column 0 position indicates that the IFM value retrieved in the previous step has been cached. Fig. 2BC and 2BD depict the retrieval of the remaining two IFM values, each resulting in a cache miss. At this point all four IFM values have been retrieved, the convolution calculation at the current location may be complete, all four IFM values have also been cached and the convolution window may be slid one column to the right.

Fig. 2BE through 2BH depict retrieving four additional IFM values to calculate the convolution at the new location. In FIG. 2BE, retrieving the IFM value at row 0, column 1 results in a cache hit (H), thus avoiding an SRAM read. Similarly, FIG. 2BG depicts another cache hit at row 1, column 1, while retrieving both of the other two IFM values causes a cache miss.

As the convolution window continues to slide, as indicated by the dark shading in fig. 2BI through 2BL (and fig. 2BE through 2BH), the leftmost previously cached IFM value will not participate in the computation for an extended period of time, or at all, until the convolution window slides all the way to the rightmost edge of the IFM tensor, one row down and all the way back to the cached value. Thus, once the convolution window slides, these values can be flushed from the cache to keep the cache size small.

Fig. 2BI through 2BL depict retrieving the next four IFM values to compute the convolution at the next location (one step to the right), resulting in two cache hits and two cache misses. As shown in fig. 2BM, IFM values cached horizontally during 2x2 convolution result in a cache hit probability (rate) of about 50% because two quarters of the IFM values (marked with light shading) are reused once each time the convolution window slides one step to the right. More generally, the hxw planar core size is used in conjunction with horizontal caches and it is assumed that the convolution of the caches with sufficient size results in H x (W-1)/(H x W) cache hit rate. A cache size sufficient for such convolution may be bytes per tile (W-1) per lane. However, as will be explained later, neural processor 100 may also use "IFM weight cycling" to accumulate multiple IFM channels into dot products by sequentially cycling the weights of the multiplier units during dot product computation. Thus, as will become apparent later, in the most general case, the maximum cache size is equal to the number of weights stored per lane per block in the MU weight register file 127 (which is equal to 18 for an 8-bit weight data type).

In fig. 2 BA-2 BM, keeping the cache size relatively small requires aggressive flushing of cache values. Referring to FIG. 2BM, when the convolution window slides over row R (row 2), the IFM values from the previous row R-1 (row 1) have been flushed from the cache (indicated as cache miss "M" at row 1, column 2) for a long time. To increase the cache hit rate above H x (W-1)/(H x W), it may be considered to cache values of e.g. one or more lines of the IFM tensor. However, caching the entire IFM tensor line would require the cache size to be increased, so that the cache size typically becomes a function of the IFM tensor width. Since the IFM tensor width is typically unknown at ASIC design time, and since the IFM tensor width may be relatively large, it appears expensive to cache IFM rows in silicon area, and thus undesirable. The same reasoning applies to the symmetric case when the convolution window is predominantly scanned vertically (i.e., the loop in plane coordinates iterates over the line numbers) rather than horizontally.

In contrast to a simple, naive planar scan, some embodiments perform a planar traversal of the IFM tensor in a "zig-zag" shape during the convolution operation. Zig-zag plane traversal may help increase cache hit probability while still keeping the cache size small. Fig. 2C depicts a bottom-right-top-right zigzag path along which the convolution window may shift (slide) in such an embodiment. Unlike a simple, naive horizontal traversal, the convolution window in FIG. 2C slides to the right after two convolutions (in vertically adjacent locations) have been computed instead of one convolution. Thus, a single complete left-to-right, edge-to-edge sweep (sweep) of the IFM tensor by the convolution window produces two rows of convolution results, as opposed to a single row of results for a simple, naive, horizontal traversal.

In a more general case, the zigzag traversal may be parameterized using a "number of Z" corresponding to the number of output lines processed in a single horizontal IFM tensor scan. For example, in fig. 2C, the Z number is equal to 2. As will be seen later, a higher Z number results in a higher cache hit rate.

In fig. 2C, a zigzag traversal that produces two lines of results per single horizontal scan can be thought of as performing a naive horizontal traversal over an IFM tensor that is twice as wide but half as high. More generally, a zigzag traversal path can be viewed as a single (horizontal) scan that "unrolls" into the length of H x Z columns using a total of H/Z scans to complete the IFM tensor convolution, where H and W are the IFM tensor height and width, respectively. For example, in fig. 2C, Z is 2, so instead of traversing the hxw IFM layer by a simple, plain scan, the length of the arrow path is approximately H x Z W2, so the logical IFM layer width becomes W x Z2W and the logical IFM layer height becomes H/Z H/2. A simple, naive, horizontal line may be equivalent to a zigzag traversal of Z ═ 1.

Fig. 2DA through 2DD depict a zigzag traversal using Z ═ 1 for the first position of the convolution window. Retrieving all four IFM values results in a cache miss, resulting in four SRAM reads occurring. As depicted in fig. 2 DE-2 DH, fetching two more IFM values for the next position of the convolution window results in a cache miss, while the other two IFM fetch positions overlap with the previous position of the convolution window, thus resulting in two cache hits.

As depicted in fig. 2DI through 2DL, for the next location of the convolution window, two IFM values are cache misses, and two overlap with the previous location of the convolution window, both resulting in a cache hit. As depicted in fig. 2 DM-2 DP, for the next position of the convolution window, one IFM value is a cache miss, and three overlap with the previous position of the convolution window and are cache hits. In this way and with further reference to fig. 2DQ through 2DX, the use of the zigzag path significantly improves the ratio of cache hits to cache misses.

Fig. 2E is a table showing the actual number of SRAM reads associated with a zigzag traversal relative to the number of SRAM reads in an ideal cache (i.e., a cache that has infinite capacity and never cleans up any values). Thus, the table in FIG. 2E is a measure of the efficiency of the zigzag traversal (measure). The table assumes that the cache size is sufficient for a given Z when a single sweep is performed (i.e., values from previous sweeps are cleared). Lower numbers in the table correspond to higher efficiencies, with 1.0 being ideal. The convolution Size (Conv Size) represents the planar dimension of the square weight kernel. For example, a 3x3 convolution with a zigzag traversal of Z2 results in an additional 2 times (2x more) SRAM read compared to a 3x3 convolution using an ideal cache. However, using a 3x3 convolution with a zigzag traversal with Z ═ 1 (i.e., a simple, naive (e.g., horizontal) traversal) results in an additional 3 times (3x more) SRAM read as compared to using an ideal cache. Thus, as calculated by the formula described below, in this case, a zigzag traversal with Z-2 reduces the SRAM read count by 3/2-1.5 times (3/2-1.5 x) compared to a simple, naive traversal, while the cache size for zigzag traversals with Z-2 and Z-1 remains a small change compared to simple, naive traversal. Note that a larger Z number results in greater SRAM read count savings. For example, for a 3x3 convolution, increasing cache Z to 4 results in a 3/1.5-2 times (3/1.5-2 x) SRAM read savings.

Fig. 2F depicts a table of average expected IFM SRAM reads per clock for supplying an IFM cache, and assumes that one IFM slice is processed per clock. For example, a 5x5 convolution with cache Z-4 performs SRAM reads on average only 8% of the time compared to 100% of the time without the cache (i.e., every clock) and compared to 20% of the time with cache Z-1 (i.e., a simple, plain traversal scheme).

Fig. 2GA to 2GB depict cache hit/miss counts and cache size derivations. Zig-zag traversal involves a repetition of a two-step sequence in which the convolution window is slid vertically by Z-1 rows and then laterally by one column. For simplicity, ignoring the special cases at the edges of the IFM tensor, sliding the convolution window of plane size W × H sideways (to the right in fig. 2 GA) by one column results in H cache misses (labeled "m") and H × W-1 hits. The next step of sliding Z-1 row vertically (downward in fig. 2 GB) results in (Z-1) cache misses and (Z-1) × (H) × W-1) cache hits.

Accordingly, once the convolution window has been slid horizontally by one column, the convolution window may use the previously cached values within the kernel window (cached during the previous vertical translation, labeled "c" in fig. 2 GA) for the current computation. Because the window will start sliding vertically (downward in fig. 2 GA), the previously cached value labeled "c" outside the core window (below in fig. 2 GA) should also be left in the cache to be used. Furthermore, after the convolution window slides down the Z-1 row, slides to the right one column, and returns back up, the value obtained from the SRAM (labeled "m") should also be added to the cache for use in the computation at the current location. Next, each time the convolution window slides down one row, one cache value (top left) can be flushed and one value from the SRAM added (labeled "m"). Thus, counting the number of "c" marks in FIG. 2GB, the required cache size is (H + (H + Z-1) × (W-1)).

As explained later, if a weight loop is used, the cache size may be increased by the same factor as the number of cores stored in any bank at the same time. As mentioned above, when the convolution kernel is small, the system may store multiple plane kernels into each MU 103. For example, if the MU 103 has 18 weight registers and the convolution is 2x2, then four 2x2 kernels may be stored in the MU weight registers 127. For example, a dot product of IFM data having 64 channels 0.. 63 may be calculated as OFM 0 … 7 by cycling over four stored cores over time. The system may take the IFM slice holding channel 0 … 15, multiply by the first core (out of four), and hold the result in the accumulator of the tile; obtain an IFM slice with channel 1 … 31, multiply by the second 2x2 kernel (out of four), and add the result to the accumulator value already stored; and repeated a third and fourth time. These IFMs may also be cached, resulting in a corresponding increase in cache size. The IFM cache size has an upper limit regardless of the choice of plane translation method (plain or zig-zag or some other method), however, the IFM cache size is a function of the size of the multiplier unit weight register file 127. This is because each cached IFM slice must have a corresponding weight in the weight register file to be multiplied, and the weight register file itself is limited (e.g., to 18 weights). Note that the same reasoning also translates into an IFM cache size with a lower bound equal to the weight register file size.

Thus, the IFM cache size should be set to (H + (H + Z-1) × (W-1) -1) and the maximum of MU _ WEIGHTS for all possible supported H and W combinations, where MU _ WEIGHTS is equal to the multiplier-unit weight register file 127 size (e.g., 18). For example, if the neural processor 100 has 18 weights per multiplier unit 103, supports a zigzag traversal of all natural H and W with Z2 and kernel weight plane sizes such that H W ≦ 18 (e.g., 1x1, 1x2, 2x1, …, 4x4, 9x2, 2x9), the IFM cache size is the maximum of (1+ (1+2-1) -1) ═ 0, (1+ (1+2-1) × 2, (2+ (2+2-1) (1-1) -1) ═ 2, …, (4+ (4+2-1) ((4-1) -1) ═ 18, (2+ (2+2-1) (-9-1) (+) (25) ((9 +2-1) (-18) and (11-1) — (18), i.e. 25.

In some embodiments, the MU weight register file capacity is equal to 18 8-bit weights (uint8 or int8) or equivalently 9 16-bit weights (uint16 or int 16). When the IFM data is 16-bit (uint16 or int16), the IFM cache may store the 16-bit IFM data by allocating two bytes of each 16-bit IFM. Thus, similar to MU weight register 127 being able to store 9 16-bit weights, IFM cache 139 may store 9 16-bit IFM values. A zigzag (and simple, naive) planar traversal may be applied to a 16-bit IFM value in a manner similar to how it is applied to an 8-bit value. In this case, the cache size calculation described above should also include additional W and H terms in the maximum function (such as (H + (H + Z-1) × (W-1) -1) × size _ of (IFM _ DATA _ TYPE), where size _ of (IFM _ DATA _ TYPE) represents the byte size of the DATA TYPE of the IFM value (e.g., 3 bytes for a 24-bit IFM value and 4 bytes for a 32-bit IFM value), in the case of an IFM data type of 24 bits, 32 bits or more, caching may be done using a zigzag (and simple, naive), however, it is recommended to increase the size of the MU weight register file 127 (and the size of the IFM cache 139) to 3x 3x size _ of (IFM _ DATA _ TYPE), as explained later, this ensures that popular 3x3 plane-sized weight kernels can be convolved without resorting to using partial results that may not be expected.

As previously described, the global, SRAM, block and lane

control logic units

140, 142, 144, and 146 work together to perform the appropriate control of SRAM IFM fetches, the transfer of IFM slices over the IFM transfer structure 104, caching of IFM values into the local blocks 102, retrieving of cached IFM values (typically at a slightly different rate for each active lane), and the results of the resynchronization of the OFM between blocks 102. To configure the IFM and OFM plane traversals, the host CPU loads the computational parameters including the zigzag height Z into global control FSM 140 and SRAM control logic 142. The global control FSM 140 then coordinates the SRAM control FSM 142 and the tile control FSM 144 to start and perform computations.

Each accumulate and return unit 167 can receive the OFM values to advantageously compute pooling on the fly as the convolution window traverses the input and output layers in a zigzag planar fashion, without having to save the pre-pooling results to SRAM and later read back the values to apply pooling. As shown in fig. 2 HA-2 HD, in the case when the pooling windows do not overlap, the ARU167 may perform pooling by not sending each convolution OFM result, but instead holding the convolution results in a register of the pooling logic 196 until each pooled output is complete. Only after each pooled output is completed is the ARU167 write the pooled output to the SRAM 109. For maximum pooling, the output register of the ARU167 register may hold a maximum value that is compared to the convolved outputs and updated when the latest OFM output exceeds the current maximum value. Once the pooling window slides, the output register of the ARU167 is reset to resume maximum (max) operations. For average pooling, the accumulator of ARU167 keeps adding the OFM outputs until the pooling window is about to slide. The accumulator is then multiplied by 1/(boosting _ WIDTH × (boosting _ HEIGHT) to calculate an average value, rounded and written to SRAM 109. Once the pooling window has slid, the accumulator is reset to restart averaging.

For example, fig. 2HA depicts a zigzag plane traversal with Z ═ 2 performed in conjunction with 2x2 plane pooling, where the IFM layer is traversed in such a way that the OFM values (in each pooling window) are computed sequentially. Because the output of ARU 167 generates every four OFM values to compute each pooling one after the other, ARU pooling logic 196 may take the maximum of four consecutive results in order to compute the maximum pooling. Fig. 2HB depicts a zigzag plane traversal with Z-3 and 3x3 plane pooling. Since the Z-value is equal to the pooling kernel height, traversing the IFM layer in a zigzag manner naturally results in OFM data within each pooling window being generated in an order suitable for maximum pooling and average pooling. Fig. 2HC provides an additional illustration of Z-H-4, where H represents the height of the pooling core.

Fig. 2HD depicts the case when the Z value does not match the height of the pooled kernel such that Z is 4 and the height of the pooled kernel is 2. In this case, the pooling logic 196 may subdivide the pooling into two regions (upper 2x2 and lower 2x2 as depicted) and use additional registers to temporarily store incomplete results from one of the two pooled regions (lower 2x2 in fig. 2 HD). More generally, the zigzag pooling window height may be a natural multiple of the height traversed by the zigzag. Reasonable numbers may include 2, 3, and 4. As mentioned previously, the zig-zag pooling vertical step should be equal to the zig-zag traversal height, which limits the pooling in operation to only such cases. The pooling windows may overlap horizontally as long as the output pooling logic 196 has enough copies of the pooling logic, however, where the copies of the pooling logic each process a respective horizontally overlapping pooling window of all such horizontally overlapping pooling windows in parallel. The zigzag pooling window width and stride may generally be arbitrary, with reasonable pooling window width numbers including, for example, 2, 3, and 4.

In the case where pooling windows overlap vertically, making pooling in operation problematic, and/or where custom pooling (in addition to maximum pooling and average pooling) is desired, pooling may be accomplished by: (i) place read-modify-write logic near SRAM bank 109 (not depicted) and/or (ii) read out SRAM outside of a CPU, GPU, DSP, or other type of computing core through an AXI interface, perform pooling and write back the results to NPUSRAM through an AXI interface. Custom read-modify-write logic near the SRAM bank 109 may also be reused to efficiently add partial results without the need to send the partial results back to the block.

To configure neural processor 100 to perform a particular operation (e.g., convolution or full connectivity layer computation), the IFM and OFM tensor sizes should be considered and the computation "mapped" onto the available hardware in conjunction with the parameters of the operation (e.g., operation type, stride, etc.). Each single block 102 may have only a fixed number of 16 IFM depth channel inputs and 8 OFM depth channel outputs, while the number of depth channels in the deep learning neural network model layer varies and typically far exceeds 16 and 8. The mapping algorithm may run offline (during compile time as opposed to run time) to subdivide the large IFM and OFM tensors into portions (sub-tensors), assign these portions to available blocks for computation, and produce a description (configuration) of how the outputs from the available blocks may be reassembled (re-assembled) to complete the computation. As will be explained in more detail below, the mapping algorithm may also determine the order of traversal of the IFM (and accordingly the OFM) tensors for both the planar direction and, in particular, the depth direction. Because there may be multiple solutions for a particular mapping problem (i.e., for a given IFM, OFM, and weight tensor size and operating parameters), the mapping algorithm may also accept parameters that indicate whether the solution is optimized for the lowest power, the lowest SRAM size, the lowest computational delay (achieved by maximizing multiplier utilization), and/or a combination of these (e.g., the lowest power given the available fixed SRAM size).

Aspects of the mapping operations of some embodiments may be understood from a set of examples that are examples that evolve from simple to more and more advanced. Because activation skipping does not affect the mapping to a large extent, the features associated with zero activation skipping should be ignored for clarity of explanation, and it is assumed that each OFM column has only one adder tree and accumulator (i.e., the calculations are "dense"). Since caching does not affect the mapping to a large extent, the cache including the zigzag plane translation method should also be ignored, and the convolution window is assumed to move (plane direction sliding) in a raster (raster) manner. In the first example depicted in fig. 3AA to 3AK, a single tile 102 is used to compute a 3x3x16x8 convolution. Fig. 3AA depicts block 102 accepting as input an IFM slice having 16 depth channels and producing an OFM slice having 8 depth channels. For this example, as shown in fig. 3AB, the size of the IFM tensor 304 is 64x64x16, the size of the OFM tensor 303 is 64x64x8, and the size of the weight tensor 302 is 3x3x16x 8.

Initially, as depicted in fig. 3AC, weights are preloaded from SRAM 109 into MU weight register file 127. The size of the weight kernel 302 is 3x3x16x 8. The weight kernel 302 with a flat size of 3x3 has 3x3 ═ 9 flat "positions," which are indicated as a through I in fig. 3 AC. Each plane position is associated with a 16 long weight vector for calculating the dot product of the 16 long IFM value vectors with one OFM channel. As depicted in fig. 3AC, because there are 8 OFM channels, the weight core 302 can be considered to have one 3D tensor for each OFM channel.

Specifically, the weights may be loaded into the MU weight register file 127 as follows. The multiple MU weight register files in the entire MR array 122 may be considered to be tensors with dimension 18x16x8 (16 MU rows and 8MU columns per MU 18 weights) more than enough to hold the entire weight core of size 3x3x16x 8. The weight register file tensor size of 18x16x8 may also be rewritten as (3x3) x16x8, where each MU weight register file at row R, column C stores all 9 weights of 3x3 ═ 9 plane positions (x, y) in the weight tensor wxhxrxc C, where W and H are weight kernel plane widths and heights (i.e., W ═ 3, H ═ 3). For example, referring to FIG. 3AC, the weight register file in row 0, column 0 stores weights { A0[0], B0[0], C0[0], D0[0], E0[0], F0[0], G0[0], H0[0], I0[0] }, where the notation is "A … I", followed by OFM column "0 … 7" and IFM row "[ 0 … 15 ]". Correspondingly, the weight register files in row 15, column 0 store weights { A0[15], B0[15], C0[15], D0[15], E0[15], F0[15], G0[15], H0[15], I0[15 }. The weight register files in row 15, column 7 store weights { A7[15], B7[15], C7[15], D7[15], E7[15], F7[15], G7[15], H7[15], I7[15] }, and so on. Since block 102 computes dot products "vertically" using a column-wise adder tree, it can be seen that the described order of loading weights allows computation of the dot product of the IFM input at each plane position A … I.

Referring to fig. 3AD, the convolution window may then be positioned at the start position and the eight accumulators 130 (one accumulator for each of the 8 OFM channels for clarity of mapping explanation, as previously described) may be cleared.

Referring to fig. 3AE, the bank 102 may then read IFM a [0..15] (where a.. z represents the planar position of the IFM and 0..15 represents the IFM depth channel) from the SRAM 109 and broadcast (broadcast) the value to 8 columns of the bank 102. The first column may multiply a [0..15] element by element with the preloaded weights a0[ 0.. a0[15], the second column may multiply a [0..15] element by element with the preloaded weights a1[ 0.. a1[15], and so on. The resulting products may be vertically summed (reduced) using an adder tree for each column and added to the corresponding accumulator 130. Since there are 8 additional plane positions (of 3x 3-9) to be processed to complete the 3x3 convolution at a single position, the dot product as a result is still not the final result.

Referring to fig. 3AF, block 102 may then read IFM b [0..15] from SRAM 109 and broadcast the value to 8 columns of block 102. The first column may multiply B [0..15] element by element with a preloaded weight B0[ 0.. B0[15], the second column may multiply B [0..15] element by element with a preloaded weight B1[ 0.. B1[15], and so on. The resulting products may be vertically summed and added to the corresponding accumulator 130. Referring to fig. 3AG, block 102 may then read IFMc [0..15] from SRAM 109 and broadcast the value to 8 columns of block 102. The first column may multiply C [0..15] element by element with a preloaded weight C0[ 0.. C0[15], the second column may multiply C [0..15] element by element with a preloaded weight C1[ 0.. C1[15], and so on. The resulting products may be vertically summed and added to the corresponding accumulator 130.

Referring to fig. 3AH, block 102 may then read IFM g [0..15] from SRAM and broadcast the value to 8 columns of block 102. The first column may multiply g [0..15] element by element with a preloaded weight D0[ 0.. D0[15], the second column may multiply g [0..15] element by element with a preloaded weight D1[ 0.. D1[15], and so on. The resulting products may be vertically summed and added to the corresponding accumulator 130. Referring to fig. 3AI, block 102 may then read IFM h [0..15] from SRAM and broadcast the value to 8 columns of block 102. The first column may multiply h [0..15] element by element with a preloaded weight E0[ 0.. E0[15], the second column may multiply h [0..15] element by element with a preloaded weight E1[ 0.. C1[15], and so on. The resulting products may be vertically summed and added to the corresponding accumulator 130.

Referring to fig. 3AJ, similar operations may be performed for the remaining ones of the nine locations (labeled a through o) of the core. The value stored in the accumulator 130 may then be rounded to form an 8-bit output OFM result, and all 8 OFM results may be written to the SRAM 109. This completes the calculation of one convolution. The convolution window plane direction may then be shifted by one column, and the operation may be repeated, as depicted in fig. 3 AK.

In the second example depicted in fig. 3 BA-3 BC, a single block is used to determine or calculate a 3x3x16x128 convolution. As previously described, for convenience, as depicted in fig. 3BA, the term "IFM slice" may be defined to represent 16 IFM depth channels (i.e., units of IFM read and block input), and the term "OFM slice" may be defined to represent 8 OFM depth channels (i.e., units of OFM block output). As depicted in fig. 3BB, it may be convenient to depict the operational mapping in a rectangle, where the height of the rectangle corresponds to the number of IFM channels and the width of the rectangle represents the number of OFM channels. The 3x3x16x128 convolution may be accomplished by splitting the convolution into 16 3x3x16x8 convolutions such that the previous example of performing a 3x3x16x8 convolution may be repeated 16 times. In a first step, the 3x3x16x8 convolution of the OFM [0..7] can be calculated. In the second step, the 3x3x16x8 convolution of the OFM [8..15] can be calculated, and so on until in the sixteenth step, the 3x3x16x8 convolution of the OFM [120..127] can be calculated. The processing of the next subset of OFM channels may be referred to herein as "stepping (step) the OFM". Sixteen steps may correspond to sixteen rectangles, the first, second, and sixteenth of which are depicted in fig. 3BC, and as can be seen from fig. 3BB and 3BC, when the sixteen steps are completed, a 3x3x16x128 convolution has been calculated.

An unlimited number of OFM channels can be handled in this way, presumably, by simply splitting the OFM into sufficiently small blocks. Every time the system "steps the OFM", the IFM is completely re-read (sixteen times in this example). Each reading of the (entire) IFM may be referred to herein as an "IFM pass (IFM pass)", and each such IFM pass may consume a significant amount of energy (or power) if the operation is repeatedly performed. Reduction of power consumption is often highly desirable, particularly for battery-powered devices (such as mobile smart phones). The next example depicts a method for avoiding some of this energy cost.

In the third example depicted in fig. 3CA and 3CB, sixteen blocks are used this time to determine or calculate the 3x3x16x128 convolution, as opposed to using one block. Each block 102 has 16 × 8-128 multipliers 126, so that the 16 blocks have 128 × 16-2048 multipliers in total. The IFM [0..15] may be broadcast to all 16 blocks 102, such that block 1 will calculate the OFM [0..7], block 2 will calculate the OFM [8..15], etc., and block 16 will calculate the OFM [120..127 ]. As used herein, the term IFM "broadcast" means that IFMs are input to multiple MR banks 102 simultaneously, as opposed to the description of the banks 102, where broadcast means that ABU outputs are input to all MU columns with a single bank.

The neural processor 100 has a plurality of SRAM bank groups 109 (fig. 1A and 3 AC). Thus, referring to FIG. 3CB, input IFM [0..15] may be input from SRAM bank group 0. The output of block 1 (OFM [0..7]) may be cascaded (concatemate) with the output of block 2 (OFM [8..15]) into a 16-channel OFM [0..15], and saved into SRAM bank group 1. Similarly, the output of block 3 may be cascaded with the output of block 4 and saved to SRAM bank group 2, and so on, with the output of block 15 being cascaded with the output of block 16 and saved to SRAM bank group 8. It can be seen that in this third example, as a result of using the IFM broadcast, all OFMs are calculated in a single "pass" (i.e., reading the entire IFM data once), and because the IFM data is read only once, most of the energy consumption caused by performing multiple IFM passes in the above second example is avoided.

In the fourth example depicted in fig. 3DA, sixteen blocks are used to determine or calculate the 3x3x16x256 convolution. The 16 blocks may only be able to generate up to 16 × 8 — 128 OFM channels in a single pass. In this example, 256 OFM channels would be generated. Thus two OFM steps can be run, the first step for calculating OFM [0..127] and the second step for calculating OFM [128.. 255 ]. Two IFM passes may be used, thereby completely reading the IFM twice. The OFM is formed in two steps depicted in fig. 3 DA.

In the fifth example depicted in fig. 3EA and 3EB, sixteen patches are used to determine or calculate the 3x3x32x64 convolution. Unlike the previous example with 16 IFM channels, this example involves 32 IFM channels. All 32 IFM channels (2 slices) can be read from the SRAM109 simultaneously. The neural processor 100 may have multiple SRAM bank sets. Each bank group (in the mapping example) can stream 1 slice per clock cycle (stream). Thus, to read (stream) 2 slices (32 IFM channels) simultaneously, two bank groups may be used, a first bank group of the two bank groups may stream the IFM [0..15], and a second bank group of the two bank groups may stream the IFM [16..31 ].

Referring to fig. 3EB, the calculation of OFM [0..7] may be split across block 1 and block 9. Block 1 may reduce (add) the IFM [0..15] to an incomplete OFM [0..7 ]. Block 2 may reduce the IFM [16..31] to an incomplete OFM [0..7 ]. The calculation of OFM [0..7] can then be done by adding the outputs of block 1 and block 2 (and applying a bias, activation function, etc.). To perform this addition, the adder trees of block 1 and block 2 may be "connected" using more than one additional hardware adder stage. Reduction fabric 111 provides such additional hardware adder stages. Similar operations may be used for OFM [8..15] (add block 2 and block 10), …, OFM [56..63] (add block 8 and block 16). Referring to fig. 3EB, in this example, there is no output from blocks 1..8 to SRAM 109. As will be explained later, only block 9..16 saves the OFM to SRAM 109.

In the sixth example depicted in fig. 3FA through 3FC, sixteen blocks are used to determine or calculate a 3x3x32x512 convolution. Referring to fig. 3FA, as in the fifth example, two IFM slices (IFM [0..31]) may be read from two SRAM banks and each of the two IFM slices may be broadcast to 8 blocks. Two such groups of 8 blocks together can compute the OFM [0..63], and the results can be saved to 4 SRAM bank groups. Referring to fig. 3FB, 64 OFMs may be calculated per IFM pass (i.e., the entire IFM may be read to calculate 64 OFMs). In this way and in a manner similar to that of the fourth example, 512 OFMs may be calculated in 8 IFM passes (and, equivalently, 8 OFM "steps"). The OFM [0..63] may be calculated during the first IFM pass. The OFM [64..127] may be calculated during the second IFM pass, and so on, with the OFM [448..511] being calculated during the eighth IFM pass. In this example, the "2 IFM slices by 64 OFM slices" operation has been split into 8 OFM steps. Each OFM step convolves "2 OFM slices by 8 OFM slices". Referring to fig. 3FC, in some embodiments, a dummy SRAM bank may be used to handle the case where the SRAM bank (which may have a capacity of about 32 kB) is filled with either optical IFM data or OFM data.

In such a case, the data structure of neural processor 100 may be transparently switched (for the block receiving the IFM stream) to connect another SRAM bank group. As previously described, IFM and OFM tensors may be too large to be stored in a single SRAM bank set 109, and thus may need to be split into multiple sub-tensors, each small enough to fit into the SRAM bank set 109 for storage. Global control logic 140 contains configuration registers that describe how the IFM and OFM tensors are split and stored in SRAM bank groups, including IFM and OFM sub-tensor indices, sizes, indices of the SRAM bank groups that store each sub-tensor, and the addresses at which each sub-tensor is stored within the associated SRAM bank group.

As the computation proceeds and the IFM (OFM) traversal moves from the sub-tensor stored in one SRAM bank group 109 to another sub-tensor stored in another SRAM bank group 109, global control FSM 140 orchestrates the operational reconfiguration of the IFM and OFM transfer structures, switching the IFM source (and OFM destination) SRAM bank group from the current SRAM bank group to the next SRAM bank group. In some embodiments, reconfiguration is done in a manner that is transparent to the blocks consuming the IFM (and the blocks generating the output) and does not stop or slow down the computation during bus switching.

As previously described, a piece of software, referred to herein as a "mapper," may statically (at compile time) decide how to partition the entire IFM and OFM stores, as well as weight the core stores and partial results across the set of SRAM banks and the physical SRAM banks. As depicted in FIG. 3FC, for clarity of mapping explanation, the details of the physical IFM and OFM storage across multiple SRAM bank groups may be ignored, and the SRAM bank groups may be considered as a "virtual" or "logical" view 306 in the IFM and OFM.

In the seventh example depicted in fig. 3 GA-3 GD, sixteen tiles are used to determine or calculate a 3x3x32x512 convolution. In this example, the same convolution as in the sixth example is calculated using fewer IFM passes to save energy. Referring to fig. 3GA, each multiplier cell weight register file 127 may have 18 weights, only 9 of which 18 weights are used for the 3x3 convolution in the sixth example. In this way, two sets of 3x3 weights may be stored (as opposed to one set), and two sets of 3x3 weights are "cycled" through time. In particular, a 3x3x32x512 convolution may be split into two 3x3x16x512 convolutions interleaved in time. Referring to fig. 3GB, in a manner similar to that of the third example, a 3x3x16x512 convolution may be mapped to 16 physical blocks. For each IFM pass, one IFM slice may be read from the SRAM bank group and broadcast to 16 physical blocks, which output 128 OFM channels to 8 SRAM bank groups. In this example, four IFM passes (and four OFM steps) are taken to complete the OFM calculation.

Referring to fig. 3GC, in some embodiments, in a first step, IFM [0..15] may be input to calculate the convolution of OFM [0..127] at OFM location (x, y), but rather than writing the result to SRAM, the OFM result may be held in an accumulator. Referring to fig. 3GD, in a second step, each multiplier unit weight register file 127 may then switch to a second set of 3x3 weights and input IFMs [16..31] to complete the calculation OFM [0..127 ]. This process may be referred to herein as an "IFM weight loop". Then, in a third step, OFM [0..127] can be saved to SRAM and the accumulator cleared. These three steps may be repeated until the calculation is complete.

Referring to fig. 3GA, in some embodiments, a logical block may be defined as a physical block that stores multiple sets of weights. It can be seen that in this example (seventh example), two sets of 16 such logical blocks (interleaved in time) (i.e., 32 logical blocks) are formed by storing two sets of 3x3 weights. In the seventh example, 32 logic blocks may physically compute more (e.g., wider) OFMs in each IFM pass, such that the number of IFM passes (and SRAM IFM read energy) is reduced by a factor of two compared to the sixth example.

In the eighth example depicted in fig. 3HA through 3HC, the 3x3x512x256 convolution is first determined or calculated using sixteen physical blocks. Note that both the number of IFM and OFM channels (512 and 256, respectively) are quite large in this example. As discussed in further detail below, partial results or "partial" may be used when the convolution kernel is too large to compute in other ways. However, this example shows how convolution with large weight kernels can still be performed without using parts. The 3x3x512x256 convolution may be calculated as shown in fig. 3 HB. For a 3x 38 bit convolution, 2 sets of 3x 38 bit weights may be stored into each multiplier cell, such that there are (2 sets of weights) × (16 physical blocks) ═ 32 logical blocks. The 32 logic blocks can reduce to 32 IFM slices, so that the maximum number of IFM channels that can be processed without using a partition is (32 slices) × (16 IFM channels per slice) — 512 IFM channels. Thus, a 3x3x512xN convolution can be computed without using parts, where N is any positive integer.

Referring to fig. 3HB and 3HC, 256 IFM channels per clock may be reduced with a block adder tree combined with reduction structure 111. To reduce all 512 IFM channels (and generate 8 OFM channels), two weight loops are performed. In weight loop 1, as depicted in fig. 3HB, IFM [0.. 15] may be input to block 1, IFM [16..31] may be input to block 2, and so on, and IFM [240..255] may be input to block 16. The hardware tree may be connected across all 16 blocks (per column) using the hardware adder stage provided by reduction structure 111. The adder tree root may end at block 16 (in the context of the reduction structure 111, the OFM transfer structure, and the adder tree, as discussed later) so that only block 16 generates the result, while the accumulators of blocks 1.. 15 are not used in this configuration. In weight loop 2 depicted in fig. 3HC, IFM [256.. 271] may be input to block 1, IFM [272.. 287] may be input to block 2, and so on, and IFM [496.. 511] may be input to block 16. The block 16 may then write the completed OFM [0..7] (x, y) result to the SRAM bank 16. Finally, 32 IFM passes (32 OFM steps) may be performed to compute OFM [0..7], then OFM [8..15], and so on to OFM [248..255 ]. Note that although the IFM pass and the number of OFM steps are the same in this specific example, the difference between the IFM pass and the OFM step will become clearer in the following examples.

Fig. 3HD additionally depicts how the 3x3x512x256 convolution depicted in fig. 3 HA-3 HC can be changed to a 3x3x512x512 convolution simply by performing 64 IFM passes (64 OFM steps) instead of 32 IFM passes (32 OFM steps).

In the ninth example depicted in fig. 3IA through 3IF, a 3x3x512x256 convolution is determined or calculated using 16 blocks and using the partial result. In some cases, the use portion may enable energy savings by reducing the number of SRAM reads (compared to, for example, the eighth example). When using portions, the mapping algorithm may divide the weight tensor into multiple portions, specifically the depth channel direction, converting a single convolution operation (including loading the weight tensor, traversing the IFM, writing the OFM) into two or more convolution operations. The outputs of the two or more resulting convolutions are then combined to produce the final result.

First, recall that fig. 3HB through 3HC depict a 3x3x512x256 convolution computed without a portion. Fig. 3 IA-3 IB and 3 IC-3 ID depict the hardware resource mapping associated after the weight tensor (and corresponding IFM and OFM)512 IFM channels are partitioned into 256 and 256 corresponding to two separate convolutions, each convolution of size 3x3x256x 256.

Fig. 3IA to 3IB depict the first of two 3x3x256x256 convolutions. Because the weight kernel plane size is 3x3 ═ 9, each MU weight register file capable of holding 18 8-bit weights has sufficient capacity to store two sets of 3x3 weights, thus making 32 logical blocks available for computation.

Eight IFM slices may then be loaded. Each IFM slice may then be broadcast to 2 physical blocks. 16 OFM steps (16 IFM passes) may be performed. During a first weight cycle, as depicted in fig. 3IA, 3x3 IFM [0..127] may be input, convolved with a first set of 3x3 weights, reduced using an adder tree, and accumulated in accumulator registers of

blocks

8 and 16. Referring to fig. 3IB, during the second weight loop, 3x3 IFM [128..255] may be input, convolved with the second set of 3x3 weights, reduced using an adder tree, and further accumulated in accumulator registers in

blocks

8 and 16. At this point, the convolution of the 3x3 IFM [0..255] with the corresponding 3x3x256x16 weight kernel is complete for OFM channel 0.. 15 and may be written as a partial result to virtual

SRAM bank groups

8 and 9. Since this is a partial result, the value of the accumulator (Acc)130 bypasses the activate function block 197 on the way to the SRAM, as opposed to a complete result. Optionally, to reduce SRAM size requirements and power consumption, bit range selection module 187 may reduce the bit width of the partial result rounding, e.g., down to 4 bytes when 8-bit activation and weights are used, or down to 6 bytes when 16-bit activation and weights are used.

The above steps are repeated until the entire IFM [0.. 255] (i.e., for all desired planar (x, y) positions) has been processed in one pass over the IFM [0.. 255] and results in a corresponding set of partial results calculated for the OFM [0.. 15 ]. Partial results for the remaining OFM channels [16.. 255] were calculated by performing another 15 passes (corresponding to another 15 OFM steps) on the IFM [0.. 255 ].

Note that in this mapping example, the OFMs that are physically and simultaneously generated in one pass are widened (spread) by two times using two partial passes (from one OFM slice to two OFM slices). Further, the size of the IFM tensor processed during each partial pass is reduced by a factor of two from HxWx512 to HxWx 256.

As depicted in fig. 3IC and 3ID, respectively, the second portion of IFM passes may be the same as the first portion, except that IFM [256..383] may be input during a first weight cycle, and IFM [384..511] may be input during a second weight cycle.

Similar to ARU 167, completing the original 3x3x512x256 convolution includes adding partial results (from two 3x3x256x256 convolutions, element by element) and applying scaling, biasing and activation functions. There are a number of ways to accomplish this final step, including: (i) reading the partial results generated by the first partial convolution, sending the partial to the block ARU 167 through the IFM transport structure 104 to be element-wise summed with the partial results of the second group, so that the ARU 167 will generate the final result during the second partial convolution; (ii) has a partial output of ARU 167 during the two partial convolutions, while having additional logic in SRAM bank group 109 to perform read-modify-write to add the portion and apply the activation function. More specifically, the additional logic for completing the partial will receive the partial results during the second partial convolution, read the results of the first partial convolution from the SRAM, sum the results and apply the activation function on the fly and write the final result back to the SRAM; (iii) there is additional logic in SRAM bank set 109 that enables read-add-write operations on the portions to continue to add the partial results from two or more partial operations element by element without applying an activation function, then to be completed during the final partial operation round by reading the partial results and sending the partial results to block ARU 167.

Unlike the case where no part is used, when a part is used, the OFM height and width should be considered when arranging the convolution operation. Referring to the IE of fig. 3, each partial result may be stored using four bytes (assuming that IFM and OFM are both 8 bits). In this case, the SRAM storage size of the partial result is equal to (OFM height) × (OFM width) × (OFM depth) × (4 bytes). As depicted, if SRAM (on-chip) storage capacity is insufficient for a partial result, the OFM data can be divided into sub-windows and processed one at a time. However, each time a sub-window is processed, the entire set of core weights may need to be loaded (or reloaded), which may increase power consumption. For example, it is assumed that the OFM plane size is set to 10 × 10 and the IFM plane size is set to be equal to the OFM plane size. In this case, the core weight is relatively small and large, 3 × 512 × 256 to 1.2 megabytes. The SRAM size that stores the entire partial result of the entire IFM plane size without subdividing it into plane sub-windows is 10 × 256 × 4 — 102400 bytes. For simplicity, it is further assumed that the SRAM has sufficient capacity such that the use of sub-windows is not required.

FIG. 3IF summarizes the process of calculating the convolution in this example, whereby a first set of portions of IFM [0..255] and all OFMs [0..255] are determined or calculated and saved, a second set of portions of IFM [0..255] and all OFMs [0..255] are determined or calculated (but not written to SRAM because this is the last partial round), and portions are added element by element when a second partial convolution is determined or calculated, the activation function being applied on the fly and written to SRAM.

As previously described, using the MR tile 102 to add parts element by element and applying the activation function is optional. Instead, Auxiliary Plane and Activation Processing (APAP) units dedicated to element-by-element and plane (no reduction across channels) operations may be used. These cells may be located inside the SRAM bank group 109 and may access portions stored locally in the SRAM as well as portions that reach the SRAM bank group. The APAP cell then writes the completed result to SRAM 109.

The determination or calculation performed according to the ninth example can save a large amount of energy by performing two passes. Since the number of IFM passes is reduced from 32 to 16, the amount of IFM data read is (IFM height) × (IFM width) × (IFM channel) × (IFM pass) ═ 10 × 512 × (32-16) ═ 819200 bytes (ignoring the cache). The amount of partial data written in the SRAM is (OFM height) × (OFM width) × (OFM channel) × (partial volume-1) × (4 bytes) × 10 × 256 × (2-1) × 4 ═ 102400 bytes. In other words, if the second fractional pass saves the result to SRAM 109 instead of inputting the result directly to the plane/activate cell, twice that amount will result. Further, the amount of partial data read from the SRAM 109 is (OFM height) × (OFM width) × (partial volume-1) × (4 bytes) ═ 10 × 256 × (2-1) × 4 ═ 102400 bytes. In other words, if the second fractional pass saves the result to SRAM 109 instead of directly inputting the result to the plane/activate cell, twice that amount will be incurred. Thus, performing a 3x3x512x256 (8-bit) convolution using a partial versus an unused portion in the example results in 819000 fewer IFM bytes being read from the SRAM, while causing an additional 102400 bytes to write a portion to the SRAM and another 102400 bytes to read a portion from the SRAM.

Assuming that the energy for one SRAM write is about twice the energy for one SRAM read, the total SRAM energy saved is 819000-2 × 102400 — 511800 (energy per SRAM read).

In the tenth example depicted in fig. 3JA through 3JD, four blocks are used to determine or calculate an 8x8x16x64 convolution. The 8x8 convolution has 8x 8-64 weights, which may not fit into a single multiplier unit. A single multiplier unit 103 may store, for example, only 18 weights. Thus, as depicted in fig. 3JA, 64 weights may be divided among the four blocks 102, such that block 1 stores W [0..1,0..7, block 2 stores W [2..3,0..7, block 3 stores W [4..5,0..7, and block 4 stores W [6..7,0..7, where the weight kernel symbols are W [ rows, columns, IFM channels, OFM channels ], and "" represents the entire applicable range. The system may then add (reduce) across the tiles to compute the OFM [0..7] so that each tile effectively performs a 2x8x16x64 convolution and four 2x8x16x64 convolutions performed simultaneously using four tiles, the four 2x8x16x64 convolutions being aggregated into one 8x8x16x64 convolution. Each 2x8x16x64 convolution also includes two 1x8x16x64 convolutions that are combined together using an IFM weight loop.

Fig. 3JB depicts the first step of the IFM weight loop, where even (not yet odd) rows within the convolution window are convolved. Here, block 1 convolves row 0W [0, ] of the convolution window with IFM values "a 0, b0, c0, d0, e0, f0, g0, h 0", while block 2 convolves row 2W [2, ] of the convolution window with IFM values "a 2, b2, c2, d2, e2, f2, g2, h 2". Block 3 convolves row 4W [4, ] of the convolution window with IFM values "a 4, b4, c4, d4, e4, f4, g4, h 4", and block 4 convolves row 6W [6, ] of the convolution window with IFM values "a 6, b6, c6, d6, e6, f6, g6, h 6". The product of multiplier unit 103 is reduced using the block adder tree within the block and using the add adder tree stage provided by reduction structure 111 and is accumulated in accumulator register 130 of block 4 (as IFM values "a, b.

Fig. 3JC depicts the second step of the IFM weight loop, in which odd rows within the convolution window are convolved. Here, block 1 convolves row 1W [1, ] of the convolution window with IFM values "a 1, b1, c1, d1, e1, f1, g1, h 1", while block 2 convolves row 3W [3, ] of the convolution window with IFM values "a 3, b3, c3, d3, e3, f3, g3, h 3". Block 3 convolves row 5W [5, ] of the convolution window with IFM values "a 5, b5, c5, d5, e5, f5, g5, h 5", and block 4 convolves row 7W [7, ] of the convolution window with IFM values "a 7, b7, c7, d7, e7, f7, g7, h 7". Similar to the first IFM weight loop step, the product of multiplier unit 103 is reduced using a block adder tree within the block and using the add adder tree stage provided by reduction structure 111 and is accumulated in accumulator register 130 of block 4 (as IFM values "a, b. However, unlike during the first IFM weight loop step, the accumulator register 130 is not cleared at the beginning of the second IFM weight loop step, such that once both IFM weight loop steps are completed, the accumulator register 130 contains dot products of both even and odd rows.

The resulting OFM [0.. 7] can then be written to SRAM 109, completing the convolution of the 8x8x16x8 window for one OFM location. As depicted in fig. 3JD, to continue the computation, the convolution window may then be translated to compute the next 8x8 convolution. The process can be repeated until the entire OFM is completed.

In the eleventh example depicted in fig. 3KA and 3KB, sixteen chunks are used to determine or calculate an 8x8x64x64 convolution. The 8x8 convolution can be applied to 16 blocks and more IFM and OFM channels can be used. As depicted in fig. 3KA, an 8x8 convolution is split over four physical blocks such that the number of "logical" blocks is reduced by a factor of four, e.g., (16 physical blocks)/(4 physical blocks at a time) 4 logical blocks. As used herein, the term "physical grouping" of a physical block is defined by connecting a block adder tree to a single adder tree (per column) to perform operations that are too large for a single physical block 102.

Referring to fig. 3KA, the 8x8 convolution may be split into four chunks because the 8x8 convolution may be too large to fit into a single chunk 102. By connecting the adder tree from four blocks to a single adder tree, the four blocks can be physically grouped into one logical block. Referring to fig. 3KB, mapping 8x8x64x64 to 16 physical blocks is logically transformed to mapping 8x8x64x64 to 4 logical blocks, where each logical block has a weight of 18 × 4-72, sufficient to fit 8x 8-64 convolution weights.

Fig. 3KB depicts the mapping of an 8x8x64x64 convolution operation to 4 logical blocks (and thus 16 physical blocks). The operation of the transformation may be performed as follows. First, four IFM slices can be read. All IFM channels may be read immediately to avoid fragmentation. Second, each IFM slice may be "broadcast" to one logical block. Third, 8 OFMs (one OFM slice) can be calculated in one IFM pass. This may be repeated so that (64 OFMs)/(8 OFMs per pass) 8 OFM passes (8 OFM steps) may be performed to calculate all OFM channels.

In some cases, more OFM channels may be needed, for example, to determine or calculate an 8x8x64x1024 convolution. This is made possible without using parts by adding more OFM steps that perform more IFM passes to re-read the IFM. In some cases, more IFM channels may be needed, for example, to determine or calculate an 8x8x128x64 convolution. In such a case, it may be desirable to use the portion unless (i) the number of physical blocks is increased, or (ii) the number of weights per multiplier is increased. However, in some applications, large size convolutions like 8x8 may be applied only to RGB images or images with a small number of IFM channels. The MU weight register file 127 holding N weights may hold convolution kernels up to H W ≦ N, where H and W represent the planar height and width of the weight kernels. For example, a MU 103 with 18 8-bit weight capacities may maintain convolution kernels that include 4x4, 5x3, 3x5, 6x2, 2x6, 7x2, 2x7, 8x2, 2x8, 9x2, 2x9, 18x1, and 1x 18. In practice, the need to compute the 8x8x128x64 convolution may be minimal and therefore may be performed by the CPU rather than the neural processor 100, making the associated neural processor additional hardware logic optional. For clarity purposes, the IFM, OFM and reduction structure descriptions omit the cases required for the connection of H W > N (such as the case described in this example).

In the twelfth example depicted in fig. 3LA through 3LD, sixteen tiles are used to determine or calculate a 1x1x1024x64 convolution. Each MU may have 18 weights. Since 1x1 convolution only requires 1x 1-1 weight, 18 sets of 1x1 convolution weights (18 weights per multiplier)/(1 weight per convolution window) can be fitted into each block. The number of logical blocks can be calculated as (16 physical blocks) × (18 convolution weights per multiplier) × 288 logical blocks. A calculation using a convolution of 16 physical blocks 1x1x1024x16 may be transformed into a calculation using a convolution of 288 logical blocks 1x1x1024x 16. All (1024) IFM channels may be read in one IFM pass to avoid partitions. With 288 logic blocks, IFMs up to the size of (16 IFM lanes per IFM slice) × (288 logic blocks) — 4608 lanes can be accepted. The 1x1x1024x64 convolution requires only 1024 of the available 4,608 IFM channels without the use of a partition. Thus, the number of OFM slices that can be calculated per IFM pass is 4 ((maximum 4608 IFM channels)/(1024 IFM channels)) -4 OFM slices.

The determination or calculation may be performed as follows. First, 16 sets of 1x1 weights may be stored in each MU. During each OFM step (IFM pass), 64 slices (all 1024 IFM channels) can be read. Physically, this corresponds to 4 IFM slices per read (64 IFM slices)/(1 x1 weight per MU 16 group). Each of the four IFM slices may be broadcast to (16 physical blocks)/(4 IFM slices) — blocks to compute 4 OFM slices in one OFM step (and one IFM pass). The OFM can be calculated using (8 OFM slices)/(broadcast over 4 blocks) 2 OFM steps (and 2 IFM passes). The IFM weights may be cycled 16 times.

Specifically, referring to fig. 3LA, the calculation of the convolution may proceed along the following steps. In a first step, the accumulator is cleared. In a second step, IFM [0..15], IFM [16..31], IFM [32..47], and IFM [48..63] are obtained and broadcast to

blocks

1, 5, 9, and 13, blocks 2, 6, 10, and 14, blocks 3, 7, 11, and 15, and blocks 4, 8, 12, and 16, respectively, IFM [0..15], IFM [16..31], IFM [32..47], and IFM [48..63 ]. In the third step, the system accumulates the dot products calculated by block 1..4 and OFM [0..7], block 5..8 and OFM [8..15], block 9..12 and OFM [16..23], and block 13..16 and OFM [24..31], respectively, as intermediate (unfinished) results into the accumulator registers of

blocks

4, 8, 12, and 16, respectively.

Referring to fig. 3LB, in a fourth step, the accumulator is not cleared and the MU 103 is switched to use the next set of 1x1 weights corresponding to the steps in the IFM weight loop. In a fifth step, IFM [64..79], IFM [80..95], IFM [96..111], and IFM [112..127] are obtained and broadcast to

blocks

1, 5, 9, and 13, blocks 2, 6, 10, and 14, blocks 3, 7, 11, and 15, and blocks 4, 8, 12, and 16, respectively, IFM [64..79], IFM [80..95], IFM [96..111], and IFM [112..127 ]. In a sixth step, the system accumulates the dot products calculated by block 1..4 and OFM [0..7], block 5..8 and OFM [8..15], block 9..12 and OFM [16..23], and block 13..16 and OFM [24..31], respectively, in the accumulator registers of

blocks

4, 8, 12 and 16 as intermediate (unfinished) results.

Referring to fig. 3LC, the calculation may proceed by continuing to loop through IFM weights (16 IFM weight loop steps total), acquire and broadcast IFMs, calculate and accumulate dot products until the last IFM slice is reached (path 960 through path 1023). At this step, the accumulator is not cleared and the MU 103 is switched to the next (last 16 th) group 1x1 weight corresponding to the last step in the IFM weight loop. In the next step, IFM [960..975], IFM [976..991], IFM [992..1007], and IFM [1008..1023] are obtained and broadcast to

blocks

1, 5, 9 and 13, blocks 2, 6, 10 and 14, blocks 3, 7, 11 and 15, and blocks 4, 8, 12 and 16, respectively, IFM [960..975], IFM [976..991], IFM [992..1007], and IFM [1008..1023 ]. Next, the system adds the point numbers calculated by the blocks 1..4 and OFM [0..7], 5..8 and OFM [8..15], 9..12 and OFM [16..23], and 13..16 and OFM [24..31], respectively, to the accumulator registers of the

blocks

4, 8, 12, and 16 to obtain the final point product result. In the next step, the activation is applied to the dot product results accumulated in the accumulator registers of

blocks

4, 8, 12 and 16, and the four resulting OFM slices are written to SRAM. This completes the calculation of OFM [0..31 ].

Referring to fig. 3LD, the system then proceeds to the next OFM step (by performing another IFM pass) and repeats the calculation, this time for the OFM [32..63 ]. The system loads the slice for the weight of the next OFM step: w [0,0,0..1023,32..63 ]. As depicted in fig. 1K and 1N, weight loading may occur concurrently with computations using the vertical weight loading bus 101, in which case there is no additional delay caused by the weight loading process. The system may clear the accumulator and switch the MU 103 to the first set of 1x1 weights. The system may then repeat the operations as described in the context of fig. 3LA through 3LC to calculate the OFM [32..63 ].

As depicted in fig. 3LD (similar to the case of fig. 3 LC), once the system has passed 15 of the 16 IFM weight cycles, has acquired the corresponding IFM slices, calculated and accumulated the intermediate dot product results, the system reaches the last (16 th) round of the IFM weight cycle. In this round, the accumulator is not cleared and the MU 103 is switched to the next (last 16 th) group 1x1 weights (last 16 th IFM weight loop step). The system acquires IFM [960..975], IFM [976..991], IFM [992..1007], and IFM [1008..1023], and broadcasts to

blocks

1, 5, 9, and 13, blocks 2, 6, 10, and 14, blocks 3, 7, 11, and 15, and blocks 4, 8, 12, and 16, respectively. Next, the system accumulates the dot products calculated by the blocks 1..4 and OFM [32..39], the dot products calculated by the blocks 5..8 and OFM [40..47], the blocks 9..12 and OFM [48..55], and the blocks 13..16 and OFM [56..63], respectively. At the end of this process, the system applies the activation function 197 (in

blocks

4, 8, 12, and 16) to the completed dot product stored in accumulator 130 (in

blocks

4, 8, 12, and 16), and writes the final OFM [32..63] result to the SRAM to complete the convolution operation.

Consider now the Full Connection (FC) layer computation as opposed to the convolution operation. Consider first the simple case of a 16x8 FC calculation using a single block and a single IFM sample. Note that the FC layer calculation is similar to the 1x1 convolution (described in the previous example) except that the weights are discarded after multiplication by IFM. A single 16x8 FC calculation may be accomplished by loading 1 weight into each MU, taking a single IFM [0.. 15] slice, calculating the dot product using the adder tree of tiles, applying an activation function to the resulting dot product, and writing the completed IFM [0.. 7] result to SRAM 109.

Consider the case where 16x16FC is determined or calculated by a single tile 102 and a single IFM sample. A single 16x16FC calculation may be accomplished by loading 2 weights into each MU 103, obtaining a single IFM [0.. 15], and having the MU 103 select the first of the two preloaded weights for multiplication. As described above, OFM [0.. 7] may be calculated. The MU 103 may select the second of the two preloaded weights for multiplication and calculation of M [8..15 ]. The process of cycling through MU weights to compute multiple OFMs from the same IFM is referred to herein as "OFM weight cycling".

Note that the 16x16FC calculation is done using one IFM pass, instead of two OFM steps (corresponding to two OFM weight loops). Thus, as observed in most other examples, the number of OFM steps is generally equal to the number of IFM passes unless an OFM weight loop is used.

Consider another simple case of using a single block and a single IFM sample to determine or compute 16x128 FC. This may be achieved by loading 16 weights into each MU 103 and acquiring a single IFM slice. The 16 OFM steps may be performed by an OFM weight loop (i.e., by looping through MU weights to calculate the OFMs [0..7], OFMs [8.. 15.,. OFMs [120..127]) one after the other).

Consider the simple case of using a single block for a batch of 18 IFM samples to determine or calculate 16x8 FC (i.e., the IFM tensor shape can be represented as 1x16x 18). As a side note, because neural processor 100 performs inference (rather than training), the mapping example has implicitly assumed an IFM batch size of 1, as in typical inference applications. Calculations with IFM batch sizes greater than 1 may also be mapped onto hardware. For example, the calculation may be repeated as already described for each sample in the IFM batch. However, a 16x8 FC single block calculation for a batch of 18 IFM samples may utilize the MU weight register file capacity to preload 18 weights into each MU 103, one weight per IFM sample. Subsequently, the calculation may be done by taking the first IFM [0..15] [0] sample (from the 18 of the batch), calculating the dot product of the obtained IFM sample with the first of the 18 weights in each MU, applying the activation function, and writing the resulting OFM [0..7] [0] to SRAM. Next, IFM [0..15] [1] samples are taken and multiplied by the second of the 18 weights in each MU 103 to obtain OFM [0..7] [1] after the activation function is applied. This sequence continues until the entire batch of IFM [0..15] [0..17] samples (18 in total) have been processed, resulting in a batch of OFM [0..7] [0..17] samples. Cycling through MU weights to process a batch of IFM samples may be referred to herein as an "IFM batch cycling (IFM batch cycling)". Note that the IFM weight loop, OFM loop, and IFM batch loop may be combined to perform the calculations as long as the MU weight register file capacity is sufficient.

In the thirteenth example depicted in fig. 3MA and 3MB, a single tile is used to perform the 288x8 full connection determination or calculation. Referring to fig. 3MA, as previously described, the calculation of a full connection may be similar to a 1x1 convolution, where the convolution window is not shifted and the weights are not reused and must be discarded after a single use. One block 102 can compute 8 OFM channels (i.e., 1 OFM slice) in parallel. The 288 IFM channels may correspond to 288/(16 rows per MR block) 18 slices. The system may use 18 weights in each MU 103 to store all 18 slices of FC weights.

To perform a fully-connected computation, the system may perform the following steps (which may be performed somewhat simultaneously, that is, they may overlap in time). In a first step, weights may be loaded from the SRAM 109. As depicted in fig. 1K and 1N, weights may be loaded simultaneously with the computation using, for example, a vertical weight loading bus 101. In this way, the system can ensure that FC weights are placed into SRAM 109. In a second step, the accumulator for OFM [0..7] may be cleared. In a third step, one sample of IFM [0..15] may be input into the block and the result may be added to the OFM [0..7] accumulator 130 to form an intermediate (unfinished) result.

In a fourth step, the OFM [0..7] accumulator may be left uncleaned and the system may switch to the next set of FC weights (the round robin IFM weights). In a fifth step, the IFM [16..31] may be input into the block and the result may be added to the OFM [0..7] accumulator. Referring to fig. 3MB, where IFM [280..287] is the last slice, the steps may be repeated until all IFM channels (and associated weights) have been cycled through. Finally, the activation function may be applied to the accumulated dot product, and the final OFM [0..7] result may be written to SRAM. This completes the full join computation.

In the fourteenth example depicted in fig. 3NA, a 288x64 full connectivity determination or calculation is performed. In this example, the OFM channel count is increased from 8 (in the thirteenth example) to 64. This is equivalent to the thirteenth example if the system splits the FC 288x64 calculation into 8 smaller FC calculations of size 288x8 and performs the calculations one by one (e.g., in 8 OFM steps). This results in 8 IFM passes.

In the fifteenth example depicted in fig. 3 OA-3 OC, a 1024x32 full connectivity determination or calculation is performed on a single IFM sample (i.e., a batch size of 1). Referring to fig. 3OA, because FC determination is similar to 1x1 convolution, there may be (18 weights per MU) 16 physical blocks up to 288 logical blocks, each logical block performing a 1x1 convolution. In this way, the system can read all 1024 IFM channels (1024/16 ═ 32 IFM slices) in a single round to avoid partitions.

To read all 32 IFM slices, 32 logical blocks may be used. The calculation may involve calculating 32 OFMs (4 OFM slices). In order to calculate 32 OFMs in one pass (all OFMs are calculated simultaneously), (32 OFM slices) × (4 OFM slices) — 128 logical blocks may be used. Thus, the available number of logical blocks (288) is sufficient. The number of logical blocks may be reduced to 128 as needed by storing 8 weights in each MU 103 (instead of storing up to 18 weights per MU 103).

The calculation may be performed as follows. The system may store every MU 1038 sets of IFM FC weights and use 128 logical blocks (as described above). The entire calculation can be done in a single IFM pass by calculating four OFM slices. Each of the four IFM slices may be acquired and broadcast to four tiles. Since 8 sets of IFM weights are stored in each MU, the weights may be cycled eight times. The sequence may include the following steps. In a first step, the OFM accumulator may be cleared. In a second step, IFM [0..63] (4 IFM slices) may be acquired and each slice may be broadcast to four blocks. In a third step, the not-yet-completed OFM [0..31] (4 OFM slices) can be calculated and added to the OFM accumulator.

Referring to fig. 3OB, in a fourth step, the OFM accumulator may be left uncleaned and the next set of weights may be used. In a fifth step, IFM [64..127] (4 IFM slices) may be acquired. In a sixth step, the system can continue to calculate the (not yet completed) OFM [0..31] (4 OFM slices) by adding the sum of the products to the OFM accumulator. Referring to FIG. 3OC, the system may continue to loop through the weights and accumulate the OFM results until all IFMs have been processed. As a final step, the system may take the IFM [960..1023] and accumulate into OFM [0..31], then apply the activation function to the accumulated OFM [0..31] and write the result to SRAM 109.

In the sixteenth example depicted in fig. 3 PA-3 PC, 4096x1024 full connection determination or calculation is performed using sixteen blocks and a batch size of 1. This calculation may use (4096/16 IFM channels per block) 256 IFM slices and (1024/8 OFM channels per block) 128 OFM slices. As in some of the other examples described above, it may be advantageous to read the entire IFM to avoid the portion. Up to (18 weights per MU) 288 logical blocks may be used to perform the calculations. To read the entire IFM, 256 logical blocks may be used. Thus, the available number of logical blocks (288) is sufficient. The system may be configured to use 256 logical blocks by loading 16 sets of weights into each MU 103. To read 256 IFM slices (no portion) in a round, all 256 logical blocks can be used. Thus, (256 logic blocks/256 IFM slices) will be generated for 1 OFM slice per IFM pass, and to complete the computation, (128 OFM slices)/(1 OFM slice per IFM pass) will be performed for 128 OFM steps (hence 128 IFM passes).

The physical configuration is depicted in fig. 3 PA. The reduction structure 111 can be configured to reduce the output of all 16 blocks into a single OFM slice. 16 IFM slices (from 16 virtual SRAM banks) will be fetched and each IFM slice is "broadcast" to only one bank 102.

The calculation may be performed in several steps as follows. In a first step, the OFM [0..7] accumulator is cleared. In a second step, 16 IFM slices (IFM [0..255]) are taken and reduced to the OFM [0..7] accumulator as intermediate (incomplete) results.

In a third step, the OFM [0..7] accumulator is left uncleaned and the system switches to the next set of IFM weights in the MU 103. In a fourth step, the next 16 IFM slices (IFM [256..511]) are acquired, reduced and added to the OFM [0..7] accumulator. As depicted in fig. 3PB, these steps may continue until all IFMs (up to and including IFM [4080..4095]) have been processed. The activation function may be applied to the accumulated dot product (in block 16) and the final result may be written to SRAM 109. This completes the calculation of OFM [0..7 ]. Referring to fig. 3PC, to perform the next OFM step, the system may repeat the previous calculation for OFM [8..15], load the weights W [0..4095,8..15], and continue stepping OFM until all OFMs are calculated (until OFM [1016..1023]), to complete the entire FC calculation.

FC computation cases may exist when an IFM already has more than (18 weights) × (16 IFM channels per IFM slice) × (16 physical blocks) — 4608 channels. In this case, the part may be used by: the IFM channel is split into parts (of sufficient size to map onto existing physical hardware), FC is computed for each part separately, the partial results (stored in SRAM) are added element by element as previously described, and the computation is done by applying an activation function.

In the case where the weight is 16 bits, as described earlier, the MU weight register file capacity becomes 9 (16-bit weight) instead of 18 (8-bit weight), and the calculation can be performed using multi-cycle. Similar reasoning applies to larger weight bit lengths (e.g., 24-bit or 32-bit, where, for example, the MU weight register file 127 has sufficient capacity to hold 6 24-bit weights or 4 32-bit weights).

Alternatively, rather than mapping operations to all available physical blocks, the neural processor may be logically subdivided into several neural processors, each with a smaller number of blocks. For example, a neural processor with 16 physical blocks may be logically viewed as two neural processors, each having half the original number of blocks (e.g., 8 blocks per neural processor), or as four neural processors, each having one-fourth the original number of blocks (e.g., 4 blocks per neural processor), and so on. Given the number of physical tiles remaining after partitioning, each neural processor resulting from such a subdivision follows substantially the same mapping principles as described above. Subdividing the neural processor into multiple smaller neural processors may be desirable for operations that require relatively fewer IFM reductions and generate relatively fewer OFM channels (more specifically, products thereof). For example, a 1x1x32x32 convolution mapping requires only 4 blocks. If 16 blocks are mapped, a 1x1x32x32 convolution will result in 12 of the 16 blocks being unused, thus significantly reducing multiplier utilization. In a case like this, a neural processor with 16 physical blocks can be subdivided into four neural processors, each with 4 blocks, mapping a 1x1x32x32 convolution onto each of the four resulting neural processors, subdividing an IFM tensor, e.g., of size HxWx32, into four non-overlapping IFM tensors of size (H/2xW/2x32), assigning one such quarter-sized IFM tensor to one of the four smaller neural processors, thus computing the convolutions of the IFM tensor for all four IFM sub-tensors in parallel. Note that such small weight tensor sizes may be relatively uncommon, and operating modes like this need to be properly supported by IFM, OFM, and reduction structures.

Various mappings of neural network layer operations onto available hardware require support from the IFM transport structure 104, the OFM transport structure 106, and the reduction structure 111. FIG. 4AA depicts a physical layout sketch of a neural processor having 16 hardware blocks 102 and 16 SRAM bank groups 109. In one embodiment, the SRAM bank group 109 memories may be placed in a distributed manner, where each SRAM bank group 109 is immediately adjacent (local) to one of the banks 102 forming the bank and SRAM bank group unit 401. This allows IFM and OFM data to be streamed between each block 102 and its local SRAM 109 in a highly parallel manner, i.e., running up to 16 IFMs and/or streams in parallel, in order to avoid bandwidth bottlenecks between the SRAM and the compute blocks, which may exist if the SRAM is aggregated into a larger storage array and placed further away from the block (i.e., when the memory is not distributed).

Fig. 4AB and 4AC depict the connections between the block 102 and its local SRAM bank set 109, as well as the contents of the SRAM bank set 109. Each SRAM bank group 109 may have four SRAM banks B0, B1, B2, B3 to provide sufficient bandwidth for concurrent read and write operations to service IFMs, OFM transfer structures, CPU access through AXI ports (not shown), read and write partial results, and load weights. FIG. 4AB depicts the paths among the banks B0, B1, B2, B3 through the multiplexer 403 to the IFM transport structure 104. This path may transfer up to two IFM slices per computational clock in order to supply enough IFM data to the blocks that can activate zero skipping. An IFM transfer structure 104 is coupled to block 102 to import IFM data from the local SRAM bank set as well as the other 15 SRAM bank sets. Each SRAM bank group 109 also supplies weights directly to its local block 102, specifically to weight decompression units 138 within local blocks 139. To make weight loading fast, all four SRAM banks B0-B3 may take weights in parallel and input the weights to WDU 139. Unlike in convolution, it is particularly important to load weights into the tiles as quickly as possible during the fully-connected layer computation, because the FC weights must be discarded after each multiplication.

The plurality of MU weight register files 127 in each MR block 102 may accept weight cores of size 18 x 16 x 8-2304 bytes-144 words, where each word has 128 bits. For example, if the total SRAM capacity available to the neural processor 100 is 2M (megabytes), each SRAM bank group has (2M bytes)/(16 SRAM bank groups) 128K (kilobytes). Furthermore, if each SRAM bank group contains 4 SRAM banks, each SRAM bank size is (SRAM bank group size)/(number of SRAM banks per SRAM bank group) 128K/4 32 kbytes. Thus, each SRAM bank in the four local SRAM banks may store 144/4 ═ 36 words (of the 2048 available words).

FIG. 4AC depicts the local OFM connections between a block and its local set of SRAM banks. Block 102 outputs the completed or partial results to the OFM transfer structure, which transfers the data to the local SRAM bank set as well as other SRAM bank sets elsewhere, and makes the data available to SRAM banks B0-B3 via demultiplexer 405.

The following paragraphs discuss the IFM data transfer structure 104 and the OFM data transfer structure 106. The IFM transfer structures 104 form connections and transfer data from the SRAM bank group 109 to the block 102, while the OFM transfer structures 106 form connections and transfer data from the block 102 back to the SRAM bank group 109.

Given the task of bringing IFM data from a bank of SRAM to a block and OFM from the block back to the SRAM, it may appear that the connections between the bank of SRAM and the block must be many-to-many (all-to-all) and the connections between the block and the bank of SRAM must also be many-to-many. Having multiple pairs of multiple connections may require the use of cross-bar switches (e.g., 16-to-16), which in such cases may consume excessive silicon area and thus be highly undesirable. More specifically, the area of a full crossbar is proportional to o (nm), where N is the number of switch inputs and M is the number of switch outputs. In the case where N ═ M ═ T ═ 16, where T is the number of physical blocks, O (nm) is therefore caused to be O (T ═ O (T)²)，T²Is the square of the number of blocks, and is such thatIncreasing (scaling up) the number of blocks (e.g., from 32 to 32 or 64) is particularly expensive relative to silicon area.

However, as discussed in detail below, many-to-many connections between the blocks and the SRAM bank groups are not necessary. To reduce the size and complexity of the communication structure, some embodiments aim to store the OFM locally where it will be generated (by each of the physical blocks) by splitting the SRAM into non-overlapping stores. The IFM data is still transferred from the respective SRAM bank groups 109 to each of the blocks 102, however, the IFM transfer structure configuration can be reduced to 5 necessary modes corresponding to the 5 main modes of reduction between blocks. Note that, instead of storing the OFMs locally and acquiring the IFMs in a distributed (global) manner, the IFM transfer structure 104 and the OFM transfer structure 106 may be configured to extract the IFMs locally while writing the OFM results in a distributed (global) manner.

In general, convolution or full-connected layer computations can be decomposed into one of these five configurations for inter-block reduction: (1) as depicted in fig. 4AD, one IFM slice is input by broadcasting the IFM slice to all 16 blocks 102 that result in 16 OFM slices in total; (2) as depicted in fig. 4AE, two IFM slices are input in parallel by broadcasting each of the two IFM slices to 8 tiles; (3) as depicted in fig. 4AG, 4 IFM slices are input in parallel by broadcasting each of the four IFM slices to 4 blocks; (4) as depicted in fig. 4AJ, 8 IFM slices are input in parallel by broadcasting each of the four IFM slices to 2 blocks; (5) as depicted in fig. 4AL, 16 IFM slices are input in parallel by broadcasting each of the 16 IFM slices to 1 tile.

Case (2) may be referred to as the "broadcast 8 reduce 2" case because each IFM slice is broadcast to 8 tiles and the output of 2 tiles is reduced by reduction structure 111 to obtain a complete (or partial) result. Similarly, case (3) may be referred to as the "broadcast 4 reduce 4" case because each IFM slice is broadcast to 4 tiles 102 and the output of the 4 tiles 102 is reduced. Since each IFM slice is broadcast to 2 tiles 102 and the output of 8 tiles 102 is reduced, case (4) may be referred to as the "broadcast 2 reduction 8" case. Since each IFM slice is broadcast to only one tile 102 (i.e., no broadcast) and the output of all 16 tiles 102 is reduced, case (5) may be referred to as the "broadcast 1 reduction 16" case. Because the IFM slices are broadcast to 16 blocks 102 and the output of 1 block 102 is reduced (i.e., no reduction), case (1) may be referred to as the "broadcast 16 reduction 1" case.

The five inter-block reduction configurations may be considered in more detail with respect to what connection modes the IFM transfer structure 104 and the OFM transfer structure 106 must support in each of the five reduction configuration cases. For additional clarity, the term "inter-block reduction" is referred to herein as specifying reduction of the reduction block output using the reconfigurable adder tree provided by the reduction structure 111, as opposed to "intra-block reduction" which is referred to herein as specifying reduction of the multiplier cell products using the

adder trees

128A, 128B inside the block 102.

The following notation may be used to determine the circumstances under which the interconnect structure may be put into use. The symbols Bm-Rn-represent the case where each IFM slice is broadcast to m tiles and the output of n tiles is reduced by the inter-tile reduction structure 111 in order to obtain the result. With 16 physical blocks available, the five inter-block reduction case includes B16-R1 depicted in FIG. 4AD, B8-R2 depicted in FIG. 4AF, B4-R4 depicted in FIG. 4AH, B2-R8 depicted in FIG. 4AK, and B1-R16 depicted in FIG. 4 AM.

The maximum number of inter-block reduction cases is equal to LOG2(N), where N is the number of physical blocks in the neural processor 100. An inter-tile reduction configuration that is available in a neural processor with N tiles is constructed starting from configuration BN-R1(m ═ N and N ═ 1), then dividing m by two and multiplying N by two for each next configuration until m reaches 1. For example, if the neuro-processor 100 has only 8 blocks, there may be four available inter-block configurations, including B8-R1, B4-R2, B2-R4, and B1-R8. The neural processor 100 having 32 blocks can provide up to six inter-block configurations, including B32-R1, B16-R2, B8-R4, B4-R8, B2-R16, and B1-R32.

Since the calculation may yield a final result (e.g. in case of applying an activation function) as well as a partial result, each inter-block configuration may have two cases to consider with respect to the OFM transfer path. These two cases include the case where the final result is Bm-Rn-F and the case where the partial result is Bm-Rn-P.

Fig. 4AE, 4AG, 4AJ, 4AL, and 4AN additionally depict the block outputs being added together by reduction structures 111 in each of the five reduction configurations. For example, fig. 4AL depicts a B2-R8 configuration in which the outputs of 8 tiles T0, T8, T4, T12, T10, T2, T14, and T6 are summed by one adder tree (left adder tree in fig. 4 AK), while the outputs of 8 tiles T7, T15, T3, T11, T13, T5, T9, and T1 are summed by another adder tree (right adder tree in fig. 4 AK).

Note that the configurable adder tree of the reduction structure 111 is designed to add the outputs of the blocks 102 that are adjacent to each other, rather than adding the outputs of the blocks 102 that are spread apart from each other, thus making the configurable adder tree of the reduction structure compact in wiring and "distributed" in the tree itself. Note also that unlike in the previous example, the 16 tiles here are identified as T0-T15, and the ordering of tile identification numbers has been changed (compared to the notation used in the mapping example) in order to simplify the notation in the following example.

The reduction configuration between each block may be examined in detail one by one. A first example scenario includes B16-R1 operations. Following the principle of storing the OFM as locally as possible while obtaining the IFM globally (from any SRAM bank group), in this configuration, the input IFM can be streamed from any SRAM bank group S0. S15. As shown in fig. 4BA, the SRAM bank group S10 provides a flow of IFM slices to all 16 blocks T0-T15 through the IFM transfer structure 104 (one IFM slice is broadcast to all 16 blocks as shown in fig. 4 AD). When one SRAM bank group (e.g., S10) runs out of IFM data, for example, another SRAM bank group (e.g., S11) may become the data source and continue to stream IFM data to the blocks. These steps may continue until the entire IFM tensor has been streamed in. In the case where multiple IFM passes are required, the IFM tensor streaming sequence may be repeated as needed.

In the B16-R1 configuration, there is no inter-bank reduction, such that the adder cells of each bank 102 only accumulate the results of that bank, and the OFM completed or partial results will be written to the adjacent SRAM bank group 109 as described below. Thus, each of the 16 blocks 102 in the B16-R1 configuration generates an OFM slice stream when the result is a final or partial resultant stream. Specifically, in the partial case, each value may be as much as 32 bits wide when operating with 8-bit IFM and OFM, or as much as 48 bits wide when assuming 16-bit IFM and OFM data, and each partial result may be stored locally as indicated by arrow 106 in fig. 4 BB. In this case, each SRAM bank group 109 serves as an end point for storing partial results. In addition, each SRAM bank group 109 receives data from its local bank, e.g., SRAM bank group S8 receives data from bank T8, S0 receives data from T0, etc. Since each SRAM bank group 109 has 4 SRAM banks 108, each SRAM bank group 109 can typically store 16 4-byte partial results per clock. However, the current source SRAM bank set 109 must simultaneously fetch IFM data while also writing partial results, which in some cases may exceed the total bandwidth available for the SRAM bank set. When the convolution plane kernel size is greater than 1x1, IFM cache 139 may, as such, facilitate reducing IFM reads from source SRAM bank group 109. Furthermore, operations using IFM weight cycles and/or convolution plane kernel sizes greater than 1x1 generate outputs once in several clocks (as opposed to one result per clock), thus reducing the requirements on OFM bandwidth and avoiding SRAM access bottlenecks.

When the final result is generated, each final value may be quantized to 8 bits (or 16 bits, etc.) and may be written to a set of SRAM banks [ S0.. S7] or [ S8.. S15 ]. Fig. 4BC and 4BD depict OFM transfer structure connections and configuration options. Since the OFM slice width is half of the IFM slice width (8 depth channels versus 16 depth channels), the outputs of two vertically adjacent blocks ("block columns") can be sent to either the upper SRAM bank group or the lower SRAM bank group through short local connections. Each SRAM bank group is capable of handling a slice having 16 lanes (since IFM slices have 16 lanes), so each SRAM bank group 109 can also accept two OFM slices. For example, the outputs of tiles T0 and T8, which together make up a column of tiles, may be grouped together and sent over local short connections 106 to SRAM bank group S8 located immediately below T8 as shown in fig. 4BC or SRAM bank group S0 located immediately above T0 as shown in fig. 4 BD. Similarly, tile column T4T 12 outputs may be combined and sent locally to S4 or S12, tile column T10T 2 to S10 or S2, tile column T14T 6 to S14 or S6, tile column T7T 15 to S7 or S15, tile column T3T 11 to S3 or S11, tile column T13T 5 to S13 or S5 and tile column T19T 1 to S9 or S1.

The second example case depicts B8-R2 operation. As shown in fig. 4CA, one IFM slice may be supplied from the upper SRAM bank group 109, wherein the term "upper" is defined to include S0, S4, S10, S14, S7, S3, S13, and S9, and one IFM slice may be supplied from the lower SRAM bank group 109, wherein the term "lower" is defined to include S8, S12, S2, S6, S15, S11, S5, and S1. More specifically, any one of the upper SRAM bank groups 109 may serve as a source for sending (broadcasting) IFM slices to all of the upper banks T0, T4, T10, T14, T7, T3, T13, and T9. For example, the IFM transport structure 104 may be configured to read an IFM slice from S10 and broadcast the IFM slice to T0, T4, T10, T14, T7, T3, T13, and T9. Alternatively, for example, the IFM transport structure 104 may be configured to read an IFM slice from S3 and broadcast the IFM slice to T0, T4, T10, T14, T7, T3, T13, and T9.

Similarly, any of the lower set of SRAM banks 109 may serve as a source for sending (broadcasting) IFM slices to all of the lower tiles T8, T12, T2, T6, T15, T11, T5, and T1. For example, the IFM transport structure 104 may be configured to read an IFM slice from S11 and broadcast the IFM slice to T8, T12, T2, T6, T15, T11, T5, and T1. Alternatively, for example, the IFM transport structure 104 may be configured to read an IFM slice from S8 and broadcast the IFM slice to T8, T12, T2, T6, T15, T11, T5, and T1.

Additionally, referring to fig. 4CA, the SRAM bank groups 109 may be paired to send IFM slices such that data is received in one (clock) cycle from one of the following pairs: [ S0, S1], [ S2, S3], [ S4, S5], [ S6, S7], [ S8, S9], [ S10, S11], [ S12, S13] and [ S14, S15 ]. For example, in FIG. 4CA, the IFM slices originate from the [ S10, S11] pair of SRAM bank groups 109.

Fig. 4CB depicts inputting two IFM slices, where each IFM slice is broadcast to 8 blocks and the output of the two blocks is reduced in a column-wise manner. For example, after the AF of fig. 4, the output of T0 and the output of T8 are reduced to generate one result; the T4 and T12 outputs are reduced to generate another result; the T10 and T2 outputs are reduced to generate yet another result; the T14 and T6 outputs are reduced to generate yet another result; the T7 and T15 outputs are reduced to generate yet another result; the T3 and T11 outputs are reduced to generate yet another result; the T13 and T5 outputs are reduced to generate yet another result; and the T9 and T1 outputs are reduced to generate yet another result.

In the case of partial results, eight reduction results may be stored in one of two sets [ S0.. S7] and [ S8..15] of SRAM bank sets. For example, FIG. 4CB depicts eight partial results stored in the SRAM bank set [ S0.. S7 ]. In the case of the final result, the OFM transfer structure 106 may merge and store the results of two adjacent block columns in one of the groups of four SRAM bank groups including [ S0.. S3], [ S4.. S7], [ S8.. S11], and [ S12.. S15 ]. For example, fig. 4CC depicts eight final results stored in SRAM bank group [ S4.. S7 ].

The third example case depicts B4-R4 operation. As shown in fig. 4DA, one IFM slice may be supplied from every fourth of the floor plan. Referring to fig. 4DB, this operation may involve broadcasting four IFM slices and generating four results after reduction. As long as the IFM slice is from one of the four groups including [ S0.. S3], [ S4.. S7], [ S8.. S11] and [ S12.. S15], and as depicted in fig. 4DB in case the result is partial, as long as the output is written to one of the four groups [ S0.. S3], [ S4.. S7], [ S8.. S11] and [ S12.. S15], and as depicted in fig. 4DC in case the result is final, one of the eight groups [ S0S 1], [ S2S 3], [ S4S 5], [ S6S 7], [ S8S 9], [ S10S 11], [ S12S 13] and [ S14S15], the IFM transfer structure 104 and the OFM transfer structure 106 may try to send an input in one clock cycle and send an output.

Referring to fig. 4AJ, note that each reduction group 407 generates one output result. Two results may be stored in the top portion and two results may be stored in the bottom portion. Because the OFM slice containing the final result has a size of 8 bytes, the OFM transfer structure 104 can merge the results of two adjacent columns. Fig. 4AH also depicts that four IFM slices are broadcast to form four output results after reduction.

The fourth example case depicts B2-R8 operation. As depicted in fig. 4EA, one IFM slice may be provided from every eighth of the floor plan. Referring to fig. 4EB, the operations may involve broadcasting two IFM slices to generate two results after reduction.

The IFM transfer structure 104 and the OFM transfer structure 106 may try to send and receive inputs and outputs in one (clock) cycle as long as the input is from one of two groups comprising [ S0.. S7] and [ S8.. S15], and in case the result is partial, as long as the output is written into one of eight groups [ S0S 1], [ S2S 3], [ S4S 5], [ S6S 7], [ S8S 9], [ S10S11], [ S12S 13] and [ S14S 15], and in case the result is final, any SRAM bank group 109.

Fig. 4EA depicts that source data is broadcast for a fourth example scenario. Fig. 4EB depicts that partial results are formed for the fourth example case, and fig. 4EC depicts that final results are formed for the fourth example case. Referring to fig. 4AJ, each portion 407 generates one result after reduction. One of the two results may be stored at the top and the other at the bottom. Because the OFM slice containing the final result has a size of 8 bytes, the OFM transfer structure 106 can merge the results of two adjacent columns. Fig. 4AK also depicts that four IFM slices are broadcast to form two output results after reduction.

The fifth example case depicts B1-R16 operations. As depicted in fig. 4FA, one IFM slice may be supplied from each SRAM bank group 109 corresponding to one broadcast. Referring to fig. 4FB, the operation may involve reducing the output of all 16 blocks 102 to generate one result that may be stored in any SRAM bank group 109 when the result is partial and when the result is final.

Because the OFM slice containing the final result has a size of 8 bytes, the OFM transfer structure 106 can merge the results of two adjacent columns. Fig. 4AM also depicts 16 IFM slice inputs to form a single output result after reduction.

The IFM transfer structure 104 and the OFM transfer structure 106 may be designed in a manner including the above-described examples, so that it is always possible to calculate and store to the SRAM 109 for one operation in such a manner as follows: subsequent operations that consume the results of previous operations can obtain those results for the full permutation of the reduction configuration of the current operation and the subsequent operations. For example, the current operation may use the B4-R4 reduction configuration and store its results to the SRAM bank group 109 after the OFM transfer structure 106 connectivity selection associated with the B4-R4 reduction configuration. The next operation may use the associated selected B2-R8 reduction configuration with IFM transport structure 106 connectivity while being able to successfully retrieve the data computed and stored by the previous B4-R4 operation.

Fig. 4G depicts one possible implementation of the IFM transport structure 104 that supports all of the IFM transport structure connectivity options of the previously described all reduction configuration. The structure includes four bidirectional multi-drop buses, where two of the bidirectional buses are placed between an upper SRAM bank group and an upper tile, and the other two bidirectional buses are placed between a lower SRAM bank group and a lower tile. The buses may be connected in a round robin fashion through registers 411 so that data from the upper bus may flow to the lower bus and back. Note that additional pipeline registers that may be present in IFM transfer structure 104 have been omitted from fig. 4G for clarity of explanation.

Fig. 4H depicts one possible implementation of the OFM transfer structure 106 that supports all OFM transfer structure connectivity options for the previously described all reduction configuration. The fabric consists of four bi-directional 16 byte wide multi-drop buses to support reduction configurations B2-R8 and B1-R16. Note that pipeline registers that may be present in the OFM transfer structure 106 have been omitted in fig. 4H for clarity of explanation.

Reduction structure 111 may perform "inter-block" reduction (rather than intra-block reduction accomplished by

adder trees

128A and 128B) for all reduction configurations (e.g., B8-R2, B4-R4, B2-R8, and B1-R16 configurations) except for configuration R1 (when no inter-block reduction is present). Reduction structure 111 includes a reconfigurable adder tree made up of reduction and accumulation (RAA) nodes 520 depicted in fig. 5A. Each RAA node 520 operates on the result of the partial reduction (i.e., the linear result before the activation function is applied). The RAA node 520 receives inputs from the same block row ARU 167 in which it is located or from other RAA nodes. The RAA node 520 sends the output to the RAA node further up in the adder tree or back to the ARU 167. Subsequently, if the result is final, the ARU 167 applies the activation function and sends the final result to the OFM transfer structure 106. Alternatively, if the result is partial, the ARU 167 sends the partial result to the OFM transfer structure 106 while bypassing the activation function.

Fig. 5B depicts a reduction structure 111 configured for R16. Here, the ARU module 167 generates a partial reduction result (from the

intra-block adder trees

128A and 128A), streams the partial reduction result to the first level RAA node 502 via a "to reduction structure" output as shown in fig. 1X. The first level RAA node 502 pair-wise reduces the 16 ARU partial reduced data streams (ARUs of partial reduced data) to 8 partial reduced data streams (streams of partial reduced data). The second level RAA 504 also reduces the 8 streams generated by the first level RAA node 502 in pairs to 4 partial reduction data streams. The third level RAA node 506 and the fourth level RAA node 508 complete the reduction process to produce a full-reduced data stream (stream of fully-reduced data) that is sent to the ARU 167 of block T14 for activating the function application (when the final result is generated) and output to the OFM transport structure 106. Note that tile T14 is physically located near the root RAA node 508 and corresponds to the ARU 167 of tile T14 in the FB of fig. 4.

Fig. 5C depicts a reduction structure 111 configured for R8. Unlike the R16 configuration, the R8 configuration includes two adder trees (instead of one adder tree), where each adder tree has three levels instead of four. The first adder tree reduces partial reduction data from ARUs of blocks T0, T8, T4, T12, T10, T2, T14, and T6, and sends the fully reduced result to ARU 167 of block T12 to complete the data return. The second adder tree reduces partial reduction data from ARUs 167 of blocks T7, T15, T3, T11, T13, T5, T9, and T1, and sends the fully reduced result to ARUs 167 of block T13 to complete the data return. Note that in fig. 4FB, tiles T12 and T13 are each physically located near the respective root RAA node 506 and correspond to the ARUs 167 of tiles T12 and T3, respectively.

Fig. 5D depicts a configuration R4 with four adder trees, where each adder tree reduces the partially reduced output from four tiles. FIG. 4DB depicts the physical location of ARUs 167 associated with four tree root nodes.

Fig. 5E depicts a configuration R2 with eight adder trees, where each adder tree reduces the partially reduced output from two tiles 102. Fig. 4CB depicts the physical location of the ARU relative to eight tree root nodes.

Finally, fig. 5F depicts configuration R1 without an adder tree, with block ARU 167 outputting the results directly to the OFM transfer structure 106 without the need for reduction structure 111. Fig. 4BB depicts the physical location of the ARU 167 in this case. Note that the numbers inside the ARU 167 in fig. 4BB, 4BC, 4BD, 4CB, 4CC, 4DB, 4DC, 4EB, 4EC and 4DB indicate the RAA tree node levels as indicated in fig. 5B to 5F, where level 0 corresponds to configuration R1 (no reduction structure is used). Configuration R1 is implemented by an ARU multiplexer 174 in the ARU, the ARU multiplexer 174 sending data directly from the accumulator 130A (or 130B) to the activation function and partial path (starting from the bit range selection unit 187), bypassing the reduction structure 111 as shown in fig. 1X. Note that for clarity of general explanation, some ancillary logic that may be needed to properly bypass reduction structure 111 with sparse activation support is not shown.

Fig. 5G depicts the reduction structure 111 formed by the

RAA nodes

502, 504, 506, 508. Again, note that each RAA node is physically located in the immediate vicinity of one block 102. Each RAA node 502 receives inputs from two blocks in the block column in which the node 502 is located. There is exactly one RAA node 502 per block column. The RAA node 508 receives its input from the node 506, which node 506 in turn receives its input from the node 504, which node 504 in turn receives the input from the node 502. Note that tile T12 has no RAA node 502 associated with it because there are 15 tree nodes and the number of physical tiles is 16.

As shown in fig. 5A, each RAA node 520 has two functions including reduction of two inputs a and B using adder 512 and accumulation of the reduced results using accumulator 518 and adder 514. Multiplexer 516 allows the reduction result from adder 512 to be loaded directly into accumulator 518 at the beginning of accumulation, e.g., to begin an IFM weight loop. Multiplexer 516 also allows the results of the reduction to be accumulated as, for example, IFM weight cycles progress over time.

Storing the weights in a compressed format may be beneficial to reduce the amount of SRAM (and off-chip DDR) storage needed to store the weights, to reduce SRAM (and off-chip DDR) power associated with fetching the weights, and to speed up weight loading, particularly during full connection layer computations. In some embodiments, idle periods may be used to load multiplier unit weights. Furthermore, in some embodiments, multiple vertical load buses 101 may be used to accelerate weight loading, rather than FIG. 1K, which depicts only one weight load bus per MR column.

More specifically, as previously depicted in fig. 4AB, weights are stored in four SRAM banks 108 local to each bank 102, and each bank 102 is able to read all 4 banks in parallel. Each SRAM bank 108 takes 16 8-bit weights. Because each tile 102 has 8 MR columns, when the weights are not compressed, (8 MR columns per tile)/(4 local SRAM banks per tile) is taken to 2 clocks to load one 8-bit weight per active lane. Each block 102 also contains a weight decompression unit 138 for each block, the weight decompression unit 138 being operable to decompress FCs and convolution weights. For example, each multiplier unit 103 may have 18 weights and may take (18 weights per MU) (2 clocks per weight) 36 clock cycles to load all MU weights. Smaller cores that do not use all 18 weights may load faster.

The weight flow concurrent with FC computation can be used to improve throughput in full-join computation to keep multiplier utilization high during large FC computation. As previously described, FC calculates the non-reuse weight. Therefore, it may be desirable to stream weights quickly during FC computation. Specifically, FC calculation with an IFM weight loop of 1 would require each clock to provide one weight to each MU to keep all multipliers 126 fully utilized. IFM weight looping to 2 requires that every two clocks provide a weight to each MU 103 to keep all multipliers fully utilized. More generally, an IFM weight loop of N requires that each MU 103 be provided with a weight every N clocks to keep all multipliers 126 fully utilized.

According to various deep learning research publications, fully connected layer weights can be compressed, sometimes by 2 or more times. In this case, instead of loading one uncompressed weight into each MU 103 every two clocks, one uncompressed weight may be loaded into each MU 103 per clock.

However, in addition, the IFM data must also be retrieved from the SRAM 109 along with the weights, thus reducing the SRAM bandwidth available for retrieving the weights. The amount of IFM data retrieved from the SRAM 109 is in turn dependent on the mapreduce configuration. A large reduction number (e.g., R16) requires more pathways to be used to acquire IFM data than a smaller reduction configuration (e.g., R1).

Since all 64 SRAM banks may be busy acquiring FC weights, it may not be possible to read IFM data from SRAM 109 at the same time. To increase multiplier utilization, IFM data may be stored tiled across all 64 banks. In some embodiments, to fetch IFM data, weight reading is stopped for one clock cycle and all 64 banks have one IFM data read into a 1-deep cache register located next to the output of SRAM 109. The IFM data is then streamed from the 6416 byte line of cache. More specifically, fetching one IFM data from all 64 banks in parallel may fetch enough data at a time to equal the number of IFM data reads (R ═ (64 SRAM banks) × (number of broadcast configurations B)/(number of physical blocks)). Thus, as shown in fig. 6 for some embodiments, the maximum multiplier utilization for full connectivity layer calculation may be calculated from R/(1+ R) as a function of the broadcast configuration number B.

As previously described, global control 140 and

local control units

142, 144 may have various configuration registers. In some embodiments, the contents of some of these configuration registers may be able to switch on the fly to change the configuration of neural processor 100 on the fly, for example, when neural processor 100 transitions from one operation to another, or when one SRAM bank group 109 is depleted of data and IFM transfer structure 104 must switch on the fly (without delay) to stream IFM data from another SRAM bank group 109. Following generally known design practices, such on-the-fly reconfiguration may be achieved by double buffering the configuration registers, and validating the new configuration by switching between the two buffers. As depicted in fig. 1A, the central controller 110 may receive configuration data from the CPU over an AXI bus, pass the configuration data through the common bus 112, which in turn, the common bus 112 may send and load configuration values from the CPU to configuration registers of the control logic (such as 140, 142, and 144) and various other registers including ARU offset registers 195, scaling registers 191, activation function 197 configuration registers, and the like. To coordinate configuration changes in operations involving a large number of double-buffered registers that switch at various times as needed, common bus 112 may load not only the configuration register values, but also the time (clock count) at which the double-buffered registers must switch their configuration to effect.

Fig. 1A also depicts SRAM bank groups 109, each SRAM bank group 109 having an AXI slave interface enabling the CPU to write the IFM and weight tensor, and read back the OFM results. Since the SRAM bank set services I/O requests from the IFM transfer fabric 104 and the OFM transfer fabric 106 and local weight load connections, CPU I/O requests on the AXI interface 114 may be arbitrated and assigned a lower priority in order to allow the neural network computations to continue without delay while the CPU waits for results.

In addition, the subject matter disclosed herein provides a scalable multiplexer circuit or module, referred to herein as a "butterfly shuffler" (butterfly shuffler), that efficiently scrambles data for purposes including homogenizing sparse data. There may be instances where sparse data, such as, in particular, data associated with the input profile, may include non-zero values grouped together. That is, the data may be non-uniform sparse data. In such a scenario, a system that can process sparse data in parallel by, for example, multiplying Input Feature Map (IFM) values in parallel may leave many multipliers idle (i.e., multipliers with at least one operand equal to 0), while a small group of multipliers may provide a large number of multiplications, thus resulting in a bottleneck condition.

For example, referring to FIG. 7A, the IFM data in the memory or SRAM 109 has zero values that are relatively evenly distributed in the IFM slice and in the paths in the IFM slice. IFM buffer 141 may receive the stream of IFM slices from fig. 7A and use forward lookup 1(look-ahead of 1) in conjunction with side lookup 1 (look-side of 1) to successfully multiplex non-zero activations in an out-of-order manner to achieve activation skipping. For example, a non-zero value 701 may be multiplexed down the way and one position ahead to replace the zero value at position 702. Similarly, IFM buffer 141 may pass other non-zero values out of order as indicated by the arrow labels.

The IFM data depicted in FIG. 7B has the same number of zero values as FIG. 7A; however, the zero values in fig. 7B are clustered in the same IFM path adjacent to the IFM slice. IFM buffer 141 will have to support a side lookup of 4 to successfully multiplex non-zero activation 703 in place of the zero value occupying position 704 to achieve activation skipping. Support for a large lateral search range (e.g., greater than 1) may be prohibitively expensive in terms of silicon area, as multiplexer 163 will have more inputs to bring activation values from more remotely located paths.

Referring to fig. 7C, the IFM shuffler 720 may be used to pseudo-randomly shuffle the values within each IFM slice to spread out the aggregation of non-zero values within the IFM slice, thus, for example, converting the arrangement of data shown in fig. 7B to the arrangement of data shown in bitmap 7A.

Note that pseudo-random scrambling of activations must be accompanied by weight scrambling in the same manner so that the scrambled activations will be multiplied by the correct weights. It should also be noted that since the pseudo-random scrambling sequence may be known prior to computation, the weights may be scrambled off-line, path by path, for each incoming IFM slice and loaded into the MR block 102 before computation begins.

In addition to scrambling the IFM slice values on a per-lane basis, the IFM shuffler 720 may also reorder the time sequence of the IFM slices. Note that the MR tile weights must be rearranged correspondingly offline for the steps in the dot product computation to match the changed order in which the IFM slices will arrive.

IFM shuffler 720 may be efficiently implemented using a butterfly shuffler. Referring to fig. 7D, a 16-channel (lane) butterfly shuffler 740 may be composed of 64 (i.e., a and b) to 1 multiplexers M arranged in an array of 16 rows 0.. 15 and 4 columns 0.. 3_row,col730. As shown, the butterfly shuffler 740 may flexibly shuffle or reorder the IFM slice values arriving through the 16 input paths (d 0-d 15) into another IFM slice output through the output paths (o 0-o 15).

Referring to fig. 7D, the multiplexers 730 in each column are paired to create a 2 x 2 cross bar. More specifically, in the 16-way butterfly shuffler 740, the 16 multiplexers 730 in each column are grouped in pairs to form 8 2 × 2 crossbar switches. The control signals of the multiplexers belonging together to a pair are connected together. The 16 multiplexers 730 in column 0 are paired to form 8 2 x 2 crossbars as follows: { M _0,0,M_1,0}、{M_2,0, M_3,0}、{M_4,0,M_5,0}、{M_6,0,M_7,0}、{M_8,0,M_9,0}、{M_10,0,M_11,0}、M_12,0,M_13,0}、 {M_14,0,M_15,0}. Eight result pairs are respectively provided by the signals Sel_0....7,0And (5) controlling. Resetting (de-alerting) sel_x,colCausing the corresponding cross bars to pass the input to the output in a non-crossing manner. Setting (alerting) sel_x,colCausing the corresponding crossbar to pass the input to the output (i.e., the input signals are swapped at the output of the crossbar). For example, resetting sel_0,0So that the multiplexer M_0,0,M_1,0The 2 x 2 cross bars formed are passed through

vias

0 and 1 without change as

vias

0 and 1. Set sel_0,0Make multiplexer { M_0,0,M_1,0Output lanes 0 and 1 as lanes 1 and 0 (i.e., swap).

Note that the multiplexer pairs in column 0 may be formed by sel_x,0Controlled pair multiplexer M_x*2,0,M _x*2+10, where x is an integer from 0 to 7. More generally, in a butterfly shuffler with N lanes and M ═ log2(N) columns, the multiplexers in column c are paired to be paired by sel_x,cControlled { M_{mod(x,k)+floor(x,k)*2,c},M_{mod(x,k)+floor(x,k)*2+k,c}Where k is 2^c，x∈[0,2^M-1]Each column having 2^M-1A control signal, and a total of S2^M-1M-N log2(N)/2 signals controlling the scrambling, resulting in a total of 2^N*log2(N)/2And (4) scrambling.

Butterfly shuffler 740 disclosed herein is not a full cross-bar multiplexer configuration. Full cross bar configuration with large area O (N)²) Where N is the number of lanes of data. In contrast, the area of butterfly shuffler 740 is O (N × log (N)), where N is the number of lanes of data. Typically, a full crossbar provides n! A unique scrambling, and a butterfly shuffler with N paths generates 2 ^N*log2(N)/2And (4) scrambling. For example, a 16-way butterfly shuffler has 2 for 16 ways^16*4/2＝2³²And (4) scrambling.

Figure 7E shows a pseudo-random generator 741 (e.g., a linear feedback shift register) that controls the scrambling of the data path 740 of the butterfly shuffler. Before the computation begins (e.g., to compute a convolution at a particular location), the control logic of the MR block may initialize a pseudo-random generator 741 to generate a known pseudo-random scrambling sequence to shuffle (shuffle) the data coming into the IFM slices. As previously described, the weights preloaded into the MR block 102 to be used in this calculation must be pre-shuffled offline such that the post-shuffle order of the ways in each IFM slice is consistent with the way index of the weights.

As described above, zero activation sparsity may be supported by both side-lookup and forward-lookup mechanisms, and also enhanced by an IFM shuffler (such as butterfly shuffler 740). Zero activation skipping using two adder trees per MU column may yield a maximum acceleration of about 2 times and an average acceleration of about 1.5 times. However, the input feature mapping structure and memory (SRAM) bandwidth may be limited. As previously described, the input feature mapping structure bandwidth in example embodiments may be limited to a factor of 2 to match the maximum acceleration of a factor of 2 obtained by zero activation skip. Thus, a 2-fold maximum acceleration due to zero activation skipping can result in an OFM fabric throughput of 2-fold compared to a calculation that disables zero activation skipping. The OFM fabric throughput should also match the computational throughput, providing 2 times the bandwidth.

If the memory (SRAM) and/or IFM transfer structures are limited to 2 times, for example due to SRAM clock frequency or area or power constraints associated with IFM transfer structure bandwidth, further increases in zero activation skipping can be capped because the SRAM and/or IFM transfer structures become bottlenecks in data transfer and the MR block multiplier becomes idle while waiting for data processing. More generally, computational acceleration by any mechanism including zero activation and zero weight skipping can become upper-bounded. As previously mentioned, methods and devices for zero activation skipping have been proposed. However, the convolution and full-connection layer weights also typically exhibit sparsity (i.e., the weight kernel may have a large number of zero weights). Therefore, while paying attention to the limited bandwidth constraints imposed by the IFM transfer structure and/or the bandwidth of the memory (SRAM), it may be advantageous to explore zero weight multiplication skips in addition to zero activation skips.

For example, consider a method and apparatus that supports weight sparsity that includes combining weight sparsity with activation sparsity. Assuming that the IFM transport fabric bandwidth is capped at 2 times the baseline bandwidth (i.e., with all multiplication skip methods disabled), the overall throughput of the weight sparsity scheme may also be capped at 2 times the baseline throughput. For this reason, for weight sparsity support, especially when combined with activation sparsity support to further increase computation speed by more than a factor of 2, it may be advantageous to utilize another method orthogonal to IFM transmission (i.e., a method that does not require further increase of IFM transmission structure bandwidth).

One such method may be output signature computation. More specifically, the MU train may generate more than one output per OFM period while maintaining the IFM transfer structure bandwidth unchanged. Fig. 8A depicts a baseline MU 810 with zero activation skip logic omitted for clarity and also without zero weight skip logic. Here, the weight register file 805 has 18 weights 815. Multiplier 822 uses 18-to-1 multiplexer 820 to receive the activation and register file 805 weights to compute the product of terms that are fed into the adder tree to continue the dot product computation. Fig. 8B depicts an MU 850 supporting double sparsity (i.e., zero-valued activation and zero-valued weight skipping) in a non-zero manner. Note that the weight register file 805 has been logically divided into two groups 811 and 812 each containing nine weights. Here, the first set of nine weights belongs to one output channel, and the second set of nine weights belongs to a second output channel. In other words, the output cycle is typically kept at least 2. Mapping experiments performed by the inventors have shown that keeping the output cycles to at least 2 may be feasible for most layers of popular neural network models, while for the remaining layers, the logical weight register grouping may be disabled.

Zero value weight skipping may continue to check whether the weight value in group 0 scheduled for the upcoming multiplication is equal to zero, and in that case, the weights in group 1 are used instead. If the weights in both group 0 and group 1 have a value of zero, the MU can process the next pixel.

In another aspect of the subject matter disclosed herein, referring to fig. 8C, the ABU may broadcast an additional activation group 850 corresponding to the next activation (i.e., an activation that generally follows the currently broadcast activation 750) with reference to the order of activations scheduled by the IFM buffer 124 as a result of zero-skip forward seek and side seek applications. Referring to fig. 8B, accordingly, the MU 850 may receive two sets of active broadcast buses. In particular, the additional activation bus allows faster columns (i.e., columns with all MUs that have been able to skip multiplication due to zero activation and/or zero weight) to proceed to the next pixel. It is also noted that while some columns may continue to compute the next pixel in this out-of-order manner, the number of active buses per MU row limits how far the columns may continue out-of-order (i.e., only one pixel in the example depicted in fig. 8B).

Note that as previously mentioned, IFM shuffling may be particularly helpful in enabling two sets of activations to be sent in each cycle as the aggregation of non-zero values becomes dispersed (i.e., homogenized).

In summary, the proposed dual sparsity approach may have the following advantages: taking advantage of weight sparsity in addition to activating sparsity, higher IFM and/or SRAM bandwidth is not required, while boosting computation speed up by more than 2 times the upper bound (i.e., 2 times faster computation relative to baseline (with sparsity support disabled)) while receiving IFM data no faster than 2 times. Another advantage of the proposed dual sparsity approach may be the reuse of weight selection multiplexers 820 when weights are grouped logically rather than physically. One particular embodiment may choose not to use side finding for zero activation skipping, thereby eliminating the need for side finding logic and multiplexers to bring (borrow) weights from neighboring MUs. Note that having IFM shuffling would be particularly advantageous for such embodiments without side-finding logic. Finally, logically, for purposes of computing the mapping, such a computation can be viewed essentially as processing 16 output columns per block using a 16 × 8 multiplier instead of 8 output columns.

As used herein, the terms "multiplexer" and "demultiplexer" are used interchangeably; each term means a switchable device having a plurality of data terminals (e.g., data inputs or data outputs) on one side (the "multi-port" side) and a single data terminal (e.g., data outputs or data inputs) on the other side (the "single-port" side), the device being configured to connect one of the plurality of data terminals on one side, selected according to a control signal received at a control input of the device, to the single data terminal on the other side.

The term "processing unit" is used herein to include any combination of hardware, firmware, and software for processing data or digital signals. The processing unit hardware may include, for example, Application Specific Integrated Circuits (ASICs), general or special purpose Central Processing Units (CPUs), Digital Signal Processors (DSPs), Graphics Processing Units (GPUs), and programmable logic devices such as Field Programmable Gate Arrays (FPGAs). In a processing unit as used herein, each function is performed by hardware configured (i.e., hardwired) to perform the function, or by more general purpose hardware (such as a CPU) configured to execute instructions stored in a non-transitory storage medium. The processing unit may be fabricated on a single Printed Circuit Board (PCB) or distributed over several interconnected PCBs. The processing unit may include other processing units; for example, the processing unit may comprise two processing units FPGA and CPU interconnected on a PCB.

It will be understood that, although the terms first, second, third, etc. may be used herein to describe various elements, components, regions, layers and/or sections, these elements, components, regions, layers and/or sections should not be limited by these terms. These terms are only used to distinguish one element, component, region, layer or section from another element, component, region, layer or section. Thus, a first element, component, region, layer or section discussed herein could be termed a second element, component, region, layer or section without departing from the spirit and scope of the present inventive concept.

Spatially relative terms (e.g., "below …," "below …," "below …," "above …," "above," etc.) may be used herein for ease of description to describe the relationship of one element or feature to another element or feature depicted in the figures. It will be understood that such spatially relative terms are intended to encompass different orientations of the device in use or operation in addition to the orientation depicted in the figures. For example, if the device in the figures is turned over, elements described as "below" or "beneath" other elements or features would then be oriented "above" the other elements or features. Thus, example terms "below …" or "below …" may include both orientations "above …" and "below …". The device may be otherwise oriented (e.g., rotated 90 degrees or at other orientations) and the spatially relative descriptors used herein interpreted accordingly. Further, it will also be understood that when a layer is referred to as being "between" two layers, it can be the only layer between the two layers, or one or more intervening layers may also be present.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the subject matter disclosed herein. As used herein, the terms "substantially," "about," and the like are used as approximate terms and not as degree terms, and are intended to account for inherent deviations in measured or calculated values that would be recognized by one of ordinary skill in the art.

As used herein, the singular is intended to include the plural as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items. Expressions such as "at least one of … …" modify an entire column of elements when it follows a column of elements, without modifying individual elements in the column. Furthermore, when describing embodiments of the subject matter disclosed herein, the use of "may" refers to "one or more embodiments of the disclosure. Furthermore, the term "exemplary" is intended to mean exemplary or illustrative. As used herein, the term "using" may be considered synonymous with the term "utilizing".

It will be understood that when an element or layer is referred to as being "on," "connected to," "coupled to" or "adjacent to" another element or layer, it can be directly on, connected to, coupled to or adjacent to the other element or layer, or intervening elements or layers may be present. In contrast, when an element or layer is referred to as being "directly on," "directly connected to," "directly coupled to" or "directly adjacent to" another element or layer, there are no intervening elements or layers present.

Any numerical range recited herein is intended to include all sub-ranges subsumed within the range with the same numerical precision. For example, a range of "1.0 to 10.0" is intended to include all sub-ranges between (and including) the recited minimum value of 1.0 and the recited maximum value of 10.0 (i.e., having a minimum value equal to or greater than 1.0 and a maximum value equal to or less than 10.0 (such as 2.4 to 7.6)). Any maximum numerical limit recited herein is intended to include all lower numerical limits included therein, and any minimum numerical limit recited herein is intended to include all higher numerical limits included therein.

Although exemplary embodiments of the neural processor have been specifically described and illustrated herein, many modifications and variations will be apparent to those skilled in the art. Thus, it will be appreciated that a neural processor constructed in accordance with the principles of the present disclosure may be implemented in ways other than those specifically described herein. The invention is also defined by the following claims and their equivalents.

Claims

1. A processor, the processor comprising:

the register stores a first group of weight values and a second group of weight values, each group of weight values comprises at least one weight value, and the weight values in the first group of weight values correspond to the weight values in the second group of weight values one to one;

A non-zero weight value selector that selects a non-zero weight value from a weight value in the first set of weight values or a weight value in the second set of weight values that corresponds to a weight value in the first set of weight values; and

a multiplier to multiply the selected non-zero weight value and the activation value corresponding to the selected non-zero weight value to form an output product value.

2. The processor of claim 1, wherein a weight value of the first set of weight values and a corresponding weight value of the second set of weight values to a weight value of the first set of weight values both comprise a zero value weight value, and

wherein the non-zero weight selector controls the multiplier to prevent the multiplier from forming an output product value.

3. The processor of claim 1, wherein a first weight value of the first set of weight values and a first weight value of the second set of weight values corresponding to the first weight value of the first set of weight values both comprise a zero value weight value, and

wherein the non-zero weight value selector selects the non-zero weight value from a second weight value of the first set of weight values and a second weight value of the second set of weight values corresponding to the second weight value of the first set of weight values, the second weight value of the first set of weight values being different from the first weight value of the first set of weight values.

4. The processor of claim 1, wherein the first set of weight values includes nine weight values and the second set of weight values includes nine weight values.

5. The processor of claim 1, further comprising: a multiplexer, coupled between the register and the multiplier,

a non-zero weight value selector controls the multiplexer to couple the selected non-zero weight value to the multiplier.

6. The processor of claim 1, wherein the processor is part of a neural processor.

7. The processor of claim 1, wherein the selected non-zero weight value comprises a uint8 value.

8. A processor, the processor comprising:

a register receiving N weight values, where N is a positive even number greater than 1, the N weight values being logically set as a first group and a second group, the first group and the second group having equal sizes, the weight values in the first group corresponding to the weight values in the second group one-to-one;

a multiplexer coupled to the register, the multiplexer selecting and outputting a non-zero weight value from the weight values in the first group or the weight values in the second group corresponding to the weight values in the first group; and

and a multiplier which multiplies the nonzero weight value output from the multiplexer and an activation value corresponding to the nonzero weight value output from the multiplexer to form an output product value.

9. The processor of claim 8, further comprising: a weight value selector that controls the multiplexer to output a non-zero weight value based on whether a weight value in the first group is equal to a zero value and whether a weight value in the second group corresponding to the weight value in the first group is equal to the zero value.

10. The processor of claim 9, wherein the weight values in the first set and the weight values in the second set corresponding to the weight values in the first set both comprise zero-valued weight values, and

wherein the weight value selector further controls the multiplier to prevent the multiplier from forming an output product value.

11. The processor of claim 9, wherein a first weight value in the first group and a first weight value in the second group corresponding to the first weight value in the first group both comprise zero value weight values, and

wherein the weight value selector selects a non-zero weight value from a second weight value in the first group and a second weight value in the second group corresponding to the second weight value in the first group, the second weight value in the first group being different from the first weight value in the first group.

12. The processor of claim 8, wherein the first set comprises nine weight values and the second set comprises nine weight values.

13. The processor of claim 8, wherein the processor is part of a neural processor.

14. The processor of claim 8, wherein the non-zero weight value output from the multiplexer comprises a uint8 value.

15. A processor, the processor comprising:

a first register receiving N weight values, where N is a positive even number greater than 1, the N weight values being logically set as a first group and a second group, the first group and the second group having equal sizes, the weight values in the first group corresponding to the weight values in the second group one-to-one;

a multiplexer coupled to the first register, the multiplexer selecting and outputting a non-zero weight value from a weight value in the first group or a weight value in the second group corresponding to the weight value in the first group;

a second register to receive a plurality of activation values; and

and a multiplier coupled to the multiplexer and the second register, the multiplier multiplying the non-zero weight value output from the multiplexer and the activation value received from the second register corresponding to the non-zero weight value output from the multiplexer to form an output product value.

16. The processor of claim 15, further comprising: a weight value selector that controls the multiplexer to output a non-zero weight value based on whether a weight value in the first group is equal to a zero value and whether a weight value in the second group corresponding to the weight value in the first group is equal to the zero value.

17. The processor of claim 16, wherein the weight values in the first set and the weight values in the second set corresponding to the weight values in the first set both comprise zero value weight values, and

18. The processor of claim 16, wherein both the first weight value in the first set and the first weight value in the second set corresponding to the first weight value in the first set comprise zero value weight values, and

19. The processor of claim 15, wherein the first set comprises nine weight values and the second set comprises nine weight values.

20. The processor of claim 15, wherein the processor is part of a neural processor.