WO2023140778A1

WO2023140778A1 - Convolution engine and methods of operating and forming thereof

Info

Publication number: WO2023140778A1
Application number: PCT/SG2022/050017
Authority: WO
Inventors: Bin Zhao; Anmin Kong; King Jien Chui; Anh Tuan Do; Tshun Chuan Kevin CHAI; Mohamed Mostafa Sabry ALY
Original assignee: Agency For Science, Technology And Research; Nanyang Technological University
Priority date: 2022-01-18
Filing date: 2022-01-18
Publication date: 2023-07-27

Abstract

There is provided a convolution engine configured to perform neural network computations. The convolution engine including: a plurality of convolution processing blocks, each convolution processing block being configured to perform convolution based on a 2D array of first feature map data stored therein to produce a set of first convolution outputs, each set of first convolution outputs including a plurality of channels; and a plurality of weight memory blocks communicatively coupled to the convolution processing blocks. Each weight memory block is configured to store and supply weight parameters to the corresponding convolution processing blocks to perform convolution. Each convolution processing block further comprises a plurality of processing element blocks and each processing element block can further comprise a plurality of sub-blocks and corresponding weight memory sub-blocks.

Description

CONVOLUTION ENGINE AND METHODS OF OPERATING AND FORMING

THEREOF

TECHNICAL FIELD

[0001] The present invention generally relates to a convolution engine configured to perform neural network computations for a neural network, a method of operating the convolution engine, and a method of forming the convolution engine, and more particularly, in relation to a neural network processor system, such as a neural network accelerator (NNA).

BACKGROUND

[0002] Deep neural network (DNN) are widely used in modem artificial intelligence (Al) systems. Convolutional neural networks (CNNs), the most popular DNN architecture, have superior performance in image recognition, speech recognition and computer vision. Due to its deep (i.e., multi-layer) architecture, state-of-the-art CNNs may have hundreds of megabytes of weights and may require billions of operations in an inference flow. However, the massive number of data movement within CNN inference flow may cause significant delay and power consumption in CNN hardware. In order to process CNNs in real-time, especially in the edge computing applications, high efficient data movement may thus be desired.

[0003] A need therefore exists to provide a convolution engine configured to perform neural network computations for a neural network and related methods (e.g., method of operating the convolution engine) that seek to overcome, or at least ameliorate, one or more of the deficiencies of conventional convolution engines, such as but not limited to, improving efficiency and/or effectiveness in performing neural network computations. It is against this background that the present invention has been developed.

SUMMARY

[0004] According to a first aspect of the present invention, there is provided a convolution engine configured to perform neural network computations for a neural network, the convolution engine comprising: a plurality of convolution processing blocks, each convolution processing block being configured to perform convolution based on a two-dimensional (2D) array of first feature map data stored therein to produce a set of first convolution outputs, each set of first convolution outputs comprising a plurality of channels of first convolution outputs; and a plurality of weight memory blocks communicatively coupled to the plurality of convolution processing blocks, respectively, each weight memory block configured to store a plurality of weight parameters and supply the plurality of weight parameters to the corresponding convolution processing block communicatively coupled thereto for the corresponding convolution processing block to perform convolution based on the plurality of weight parameters, wherein each convolution processing block of the plurality of convolution processing blocks comprises: a data buffer block configured to store the corresponding 2D array of first feature map data; an input data processing block configured to process input data for storing in the data buffer block so as to form the 2D array of first feature map stored in a data buffer block; a plurality of feature map memory blocks, including a first feature map memory block, each feature map memory block configured to store a 2D array of second feature map data in relation to the convolution performed by the convolution processing block based on the 2D array of first feature map data stored in the data buffer block; a plurality of processing element blocks communicatively coupled to the plurality of feature map memory blocks, respectively, each processing element block being configured to perform convolution based on a 2D array of current feature map data, the 2D array of current feature map data being the 2D array of first feature map data stored in the data buffer block, the 2D array of second feature map data stored in the first feature map memory block or the 2D array of second feature map data stored in the corresponding feature map memory block to produce a set of second convolution outputs, the set of second convolution outputs comprising a plurality of channels of second convolution outputs; and a first convolution output combining block configured to channel-wise combine the plurality of sets of second convolution outputs produced by the plurality of processing element blocks to produce the corresponding set of first convolution outputs. [0005] According to a second aspect of the present invention, there is provided a method of operating a convolution engine configured to perform neural network computations for a neural network, the convolution engine comprising: a plurality of convolution processing blocks, each convolution processing block being configured to perform convolution based on a two-dimensional (2D) array of first feature map data stored therein to produce a set of first convolution outputs, each set of first convolution outputs comprising a plurality of channels of first convolution outputs; and a plurality of weight memory blocks communicatively coupled to the plurality of convolution processing blocks, respectively, each weight memory block configured to store a plurality of weight parameters and supply the plurality of weight parameters to the corresponding convolution processing block communicatively coupled thereto for the corresponding convolution processing block to perform convolution based on the plurality of weight parameters, wherein each convolution processing block of the plurality of convolution processing blocks comprises: a data buffer block configured to store the corresponding 2D array of first feature map data; an input data processing block configured to process input data for storing in the data buffer block so as to form the 2D array of first feature map stored in a data buffer block a plurality of feature map memory blocks, including a first feature map memory block, each feature map memory block configured to store a 2D array of second feature map data in relation to the convolution performed by the convolution processing block based on the 2D array of first feature map data stored in the data buffer block; a plurality of processing element blocks communicatively coupled to the plurality of feature map memory blocks, respectively, each processing element block being configured to perform convolution based on a 2D array of current feature map data, the 2D array of current feature map data being the 2D array of first feature map data stored in the data buffer block, the 2D array of second feature map data stored in the first feature map memory block or the 2D array of second feature map data stored in the corresponding feature map memory block to produce a set of second convolution outputs, the set of second convolution outputs comprising a plurality of channels of second convolution outputs; and a first convolution output combining block configured to channel-wise combine the plurality of sets of second convolution outputs produced by the plurality of processing element blocks to produce the corresponding set of first convolution outputs, and the method comprising: determining, for each of a plurality of time instances, whether to store the input data received at the time instance in the data buffer block based on whether the input data received at the time instance belongs to the convolution processing block; storing, for each time instance of the plurality of time instances that the input data received at the time instance is determined to belong to the convolution processing block, the input data received at the time instance in the data buffer block according to a data storage pattern so as to form the 2D array of first feature map stored in the data buffer block; and performing, at each of the plurality of convolution processing blocks, convolution based on the corresponding 2D array of first feature map data stored therein to produce the corresponding set of first convolution outputs.

[0006] According to a third aspect of the present invention, there is provided a method of forming a convolution engine configured to perform neural network computations for a neural network, the method comprising: forming a plurality of convolution processing blocks, each convolution processing block being configured to perform convolution based on a two-dimensional (2D) array of first feature map data stored therein to produce a set of first convolution outputs, each set of first convolution outputs comprising a plurality of channels of first convolution outputs; and forming a plurality of weight memory blocks communicatively coupled to the plurality of convolution processing blocks, respectively, each weight memory block configured to store a plurality of weight parameters and supply the plurality of weight parameters to the corresponding convolution processing block communicatively coupled thereto for the corresponding convolution processing block to perform convolution based on the plurality of weight parameters, wherein each convolution processing block of the plurality of convolution processing blocks comprises: a data buffer block configured to store the corresponding 2D array of first feature map data; an input data processing block configured to process input data for storing in the data buffer block so as to form the 2D array of first feature map stored in a data buffer block; a plurality of feature map memory blocks, including a first feature map memory block, each feature map memory block configured to store a 2D array of second feature map data in relation to the convolution performed by the convolution processing block based on the 2D array of first feature map data stored in the data buffer block; a plurality of processing element blocks communicatively coupled to the plurality of feature map memory blocks, respectively, each processing element block being configured to perform convolution based on a 2D array of current feature map data, the 2D array of current feature map data being the 2D array of first feature map data stored in the data buffer block, the 2D array of second feature map data stored in the first feature map memory block or the 2D array of second feature map data stored in the corresponding feature map memory block to produce a set of second convolution outputs, the set of second convolution outputs comprising a plurality of channels of second convolution outputs; and a first convolution output combining block configured to channel-wise combine the plurality of sets of second convolution outputs produced by the plurality of processing element blocks to produce the corresponding set of first convolution outputs.

BRIEF DESCRIPTION OF THE DRAWINGS

[0007] Embodiments of the present invention will be better understood and readily apparent to one of ordinary skill in the art from the following written description, by way of example only, and in conjunction with the drawings, in which:

FIG. 1A depicts a schematic drawing of a convolution engine configured to perform neural network computations for a neural network, according to various embodiments of the present invention;

FIG. IB depicts a schematic drawing of a convolution processing block of the plurality of convolution processing blocks shown in FIG. 1A, according to various embodiments of the present invention;

FIG. 2 depicts a schematic flow diagram of a method of operating a convolution engine configured to perform neural network computations for a neural network, according to various embodiments of the present invention;

FIG. 3 depicts a schematic flow diagram of a method of forming a convolution engine configured to perform neural network computations for a neural network, according to various embodiments of the present invention; FIG. 4A depicts a schematic drawing of a neural network processor system according to various embodiments of the present invention;

FIG. 4B depicts a schematic drawing of a computing system comprising the convolution engine according to various embodiments of the present invention;

FIG. 5 depicts a schematic drawing of an example computing system, along with its example system architecture, according to various example embodiments of the present invention;

FIG. 6 depicts a schematic drawing of a convolution engine configured to perform neural network computations, according to various example embodiments of the present invention;

FIG. 7 depicts a schematic drawing showing various or main functional blocks of the convolution engine, along with various or main data paths and interfaces shown, according to various example embodiments of the present invention;

FIG. 8A depicts a schematic drawing of the PESB with image buffer, along with its example architecture, according to various example embodiments of the present invention;

FIG. 8B depicts a schematic drawing of the PESB without image buffer, along with its example architecture, according to various example embodiments of the present invention;

FIG. 9 depicts a schematic drawing of the PE macro row, along with its example architecture, according to various example embodiments of the present invention;

FIG. 10 depicts a schematic drawing of the PE prime, along with its example architecture, according to various example embodiments of the present invention;

FIG. 11 depicts a schematic drawing of a 3D feature map (FM) data, according to various example embodiments of the present invention;

FIG. 12A depicts a schematic drawing of a channel of FM data of the 3D FM data, according to various example embodiments of the present invention;

FIG. 12B depicts a schematic drawing of a ID array of FM data forming one continuous data (byte) sequence or stream, according to various example embodiments of the present invention;

FIG. 12C depicts a schematic drawing of multiple ID arrays of FM data forming multiple data (byte) sequence or stream, according to various example embodiments of the present invention;

FIG. 12D illustrates a data storage pattern based on the 3D FM data, according to various example embodiments of the present invention; FIG. 13 A illustrates an example 1x1 convolution with storage and data flow with respect to a data memory, according to various example embodiments of the present invention;

FIG. 13B illustrates the example 1x1 convolution shown in FIG. 13 A, together with the PE macro row shown, according to various example embodiments of the present invention;

FIG. 13C illustrates an overall data operation and timing diagram of the example 1x1 convolution shown in FIG. 13 A, according to various example embodiments of the present invention;

FIG. 14A illustrates an example 3x3 convolution with storage with respect to a data memory and data flow, according to various example embodiments of the present invention;

FIG. 14B illustrates the example 3x3 convolution shown in FIG. 14A, together with the PE macro row shown, according to various example embodiments of the present invention;

FIG. 14C illustrates an overall data operation and timing diagram of the example 3x3 convolution shown in FIG. 14A, according to various example embodiments of the present invention;

FIG. 15 depicts a schematic drawing of a short-cut operation in ResNet; and

FIG. 16 illustrates the partition/grouping of the PESBs for implementing the short-cut operation, according to various example embodiments of the present invention.

DETAILED DESCRIPTION

[0008] Various embodiments of the present invention provide a convolution engine configured to perform neural network computations for a neural network, a method of operating the convolution engine, and a method of forming the convolution engine, and more particularly, relating to a neural network processor system, such as a neural network accelerator (NNA).

[0009] For example, as described in the background, convolution engine in conventional neural network processor systems may suffer from various efficiency and/or effectiveness issues when performing neural network computations for convolutional neural networks (CNNs). For example, the massive number of data movement within CNN inference flow may cause significant delay and power consumption in CNN hardware. In order to process CNNs in real-time, especially in the edge computing applications, high efficient data movement may thus be desired. Accordingly, various embodiments of the present invention provide a convolution engine and related methods (e.g., a method of operating the convolution engine and a method of forming the convolution engine, that seek to overcome, or at least ameliorate, one or more of the deficiencies of conventional convolution engines, such as but not limited to, improving efficiency and/or effectiveness in performing neural network computations, and more particularly, improving efficiency in data movement associated with neural network computations.

[0010] FIG. 1A depicts a schematic drawing of a convolution engine 100 configured to perform neural network computations for a neural network, according to various embodiments of the present invention. The convolution engine 100 comprises: a plurality of convolution processing blocks 128, each convolution processing block 128 being configured to perform convolution based on a two-dimensional (2D) array of first feature map data stored therein to produce a set of first convolution outputs 130, each set of first convolution outputs 130 comprising a plurality of channels of first convolution outputs; and a plurality of weight memory blocks 124 communicatively coupled to the plurality of convolution processing blocks 128, respectively, each weight memory block 124 configured to store a plurality of weight parameters and supply the plurality of weight parameters to the corresponding convolution processing block 128 communicatively coupled thereto for the corresponding convolution processing block 128 to perform convolution based on the plurality of weight parameters. FIG. IB depicts a schematic drawing of a convolution processing block 128 of the plurality of convolution processing blocks 128, according to various embodiments of the present invention. In various embodiments, each convolution processing block 128 of the plurality of convolution processing blocks 128 comprises: a data buffer block 152 configured to store the corresponding 2D array of first feature map data; an input data processing block 154 configured to process input data for storing in the data buffer block 152 so as to form the 2D array of first feature map stored in a data buffer block 152; a plurality of feature map memory blocks 160, including a first feature map memory block 160-1, each feature map memory block 160 configured to store a 2D array of second feature map data in relation to the convolution performed by the convolution processing block 128 based on the 2D array of first feature map data stored in the data buffer block 152; a plurality of processing element blocks 164 communicatively coupled to the plurality of feature map memory blocks 160, respectively, and each processing element block 164 being configured to perform convolution based on a 2D array of current feature map data, the 2D array of current feature map data being the 2D array of first feature map data stored in the data buffer block 152, the 2D array of second feature map data stored in the first feature map memory block 160-1 or the 2D array of second feature map data stored in the corresponding feature map memory block 160 to produce a set of second convolution outputs 170 (e.g., corresponding to partial sums). In this regard, the set of second convolution outputs 170 comprises a plurality of channels of second convolution outputs. Each convolution processing block 128 further comprises a first convolution output combining block 180 configured to channel- wise combine the plurality of sets of second convolution outputs 170 produced by the plurality of processing element blocks 164 to produce the corresponding set of first convolution outputs 130. Accordingly, the convolution engine 100 configured with the above-described architecture results in improved efficiency and/or effectiveness in performing neural network computations. These advantages or technical effects will become more apparent to a person skilled in the art as the convolution engine 100 is described in more detail according to various embodiments and example embodiments of the present invention.

[0011] In various embodiments, for the above-mentioned each convolution processing block 128, each processing element block 164 of the plurality of processing element blocks 164 of the convolution processing block 128 comprises a plurality of processing element sub-blocks configured to perform convolution based on the 2D array of current feature map data. In this regard, each processing element sub-block comprises a plurality of processing elements. Furthermore, the weight memory block 124 corresponding to the convolution processing block 128 comprises a plurality of sets of weight memory sub-blocks communicatively coupled to the plurality of processing element blocks 164 of the convolution processing block 128, respectively. In this regard, each set of weight memory sub-blocks is configured to store a set of weight parameters and supply the set of weight parameters to the corresponding processing element block 164 communicatively coupled thereto for the corresponding processing element block 164 to perform convolution based on the 2D array of current feature map data and the set of weight parameters.

[0012] In various embodiments, the above-mentioned each set of weight memory subblocks is communicatively coupled to the plurality of processing element sub-blocks of the corresponding processing element block 164 such that the set of weight memory sub-blocks is configured to supply the set of weight parameters to the plurality of processing element subblocks for the plurality of processing element sub-blocks to perform convolution based on the 2D array of current feature map data and the set of weight parameters.

[0013] In various embodiments, for the above-mentioned each set of weight memory subblocks communicatively coupled to the plurality of processing element sub-blocks of the corresponding processing element block 164, the set of weight memory sub-blocks comprises a plurality of weight memory sub-blocks, each weight memory sub-block of the plurality of weight memory sub-blocks being communicatively coupled to the corresponding processing element of each of the plurality of processing element sub-blocks of the corresponding processing element block 164 for supplying thereto a weight parameter stored therein.

[0014] In various embodiments, the above-mentioned each weight memory sub-block communicatively coupled to the corresponding processing element of each of the plurality of processing element sub-blocks is dedicated to the corresponding processing element of each of the plurality of processing element sub-blocks for supplying thereto the weight parameter stored therein.

[0015] In various embodiments, for the above-mentioned each set of weight memory subblocks, the set of weight parameters stored therein are weights for a plurality of channels of the 2D array of current feature map data.

[0016] In various embodiments, for the above-mentioned each processing element block 164 of the plurality of processing element blocks 164 of the convolution processing block 128, the 2D array of current feature map data comprises a plurality of columns and rows of feature map data; and the plurality of processing element sub-blocks are configured to process the plurality of columns of feature map data respectively and in parallel, row-by-row in performing convolution based on the 2D array of current feature map data. In various embodiments, each processing element sub-block may handle a corresponding one-dimensional (ID) array (e.g., a row or a column) of a 2D feature map data. In various embodiments, each processing element sub-block may be configured such that the plurality of processing elements therein process the same input feature map data to generate different first level of convolution outputs (thereby, generating different output feature map data). These different first level of convolution outputs based on the same input feature map data may then be multiplexed out through a data output channel using a multiplexer (MUX).

[0017] In various embodiments, for the above-mentioned each processing element subblock of the plurality of processing element sub-blocks of the processing element block 164, each processing element of the processing element sub-block comprises: a feature map data input port configured to receive input feature map data from the corresponding column of the plurality of columns of feature map data; a weight parameter input port configured to receive the weight parameter from the corresponding weight memory sub-block; one or more data registers; a multiplier configured to multiply the input feature map data and the weight parameter received to produce a first convolution result; an adder configured to add the first convolution result (e.g., corresponding to a partial sum) and a data output from one of the one or more data registers to produce a second convolution result for storing in one of the one or more data registers (e.g., this same operation may continue (e.g., iteratively) to the end of convolution window, that is, repeated until the end of input data sequence from the 2D array of current feature map data (e.g., which may correspond to an end of a dimension (e.g., the last row) of the 2D array of current feature map data); and a processing element convolution output port configured to output the second convolution result (e.g., at end of convolution window) from one of the one or more data registers as a processing element convolution output of the processing element.

[0018] In various embodiments, for the above-mentioned each convolution processing block 128, the convolution processing block 128 further comprises a plurality of sets of data selectors, each set of data selectors being arranged between a corresponding processing element block 164 of the plurality of processing element blocks 164 and a corresponding feature map memory block 160 of the plurality of feature map memory blocks 160, and is controllable to be in one of a plurality of operation modes for inputting the 2D array of current feature map data to the corresponding processing element block 164 to perform convolution.

[0019] In various embodiments, the plurality of operation modes comprises a first operation mode, a second operation mode and a third operation mode. In this regard, the set of data selectors, when in the first operation mode, is configured to input the 2D array of first feature map data stored in the data buffer block 152 as the 2D array of current feature map data to the corresponding processing element block 164 to perform convolution. In addition, the set of data selectors, when in the second operation mode, is configured to input the 2D array of second feature map data stored in the first feature map memory block 160-1 as the 2D array of current feature map data to the corresponding processing element block 164 to perform convolution. Furthermore, the set of data selectors, when in the third operation mode, is configured to input the 2D array of second feature map data stored in the corresponding feature map memory block 160 as the 2D array of current feature map data to the corresponding processing element block 164 to perform convolution.

[0020] In various embodiments, the above-mentioned process input data, by the corresponding input data processing block 154 of each convolution processing block 128, comprises: determining, for each of a plurality of time instances, whether to store the input data received at the time instance in the data buffer block 152 based on whether the input data received at the time instance belongs to the convolution processing block 128; and storing, for each time instance of the plurality of time instances that the input data received at the time instance is determined to belong to the convolution processing block 128, the input data received at the time instance in the data buffer block 152 according to a data storage pattern so as to form the 2D array of first feature map stored in the data buffer block 152.

[0021] In various embodiments, the data storage pattern comprises: storing, for each time instance of the plurality of time instances that the input data received at the time instance is determined to belong to the convolution processing block, a one-dimensional (ID) array of a channel of feature map data based on the input data received at the time instance along a memory width direction in the data buffer block 152 so as to form the 2D array of first feature map stored in the data buffer block 152, the 2D array of first feature map data comprising a plurality of columns and rows of feature map data.

[0022] In various embodiments, the above-mentioned plurality of rows of feature map data of the 2D array of first feature map data comprises a plurality of groups of rows of feature map data, each group of rows of feature map data comprises consecutive rows of feature map data corresponding to a plurality of ID arrays of different channels of feature map data, respectively, and ordered in a channel order that is the same amongst the plurality of groups of rows of feature map data.

[0023] In various embodiments, the above-mentioned input data received by the input data processing block 154 is an input image data.

[0024] FIG. 2 depicts a schematic flow diagram of a method 200 of operating a convolution engine configured to perform neural network computations for a neural network, according to various embodiments of the present invention, and in particular, the convolution engine 100 as described hereinbefore with reference to FIG. 1 according to various embodiments. The method 200 comprises: determining (at 204), for each of a plurality of time instances, whether to store the input data received at the time instance in the data buffer block 152 based on whether the input data received at the time instance belongs to the convolution processing block 128; storing (at 206), for each time instance of the plurality of time instances that the input data received at the time instance is determined to belong to the convolution processing block 128, the input data received at the time instance in the data buffer block 152 according to a data storage pattern so as to form the 2D array of first feature map stored in the data buffer block 152; and performing (at 208), at each of the plurality of convolution processing blocks 128, convolution based on the corresponding 2D array of first feature map data stored therein to produce the corresponding set of first convolution outputs 130.

[0025] In various embodiments, as described hereinbefore, the data storage pattern comprises: storing, for each time instance of the plurality of time instances that the input data received at the time instance is determined to belong to the convolution processing block 128, a one-dimensional (ID) array of a channel of feature map data based on the input data received at the time instance along a memory width direction in the data buffer block 152 so as to form the 2D array of first feature map stored in the data buffer block 152, the 2D array of first feature map data comprising a plurality of columns and rows of feature map data.

[0026] In various embodiments, as described hereinbefore, the plurality of rows of feature map data of the 2D array of first feature map data comprises a plurality of groups of rows of feature map data, each group of rows of feature map data comprises consecutive rows of feature map data corresponding to a plurality of ID arrays of different channels of feature map data, respectively, and ordered in a channel order that is the same amongst the plurality of groups of rows of feature map data

[0027] FIG. 3 depicts a schematic flow diagram of a method 300 of forming a convolution engine configured to perform neural network computations for a neural network, according to various embodiments of the present invention, and in particular, the convolution engine 100 as described hereinbefore with reference to FIG. 1 according to various embodiments. Accordingly, in various embodiments, the method 300 comprises: forming (at 302) a plurality of convolution processing blocks 128, each convolution processing block 128 being configured to perform convolution based on a 2D array of first feature map data stored therein to produce a set of first convolution outputs 130, each set of first convolution outputs 130 comprising a plurality of channels of first convolution outputs; and forming (at 304) a plurality of weight memory blocks 124 communicatively coupled to the plurality of convolution processing blocks 128, respectively, each weight memory block 124 configured to store a plurality of weight parameters and supply the plurality of weight parameters to the corresponding convolution processing block 128 communicatively coupled thereto for the corresponding convolution processing block 128 to perform convolution based on the plurality of weight parameters. As mentioned above, in particular, the method 300 of forming a convolution engine may form the convolution engine 100 as described hereinbefore with reference to FIG. 1A according to various embodiments. Therefore, in various embodiments, the method 300 comprises steps for forming any one or more of components/blocks or elements of the convolution engine 100 as described hereinbefore with reference to FIG. 1A according to various embodiments, and thus need not be repeated with respect to the method 300 for clarity and conciseness. In other words, various embodiments described herein in context of the convolution engine 100 are analogously valid for the corresponding method 300 of forming the convolution engine 100, and vice versa. [0028] In various embodiments, the convolution engine 100 may be included or implemented in (e.g., a part or a component of) a neural network processor system, such as a neural network accelerator (NNA). FIG. 4A depicts a schematic drawing of a neural network processor system 400 according to various embodiments of the present invention, comprising the convolution engine 100 as described hereinbefore with reference to FIG. 1A according to various embodiments of the present invention. For example, the neural network processor system 400 may be embodied as a device or an apparatus and may be formed as an integrated neural processing circuit, such as but not limited to, a NNA chip.

[0029] FIG. 4B depicts a schematic drawing of a computing system 401 (e.g., which may also be embodied as a device or an apparatus) comprising or communicatively coupled to (e.g., not comprising but communicatively coupled to) the convolution engine 100 (or the neural network processor system 400 comprising the convolution engine 100) as described hereinbefore according to various embodiments. The computing system 401 comprising a memory 402 and at least one processor 404 communicatively coupled to the memory 402 and configured to coordinate with (e.g., instruction or control) the convolution engine 100 (or the neural network processor system 400) to perform neural network computations for a neural network based on input data (e.g., input image data). In various embodiments, the computing system 401 may be configured to transfer or send the input data to the convolution engine 100 (or the neural network processor system 400) and instruct the convolution engine 100 to perform neural network computations for a neural network based on the input data. For example, the computing system 401 may be an image processing system configured to process image data. For example, the image processing system may be configured to obtain sensor data (e.g., raw image data) relating to a scene using an image sensor and then perform neural network computations based on the sensor data obtained, such as to classify the sensor data.

[0030] It will be appreciated to a person skilled in the art that various components of any computing system may communicate via an interconnected bus and in a manner known to the person skilled in the relevant art.

[0031] A computing system, a controller, a microcontroller or any other system providing a processing capability may be presented according to various embodiments in the present disclosure. Such a system may be taken to include one or more processors and one or more computer-readable storage mediums. For example, the computing system 401 described hereinbefore may include a processor (or controller) 404 and a computer-readable storage medium (or memory) 402 which are for example used in various processing carried out therein as described herein. Similarly, the neural network processor system 400 described hereinbefore may include a processor and a computer-readable storage medium (e.g., a memory) communicatively coupled to the convolution engine 100 for performing various processing or operations as desired or as appropriate. A memory or computer-readable storage medium used in various embodiments may be a volatile memory, for example a DRAM (Dynamic Random Access Memory) or a non-volatile memory, for example a PROM (Programmable Read Only Memory), an EPROM (Erasable PROM), EEPROM (Electrically Erasable PROM), or a flash memory, e.g., a floating gate memory, a charge trapping memory, an MRAM (Magnetoresistive Random Access Memory) or a PCRAM (Phase Change Random Access Memory).

[0032] In various embodiments, a "circuit" may be understood as any kind of a logic implementing entity, which may be special purpose circuitry or a processor executing software stored in a memory, firmware, or any combination thereof. Thus, in an embodiment, a "circuit" may be a hard-wired logic circuit or a programmable logic circuit such as a programmable processor, e.g., a microprocessor (e.g. a Complex Instruction Set Computer (CISC) processor or a Reduced Instruction Set Computer (RISC) processor). A "circuit" may also be a processor executing software, e.g., any kind of computer program, e.g., a computer program using a virtual machine code such as, e.g., Java. Any other kind of implementation of the respective functions which will be described in more detail below may also be understood as a "circuit" in accordance with various alternative embodiments. Similarly, a “module” may be a portion of a system according to various embodiments in the present invention and may encompass a “circuit” as above, or may be understood to be any kind of a logic-implementing entity therefrom.

[0033] Some portions of the present disclosure are explicitly or implicitly presented in terms of algorithms and functional or symbolic representations of operations on data within a computer memory. These algorithmic descriptions and functional or symbolic representations are the means used by those skilled in the data processing arts to convey most effectively the substance of their work to others skilled in the art. An algorithm is here, and generally, conceived to be a self-consistent sequence of steps leading to a desired result. The steps are those requiring physical manipulations of physical quantities, such as electrical, magnetic or optical signals capable of being stored, transferred, combined, compared, and otherwise manipulated.

[0034] Unless specifically stated otherwise, and as apparent from the following, it will be appreciated that throughout the present specification, discussions utilizing terms such as “receiving”, “storing”, “processing”, “performing”, “combining”, “producing”, “multiplying”, “partitioning”, “outputting”, “inputting”, “determining” or the like, refer to the actions and processes of a computer system, or similar electronic device, that manipulates and transforms data represented as physical quantities within the computer system into other data similarly represented as physical quantities within the computer system or other information storage, transmission or display devices.

[0035] The present specification also discloses a system (e.g., which may also be embodied as a device or an apparatus) for performing the operations/functions of the method(s) described herein. Such a system or apparatus may be specially constructed for the required purposes. The algorithms presented herein are not inherently related to any particular computer or other apparatus.

[0036] In addition, the present specification also at least implicitly discloses a computer program or software/functional module, in that it would be apparent to the person skilled in the art that various individual steps of the methods described herein may be put into effect by computer code. The computer program is not intended to be limited to any particular programming language and implementation thereof. It will be appreciated that a variety of programming languages and coding thereof may be used to implement the methods/techniques of the disclosure contained herein. Moreover, the computer program is not intended to be limited to any particular control flow. There are many other variants of the computer program, which can use different control flows without departing from the scope of the invention. It will be appreciated to a person skilled in the art that various modules may be software module(s) realized by computer program(s) or set(s) of instructions executable by a computer processor to perform the required functions, or may be hardware module(s) being functional hardware unit(s) designed to perform the required functions. It will also be appreciated that a combination of hardware and software modules may be implemented.

[0037] Furthermore, one or more of the steps of the computer program/module or method may be performed in parallel rather than sequentially. Such a computer program may be stored on any computer readable medium. The computer readable medium may include storage devices such as magnetic or optical disks, memory chips, or other storage devices suitable for interfacing with a computer. The computer program when loaded and executed on such a computer effectively results in an apparatus that implements various steps of the methods or operations described herein. [0038] In various embodiments, there is provided a computer program product, embodied in one or more computer-readable storage mediums (non-transitory computer-readable storage medium), comprising instructions executable by one or more computer processors (e.g., by the input data processing block 154 and/or a central control block 644 (to be described later below) to perform the method 200 of operating the convolution engine 100 as described hereinbefore with reference to FIG. 2. Accordingly, various computer programs or modules described herein may be stored in a computer program product receivable (e.g., pre-loading thereto) by the convolution engine 100 for execution by at least one processor of the convolution engine 100 to perform the respective functions or operations.

[0039] Software or functional modules described herein may also be implemented as hardware modules. More particularly, in the hardware sense, a module is a functional hardware unit designed for use with other components or modules. For example, a module may be implemented using discrete electronic components, or it can form a portion of an entire electronic circuit such as an Application Specific Integrated Circuit (ASIC). Numerous other possibilities exist. Those skilled in the art will appreciate that the software or functional module(s) described herein may also be implemented as a combination of hardware and software modules.

[0040] It will be appreciated to a person skilled in the art that the terminology used herein is for the purpose of describing various embodiments only and is not intended to be limiting of the present invention. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising" when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

[0041] It will also be appreciated to a person skilled in the art any reference to an element or a feature herein using a designation such as “first”, “second” and so forth does not limit the quantity or order of such elements or features, unless stated or the context requires otherwise. For example, such designations may be used herein as a convenient method of distinguishing between two or more elements or instances of an element. Thus, a reference to first and second elements does not necessarily mean that only two elements can be employed, or that the first element must precede the second element, unless stated or the context requires otherwise. In addition, a phrase referring to “at least one of’ a list of items refers to any single item therein or any combination of two or more items therein.

[0042] In order that the present invention may be readily understood and put into practical effect, various example embodiments of the present invention will be described hereinafter by way of examples only and not limitations. It will be appreciated by a person skilled in the art that the present invention may, however, be embodied in various different forms or configurations and should not be construed as limited to the example embodiments set forth hereinafter. Rather, these example embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the present invention to those skilled in the art. In particular, for better understanding of the present invention and without limitation or loss of generality, various example embodiments of the present invention will now be described with respect to the convolution engine 100 being included or implemented in a neural network processor system 400 and the input data being an input image data, whereby the neural network processor system 400 is a neural network accelerator (NNA). However, it will be appreciated by a person skilled in the art that the present invention is not limited to the input data being an input image data. For example, and without limitation, the input data may also be acquisition data from bio-sensor array or any other type of sensor array, data from radar (array), sound, and so on, as long as the input data may be represented as feature map data so as to be capable of being processed by the convolution engine 100.

[0043] Various example embodiments provide an artificial intelligence (Al) accelerator with efficient data storage and movement techniques and high-performance processing element (PE) array architecture, as well as methods to operate the same. In various example embodiments, a neural network accelerator (NNA) chip (e.g., corresponding to the neural network processor system 400 as described hereinbefore according to various embodiments) is provided with (1) a unique PE architecture, (2) a unique PE array architecture and (3) a unique data storage architecture and operation method (data storage pattern). Advantageously, the unique data storage architecture and operation method allow the reading/writing data to be performed in parallel without requiring any input/output buffer between data memories and PE arrays for data reordering, thereby minimizing execution time and control complexity, and thus improving efficiency and/or effectiveness in performing neural network computations. In various example embodiments, the NNA architecture may be a stand-alone accelerator without any dedicate microcontroller (MCU) control. [0044] FIG. 5 depicts a schematic drawing of an example computing system 500 (e.g., corresponding to the computing system 401 as described hereinbefore according to various embodiments), along with its example system architecture, according to various example embodiments of the present invention. As shown in FIG. 5, the computing system 500 comprises a host system 504 and a neural network accelerator (NNA) chip 508 (e.g., corresponding to the neural network processor system 400 as described hereinbefore according to various embodiments) communicatively coupled to the host system 504. In various example embodiments, the NNA chip 508 (more particularly, the convolution engine therein) is the key or core design or component, alongside with a host system 504 comprising a host controller (which may also be referred to as a MCU, e.g., realized or implemented by a commercially available field programmable gate arrays (FPGA), such as ZC706, or any other CPU-based controller). In various example embodiments, the host system 504 may obtain raw (or original) input image data (e.g., which may also be referred to as sensor data) from a camera module (or any other type of sensor). The input image data may then be sent to a DDR memory and subsequently to the NNA 508 via an image data bus for performing neural network computations therein, such as for classification or recognition processing. Subsequently, the outputs (e.g., detection or classification results) of the NNA 508 may be combined with the corresponding input image data (e.g., to plot or annotate the detection results in the original image data) to produce annotated image data, which may then be sent to a display or video interface (e.g., a HDMI connector or any other type of display or video interface) for display on a display screen. In various example embodiments, weight data and bias data for the neural network associated with the NNA 508, as well as instruction data, may be pre-loaded to the NNA 508 (more particularly, the convolution engine therein) by the host controller through an interconnected bus (e.g., a serial interface (e.g., AXI4 bus) or any other suitable type of networking protocol). In various example embodiments, the NNA 508 may be configured to accelerate the neural network processing (neural network computations) such that its outputs (e.g., detection or classification results) can meet the real-time processing requirements for various applications, such as live object or face detection in imaging devices.

[0045] FIG. 6 depicts a schematic drawing of a convolution engine 620 (which may also be referred to as a convolution engine block or a PE array block) of the NNA 508 configured to perform neural network computations for a neural network based on input image data, along with its example architecture, according to various example embodiments of the present invention. As shown in FIG. 6, the convolution engine 620 comprises a plurality (e.g., a group (N)) of PE sub-blocks (PESB) 628 (e.g., corresponding to the plurality of convolution processing blocks 128 as described hereinbefore according to various embodiments) and a plurality (e.g., a group (N)) of weight memory (WM) 624 (e.g., corresponding to the plurality of weight memory blocks 624 as described hereinbefore according to various embodiments). In various example embodiments, the convolution engine 620 may further comprise a plurality (e.g., a group (L)) of convolution output combining blocks (e.g., adder tree blocks) 632 (e.g., multiple (L) sets of parallel adder trees (PAT)), a central processing block (or a max-pooling and activation block) 636, a bias memory block 640 and a central control block (“Ctrl_top”) 644.

[0046] For illustration purpose, FIG. 7 depicts a schematic drawing showing various or main functional blocks of the convolution engine 620, along with various or main data paths and interfaces shown, according to various example embodiments of the present invention. As shown in FIG. 7, a first interface may be provided as an image data bus 702 configured for loading the raw input image data into a number of PESBs 628 (e.g., those PESBs with image buffer (“PESB w/ ImageBuf”) shown in FIG. 7) (e.g., corresponding to the plurality of convolution processing blocks 128 as described hereinbefore according to various embodiments), a second interface may be provided as NNA data bus 704 for the NNA outputs which may be used to output the inference result to the host computer 504 from the central process unit 636 (which may also be referred to as the max-pooling and activation block shown in FIG. 6), and a third interface may be provided as a serial interface 706 configured for connecting to various or all main functional blocks (e.g., PESBs 628, WMs 624, BMs 640, Ctrl_top 644 and so on) to access the distributed memory within the functional blocks. In various example embodiments, the serial interface 706 may be used in a test-mode or to preload the weights, bias and instructions through the serial interface 706 to various functional blocks in the convolution engine 620. In various example embodiments, the serial interface 706 may not operate when the convolution engine 620 is in normal working or operating mode (i.e., inference mode).

[0047] In various example embodiments, not all PESBs 628 comprise an image buffer (accordingly, the plurality of convolution processing blocks 628 described hereinbefore according to various embodiments may correspond to those PEBSs 628 comprising an image buffer). In this regard, for example, a typical input image data may contain only 3 channels (e.g., RGB channels). In various example embodiments, more PESBs 628 with image buffer may be provided within the convolution engine 620. In various example embodiments, the number of such PESBs 628 (including an image buffer) provided in the convolution engine 620 may be determined based on (e.g., directly linked to or correspond to) the stride of first convolution of the inference process (or inference flow). In various example embodiments, during the inference process, a PESB 628 with image buffer may perform the first convolution based on a first input image data received (more particularly, a 2D array of feature map data stored according to a data storage pattern based on the first input image data) for the first convolution in the image buffer, and subsequent feature map data associated with subsequent convolutions performed by the PESB 628 in relation to the first input image data may be stored in activation memory blocks (AMEMs) in the PESB. Accordingly, new input image data (e.g., the next input image data) for convolution to be performed may be uploaded to the image buffer of the PESB 628 immediately after the first convolution performed in relation to the first input image data since subsequent convolutions performed by the PESB 628 in relation to the first input image data are based on subsequent feature map data stored in the AMEMs. As a result, the convolution engine 628 is able to process subsequent convolutions of the inference in relation to the first input image data while loading the next input image data (which is at relative slow rate) for convolution to be performed, thereby improving system performance.

[0048] It will be appreciated that various components/blocks or elements described herein may be physically or virtually interconnected using on-chip infrastructures, such as wires, buffers or network-on-chip (NOC) such that data can flow from one block or element to another block or element such as in accordance with the arrows shown in FIG. 7.

[0049] In various example embodiments, each WM 624 may be connected to the corresponding PESB 628 (only one corresponding PESB 628) and data flowing from the weight memory 624 to the PESB 628 may be synchronized across the entire convolution engine 620. In various example embodiments, each pair of WM 624 and PESB 628 is configured to have a one-to-one relationship, which not only reduces the number of interconnections between many WMs 624 and PESBs 628, but also to removes the long distance data movement requirement, which is a technical problem faced by conventional convolution engine architecture, especially for a very large PE array.

[0050] In various example embodiments, each WM 624 may include single or multiple identical memory macros, which are used to store the weights of the neural network. By way of examples only and without limitation, the memory macro may be implemented using conventional SRAM or emerging non-volatile memory, such as RRAM, MRAM and so on. Accordingly, in various example embodiments, all the WMs 624 may be folded into another or multiple memory chips while the interconnection between WMs and PESBs can be formed through the through-silicon-via (TSV) technology for a much deeper neural network.

[0051] In various example embodiments, for a convolution layer, each adder tree (AT) block 632 (which may also be referred to as parallel adder tree (PAT) block) may be configured to accumulate the convolution outputs from the PESBs 628 connected thereto, and the output of each AT block 632 may then be sent to the max-pooling and activation block 636 to perform a max-pooling function and one or more activation functions, such as ReLU, Bach- normalization, and so on. Subsequently, data outputs of the max-pooling and activation block 636 may be sent to the activation memory (AMEM) blocks 816 (to be described later below) of the PESBs 628 for the PESBs 628 to perform the next convolution layer. For example, for a convolution layer, max-pooling and activation operations are only applied to the final convolution outputs which are also in turn the input feature map data for the next convolution layer, and are thus sent back to the AMEM blocks 624 of the PESBs 628 for the PESBs 628 to perform the next convolution layer based on a broadcasting method. In this regard, each AMEM 624 may be configured to include two input/output feature map banks (which may be referred to as two ping-pong feature map buffers), one for feature map data output (to PE array) and the other one for feature map data input (from the max-pooling and activation block 636. For example, based on the data pattern of input feature map data and the PE array architecture, multiple output feature map data are generated at the same time and serial shifted out from the PESBs 628, and after the AT blocks 632 and the max -pooling and activation operations, the output feature map data sequence are still the same as the input feature map data sequence, and thus, the output feature map data sequence are advantageously in the data pattern required by the PESBs 628 for the next convolution layer. In various example embodiments, because the number channels of output feature map are limited, only one PESB may be activated at a time to receive the corresponding output feature map data at any iteration of convolution, so as to advantageously avoid the need to synchronize the timing from the max -pooling and activation block 636 to different PESBs 628. As a result, the output feature map data may be transmitted from the max-pooling and activation block 636 using a broadcasting method is enough, which simplifies the design implementation for top integration, especial for a very large design.

[0052] In various example embodiments, the AT blocks 632 may also be used as a pipeline for a large system integration. In various example embodiments, as shown in FIG. 6, a plurality of adder trees at the same level or channel (e.g., for each of the L channels) may be grouped together as a group or set of adder trees to form the corresponding AT block 632 (which may be referred to as parallel adder tree (PAT) block or simply as PAT), for example, for efficient timing and routing congestion control, thereby obtaining a plurality of AT blocks 632 with respect to the plurality of channels (e.g., 1 to L), respectively. In various example embodiments, each AT block 632 may be considered or referred to as a second level AT block (at the convolution engine or system level). In this regard, the PESB 628 comprises a plurality of AT blocks, which may be considered or referred to as first level AT blocks (at the PESB level).

[0053] In various example embodiments, the convolution engine 620 may further comprise a bias memory block 640 configured to store all bias data. For example, the bias memory block 640 may include a plurality of memory macro, and all the outputs from the bias memory block 740 may be attached as one input of the AT blocks 632, that is, from convolution point of view, after adding all the partial sums (all accumulated data before the final convolution output are considered as partial sums), one bias-adding operation may be required for the final convolution output data, and such a bias-adding operation may be applied to each convolution output data, by adding through the corresponding AT block 632. For example, for multiple output feature map data, the same number of bias data may also appear at the corresponding AT blocks 632 in sequence repeatedly.

[0054] In various example embodiments, the convolution engine 620 may further comprise a central controller (e.g., a top-control block (“Ctrl top”)) 644 configured to generate various control signals for controlling various components or functional blocks in the convolution engine 620, such as the PESBs 628. For example, the central controller 644 may include a program memory and an instruction decoder communicatively coupled to the program memory and configured to generate various (e.g., all) control signals for controlling operations of the functional blocks in the convolution engine 620. For example, the control signals may be slow signals so they may be broadcasted to functional blocks via wires and buffers. The PESB 628 will now be described in further details below according to various example embodiments of the present invention.

[0055] FIG. 8A depicts a schematic drawing of the PESB 628-1 (i.e., with image buffer), along with its example architecture, according to various example embodiments of the present invention. As shown in FIG. 8 A, the PESB 628-1 may comprise a plurality of activation memory (AMEM) blocks 812 (e.g., corresponding to the plurality of feature map memory blocks 160 as described hereinbefore according to various embodiments) and a plurality of PE macro rows 816 (e.g., corresponding to the plurality of processing element blocks 164 as described hereinbefore according to various embodiments) communicatively coupled to the plurality of activation memory blocks 812, respectively. In various example embodiments, each PESB 628-1 may include M number of PE macro rows 816 and M number of AMEM blocks 812. In various example embodiments, each PE macro row 816 may comprise a plurality (e.g., L number) of PE macros (e.g., corresponding to the plurality of processing element sub-blocks as described hereinbefore according to various embodiments). The PESB 628-1 may further comprise a plurality (e.g., L number) of convolution output combining blocks (e.g., M-to-1 adder tree blocks) 820 configured to add the corresponding convolution outputs 828 from each PE macro row 816 to generate L outputs 630 (e.g., corresponding to the set of first convolution outputs 130 as described hereinbefore according to various embodiments) for one PESB 628- 1. In various example embodiments, L may denote the number of parallel data of an adder-tree 820 (e.g., the parallel data width) within the PESB 628-1 and the number of PE macros within each PE macro row 816.

[0056] In various example embodiments, the PESB 628-1 may further comprise an image buffer 852 (e.g., corresponding to the data buffer block 152 as described hereinbefore according to various embodiments) configured to store a corresponding 2D array of feature map data and an image data processor 854 (e.g., corresponding to the input data processing block 154 as described hereinbefore according to various embodiments) configured to process input image data for storing in the image buffer 852 so as to form (e.g., result in or realise) the 2D array of feature map stored in the image buffer 852.

[0057] In various example embodiments, the AMEM blocks 812 may be configured to store the input/output feature map (FM) of the neural network obtained when performing convolutions. In various example embodiments, based on the convolution size, one partitioned feature map data may be assigned to each PESB 628-1 for the reason that each of L number of input feature map data may have to be used multiple times and one convolution involves multiple data from L number of feature map data, which may be shifted to the left by one position per PE macro row 816 (i.e., shifted to the left by one position from one PE macro row to the next PE macro row). For example, one dimension of convolution may be handled by M number of PE macro rows 816 and another dimension of convolution may be handled within the PEs. For example, for a 1x1 convolution, the channels may be further partitioned into M sections following the same rule since typically, a 1x1 convolution has more channels than a larger convolution. Therefore, M may normally be a power of 2. In various example embodiments, each AMEM block 812 may be configured to comprise two memory banks, a first memory bank may be configured to read (input feature map data) and a second memory bank may be configured to write (output feature map data), or vice versa. For example, during convolution processing, almost all the AMEM blocks (e.g., memory bank A thereof) may be needed for read-out, while the output data may be ready after certain clock cycles, so another memory bank (e.g., memory bank B of the corresponding AMEM) may be utilized for writing data. After one layer of convolutions, the input feature map data may be stored in the corresponding Bank B, so that for the next convolution layer, Bank B may in turn be control to perform read-out and Bank A may then in turn become the memory for receiving the output feature map. Such a switching of operations performed between the two memory banks (between reading input feature map data and writing output feature map) may be repeated until the end of convolution. By way of examples only and without limitation, the AMEM block 812 may be realized or implemented based on conventional SRAM or emerging memory technologies, such as RRAM or MRAM.

[0058] In various example embodiments, the PESB 628-1 may further comprise a plurality of multiplexer (MUX) arrays 824 configured to coordinate or handle data sharing within the PESB 628-1. Each MUX array 824 may comprise a plurality of MUXs (e.g., corresponding to the plurality of data selectors as described hereinbefore according to various embodiments). In various example embodiments, the PESB 628 may be configured to have at least three modes for the PESB operations based on the plurality of MUX arrays 824. In a first operation mode, each MUX array 824 may be configured to select input feature map data stored in the image buffer 852 as the 2D array of current feature map data for the corresponding PE macro row 816 to perform convolution.. In a second operation mode, each MUX array 824 may be configured to select input feature map data stored in the first AMEM block 812-1 as the 2D array of current feature map data for the corresponding PE macro row 816 to perform convolution and in a pattern that input feature map data are shifted one position (to left in FIG. 8A) for every next (immediate subsequent) PE macro row 812. An example data path to illustrate this pattern is shown in FIG. 8A. In a third operation mode, each MUX array 824 may be configured to select input feature map data stored in the corresponding AMEM block 812 as the 2D array of current feature map data for the corresponding PE macro row 816 to perform convolution.

[0059] In various example embodiments, each PE macro row may have L number of feature map data inputs (which may simply referred to herein as FM inputs) and L number of feature map data outputs (which may simply be referred to herein as FM outputs). In various example embodiments, as mentioned above, L sets of adder trees 820 may be configured to add the corresponding convolution outputs 828 from each PE macro row 816 to generate L channels of convolution outputs 630 of the PESB 628-1. For example, the first level adder trees may be L sets of M-to-1 adder trees 820 so as to add all M number of PE macro row data from the same (corresponding) position or channel (i.e., one of L). For example, since the same (corresponding) position convolution outputs 828 (or the same (corresponding) channel of convolution outputs) from different PE macro rows 816 are added together through the L sets of adder tree blocks 820 (which may be referred to as the first level PAT), the above-mentioned third operation mode may be used for large-size convolution, such as 3x3, 5x5, 7x7, and so on. As an example, for a 3x3 convolution, according to various example embodiments, the first PE macro row 816 (PE macro row 1) may obtain a plurality of input feature map data corresponding to original input data (either one row or one column thereof) as dl, d2, d3, ..., du, the second PE macro row (PE macro row 2) may obtain a plurality of input feature map data as d2, d3,..,di., d +i, and the third PE macro row (PE macro row 3) may obtain a plurality of input feature map data as d3, . . . , di., di.+ i . di.₊2 , respectively. Subsequently, L outputs 828 after adder trees are the corresponding convolution outputs (e.g., partial sums) of the output feature map data which are produced from each PE macro row 816. In this manner, one dimension of 2D convolution may be processed. Another dimension of convolution may also be processed within a PE which will be discussed later below.

[0060] In various example embodiments, in relation to the above-mentioned second operation, the first AMEM block 812-1 (i.e., AMEM 1 shown in FIG. 8A) may have a wider memory width than the other AMEM blocks 812 within the PESB 628-1, and the plurality of MUX arrays 824 may be used to share the feature map data from the first AMEM block 812-1 among the plurality of PE macro rows 816. In various example embodiments, in relation to the above-mentioned first operation, the image buffer 852 may also similarly have a wider memory width than the other AMEM blocks 812 within the PESB 628-1, and the plurality of MUX arrays 824 may be used to share the feature map data from the image buffer 852 among the plurality of PE macro rows 816. In various other example embodiments, the required feature map data may be directly re-written or stored in the plurality of AMEM blocks 812 accordingly for large size convolution, but at cost of data written control and power for duplicated memory writing of the same feature map data. For example, certain row/column of input feature map data must be reused for different convolution windows, and thus, there is no overhead for data reading.

[0061] FIG. 8B depicts a schematic drawing of the PESB 628-2 without image buffer, along with its example architecture, according to various example embodiments of the present invention. In various example embodiments, as shown in FIG. 8B, the PESB 628-2 without image buffer may have the same or similar components and configuration (or architecture) (e.g., denoted by the same reference numerals) as the PESB 628-1 with image buffer, but without the image data processor 854 and the image buffer 852, as well as the MUX array coupled to the image buffer 852. Therefore, it is not necessary to describe the components and configuration (or architecture) of the PESB 628-2 without image buffer for clarity and conciseness.

[0062] FIG. 9 depicts a schematic drawing of the PE macro row 816, along with its example architecture, according to various example embodiments of the present invention. The PE macro row 816 may comprise a group (e.g., L) of PE macros 910 (e.g., corresponding to the plurality of processing element sub-blocks described hereinbefore according to various embodiments). Each PE macro 910 may comprise a group (P) of PE primes 920 (e.g., corresponding to the plurality of processing elements described hereinbefore according to various embodiments). In various example embodiments, the number (P) of PE primes 920 in a PE macro 910 may be 2^m because the number of input feature map data for most convolution input/output are power of 2, and thus may result in the most efficient usage of the example architecture. In various example embodiments, one PE prime 920 is the smallest processing unit in the PESB 628 and its example structure is shown in FIG. 10. In various example embodiments, all the PE primes 920 within a PE macro 910 share the same input feature map data while different PE primes within the PE macro 910 have different weight inputs which are transmitted from corresponding dedicated weight memory sub-blocks 914 pre-loaded with weights for the current convolution (e.g., corresponding to the set of weight memory sub-blocks as described hereinbefore according to various embodiments). In various example embodiments, the weight memory sub-blocks 914 may be a small group of data registers for storing the weights only required for one convolution and provide weight inputs for the corresponding PE primes 920, while a weight memory block, including the weight memory sub-blocks 914, may be provided for storing all the weights required for all convolutions, whereby each convolution may only require a small group of weights to complete the convolution. The convolution output from each PE prime 920 within one PE macro 910 may be selected out using a MUX or other data selectors (selection techniques) as appropriate so as to share the same feature map output. In various example embodiments, within a PE macro 910, there is provided a P-to- 1 MUX configured to output only one convolution output from a PE prime 920 amongst P number of PE primes 920 within the PE macro 910 to an output port of the PE macro 910 per clock cycle. Accordingly, L number of parallel output feature map data may be obtained from each PE macro row 816.

[0063] Accordingly, each PE macro row 816 may comprise a group (L) of PE macro 910 (e.g., corresponding to the plurality of processing element sub-blocks as described hereinbefore according to various embodiments), whereby each PE macro 910 may handle one line of parallel feature map inputs at a time (e.g., per clock) accordingly, while the multiple weight inputs are advantageously shared across the PE macro row 816 as shown in FIG. 9. Accordingly, in various example embodiments, weight sub-blocks 914 communicatively coupled to the PE macros 910 of the corresponding PE macro row 816 advantageously have stored therein the required weights for the convolution to be performed by the corresponding PE macro row 816 and can be reused until the end of feature map row(s) and/or to the end of feature map column(s) of the array of feature map data stored in the corresponding AMEM block 812. In various example embodiments, the group of PE macro 910 may obtain the required weights from the plurality of weight sub-blocks 914 based on a unidirectional bus interface therebetween. Furthermore, the plurality of weight sub-blocks 914 may rotate out the pre-fetched weights per clock cycle that correspond to the input feature map data such that the PE macro row 816 is able to process an entire dimension of the input feature map data.

[0064] In various example embodiments, each PE prime 920 may be configured to handle only one of the partitioned channels only, for example, a first PE prime may only handle a first channel of feature map data, a second PE prime may only handle a second channel of feature map data, and so on. In this manner, the above-described architecture of the PE macro 910 may also be used to process the depth-wise convolution.

[0065] FIG. 10 depicts a schematic drawing of the PE prime 920, along with its example architecture, according to various example embodiments of the present invention. The PE prime 920 may comprise a multiplier 1004, an adder 1006 with a plurality of data registers 1008 (e.g., four data registers QI, Q2, Q3 and Q4 are illustrated as shown in FIG. 10) and two data selection blocks or modules (i.e., MUX block 1012 and output selection block 1014 shown in FIG. 10). In particular, a selection module is the MUX block 1012 which is used to select the feedback path to an input of the adder 1006 from the data registers 1008 (QI, Q2, Q3 and Q4 as shown in FIG. 10); another selection module is the output selection module 1014 which is used to control the output from the internal data registers 1008 (QI to Q4). In various example embodiments, the selection control signals (ACC_CTRL 1022 and Output_CTRL 1024) shown in FIG. 10 may be shared across the PE macro 910, across the PE macro row 816 and/or even across the PESB 628. In various example embodiments, the selection control signals may be generated based on a multi-bit counter. The selection control signals may be duplicated to address drivability issue across a large silicon area. According to various example embodiments, experimental results obtained show that four data registers are sufficient for almost all sizes of convolution based on the manner in which large stride convolution is handled according to various example embodiments of the present invention.

Data Storage Pattern

[0066] Various example embodiments provide a unique or predetermined data storage pattern or technique which is able to achieve fully utilization of the system architecture of the convolution engine 620 and remove the need for input/output buffers between data memories and PE array in conventional systems.

[0067] By way of an example only and without limitation, FIG. 11 depicts a schematic drawing of a 3D feature map (FM) data 1100 for illustration purpose only, according to various example embodiments of the present invention. For example, each channel of FM data may have a dimension NxK, and the 3D FM data 1100 may have a dimension MxNxK. According to various example embodiments, the whole 3D FM data 1100 may be partitioned into multiple sections across the channels of feature map data, such as partitioned FM data Pl 1104-1 and P2 1108-2 as illustrated in FIG. 11. It will be appreciated to a person skilled in the art that the 3D FM map data 1100 may be partitioned into other number of partitioned FM data. In various example embodiments, different partitioned FM data may be allocated into the AMEM blocks 812 of different PESBs 628, respectively, that is, one partitioned FM data may be allocated into a corresponding AMEM block 812. According to the weight memory and PESB pair in architecture, only the weights related to the corresponding channels are stored in the corresponding weight memory 624, so the weights for a PESB 628 are not shared with other PESBs 628. For example, there may be M number of PE macro rows 816 within a PESB 628, so there may also be provided M number of corresponding AMEM blocks 824. For example, for a 1x1 convolution case, channels of feature map data of a partitioned FM data may be further partitioned into M number of AMEM blocks 812 within the PESB 628, while the first AMEM block 812-1 may be utilized for other sizes of convolution (e.g., 3x3 convolution and so on).

[0068] By way of an example only and without limitation, FIG. 12A depicts a schematic drawing of a channel of FM data 1202 of the 3D FM data 1100, according to various example embodiments. For illustration purpose, FIG. 12B depicts a ID array of FM data 1206 forming one continuous data (byte) sequence or stream. For example, the ID array of FM data 1206 may be stored according to the order of channels, and then followed by the next position (i.e., the next row of the ID array of FM data 1206), and repeating this pattern until the end of the feature map rows. FIG. 12C illustrates that each row of the ID array of FM data 1206 (e.g., FM1 to FM8) form one continuous data sequence according to the arrow lines, while L rows form a parallel L data sequence inputted from the AMEM block 812 and outputted to the PE macro row(s) 816, thereby forming multiple data streams.

[0069] Accordingly, as shown in FIG. 12D, a column or part of a column of the partitioned FM data (e.g., 1108-1) may be stored as multiple words across the memory width in the AMEM block 812 or the image buffer 852 (e.g., the column of the partitioned FM data may meet or correspond to the number of PE macros in the corresponding PE macro row 816 for better hardware utilization, but the present invention is not limited as such), followed by the same column (or part of the column) of next channel of the partitioned FM data, repeated till the end or last channel of the partitioned FM data, and then followed by the next column (or part of the column) of the partitioned FM data. This data storage pattern may continue iteratively in the manner or technique as described above until the entire row of the partitioned FM data is completed. In the case of a part of the column as mentioned above, the remaining part(s) of the column of the partitioned FM data may be stored in the same manner as described above, relating to multiple iteration of convolutions. It will be appreciated by a persons skilled in the art that the above-mentioned predetermined data storage pattern has been described based on a column-first case, but the present invention is not limited to column-first only. For example, the predetermined data storage pattern may instead be based on a row-first case. In this regard, there is no difference in data processing for the system architecture for either the column-first case or the row-first case, especially when the feature map width and height are identical. In various example embodiments, for the case of different feature map width and height, the shorter side may be applied first to reduce number of iterations in data processing.

[0070] In various example embodiments, for the case of higher stride convolution, the 3D FM data 1100 may be further partitioned by partitioning the different feature map rows into different PESBs 816 according to the number of stride. For example, as the higher stride convolution are normally linked to the first convolution, the higher stride convolution may thus relate to the image buffer 852 (which relates to the first convolution). By way of an example, for the case of a 7x7 convolution with stride 2, the input 3D FM data may be further partitioned into a group of even rows and a group of odd rows of FM data. In general, the input 3D FM data may be partitioned into a number of groups of rows of FM data according to the number of strides (e.g., stride 3 results in 3 groups of rows of FM data, stride 4 results in 4 groups of rows of FM data, and so on). Each partitioned FM data may then be allocated (or stored) to a corresponding PESB 628 according to the data storage pattern as described above. For better understanding, example data operations will now be described below according to various example embodiments of the present invention.

Data operation for convolution

[0071] For illustration purpose only, using a 1x1 convolution as an example, one FM input is multiplied with one weight in one PE prime 920 for every clock cycle, and the result is accumulated (or added) with the previous stored result which is stored in the data registers 1008 of the PE prime 1008 (e.g., only one data register may be used for the example case of 1x1 convolution). Then, after K clocks (e.g., K is also the number of channels of input feature map data stored in an AMEM 812 or the image buffer 852 as shown in FIG. 12D), one partial sum of one output feature map data (which is correlated to K number of input feature map data) is generated in one PE prime 920. For example, for every first input feature map data, the previous stored data may be cleared to zero in order to prepare the starting point of the new partial sum. Considering the same input data are processed at same time within a PE macro 910, different channels of output feature map data (K channels are preferred) are generated at the same time and are shifted out one by one through a selection block of PE macro 910. Thereafter, the PE macro output data (partial sum) has the same pattern as shown in FIG. 12D. In various example embodiments, as long as K > 2^m, whereby 2^m is the number of PE primes 920 within a PE Macro 910, one FM output channel is sufficient for the PE Macro 910.

[0072] In various example embodiments, different PE macro rows 816 may process different rows of the partitioned FM data as shown in FIG. 12C at the same time, while different PESBs 628 may process different partitioned FM data accordingly. In this manner, the final output feature data sequence is just the same as the PE macro’s after first level AT within a PESB 628 and second level AT 632 among the PESBs 628. Accordingly, input/output buffer between data memories (e.g., AMEM blocks 812) and PE arrays (e.g., PESBs, PE Macro Row, and PE Macro, PE Prime thereof) can advantageously be omitted.

[0073] For illustration purpose, FIG. 13 A depicts an example 1x1 convolution with storage and data flow with respect to a data memory (an AMEM block 812 or the corresponding image buffer 582), according to various example embodiments of the present invention. In FIG. 13 A, FMM_RL_CN denotes the data location in the feature map data, whereby M denotes the channel number of the feature map data (or feature map channel), L denotes the row number of the feature map data (or feature map row number), and N denotes the column number of the feature map data (or feature map column number). W_OFp_IFq denotes a weight parameter, whereby p denotes the number of output FM channels and q denotes the number of input FM channels. In the example shown in FIG. 13 A, it is assume that the original partitioned 3D feature map data has a dimension of 40 (column) x 32 (row) x 64 (channels). Accordingly, in the example, the number of PE macros 910 provided in a PE macro row 816 may be eight (i.e., 8 PE macros 910 per PE macro row 816), the number of PE prime 920 within the PE macro row 816 may be four, and the feature map may be stored with four channels (i.e., the partitioned 3D feature map data has four channels of feature map data). Accordingly, there may be provided four PESBs 628 in the convolution engine 620. Furthermore, every four channels of FM data are stored in the sample pattern in other AMEMs 812, for example, AMEM 1 may store channels 1-4 of FM data, AMEM 2 may store channel 5-8 of FM data, and so on in the same data store pattern as described hereinbefore. In particular, assuming that the number of PE macro 910 within a PE macro row 816 is 8, 8 rows of feature map data may be stored in the corresponding AMEMs 812 and/or the corresponding image buffer 852 in the data order as shown in FIG. 13 A. For better understanding, FIG. 13B illustrates the example 1x1 convolution with storage and data flow together with the PE macro row 816 shown.

[0074] In particular, FIG. 13B illustrates the data storage pattern and the example connections for the example 1x1 convolution. As shown in FIG. 13B, L (8) columns of feature map data are parallel read out from the data memory (i.e., an AMEM 812 or the image buffer 852), and feed into the L number of feature map inputs of the corresponding PE macro row 816. The feature map data may be read-out continuously in the same or similar manner until the end of a section of the feature map data. For example, remaining sections of the feature map data may be processed in the same or similar manner in different iterations.

[0075] For illustration purpose, FIG. 13C depicts an overall data operation and timing diagram of the above-mentioned example 1x1 convolution with the storage and data flow, according to various example embodiments of the present invention. For example, it can be observed that the output data sequence has the same data pattern as described hereinbefore. Therefore, no global input buffer and output buffer are required, in contrast with conventional convolution engines. The input feature map data are being received continuously and the output feature map data are also shifted out continuously. For this example 1x1 convolution, only one data register may be required for any number of input feature maps. As a result, the size of the PE prime 920 can be minimized, which allows more PE primes 920 to be formed within a given silicon area. This enables the freedom of partitioning higher number of input feature maps and reduce the number of input feature map data access accordingly.

[0076] In various example embodiments, for large-size convolutions, such as a 3x3 convolution for example, three kernel (weight) rows of convolution may be processed by three PE macro rows 816, respectively, within one PESB 628, and three convolutions (e.g., multiplications) within a kernel (weight) row of convolution may be handled or processed by a PE Prime 920. Furthermore, different PE macros 816 may parallel process different rows of the input feature map data, while the input feature map data are shifted left using the MUX arrays 824 as shown in FIG. 8A or 8B. Accordingly, for example, the feature map data inputs at the second and third PE macro rows 816 may be utilized to complete 3x3 convolution, since the parallel adder-trees are configured to add the corresponding position of feature map outputs from the PE macros 816.

[0077] In the PE prime 920, every FM input may last for 3 clock cycle at input of PE prime and may convolute (multiply) with three weights (weights from a kernel (weight) row of 3x3 convolution) and then stored into different data registers, respectively. The new or next feature map input data may then convolute with corresponding weights and accumulated into the data registers 1008 in a rotated mode, for example, data that convolute with a second weight may be accumulated in a first data register, data that convolute with a third weight may be accumulated in a second data register, and data that convolute with a first weight may be accumulated in a third data register. These operations may continue until the end of the feature map row. In this example, three data registers are sufficient to handle all the rows of multiple input feature map data. One partial sum is ready within one PE prime 920 for every 3K clock cycles for the case of the 3x3 convolution and one partial sum is ready for every K clock cycle for the case of the 1x1 convolution case. Accordingly, more PE primes 920 of the PE macro 816 can be enabled to process multiple different output feature maps at the same time and only one data output channel is sufficient to shift out from multiple PE primes 920 within one PE macro 910, which greatly reduce interconnect complexity of the whole convolution engine 620.

[0078] For illustration purpose, FIG. 14A depicts an example 3x3 convolution with storage with respect to a data memory (an AMEM block 812 or the corresponding image buffer 582) and data flow, according to various example embodiments of the present invention. For example, it can be seen that for the first column of data at first section, all zero data are inputted. This is for zero padding for the 3x3 convolution. In various example embodiments, the zero data in the data memory may not be saved, but the first column data at first section (iteration) may be cleared to be zero by logic operation (e.g., an AND gate). Compared to the above- mentioned 1x1 convolution, it can be observed that there are addition columns for 3x3 convolution. For example, to generate 8 parallel output data, 8+2 data is required for 3x3 convolution. Therefore, in various example embodiments, the first AMEM block 812-1 within a PESB 628 has a wider data width than the other AMEM blocks 812 in the PESB 628. Similarly, the image buffer 852 may also have a wider data width than the other AMEM blocks 812. For better understanding, FIG. 14B illustrates the example 3x3 convolution with storage and data flow together with the PE macro row shown. For illustration purpose, FIG. 14C depicts an overall timing diagram of the example 3x3 convolution with the storage and data flow within a PE prime 920, according to various example embodiments of the present invention. For example, in FIG. 14C, it can be observed that only 3 data registers within a PE prime 920 is sufficient to handle multiple channels of input feature maps. This enables a higher density of PEs (or MACs), which is advantageous for a neural-network accelerator.

[0079] For large convolution (e.g., 3x3, 7x7, 11x11 convolutions, and so on) input feature map data may be read from the first AMEM block 812- 1 or the image buffer 852 within a PESB 628, and the same read-out data may be shared by different PE macro rows 816, whereby the read-out data is shifted one position (e.g., to the right in the example) per PE macro row 816. This operation is performed by the MUX arrays 824. Accordingly, each PE macro row 816 may handle one row of convolution, and multiple PE macro rows 816 may be used to handle one dimension of convolution, while another dimension of convolution may be handled by the PE prime 920.

[0080] For some even large-size convolutions such as 7x7, 11x11, and so on, typically, such convolutions may have a higher number of stride. In this regard, various example embodiments partition the 3D feature map data 1100 further according to the number of stride. As an example, for an example of 7x7 convolution with stride 2, the predetermined data storage pattern as described above may still applied. The only difference is that, for example in the case of column-first, the odd rows (with respect to channels) of the 3D feature map data 1100 may form a first group and even rows (with respect to channels) of the 3D feature map data 1100 feature map may form a second group, whereby the first and second groups of feature map data are stored in different PESBs 628. In this regard, the first group (odd number group) may only convolute (multiply) with odd number of kernel row only and the second group (even number group) may only convolute (multiply) with even number of kernel row as the stride is 2. That is, for the example case of stride 2, the 3D feature map data 1100 may be further partitioned according to row number, such as odd and even number of rows, into two different PESBs 628 accordingly, so redundant convolution due to the stride can be avoided. As a result, 4 clock cycle per feature map data may be required as odd and even rows of feature map data can be processed at the same time. For example, seven PE macro rows 816 from two PESBs 628 may be used for seven rows of convolution, respectively. Another dimension of the kernel may be handled within the PE prime 920. In this manner, not only the speed can be almost double but all the redundant convolutions (multiplications) due to higher stride number can be avoided directly without any additional hardware overhead. At the same time, the number of data registers in the PE prime 920 can be limited to a smaller number (e.g., 4 for most size of convolution) which minimizes the footprint of the PE array.

[0081] By way of an example, one dimension of 7x7 convolution is illustrated in following equations, whereby W1 to W7 are 7 weights of one row of kernels, while DI, D2, D3 and so on are one row of one feature map data, Doutl, Dout2, and so on are the convolution output (partial sum) based on one input feature map for the case of stride 2.

Doutl = WlxDl + W2xD2 + W3xD3 + W4xD4 + W5xD5 + W6xD6 + W7xD7 Dout2 = WlxD3 + W2xD4 + W3xD5 + W4xD6 + W5xD7 + W6xD8 + W7xD9 Dout3 = WlxD5 + W2xD6 + W3xD7 + W4xD8 + W5xD9 + W6xD10 + W7xDll Dout4 = WlxD7 + W2xD8 + W3xD9 + W4xD10 + W5xDll + W6xD12 + W7xD13 Dout5 = W 1 xD9 + W2xD 10 + W3 xD 11 + W4xD 12 + W5 xD 13 + W6xD 14 + W7xD 15 [0082] The above equations may be rewritten as follows:

Doutl = (WlxDl + W3xD3 + W5xD5 + W7xD7) + (W2xD2 + W4xD4 + W6xD6)

Dout2 = (WlxD3 + W3xD5 + W5xD7 + W7xD9) + (W2xD4 + W4xD6 + W6xD8)

Dout3 = (WlxD5 + W3xD7 + W5xD9 + W7xDll) + W2xD6 + W4xD8 + W6xD10)

Dout4 = (WlxD7 + W3xD9 + W5xDll + W7xD13) + (W2xD8 + W4xD10 + W6xD12)

Dout5 = (WlxD9 + W3xDll + W5xD13 + W7xD15) +

(W2xD10 + W4xD12 + W6xD14) [0083] From the above equations, it can be observed that the odd row number of feature map data are only convoluted with corresponding odd number of kernels (weights) and even row number of feature map data are only convoluted with corresponding even number of kernels (weights). Accordingly, the above-described partitioning of the 3D feature map data 1100 according to the stride number helps to remove or avoid redundant convolution processing due to the high stride number. The plurality of MUX arrays 824 shown in FIGs. 8A and 8B may also be used for shifting the feature map data for different PE macro rows 816, in the same or similar manner as described hereinbefore for the case of 3x3 convolution. Accordingly, one dimension of 7x7 convolution is handled or performed efficiently and/or effectively according to various example embodiments of the present invention.

[0084] Furthermore, another dimension of 7 x7 convolution is handled by the PE prime 920. In the same or similar manner as described hereinbefore for the case of 3^X3 convolution within a PE prime 920, 4 clock cycle per feature map data may be required for odd/even position data following the equations above. In various example embodiments, one dummy zero weight may be added for the even row number feature map data to facilitate control because there are 4 odd position weights and 3 even position weights initially. As the stride is 2, odd position feature map data only convolute with odd position weight and even position feature map data convolute with even position weights, and thus, only 4 data registers 1008 may be required within one PE prime 920. For example, by partitioning the input 3D feature map data into two groups of channels for odd and even channels of feature map data, then the convolution-of-7 can be treated as 2 convolution-of-4. In this manner, the redundant convolution within the feature map row can be removed or avoided. In various example embodiments, the data store register sequence may be rotated instead of rotating the feature map or weight, so for the case of 4 data registers, only 2 bits of selection control signal (ACC_CTRL 1022) may be needed, such as shown in FIG. which is shown as in FIG. 10. In various example embodiments, the control of even/odd row of input 3D feature map data may be implemented using a row counter and a column counter for the image data processor 854.

[0085] In various example embodiments, the image data processor 854 (e.g., dedicated image interface) may be configured to process the input image data and process (e.g., partition) the input image data for storing the image buffer 852 according to the data storage pattern as described hereinbefore according to various example embodiments. As shown in FIG. 7, it can be seen that only a number of PESBs comprises an image buffer 852. In various example embodiments, for every clock (i.e., image clock which is different from convolution engine clock), one pixel of input image data may be sent to the convolution engine in parallel according to RGB data of the input image data. The input image data may be inputted row by row until the end of the input image data. The image buffer 852 may be a memory configured with bytewriting. For example, the plurality of colors (e.g., red, green and blue colour channels) of the input image data may be associated with a plurality of PESBs 628-1, respectively, according to the stride as described hereinbefore. In the PESB 628-1, the image data processor 854 of the PESB 628-1 may be configured to count the number of the input image data to determine whether to store the input image data into the image buffer 852 of the PESB 628- 1. For example, this may be achieved by a row counter and a column counter. For example, each corresponding PESB 628-1 may be assigned a ID in order to differentiate it, and the same ID may also be assigned to the corresponding weight memory (WM) 624 as they are configured to operate in pair. For example, for stride 2, the lowest ID bit may be used to differentiate the even/odd row of the input image data, so that the partition of even/odd number of rows of the input image data can be achieved. As shown in FIG. 8A, one more array of MUXs may be provided for the PESB 628-1 with an image buffer 852 compared to the PESB 628-2 without an image buffer. As shown in FIG. 8A, the additional array of MUXs may be provided between the image buffer 852 and the first AMEM 812-1.

[0086] For example, when image buffer 852 has stored therein a 2D array of feature map data corresponding to a full or complete image data, the PESB 628-1 may inform the central control block 644 to start the convolution. When the first convolution is completed, the image buffer 852 is able to store a next or new 2D array of feature map data corresponding to a next or new image data. Accordingly, then image buffer 852 is able to receive the new image data while the convolution engine 620 is processing other layers of convolutions.

[0087] As mentioned above, corresponding weight memory 624 and PESB 628 operate in pair, and the same ID may be assigned to both. Accordingly, the order to fetch the weights from weight memory 624 may determine the order of output feature map data, which in turn is the input feature map data for the next convolution. It will be appreciated by a person skilled in the art that the present invention is not limited to any particular order of feature map channel partition, and the above-described order of feature map channel partition according to the ID sequence is an example to show that a simple order counter can be used to control the order of the output feature maps. Data operation for the Reset Bottleneck structure

[0088] FIG. 15 illustrates a data operation including a short-cut operation in ResNet according to various example embodiments. The data operation may comprise three normal convolutions, one short-cut operation and one element- add (bit- wise add) operation. Convl and Conv3 are 1x1 convolution, conv2 is the 3x3 convolution, while the short-cut operation can be either a 1x1 convolution or direct mapping of the feature maps. In various example embodiments, the direct mapping is treated as one special 1x1 convolution with special weight equal to 1. As a result, the same solution can be used for both cases.

[0089] From FIG. 15, it can be observed that the input feature maps are the exact same input feature maps of Convl, and Conv3 and Short-Cut operation are both 1x1 convolution. Various example embodiments provide the hardware acceleration solution of Short-Cut Operation using the system architecture according to various example embodiments by merging Conv3, Short- Cut and bit- wise add operations into a new Conv3 so as to speed up system performance by remove additional time for short-cut and element-wise adding operations because they are executing at the same time of Conv3. In addition, the element-wise add operation are absorbed in the new Conv3 operation, and thus, there is no hardware overhead for the element-wise add operation.

[0090] FIG. 16 depicts the partition/grouping of PESBs for short-cut operation, according to various example embodiments of the present invention. FIG. 16 illustrates how the input feature maps data are allocated for short-cut related data processing. First, all the PESBs may be partitioned into upper deck and lower deck, each including multiple PESBs. The original input feature map data for Convl and Short-cut may be stored in the first memory bank (memory bank A) of the AMEM blocks of the lower-deck of PESBs. The output of Convl may be stored in the second memory bank (memory bank B) of the AMEM blocks (not limited to only the lower-deck of PESBs). The output of Conv2 may be stored into the first memory bank (memory bank A) of the AMEM blocks of the upper-deck of PESBs, while the original input feature map data are still available in lower-deck of PESBs. Accordingly, the input feature maps data are ready for the new Conv3 (including original Conv3 and Short-Cut). When the new Conv3 is executed, the element- wise adding is absorbed by the second level ATs. The key difference is that each PE prime of a PE macro in the lower-deck of PESBs may handle one feature map in short-cut mode to meet the requirement of element-wise adding. In this mode, the PE primes in the lower deck of PESBs may handle one of the stacked input feature map data only, which may be performed by only writing the selected data from the multiplier into the data register of the PE prime. This approach may be extended to other similar cases such as Inception of GoogleNet, and so on.

[0091] Accordingly, based on the system architecture and data storage pattern according to various example embodiments of the present invention, the following features may be provided. [0092] Partition the 3D feature map data into the data storage pattern can achieve high- utilization of convolution engine architecture, balanced the memory read/write speed from/to the memories and remove the requirement of input/output buffer in system operation. In addition, memory access can be continuously for the whole row or column of FMs.

[0093] Each PE prime may process one complete row/column from one or many input feature maps, one dimension of the convolutions may be handled within a PE prime. Limited number of data registers are required for any size of convolution with the data storage pattern according to various example embodiments. Accordingly, the highest area efficiency of PE array can be achieved.

[0094] Different PE prime from one PE Macro may share the same input feature map data and generate the partial-sum of different output feature maps at the same time. All PE primes of a PE macro may share the output port, which will help to keep data paths minimum and clean, reuse the first level adder-tree for different output feature maps, and maintain the output pattern to be the same as input feature map pattern. This advantageously enables the removal the requirement of input buffer and output buffer in convolution engine architecture.

[0095] Different PE macro within a PE macro Row may handle another dimension (column/row) of one or many input feature maps. Using the MUX array within the PESB enables large-size convolution to be performed with one parallel input feature maps access. The data may be shared among different PE macro rows.

[0096] Different input partitioned feature maps may be allocated into the AMEM blocks of different PESBs, so that only one weight memory is interconnected with one PESB. This results in better system performance, such as easy to achieve high-speed and low-power by avoiding long distance data movement with very limited channels between corresponding pairs of weight memory and PESB.

[0097] The PE macro row may be used to handle different output feature maps in normal convolution, short-cut operation and depth- wise convolution by allowing different PE prime to handle different input feature maps respectively. [0098] Partitioning the PESBs into upper deck and lower deck enables to merge the shortcut and bit- wise adding with one normal convolution to speed-up the system without additional hardware cost.

[0099] The second level adder-tree array enables the complete convolution with the bias (as one input of the adder- tree).

[00100] Centralized max -pooling and activation block can reduce the size of all levels of the PE array (e.g., the PE prime, the PE macro, the PE macro row and the PESB) and final output data (output feature maps) can be broadcasted within the whole convolution engine because only one PESB is enabled for writing at a time according to various example embodiments. [00101] While embodiments of the invention have been particularly shown and described with reference to specific embodiments, it should be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. The scope of the invention is thus indicated by the appended claims and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced.

Claims

CLAIMS What is claimed is:

1. A convolution engine configured to perform neural network computations for a neural network, the convolution engine comprising: a plurality of convolution processing blocks, each convolution processing block being configured to perform convolution based on a two-dimensional (2D) array of first feature map data stored therein to produce a set of first convolution outputs, each set of first convolution outputs comprising a plurality of channels of first convolution outputs; and a plurality of weight memory blocks communicatively coupled to the plurality of convolution processing blocks, respectively, each weight memory block configured to store a plurality of weight parameters and supply the plurality of weight parameters to the corresponding convolution processing block communicatively coupled thereto for the corresponding convolution processing block to perform convolution based on the plurality of weight parameters, wherein each convolution processing block of the plurality of convolution processing blocks comprises: a data buffer block configured to store the corresponding 2D array of first feature map data; an input data processing block configured to process input data for storing in the data buffer block so as to form the 2D array of first feature map stored in a data buffer block; a plurality of feature map memory blocks, including a first feature map memory block, each feature map memory block configured to store a 2D array of second feature map data in relation to the convolution performed by the convolution processing block based on the 2D array of first feature map data stored in the data buffer block; a plurality of processing element blocks communicatively coupled to the plurality of feature map memory blocks, respectively, each processing element block being configured to perform convolution based on a 2D array of current feature map data, the 2D array of current feature map data being the 2D array of first feature map data stored in the data buffer block, the 2D array of second feature map data stored in the first feature map memory block or the 2D array of second feature map data stored in the corresponding feature map memory block to produce a set of second convolution outputs, the set of second convolution outputs comprising a plurality of channels of second convolution outputs; and a first convolution output combining block configured to channel-wise combine the plurality of sets of second convolution outputs produced by the plurality of processing element blocks to produce the corresponding set of first convolution outputs.

2. The convolution engine according to claim 1, wherein for said each convolution processing block, each processing element block of the plurality of processing element blocks of the convolution processing block comprises a plurality of processing element sub-blocks configured to perform convolution based on the 2D array of current feature map data, each processing element sub-block comprising a plurality of processing elements, and the weight memory block corresponding to the convolution processing block comprises a plurality of sets of weight memory sub-blocks communicatively coupled to the plurality of processing element blocks of the convolution processing block, respectively, each set of weight memory sub-blocks configured to store a set of weight parameters and supply the set of weight parameters to the corresponding processing element block communicatively coupled thereto for the corresponding processing element block to perform convolution based on the 2D array of current feature map data and the set of weight parameters.

3. The convolution engine according to claim 2, wherein each set of weight memory subblocks is communicatively coupled to the plurality of processing element sub-blocks of the corresponding processing element block such that the set of weight memory sub-blocks is configured to supply the set of weight parameters to the plurality of processing element subblocks for the plurality of processing element sub-blocks to perform convolution based on the 2D array of current feature map data and the set of weight parameters.

4. The convolution engine according to claim 3, wherein for said each set of weight memory sub-blocks communicatively coupled to the plurality of processing element sub-blocks of the corresponding processing element block, the set of weight memory sub-blocks comprises a plurality of weight memory sub-blocks, each weight memory sub-block of the plurality of weight memory sub-blocks being communicatively coupled to the corresponding processing element of each of the plurality of processing element sub-blocks of the corresponding processing element block for supplying thereto a weight parameter stored therein.

5. The convolution engine according to claim 4, wherein said each weight memory subblock communicatively coupled to the corresponding processing element of each of the plurality of processing element sub-blocks is dedicated to the corresponding processing element of each of the plurality of processing element sub-blocks for supplying thereto the weight parameter stored therein.

6. The convolution engine according to claim 4, wherein for said each set of weight memory sub -blocks, the set of weight parameters stored therein are weights for a plurality of channels of the 2D array of current feature map data.

7. The convolution engine according to claim 4, wherein for said each processing element block of the plurality of processing element blocks of the convolution processing block, the 2D array of current feature map data comprises a plurality of columns and rows of feature map data; and the plurality of processing element sub-blocks are configured to process the plurality of columns of feature map data respectively and in parallel, row-by-row in performing convolution based on the 2D array of current feature map data.

8. The convolution engine according to claim 7, wherein for each processing element subblock of the plurality of processing element sub-blocks of the processing element block, each processing element of the processing element sub-block comprises: a feature map data input port configured to receive input feature map data from the corresponding column of the plurality of columns of feature map data; a weight parameter input port configured to receive the weight parameter from the corresponding weight memory sub-block; one or more data registers; a multiplier configured to multiply the input feature map data and the weight parameter received to produce a first convolution result; an adder configured to add the first convolution result and a data output from one of the one or more data registers to produce a second convolution result for storing in one of the one or more data registers; and a processing element convolution output port configured to output the second convolution result from one of the one or more data registers as a processing element convolution output of the processing element.

9. The convolution engine according to claim 4, wherein for said each convolution processing block, the convolution processing block further comprises a plurality of sets of data selectors, each set of data selectors being arranged between a corresponding processing element block of the plurality of processing element blocks and a corresponding feature map memory block of the plurality of feature map memory blocks, and is controllable to be in one of a plurality of operation modes for inputting the 2D array of current feature map data to the corresponding processing element block to perform convolution.

10. The convolution engine according to claim 9, wherein the plurality of operation modes comprises a first operation mode, a second operation mode and a third operation mode, wherein the set of data selectors, when in the first operation mode, is configured to input the 2D array of first feature map data stored in the data buffer block as the 2D array of current feature map data to the corresponding processing element block to perform convolution, the set of data selectors, when in the second operation mode, is configured to input the 2D array of second feature map data stored in the first feature map memory block as the 2D array of current feature map data to the corresponding processing element block to perform convolution, and the set of data selectors, when in the third operation mode, is configured to input the 2D array of second feature map data stored in the corresponding feature map memory block as the 2D array of current feature map data to the corresponding processing element block to perform convolution.

11. The convolution engine according to claim 1, wherein said process input data comprises: determining, for each of a plurality of time instances, whether to store the input data received at the time instance in the data buffer block based on whether the input data received at the time instance belongs to the convolution processing block; and storing, for each time instance of the plurality of time instances that the input data received at the time instance is determined to belong to the convolution processing block, the input data received at the time instance in the data buffer block according to a data storage pattern so as to form the 2D array of first feature map stored in the data buffer block.

12. The convolution engine according to claim 11, wherein the data storage pattern comprises: storing, for each time instance of the plurality of time instances that the input data received at the time instance is determined to belong to the convolution processing block, a one-dimensional (ID) array of a channel of feature map data based on the input data received at the time instance along a memory width direction in the data buffer block so as to form the 2D array of first feature map stored in the data buffer block, the 2D array of first feature map data comprising a plurality of columns and rows of feature map data.

13. The convolution engine according to claim 12, wherein the plurality of rows of feature map data of the 2D array of first feature map data comprises a plurality of groups of rows of feature map data, each group of rows of feature map data comprises consecutive rows of feature map data corresponding to a plurality of ID arrays of different channels of feature map data, respectively, and ordered in a channel order that is the same amongst the plurality of groups of rows of feature map data.

14. The convolution engine according to claim 1 , wherein the input data is input image data.

15. A method of operating a convolution engine configured to perform neural network computations for a neural network, the convolution engine comprising: a plurality of convolution processing blocks, each convolution processing block being configured to perform convolution based on a two-dimensional (2D) array of first feature map data stored therein to produce a set of first convolution outputs, each set of first convolution outputs comprising a plurality of channels of first convolution outputs; and a plurality of weight memory blocks communicatively coupled to the plurality of convolution processing blocks, respectively, each weight memory block configured to store a plurality of weight parameters and supply the plurality of weight parameters to the corresponding convolution processing block communicatively coupled thereto for the corresponding convolution processing block to perform convolution based on the plurality of weight parameters, wherein each convolution processing block of the plurality of convolution processing blocks comprises: a data buffer block configured to store the corresponding 2D array of first feature map data; an input data processing block configured to process input data for storing in the data buffer block so as to form the 2D array of first feature map stored in a data buffer block a plurality of feature map memory blocks, including a first feature map memory block, each feature map memory block configured to store a 2D array of second feature map data in relation to the convolution performed by the convolution processing block based on the 2D array of first feature map data stored in the data buffer block; a plurality of processing element blocks communicatively coupled to the plurality of feature map memory blocks, respectively, each processing element block being configured to perform convolution based on a 2D array of current feature map data, the 2D array of current feature map data being the 2D array of first feature map data stored in the data buffer block, the 2D array of second feature map data stored in the first feature map memory block or the 2D array of second feature map data stored in the corresponding feature map memory block to produce a set of second convolution outputs, the set of second convolution outputs comprising a plurality of channels of second convolution outputs; and a first convolution output combining block configured to channel-wise combine the plurality of sets of second convolution outputs produced by the plurality of processing element blocks to produce the corresponding set of first convolution outputs, and the method comprising: determining, for each of a plurality of time instances, whether to store the input data received at the time instance in the data buffer block based on whether the input data received at the time instance belongs to the convolution processing block; storing, for each time instance of the plurality of time instances that the input data received at the time instance is determined to belong to the convolution processing block, the input data received at the time instance in the data buffer block according to a data storage pattern so as to form the 2D array of first feature map stored in the data buffer block; and performing, at each of the plurality of convolution processing blocks, convolution based on the corresponding 2D array of first feature map data stored therein to produce the corresponding set of first convolution outputs.

16. The method according to claim 15, wherein the data storage pattern comprises: storing, for each time instance of the plurality of time instances that the input data received at the time instance is determined to belong to the convolution processing block, a one-dimensional (ID) array of a channel of feature map data based on the input data received at the time instance along a memory width direction in the data buffer block so as to form the 2D array of first feature map stored in the data buffer block, the 2D array of first feature map data comprising a plurality of columns and rows of feature map data.

17. The method according to claim 16, wherein the plurality of rows of feature map data of the 2D array of first feature map data comprises a plurality of groups of rows of feature map data, each group of rows of feature map data comprises consecutive rows of feature map data corresponding to a plurality of ID arrays of different channels of feature map data, respectively, and ordered in a channel order that is the same amongst the plurality of groups of rows of feature map data.

18. The method according to claim 17, wherein the input data is input image data.

19. The method according to claim 15, wherein for said each convolution processing block, each processing element block of the plurality of processing element blocks of the convolution processing block comprises a plurality of processing element sub-blocks configured to perform convolution based on the 2D array of current feature map data, each processing element sub-block comprising a plurality of processing elements, and the weight memory block corresponding to the convolution processing block comprises a plurality of sets of weight memory sub-blocks communicatively coupled to the plurality of processing element blocks of the convolution processing block, respectively, each set of weight memory sub-blocks configured to store a set of weight parameters and supply the set of weight parameters to the corresponding processing element block communicatively coupled thereto for the corresponding processing element block to perform convolution based on the 2D array of current feature map data and the set of weight parameters.

20. A method of forming a convolution engine configured to perform neural network computations for a neural network, the method comprising: forming a plurality of convolution processing blocks, each convolution processing block being configured to perform convolution based on a two-dimensional (2D) array of first feature map data stored therein to produce a set of first convolution outputs, each set of first convolution outputs comprising a plurality of channels of first convolution outputs; and forming a plurality of weight memory blocks communicatively coupled to the plurality of convolution processing blocks, respectively, each weight memory block configured to store a plurality of weight parameters and supply the plurality of weight parameters to the corresponding convolution processing block communicatively coupled thereto for the corresponding convolution processing block to perform convolution based on the plurality of weight parameters, wherein each convolution processing block of the plurality of convolution processing blocks comprises: a data buffer block configured to store the corresponding 2D array of first feature map data; an input data processing block configured to process input data for storing in the data buffer block so as to form the 2D array of first feature map stored in a data buffer block; a plurality of feature map memory blocks, including a first feature map memory block, each feature map memory block configured to store a 2D array of second feature map data in relation to the convolution performed by the convolution processing block based on the 2D array of first feature map data stored in the data buffer block; a plurality of processing element blocks communicatively coupled to the plurality of feature map memory blocks, respectively, each processing element block being configured to perform convolution based on a 2D array of current feature map data, the 2D array of current feature map data being the 2D array of first feature map data stored in the data buffer block, the 2D array of second feature map data stored in the first feature map memory block or the 2D array of second feature map data stored in the corresponding feature map memory block to produce a set of second convolution outputs, the set of second convolution outputs comprising a plurality of channels of second convolution outputs; and a first convolution output combining block configured to channel-wise combine the plurality of sets of second convolution outputs produced by the plurality of processing element blocks to produce the corresponding set of first convolution outputs.