CN106203619B

CN106203619B - Data optimized neural network traversal

Info

Publication number: CN106203619B
Application number: CN201610370892.3A
Authority: CN
Inventors: 约翰·布拉泽斯; 李周勋
Original assignee: Samsung Electronics Co Ltd
Current assignee: Samsung Electronics Co Ltd
Priority date: 2015-05-29
Filing date: 2016-05-30
Publication date: 2022-09-13
Anticipated expiration: 2036-05-30
Also published as: CN106203619A

Abstract

Providing data-optimized neural network traversal. The step of executing the neural network comprises: an output tile of the first layer is generated by processing an input tile to the first layer of the neural network and stored in an internal memory of the processor. A processor may be used to generate an output tile for a second layer of the neural network by processing an output tile for the first layer stored in the internal memory.

Description

Data optimized neural network traversal

Technical Field

The present disclosure relates to neural networks. More particularly, the present disclosure relates to the execution of neural networks.

Background

Neural networks refer to computational architectures that mimic biological brains. Within a neural network, nodes called neurons may be interconnected and operated on collectively to process complex input data. Examples of different types of neural Networks include, but are not limited to, convolutional neural Networks, recursive neural Networks, Deep Belief Networks (Deep Belief Networks), Restricted Boltzman Machines (Restricted Boltzman Machines), and the like. In a feedforward neural network, the neurons of the neural network are linked to other neurons. These links extend in only one direction (i.e., forward direction) through the neural network.

Neural networks can be used to extract "features" from complex input data. The neural network may include a plurality of layers. Each layer may receive input data and generate output data by processing the input data to that layer. The output data may be a feature map of the input data generated by the neural network by convolving the input image or the feature map with the convolution kernel. An initial layer (e.g., convolutional layer) of the neural network may be operated to extract low-level features (such as edges and/or gradients) from an input (such as an image). The initial layer of the neural network is also referred to as the feature extraction layer. Subsequent layers of the neural network, referred to as feature classification layers, may extract or detect increasingly more complex features such as eyes, nose, etc. The feature classification layer is also referred to as a "fully connected layer".

An external memory may be used to store a large amount of intermediate result data generated during execution of the neural network. External memory may also be used to store a number of weights used in the feature classification layer.

Disclosure of Invention

Embodiments include a method of performing a neural network. The method comprises the following steps: generating a first layer of output tiles by processing input tiles to a first layer of the neural network; the output tile of the first layer is stored in an internal memory of the processor. The method further comprises the following steps: an output tile of a second layer of the neural network is generated using a processor by processing the output tile of the first layer stored in the internal memory.

Another embodiment includes an apparatus for performing a neural network. The apparatus comprises: an internal memory within the processor; a first computing unit within the processor coupled to the internal memory and configured to initialize an executable operation. The operations that may be performed include: generating a first layer of output tiles by processing input tiles to a first layer of the neural network; the output tiles of the first layer are stored in an internal memory. The executable operations further comprise: an output tile for a next layer of the neural network is generated by processing an output tile for a first layer stored in the internal memory.

This summary is provided merely to introduce a selection of concepts and not to identify key features or essential features of the claimed subject matter. Many other features and embodiments of the invention will be apparent from the accompanying drawings and from the detailed description that follows.

Drawings

The drawings illustrate one or more embodiments; the drawings, however, should not be taken to limit the invention to the only embodiments shown. Various aspects and advantages will become apparent upon review of the following detailed description and upon reference to the drawings.

FIG. 1 is a diagram showing an example of processing by multiple layers of a neural network;

FIG. 2 is a block diagram illustrating processing performed by an example neural network engine;

FIG. 3 is a block diagram illustrating an example segmentation of a neural network with overlapping tiles;

4-1 and 4-2 are block diagrams illustrating further example segmentations of a neural network;

FIG. 5 is a flow diagram illustrating an example method of performing a neural network;

FIG. 6 is a flow diagram illustrating an example method of determining a frustum (frustum) of a neural network;

FIG. 7 is a diagram showing an example of batch processing for performing a neural network;

FIG. 8 is a flow diagram illustrating an example method of performing a neural network;

FIG. 9 is a diagram of an example data processing system.

Detailed Description

While the disclosure concludes with claims defining novel features, it is believed that the various features described herein will be better understood from a consideration of the description in conjunction with the drawings. The processes, machines, manufacture, and any variations thereof described in this disclosure are provided for illustrative purposes. Any specific structural and functional details described are not to be interpreted as limiting, but merely as a basis for the claims and as a representative basis for teaching one skilled in the art to variously employ the features described in virtually any appropriately detailed structure. Further, the terms and phrases used in this disclosure are not intended to be limiting but rather to provide an understandable description of the feature being described.

The present disclosure relates to neural networks. More particularly, example embodiments disclosed herein relate to reducing memory access and intra-network bandwidth consumption of neural networks during operation. According to example arrangements disclosed herein, methods and systems for operating a neural network are provided. Example embodiments described herein can facilitate efficient use of internal memory and reduce the amount of data accesses to external memory or high-level cache memory performed during operation of a neural network.

For example, the example embodiments disclosed herein can reduce, eliminate all or almost all of the data traffic and associated storage for intermediate results during forward operation of the neural network. Example embodiments relate to execution of one or more stages of a neural network. For example, the execution of a feature extraction layer (e.g., a convolutional layer) is described in conjunction with fig. 1-6. Additionally or alternatively, as described in more detail in connection with fig. 7 and 8, example embodiments can eliminate many (e.g., all or nearly all) of the parameters read from and written to the external memory that are relevant to the execution of the fully-connected layer of the neural network. As used herein, the term "fully-connected layer" means a feature classification layer of a neural network.

Data traffic for executing the neural network is reduced, thereby improving performance and reducing the power required to determine the same calculation result (e.g., no approximation). Example embodiments described herein can facilitate increased execution speed, reduced power consumption, and reduced memory storage load by reducing the weight of reading from or writing to external storage of intermediate results and/or feature classification layers.

Convolutional neural networks may be deployed for many applications including (but not limited to) object recognition in images, image reconstruction, semantic segmentation, scene recognition, and the like. Object recognition refers to image processing to detect or identify a particular object (such as a cat, car, chair, etc.) in an image. Image reconstruction refers to image processing that attempts to correct an image. An example of image reconstruction may include sharpening a blurred image. Semantic segmentation refers to the image processing of a portion of an annotation image. Scene recognition refers to image processing that determines a particular scene (such as an office, bedroom, stadium, etc.) represented in an image. In addition to these visual examples, there are many other application areas in which similar neural networks are effectively applied.

While neural networks can achieve excellent precision (accuracy), neural networks can be computationally intensive. For example, neural networks typically perform a large number of operations on each image, require a large number of weights to perform, and produce a large amount of intermediate result traffic. For example, a typical neural network may perform operations on the order of billions of operations per image, utilize hundreds of millions to billions of weights, and generate hundreds of gigabytes of intermediate result data. In many embodiments, the weights and intermediate result data traffic are energy-efficient with high consumption. As the computational efficiency of neural networks increases, these traffic make up a larger proportion of the power spent executing the neural networks, thereby limiting the use of the neural networks in power-constrained mobile devices and other power-constrained applications and/or computing environments. Accordingly, example embodiments disclosed herein can facilitate the deployment of neural networks and neural network-based applications on mobile electronic devices.

Fig. 1 is a diagram showing an example of processing by a plurality of layers of a neural network 100. Fig. 1 illustrates an input 102 and a plurality of feature map sets 104 and 106. For example, the input 102 may be an image to be processed through the neural network 100. As used herein, the term "feature map set" means one or more feature maps (e.g., data). The feature map set is received as input and processed by a layer of the neural network and/or generated as output by a layer of the neural network. In an example embodiment, the feature map sets 104 and 106 are generated by feature extraction layers or convolution layers of the neural network 100.

In general, the layers of the neural network 100 can define an input to output mapping. For example, in the case of a convolutional neural network, the mapping defined by a layer is implemented as one or more convolution kernels to be applied to input data (such as an image and/or a particular feature map) to further generate the feature map from the layer as an output. Referring to FIG. 1, a layer (not shown) receives an input 102 during forward execution and produces as an output a feature map set 104. The next layer (not shown) receives the feature map set 104 as input and produces the feature map set 106 as output during forward execution. Yet another next layer may receive the feature map set 106 as input during forward execution and produce a further feature map set as output. Thus, during forward execution, data flows up from the input 102 to the feature map set 104 and to the feature map set 106. All or one or more of the layers receiving and/or generating feature map sets 104 and 106 may be hidden layers (e.g., hidden convolutional layers). In addition to applying convolution kernels to map the input feature map to the output feature map, other processing operations may be performed. Examples of such processing operations may include, but are not limited to, the application of a start-up function, pooling, and resampling.

In the example of FIG. 1, feature map set 104 includes four feature maps 104-1, 104-2, 104-3, and 104-4. The feature map set 106 includes six feature maps 106-1, 106-2, 106-3, 106-4, 106-5, and 106-6. It should be understood that the number of feature maps shown in each of feature map sets 104 and 106 is for illustration purposes. The example arrangements described in this disclosure are not intended to be limited by the particular number of feature maps in any feature map set of the neural network 100 and/or by the particular number of layers in the neural network 100.

The term "intermediate data" refers to the data of feature maps produced by hidden convolutional layers (e.g., layer 1 to layer N-1) of a neural network. For example, a neural network engine (NN engine) produces intermediate data during execution of a neural network, such as the neural network 100.

In general, each feature map set 104 and 106 may consist of tens to hundreds of feature maps. In one example, each feature map is a 16-bit value 2D image map representing the intensity of known features at all x, y locations. To generate each feature map for layer N +1 of the neural network, the NN engine reads each feature map output by layer N of the neural network. For example, if layer N generates 10 feature maps as inputs to layer N +1 (layer N +1 generates 20 feature maps as outputs), then each feature map in layer N must be read 20 times in the course of running layer N + 1. Therefore, the NN engine must perform a total of 200 feature maps read from layer N.

In one arrangement, the NN engine may use parallelization (parallelisms) to rearrange the computations so that intermediate data is consumed soon after production. By consuming intermediate data shortly after production, only a small amount of intermediate data is stored at any one time. The small amount of intermediate data may be incorporated into nearby on-chip memory (e.g., internal memory) rather than storing the intermediate data in external Random Access Memory (RAM) or other remote cache memory. Furthermore, in example embodiments, a small amount of intermediate data (if any) is moved any significant distance within the NN engine itself. The same local set of multiply-accumulate (MAC) units that produce intermediate data may be used to consume the intermediate data as input shortly after it is produced. This further reduces power because no long interconnects within the NN engine are required to transfer intermediate data.

In another example embodiment, the NN engine may be configured to rearrange the computations to reduce and/or eliminate and localize the intermediate data by interleaving (interleaving) the generation of one or more, or possibly all, convolutional layers of the neural network. This is in contrast to executing all layers to produce feature map set 104 and then executing all next layers to produce feature map set 106, and so on. Rather, according to the example arrangements described herein, the NN engine may execute portions of a layer to produce a portion of the feature map set 104, then execute portions of a next layer to produce the feature map set 106, and so on. For example, the NN engine may generate tile 110-1 of feature map set 104 followed by a corresponding tile 112-1 of feature map set 106, and so on. The NN engine may then generate tile 110-2 followed by tile 112-2, etc.

For illustrative purposes, the neural network 100 of FIG. 1 may be visualized as a pyramid of layers. As described, execution may be performed starting from the bottom of the pyramid by processing the input 102 to produce a feature map set 104 having tiles 110 and processing the feature map set 104 to produce a feature map set 106 having tiles 112. As the neural network 100 is traversed upward, each next higher level may shrink in the x-y dimension while the number of feature maps for the next higher level may increase. For example, the x-y dimensions of the layers that produce feature map set 106 may be smaller than the x-y dimensions of the layers that produce feature map set 104. Feature map set 106 has more feature maps than feature map set 104. In other cases, the number of feature maps in the next higher layer of the neural network may remain unchanged.

According to another example embodiment, the 3D volume of the neural network 100 may be conceptually decomposed or partitioned into a plurality of rectangular frustums. Each rectangular frustum may have rectangular intersecting surfaces that define a tile with the inputs and/or each set of feature maps used by the neural network 100. In this regard, a tile is a rectangular portion of the input data or feature map set to the neural network. In the example of fig. 1, neural network 100 is divided into four frustums, referred to as frustum 1, frustum 2, frustum 3, and frustum 4. The rectangular tiles are defined by the intersecting surfaces of the frustums within the input 102 and by the intersecting surfaces of the frustums within each of the feature map sets 104 and 106. Thus, each tile of a given feature map set comprises a portion of each feature map of the feature map set. For example, tile 110-1 includes an upper left portion of each of feature maps 104-1, 104-2, 104-3, and 104-4. For purposes of discussion, the extension of the reference number for each tile of the feature map set represents the particular frustum to which that tile belongs. For example, frustum 1 may include tile 108-1 of input 102, tile 110-1 of feature map set 104, and tile 112-1 of feature map set 106. Frustum 2 may include tile 108-2 of input 102, tile 110-2 of feature map set 104, tile 112-2 of feature map set 106, and so on. Because the layers define the mapping of inputs to outputs using convolution kernels, it should be understood that each frustum also defines a particular one of the convolution kernels in each layer that operates on the input tile and produces the output tile. An example method for segmentation is described in more detail in connection with fig. 6.

As used herein, examples of "executing layers" and "processing data" using layers of a neural network (e.g., by using a processor, computational unit, NN engine, etc.) include "applying convolution kernels of the layers of the neural network to data provided as inputs to the layers to produce a set of output feature maps of the layers. The data may be a feature map, a set of feature maps, or another input (such as one or more images). In this regard, it should be understood that portions of the neural network may be executed to process the blocks. As used herein, examples of "processing tiles" using layers of a neural network include "applying a subset of convolution kernels of a layer of the neural network corresponding to a tile provided as an input to the layer to produce an output tile of the layer. For example, convolution kernels defining layers in a frustum of a tile provided as an input may be applied to the input tile to produce an output tile.

In general, the processing in each frustum may be performed independently of the various other frustums. In one example embodiment, a small amount of data may be shared between adjacent cones. Furthermore, for a given tile of a layer, all of the feature map portions necessary to produce the corresponding tile in the next layer may be stored in a buffer local to the particular logic circuit producing the corresponding tile in the next layer. As defined in this disclosure, the term "respective tile" refers to a tile in the same frustum of the neural network and in an adjacent layer as a reference tile or a subject tile.

For example, for a given layer of the neural network 100, portions of the feature map consumed and produced by the processors of the NN engine may be stored in internal memory on the processor chip. The portion of the feature map generated by the processor for the tile is used to generate an output tile that is provided as an input to the next layer. For example, the processor may consume a portion of the input 102 (e.g., tile 108-1) stored in the internal memory to generate a corresponding tile 110-1 of the feature map set 104. Tiles 110-1 of the feature map set 104 may also be stored in internal memory. The processor may then utilize tile 110-1 of the feature map set 104 in internal memory to generate tile 112-1 of the feature map set 106. Block 112-1 may also be stored in an internal memory. In one aspect, the total storage required for the internal memory to process the frustum is the maximum footprint (e.g., storage usage) of the respective tiles of the frustum in two adjacent layers of the neural network 100. For example, the data corresponding to tile 112-1 may overlay the data of tile 108-1. It should be understood that the x and y dimensions of the tile (e.g., frustum size) may be reduced to the size needed to ensure that intermediate results are incorporated into the available internal memory.

For each frustum of the neural network 100, the NN engine may generate the portion of the feature map defined by the respective tile of layer N +1 from the portion of the feature map defined by the tile of layer N. In one embodiment, the NN engine may perform the necessary processing in any of a variety of different orders while holding all the required data in an internal memory or buffer. For example, the NN engine may generate the portion of each feature map for the corresponding tile for layer N +1 by reading and convolving all the input feature maps defined by the tiles for layer N and summing the results. After the corresponding tile for layer N +1 is generated, the data for generating the tile for layer N of the tile for layer N +1 is no longer needed. Thus, the NN engine may reclaim, delete, free, or overwrite storage for the tile of storage tier N, so that the results of storage tier N +2 (e.g., the corresponding tile) are stored, and so on. The NN engine may continue the intermediate data of the overlay layer while the newly generated intermediate data of the next layer is generated as described. An example method is described in more detail in connection with fig. 5.

By dividing the neural network into frustums that can be processed independently of each other, the NN engine may process the frustums in parallel using multiple computing units. For example, one computing unit of the NN engine may process blocks 108-1, 110-1, and 112-1; while another computing unit of the NN engine may process blocks 108-2, 110-2, and 112-2; while another computing unit of the NN engine may process blocks 108-3, 110-3, and 112-3, while another computing unit of the NN engine may process blocks 108-4, 110-4, and 112-4. Parallel processing is described in more detail in connection with fig. 4.

In some cases, some data (e.g., very little data) is used by closely adjacent tiles within the same feature map set. Thus, while the frustum can be processed independently, a small portion of the intermediate data can be shared along the boundaries of adjacent tiles within the same feature map set when processed by the layers of the neural network. For example, a small portion of the data generated for tile 110-1 of the feature map set 104 may be shared by tile 110-2 of the feature map set 104. Because the processing is consistent within each frustum (the same number of calculations are performed and the data is held internally), the processing time may be predictable. Thus, synchronization between tiles can be controlled simply without any significant stalls. In an example embodiment, the computing unit will naturally complete processing of an input tile substantially simultaneously with the computing unit operating on an immediately adjacent input tile, at which time data at the edges of adjacent tiles may be swapped. In another example embodiment, synchronization and data exchange may be implemented at a finer granularity on a feature map basis. The frustum-based approach of neural network traversal makes the efficient scaling of the architecture simple and efficient.

In another example embodiment, data sharing between adjacent tiles may be eliminated by defining tiles to overlap each other at tile boundaries. In this case, the NN engine may generate data for a tile (including the bounding region of the tile) once per tile. Thus, in the case of overlapping tiles, the data of two adjacent tiles need not be shared. An example of overlapping tiles is described in more detail in connection with fig. 3.

Fig. 2 is a block diagram illustrating processing performed by an example NN engine 200. As shown, the NN engine 200 may include a processor 205 and an external memory 215. Processor 205 may include one or more computing units 208. In case the processor 205 comprises more than one computing unit 208, the computing units 208 may be configured to operate in parallel or simultaneously with each other. Further, the computing units 208 may operate independently of each other. In one example, each computing unit 208 may be implemented as a core of executable instructions.

The processor 205 may be implemented as one or more hardware circuits. For example, the processor 205 may be implemented as an integrated circuit. In an example embodiment, the processor 205 may be configured to execute instructions (such as instructions 225). The instructions 225 may be embodied in program code. Example embodiments of processor 205 may include, but are not limited to, a Central Processing Unit (CPU), a multi-core CPU, an array processor, a vector processor, a Digital Signal Processor (DSP), a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), an Application Specific Integrated Circuit (ASIC), a programmable logic circuit, a controller, a Graphics Processor (GPU), and so forth. The NN engine 200 may be implemented using any of a variety of different processors described in connection with the external memory 215.

The processor 205 may include an internal memory 210. The internal memory 210 may be an on-chip memory. For example, internal memory 210 may be a cache memory of processor 205. Internal memory 210 may be implemented as a simple buffer, a level 1 cache, a level 2 cache, or other type of on-chip memory of processor 205. As shown, the computing unit 208 may be connected to an internal memory 210. In an arrangement where the processor 205 includes a plurality of computing units 208, each computing unit 208 may have a dedicated internal memory 210. The internal memory 210, or individual internal memories as the case may be, may store the feature maps and/or portions thereof as feature map data 222-1, weights 220-1, and instructions 225.

As shown, the processor 205 may be connected to an external memory 215. In one example, external memory 215 may be implemented as one or more higher levels of cache memory for processor 205. However, the external memory 215 may not be located on the same chip as the processor 205. In another example, the external memory 215 may be implemented as RAM (e.g., DRAM, SRAM) or other memory. In another example, the processor 205 may be connected to the external memory 215 through a memory controller (not shown).

In one example embodiment, the external memory 215 stores the weights 220-2 for neurons of the neural network that are not currently in use. The external memory 215 can also store the final output tile data 222-2. Thus, while the memory 215 stores the weight 220-2, the weight 220-1 may be stored in the internal memory 210. For example, the weights 220-1 are those weights required to process an input tile to a current layer of the neural network 100 to produce an output tile for the current layer of the neural network 100 to be used as input for a next layer. In the example of FIG. 2, the weight 220-1 is the weight required to process the tile 108-1 of the input 102 to produce the tile 110-1 of the feature map set 104 as output. The weight 220-2 is other weights of the neural network 100 that are not currently used or needed to process block 108-1. In another example embodiment, the processor 205 may compress the weights 220-1 for storage in the internal memory 210.

In another example embodiment where the processor 205 includes multiple computing units, each with its own internal memory, the weights may be loaded once per layer. Each computing unit uses the same weight as the other computing units when processing the same feature map. The internal memory of each computing unit may store different weights, for example, for the layer of the neural network that is currently being processed. A portion of the internal memory used to store the weights for each computing unit may be shared with each other computing unit. The internal memory of each computing unit is accessible to other computing units to share the weights stored therein for processing the tiles.

FIG. 3 is a block diagram illustrating an example segmentation of a neural network with overlapping tiles. More specifically, fig. 3 illustrates a feature map set 104 of the neural network 100 implemented using overlapping tiles. As shown, the tiles 110-1, 110-2, 110-3, and 110-4 are defined to overlap one another. The overlap area 305 is shown shaded. The overlap region 305 is also shown separately without tiles 110-1, 110-2, 110-3, and 110-4. Defining tiles with overlap as described avoids having to share data between adjacent tiles.

At a coarser level, there may be a wider area of some computational units that each process the entire neural network with multiple frustums. For example, a 4-cell configuration may divide the network into 4 quadrants, each quadrant having 16 frustums. In this case, each computing unit may pass through the quadrant assigned to that computing unit from the top left corner to the bottom right corner. In another example, the neural network may be divided into a checkerboard of frustums.

Fig. 4-1 and 4-2 are block diagrams illustrating further example segmentations of a neural network. In the examples of fig. 4-1 and 4-2, the tiles do not overlap. Referring to fig. 4-1, the feature map set 402 is partitioned to include 16 tiles shown as tiles 404 and 434. The NN engine that processes the feature map set 402 may include a plurality of different computing units or cores as described. For example, the NN engine may include 4 computation units A, B, C and D. Thus, the NN engine may scatter the work to compute units A, B, C and D to achieve good scaling with minimal (or at least less) cross traffic (scaling) for the exchange of data between adjacent tiles.

In the example of fig. 4-1, the portion 440 of the feature map set 402 may include tiles 404, 406, 408, and 410. The NN engine may process the feature mapped tiles of portion 440 as inputs to the next layer of the neural network. The computing unit a may process the block 404. The computing unit B may process block 406. The computing unit C may process block 408. The computing unit D may process the block 410. The calculation units A, B, C and D operate simultaneously.

Similarly, portion 442 includes tiles 412, 414, 416, and 418. The NN engine may process the feature mapped tiles of the portion 442 as inputs to the next layer of the neural network. The computing unit a may process block 414. The computing unit B may process block 412. The computing unit C may process block 418. The computing unit D may process block 416. The calculation units A, B, C and D operate simultaneously.

Portion 444 includes tiles 420, 422, 424, and 426. The NN engine may process the feature mapped tiles of portion 444 as inputs to the next layer of the neural network. The computing unit a may process block 424. The computing unit B may process block 426. The computing unit C may process block 420. The computing unit D may process block 422. The calculation units A, B, C and D operate simultaneously.

Portion 446 includes tiles 428, 430, 432, and 434. The NN engine may process the feature mapped tiles of portion 446 as input to the next layer of the neural network. The computing unit a may process block 434. The computing unit B may process block 432. The computing unit C may process block 430. The computing unit D may process block 428. The calculation units A, B, C and D operate simultaneously.

In the description of the process in fig. 4-1, the NN engine may process a portion and then proceed to process a corresponding portion of the next layer and proceed upstream to process one or more other feature extraction layers of the neural network. In other arrangements, the NN engine may operate on portion 440, then portion 442, and then proceed with processing of the corresponding portions in the next layer or layers, as appropriate, before returning to layer 402

processing portions

444 and 446.

It should be understood that fig. 4-1 is provided for illustrative purposes only. In one or more other embodiments, the NN engine can divide the neural network into bands (bands), traverse in row-first order, Z-order, and so forth. There are other possible coarse level sub-divisions and traversals that can be used. One consideration is to reduce data exchange between the frustum and/or the computing unit. For example, referring to fig. 4-1, the computing unit B operates the adjacent tile 406 and tile 412 from different portions and operates the adjacent tile 426 and tile 432 from different portions. The same is true for the compute unit C and the adjacent tiles 408 and 420 and the adjacent tiles 418 and 430. The computing unit D operates on the adjacent block composition of blocks 410, 416, 422, and 428.

Fig. 4-2 illustrates another example segmentation of a feature map set 402. In the example of fig. 4-2, the feature map set 402 is partitioned into blocks 1002-. Block 1002-1008 is in row 1010. Block 1012-1018 in row 1020. Block 1022- _ 1028 is in row 1030. Block 1032 and 1038 are in row 1040. Each tile in fig. 4-2 is also labeled with a specific core of the NN engine that operates the tile. As shown, compute unit A operates on each tile of row 1010. The computational unit B operates on each tile of row 1020. The computing unit C operates on each tile of row 1030. The computational unit D operates on each tile of row 1040. The arrows represent each computing unit configured to traverse the tiles in the row from left to right. For example, referring to row 1010, the computing unit a processes block 1002, then block 1004, then block 1006, then block 1008. Each of the other computing units may operate on tiles in other rows in the same manner. It will be apparent that the tiles may be processed from right to left if desired.

The order in which the tiles are processed or traversed may be determined by band. As defined herein, the term "stripe" means a collection of two or more adjacent tiles in the same row or column. In one example embodiment, the bands are respective adjacent tiles of a row or column. Traversal of the tiles from left to right on a row-by-row basis shows an example of horizontal (or row-based) bands. In one example, each row may be a band, where row 1010 corresponds to band 1104; row 1020 corresponds to band 1106; line 1030 corresponds to band 1108; row 1040 corresponds to band 1110. Fig. 4-2 illustrates an example embodiment in which each strip is formed in a row. However, in other examples, each strip may be formed of 2, 3, 4, or more rows.

The organization and traversal of tiles in bands provides a number of advantages in the case of non-overlapping tiles. In one aspect, data need not be exchanged between computing units at the boundary of two tiles in a band during movement from one tile to the next adjacent tile in the same band. For example, the computing unit a processes block 1002 and the next block 1004 following block 1002. As such, data need not be exchanged with different computing units to process the shared boundaries or edges between the tile 1002 and the tile 1004.

On the other hand, data exchange between computing units A, B, C and D is facilitated because the computing units complete operations on adjacent tiles of different bands (e.g., tiles in the same column as shown in FIG. 4-2) at approximately the same time. For example, band 1104 and band 1106 have a shared border area 1204. If computing unit A processes band 1104 and computing unit B processes band 1106, computing unit A and computing unit B share data in order to process shared bounding region 1204. Processing of block 1002 by computing unit a is concurrent with processing of block 1012 by computing unit B. The computing unit a and the computing unit B end the processing of the tiles 1002 and 1012, respectively, at approximately the same time, allowing the computing unit a and the computing unit B to more easily share the portion of data that shares the bounding region 1204 (the shared edge between the tiles 1002 and 1012). Computing unit a and computing unit B may then proceed to blocks 1004 and 1014, respectively, while processing blocks 1004 and 1014, sharing the data, and proceeding down the respective bands.

Similarly, band 1106 and band 1108 have a shared bounding region 1206. If computing unit B processes zone 1106 and computing unit C processes zone 1108, then computing unit B and computing unit C share data to process shared bounding region 1206. Processing of block 1012 by computing unit B is concurrent with processing of block 1022 by computing unit C. The computing unit B and the computing unit C end the processing of the tiles 1012 and 1022, respectively, at approximately the same time, allowing the computing unit B and the computing unit C to more easily share the portion of data that shares the bounding region 1206 (the shared edge between the tiles 1012 and 1022). Computing unit B and computing unit C may then move to blocks 1014 and 1024, respectively, while processing blocks 1014 and 1024, sharing the data, and continue processing the respective bands down.

Finally, band 1108 and band 1110 have a shared bounding region 1208. If computing unit C processes band 1108 and computing unit D processes band 1110, then computing unit C and computing unit D share data to process shared bounding region 1208. Processing of block 1022 by computing unit C is concurrent with processing of block 1032 by computing unit D. The computing unit C and the computing unit D end processing of the tiles 1022 and 1032, respectively, at approximately the same time, allowing the computing unit C and the computing unit D to more easily share the portion of data that shares the bounding region 1208 (the shared edge between the tiles 1022 and 1032). Computing unit C and computing unit D may then proceed to blocks 1024 and 1034, respectively, while processing blocks 1024 and 1034, sharing the data, and continuing to process the respective bands down.

While fig. 4-2 is generally described as having bands formed from one or more rows of tiles, it should be understood that the bands may be formed as one or more columns of tiles. For example, a first band may be formed by tiles 1002, 1012, 1022, and 1032, a second band formed by tiles 1004, 1014, 1024, and 1034, and so on. In this case, each computing unit may start at the top (or bottom) of the band and process the tiles moving from the top (or bottom) to the bottom (or top) of the band. Each strip may also be formed of 2 columns, 3 columns, 4 columns or more.

Fig. 5 is a flow diagram illustrating an example method 500 of performing a neural network. More specifically, method 500 illustrates an example method of performing a feature extraction layer of a neural network. In addition to block 505, the method 500 may be performed by the NN engine described with reference to fig. 2.

At block 505, the neural network may be partitioned into a plurality of frustums. These frustums may be rectangular. The neural network may be divided into rectangular frustums that project from higher layers to lower layers of the neural network. In one embodiment, the neural network is segmented into frustums using offline processing as performed by the data processing system. The segmented neural network and the segmentation may be stored in memory as a data structure or as part of a data structure so that the NN engine and/or another system may read and/or determine the segmentation when executing the neural network.

For example, the system may partition the neural network according to the size of the internal memory of the processors of the neural network. The system may scale the size of the frustum according to an amount of storage that can be used internally by the processor to store weights used to process tiles of one layer, tiles of feature maps in adjacent layers, and instructions for generating tiles of a next layer. An example method of implementing block 505 is described in connection with FIG. 6.

At block 510, the NN engine may select a layer of the neural network as the current layer. For example, the NN engine may select layer N as the current layer. At block 515, the NN engine may determine whether the current layer is designated as a stopping point for performing the execution method shown in fig. 5. If so, in one aspect, the method 500 may end. In another aspect, the method 500 may be restarted at a selected level of the neural network.

For example, at block 515, the NN engine may determine that a particular layer of the neural network has been reached. In response, the NN engine may start processing from the start layer if the next tile remains processed as input to the start layer (e.g., at least one other frustum of the neural network needs processing), and end otherwise. If the specified layer has not been reached, the method 500 may continue to block 520.

At block 520, the NN engine may select the tile as an input to the current layer as the current tile. At block 525, the NN engine may generate a corresponding tile for the next or adjacent layer of the neural network for the current tile. For example, the neural network engine processes the selected tiles as input tiles to produce output tiles (e.g., respective tiles). At block 530, the system determines whether to process another tile of input objects (image or feature map set) to the current layer or continue to the next layer in the neural network. In response to determining that a different tile or next tile of the input object is to be processed, the method 500 loops back to block 520 to select the next tile of the input object. In response to determining that the corresponding tile in the next layer of the neural network is to be processed, the method 500 loops back to block 510 to select the next adjacent layer of the neural network. For example, the tile selected in the next iteration of block 520 may be the output tile produced in the previous iteration of block 520.

In one embodiment, depending on the partition, band, number of computational units in the NN engine, size of internal memory, etc., the NN engine may process another tile in the current layer. For example, the NN engine may process only a subset of the tiles (e.g., one or more, but less than all of the tiles) as input to the current layer. The NN engine may continue to the next layer of the neural network before processing all tiles in the previous layer. As discussed, intermediate data generated by processing the input tile to the current layer may be stored in internal memory and used for the next layer (e.g., as an input tile for generating an output tile in the next layer).

In another embodiment, the method 500 may be performed by a first computing unit of the NN engine, while one or more other computing units of the NN engine also implement the method 500 of fig. 5 concurrently with the first computing unit. The compute units may also operate in an asynchronous manner, so data at the edges of adjacent tiles in the same feature map set being processed by concurrently operating compute units may be shared. Alternatively, tiles may be defined in an overlapping manner to avoid sharing of data between computing units.

In another embodiment, the method 500 may be performed to process a first frustum of a neural network through a plurality of first adjacent layers. The method 500 may iterate to process each other frustum through the plurality of first adjacent layers. The method 500 may then be performed again to process a first frustum having a plurality of different second adjacent layers of different segmentations than the plurality of first adjacent layers. The method 500 may be repeated to process the remaining frustums with the plurality of second adjacent layers having different segmentations.

FIG. 6 is a flow diagram illustrating an example method 600 of determining a frustum of a neural network. The method 600 may be performed to partition the neural network into frustums that define the size of the tiles of the various feature extraction layers of the neural network. In one aspect, method 600 may be an example implementation of block 505 of fig. 5. In one embodiment, the method 600 may be performed by a data processing system (system), such as a computer. For example, the method 600 may be performed as an offline process, wherein the method 600 may be performed prior to execution of the neural network. The determined segmentation may be stored as part of the neural network for subsequent execution by the NN engine.

At block 605, the system may select a set of adjacent feature extraction layers of the neural network to process together, keeping the intermediate data in the internal memory of the processors of the NN engine. For example, the system may select two adjacent feature extraction layers. As discussed, keeping the intermediate data in the internal memory of the processors of the NN engine reduces off-chip data traffic generated in the course of executing the neural network.

At block 610, the system may subtract the storage required to store the weights for compression in the set from the determined internal memory size. The amount of storage required to store the compressed weights for each layer of the neural network may be determined from a training process performed prior to segmentation. At block 615, the system may determine the width and height of the tile based on the storage required for the number of feature maps in layer N of the set plus the corresponding storage requirements of the next layer (layer N +1) of the set. The storage required for layer N +1 is the product of the scaled width and height and the number of feature maps in layer N + 1. The width and height are scaled from layer N and account for (account) additional neurons at the tile boundary for the convolution kernel width and height.

For the selected set of layers in block 605, the system may determine the width and height of any given layer such that after subtracting the compressed weights from the total available storage of internal memory, the portions of the feature maps of the tiles (e.g., corresponding tiles) of the two adjacent layers are included in the remaining storage. Since tile resolution is scaled at each layer, one size will result in no scaling beyond the available storage.

FIG. 6 is presented for illustrative purposes only and is therefore not intended as a limitation on the inventive arrangements disclosed herein. Fig. 6 illustrates an example process of segmenting a neural network into frustums based on the size of the internal memory of the NN engine. In one arrangement, fig. 6 may be performed for different groups of adjacent feature extraction layers in a neural network to determine more than one segmentation of the neural network. In this way, a first segmentation may be used to perform a portion of the neural network having multiple neighboring feature extraction layers, while a second (or third or more) different portion of the neural network having neighboring feature extraction layers may be used to perform a second (or different) segmentation.

In another aspect, example embodiments described herein may also address the reduction of parameter data traffic. For example, the reading of parameter data may consume a significant portion of the memory bandwidth in the feature classification layer of the neural network. As defined in this specification, the term "parameters" means weights applied to data read from an input feature map to produce an output feature map in a subsequent layer of the neural network. In this regard, the term "parameter" may be used interchangeably with "weight" in this disclosure. Typically, each weight is 8 bits or 16 bits, but the inventive arrangement is not intended to be limited by the particular bit width of the parameters. As an illustrative example, many convolutional neural networks include millions of weights, thereby generating a large amount of intra-network data traffic.

In many embodiments, most of the weight (e.g., about 90%) is the final feature classification layer for the neural network. For example, as neural networks evolve to classify more target classes (or handle more complex tasks) in the future, the number of parameters in the feature classification layer may increase, making the parameter traffic for implementing the network a greater problem for power consumption. To reduce or nearly eliminate the overhead of parameter traffic, test cases may be batch processed. For example, in some applications, tens or hundreds of images may be collectively processed by the network.

In many cases, the neural network may narrow from the input layer to the output side of the neural network. For example, a neural network may have a set of feature extraction layers followed by a fully connected feature classification layer. Most of the weights belong to the feature classification level. The amount of storage required at the top of the feature extraction layer of the neural network for the intermediate data may be small.

Fig. 7 is a diagram showing an example of batch processing for executing a neural network. In the example of fig. 7, an NN engine (e.g., the NN engine of fig. 2) includes a memory system 700. The memory system 700 includes internal memory 705 on a chip with the particular processor, logic, computational units, etc. of the NN engine. The memory system 700 also includes external (or off-chip) memory 710. The external memory 710 may be connected to the NN engine through a memory controller (not shown). As shown, the neural network includes a plurality of feature extraction layers 740 and a plurality of feature classification layers 745.

As shown, the NN engine may process the N images through a feature extraction layer 740 of the neural network as represented by

feature maps

715, 720, 725 and a final feature map 730. In this example, N is an integer value greater than 1. The NN engine saves the intermediate results 735 for each of the N images in the internal storage 705.

Proceeding to the feature classification layer 745 of the neural network, the NN engine reads a portion or subset of the weights 750 of the first fully connected feature classification layer 755 and processes each intermediate result of the N images through the layer 755. In the batch processing to the internal memory 705, the NN engine stores partial results of all N images. In some cases, the NN engine may save partial results of layer 755 for these N images in external memory 710.

For example, if there are 1600 million 16-bit parameters for layer 755 in internal memory 705 and 32KB of storage for weights 750, the NN engine reads a subset of weights 750 (e.g., 16K of weights 750) into internal memory 705, applies the subset of weights 750 to all N images, and adds the contribution of the subset of weights 750 to the intermediate results of the N images stored in internal memory 705. The NN engine then reads the next subset of weights 750 (e.g., the next 16K of weights 750) into the internal memory 705, overwriting the first subset of weights 750, and so on.

In this example, this process may be repeated 1000 times to process all 1600 ten thousand parameters of layer 755. In this way, the overhead of reading the 1600 thousand weights is amortized over the N images, reducing the associated read traffic to (1/N) × (16000000 weight × 2 bytes/weight). If N is 16, the read traffic for the batch is 32MB instead of 256 MB. The NN engine then performs the described process for weights 760 and layers 765, and then performs the described process for weights 770 and layers 775. The NN engine performs final processing on layer 780 to produce an output classification. For example, layer 780 may be implemented as a soft max layer (softmax layer) configured to find the classification with the greatest likelihood from the fully connected layer output. Thus, the NN engine reads the weight of each of the feature classification layers 745 once for the entire batch of N images, rather than reading once for each of the N images. A batch of 16 images processed as described herein with reference to the feature classification layer 745 will have 1/16 (as opposed to processing each image individually) the number of weights read for the feature classification layer 745. The example embodiments described in this disclosure save about 84% (15/16 x 9/10) of the weight traffic for executing the neural network if the feature classification layer 745 constitutes about 90% of the weight of the neural network.

Fig. 8 is a flow diagram illustrating an example method 800 of performing a neural network. More specifically, fig. 8 illustrates an example method 800 of performing a feature classification layer of a neural network. The method 800 may begin in a state where the NN engine has processed a plurality of images through a feature extraction layer of a neural network as described herein. The feature extraction layer may be performed using segmentation and tiling as described. In one example, the method 800 may be performed by an NN engine as described with reference to fig. 2.

At block 805, the NN engine selects a layer of the neural network (i.e., a feature classification layer) as the current layer for processing. At block 810, the set of weights for the current layer is loaded into internal memory. The weights loaded into the internal memory may be a subset of the weights of the current layer. Furthermore, the subset of loaded weights of the current layer may be a number of weights that allow the NN engine to store the intermediate results for each of the N images with the intermediate results for the N images in the internal memory.

At block 815, the NN engine applies the set of weights loaded at block 810 to the intermediate results of the images in the set of images. At block 820, the NN engine adds the contribution from the set of weights to the intermediate result for the image in internal memory.

At block 825, the NN engine determines whether there are any further intermediate results remaining for different images in the batch of images processed through the feature classification layer of the neural network that are to be processed for the current layer. If so, the method 800 loops back to block 815 to process the intermediate result for the next image. If not (e.g., all images in the batch of images have been processed using the set of weights loaded at block 810), the method 800 continues to block 830.

At block 830, the NN engine determines whether there are additional weights to be applied to the current layer of the image. If so, the method 800 loops back to block 810 to load the next set (e.g., subset) of weights for the current layer. If not, method 800 continues to block 835. At block 835, the NN engine determines whether there are any other layers of the neural network to be executed (e.g., feature classification layers). If so, the method 800 loops back to block 805 to select the next feature classification level and continue processing. If not, the method 800 ends.

FIG. 9 is a diagram illustrating an example data processing system (system) 900 for determining a frustum of a neural network. For example, system 900 is used to implement the segmentation process as described herein with reference to fig. 6. In another example embodiment, the system 900 is used to implement a neural network.

As shown, the system 900 includes at least one processor (e.g., a Central Processing Unit (CPU))905 coupled to memory elements 910 through a system bus 915 or other suitable circuitry. System 900 stores computer readable instructions (also referred to as "program code") in memory element 910. Memory element 910 may be viewed as an example of a computer-readable storage medium. The processor 905 executes program code that is accessed from the memory elements 910 via the system bus 915. In one example, the processor 905 may be implemented as described with respect to fig. 2.

Memory element 910 includes one or more physical memories (e.g., such as local memory 920 and one or more mass storage devices 925). Local memory 920 refers to RAM or other non-persistent storage devices that are typically used during actual execution of program code. The mass storage device 925 may be implemented as a Hard Disk Drive (HDD), Solid State Drive (SSD), or other persistent data storage device. The system 900 can also include one or more cache memories (not shown) that provide temporary storage of at least some program code in order to reduce the number of times program code must be retrieved from the mass storage device 925 during execution.

Input/output (I/O) devices (such as a keyboard 930, a display device 935, a pointing device 940) and one or more network adapters 945 may be connected to the system 900. The I/O devices may be connected to the system 900 directly or through intervening I/O controllers. In some cases, one or more I/O devices may be combined as in the case where a touch screen is used as the display device 935. In this case, the display device 935 may also implement a keyboard 930 and a pointing device 940. Network adapter 945 can be used to connect system 900 to other systems, computer systems, remote printers, and/or remote storage devices over an intervening private or public network. Modems, fiber optic cable modems, Ethernet cards, wireless transceivers, and/or wireless broadcasts are examples of different types of network adapters 945 that may be used with system 900. The particular type of network adapter or network adapters as the case may be will vary depending on the particular implementation of the system 900.

As shown in fig. 9, the memory element 910 may store an operating system 950 and one or more applications 955. For example, the application 955 may be a neural network utility that, when executed, segments and/or executes a neural network. For example, the application 955 may include program code that causes the processor 905 to perform one or more of the

methods

500, 600, and/or 800. In this manner, the processor 905 is a special purpose processor for performing the functions defined by one or more computer programs.

In one aspect, an operating system 950 and applications 955, implemented in the form of executable program code, are executed by the system 900, and in particular by the processor 905. As such, operating system 950 and applications 955 can be viewed as integral portions of system 900. The operating system 950, applications 955, and any data items used, generated, and/or operated on by the system 900 are functional data structures that when utilized by the system 900 impart functionality.

In one aspect, system 900 can be a computer or other apparatus adapted to store and/or execute program code. System 900 may represent any of a variety of computer systems and/or devices including a processor and memory and capable of performing the operations described in this disclosure. Examples of such systems may include mobile devices, smart phones, and/or other portable computing and/or communication devices. In some cases, a particular computer system and/or apparatus may include fewer or more components than those described. System 900 may be implemented as a single system as shown or as multiple networked or interconnected systems, each having the same or similar architecture as system 900.

In one example, the system 900 may receive a neural network as an input. The system 900, in executing the operating system 950 and the applications 955, may segment the neural network and store the segmented neural network in memory or a computer-readable storage medium for subsequent execution.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting. Nevertheless, some definitions applied throughout this document will now be presented.

As defined herein, the singular forms are intended to include the plural forms as well, unless expressly stated otherwise.

The term another, as defined herein, means at least a second or more.

As defined herein, unless expressly stated otherwise, the terms "at least one," "one or more," and/or "are open-ended expressions that are both conjunctive and disjunctive in operation. For example, each expression of "at least one of A, B and C", "at least one of A, B or C", "one or more of A, B and C", "one or more of A, B or C", and "A, B and/or C" means a alone, B alone, C, A alone and B together, a and C together, B and C together, or A, B and C together.

As defined herein, the term "automatically" means without user intervention.

As defined herein, the term "computer-readable storage medium" means a storage medium that contains or stores program code for use by or in connection with an instruction execution system, apparatus, or device. As defined herein, a "computer-readable storage medium" is not a transitory propagating signal per se. The computer readable storage medium may be, but is not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. The memory elements as described herein are examples of computer-readable storage media. A non-exhaustive list of more specific examples of the computer-readable storage medium may include: portable computer diskette, hard disk, Random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), Static Random Access Memory (SRAM), portable compact disc read-only memory (CD-ROM), Digital Versatile Disc (DVD), memory stick, floppy disk and the like.

As defined herein, unless otherwise indicated, the term "connected" means connected without any intermediate elements or indirectly with one or more intermediate elements. Two elements may be connected mechanically, electrically, or communicatively linked through a communication channel, pathway, network, or system.

As defined herein, the terms "executable operation" or "operation" are tasks performed by a data processing system or a processor within a data processing system, unless the context indicates otherwise. Examples of executable operations include, but are not limited to, "processing," "computing," "calculating," "determining," "displaying," "comparing," and the like. In this regard, operations refer to the action and/or processes of a data processing system (e.g., a computer system or similar electronic computing device) that manipulates and transforms data represented as physical (electronic) quantities within the computer system's registers and/or memories into other data similarly represented as physical quantities within the computer system memories and/or registers or other such information storage, transmission or display devices.

As defined herein, the terms "comprises" and/or "comprising" specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

As defined herein, the term "if" means "when … …" or "at … …" or "in response to … …", depending on the context. Thus, depending on the context, the phrase "if it is determined" or "if [ stated condition or event ] is detected" may be interpreted to mean "upon determination … …" or "in response to determination … …" or "upon detection of [ stated condition or event ]" or "in response to detection of [ stated condition or event ].

As defined herein, the terms "one embodiment," "an embodiment," or similar language means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment described in the present disclosure. Thus, appearances of the phrases "in one embodiment," "in an embodiment," and similar language throughout this disclosure may, but do not necessarily, all refer to the same embodiment. The terms "embodiment" and "arrangement" are used interchangeably in this disclosure.

As defined herein, the term "output" means stored in a physical memory element (e.g., device), written to a display or other peripheral output device, sent or transmitted to another system, an outlet, or the like.

The term "plurality", as defined herein, means two or more than two.

The term "processor," as defined herein, means at least one hardware circuit configured to execute instructions contained in program code. The hardware circuit may be an integrated circuit. Examples of processors include, but are not limited to, Central Processing Units (CPUs), array processors, vector processors, Digital Signal Processors (DSPs), Field Programmable Gate Arrays (FPGAs), Programmable Logic Arrays (PLAs), Application Specific Integrated Circuits (ASICs), programmable logic circuits, Graphics Processors (GPUs), controllers, and the like.

As defined herein, the term "real-time" means the degree of processing responsiveness that a particular process or determination made is perceived by a user or system to be sufficiently timely or to enable the processor to keep up with some external process.

As defined herein, the term "in response to … …" means to respond or react quickly to an action or event. Thus, if the second action is performed "in response to the first action," the occurrence of the first action and the occurrence of the second action are causally related. The term "in response to … …" indicates the causal relationship.

As defined herein, the term "user" means a person.

The terms "first," "second," and the like may be used herein to describe various elements. These elements should not be limited by these terms, as these terms are only used to distinguish one element from another, unless otherwise stated or the context clearly indicates otherwise.

The computer program product may include a computer readable storage medium having computer readable program instructions embodied in the medium for causing a processor to execute aspects of the invention. In this disclosure, the term "program code" is used interchangeably with the term "computer readable program instructions". The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to various computing/processing devices via a network (e.g., the internet, a LAN, a WAN, and/or a wireless network) or to an external computer or external storage device. The network may include copper transmission cables, optical transmission fibers, wireless transmission, routing, firewalls, switches, gateway computers, and/or edge devices including edge servers. A network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.

The computer-readable program instructions for carrying out operations of the inventive arrangements described herein may be assembly instructions, Instruction Set Architecture (ISA) instructions, machine dependent instructions, microcode, firmware instructions, or source code or object code written in any combination of one or more programming languages, including an object oriented language and/or a procedural programming language. The computer-readable program instructions may specify state setting data. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a LAN or a WAN, or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some cases, an electronic circuit comprising, for example, a programmable logic circuit, FPGA, or PLA, can execute computer-readable program instructions to perform aspects of the inventive arrangements described herein by personalizing the electronic circuit with state information of the computer-readable program instructions.

Some aspects of the inventive arrangements are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions (e.g., program code).

These computer-readable program instructions may be provided to a processor of a special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, a programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable storage medium having stored therein the instructions which implement aspects of the operations specified in the flowchart and/or block diagram block or blocks comprises an article of manufacture including instructions which implement the aspects specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operations to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various aspects of the present inventive arrangements. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions which comprises one or more executable instructions for implementing the specified operation(s). In some alternative implementations, the operations noted in the blocks may occur out of the order noted in the figures. For example, two blocks shown in succession may be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

For simplicity and clarity of illustration, elements illustrated in the figures have not necessarily been drawn to scale. For example, the dimensions of some of the elements may be exaggerated relative to other elements for clarity. Further, where considered appropriate, reference numerals have been repeated among the figures to indicate corresponding, analogous or identical features.

The corresponding structures, materials, acts, and equivalents of all means or step plus function elements that may be found in the claims are intended to include any structure, material, or act for performing the function in combination with other claimed elements as specifically claimed.

The description of the present inventive arrangements has been presented for purposes of illustration and is not intended to be exhaustive or limited to the forms and examples disclosed. The terminology used herein was chosen to explain the principles of the inventive arrangements, practical applications, or technical improvements in technology found in the marketplace, and/or to enable others of ordinary skill in the art to understand the embodiments disclosed herein. Modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described inventive arrangements. Accordingly, reference should be made to the claims, rather than to the foregoing disclosure, as indicating the scope of such features and implementations.

Claims

1. A method of performing a neural network, comprising:

generating a plurality of output tiles of a first layer by processing a plurality of input tiles to the first layer of the neural network;

storing a plurality of output tiles of a first layer in an internal memory of a processor;

generating, using a processor, a plurality of output tiles of a second layer of the neural network by processing a plurality of output tiles of the first layer stored in an internal memory,

wherein output tiles of different bands of the plurality of output tiles of the first layer are processed by different ones of a plurality of computing units in the processor,

wherein the band comprises a set of two or more adjacent tiles in at least one row or at least one column of the plurality of output tiles of the first layer,

wherein the neural network is partitioned into a plurality of frustums that are processed independently, each frustum having rectangular intersecting surfaces that define a tile with the input and/or each set of feature maps used by the neural network.

2. The method of claim 1, wherein each tile is comprised of a portion of each feature map of a set of feature maps.

3. The method of claim 1, wherein the plurality of computing units process the plurality of frustums in parallel.

4. The method of claim 1, wherein the first layer and the second layer are feature extraction layers configured to process a plurality of images to produce a plurality of output feature maps, the method further comprising:

batch processing the plurality of output feature maps of the plurality of images through a feature classification layer of a neural network.

5. The method of claim 4, wherein processing the plurality of output feature maps for the plurality of images through a feature classification layer comprises:

loading a plurality of first weights of a feature classification layer from an external memory into an internal memory of a processor;

processing each of the plurality of output feature maps using the plurality of first weights of a feature classification layer prior to loading a plurality of second weights of the feature classification layer or weights of a next feature classification layer from external memory.

6. The method of claim 5, further comprising:

loading the plurality of second weights of the feature classification layer into an internal memory in response to processing each of the plurality of output feature maps using the plurality of first weights of the feature classification layer;

wherein the plurality of second weights for the feature classification layer override the plurality of first weights for the feature classification layer.

7. An apparatus for performing a neural network, comprising:

an internal memory within the processor;

a plurality of computing units within the processor coupled to the internal memory and configured to initialize executable operations comprising: generating a plurality of output tiles of a first layer of the neural network by processing a plurality of input tiles of the first layer; storing a plurality of output tiles of a first layer in an internal memory; generating a plurality of output tiles of a second layer of the neural network by processing a plurality of output tiles of the first layer stored in an internal memory,

wherein output tiles of different bands of the plurality of output tiles of the first layer are processed by different ones of the plurality of computing units,

wherein the neural network is partitioned into a plurality of frustums that are processed independently, each frustum being a rectangular frustum having rectangular intersecting surfaces that define a tile with the input and/or each set of feature maps used by the neural network.

8. The apparatus of claim 7, wherein each tile is comprised of a portion of each feature map of a set of feature maps.

9. The apparatus of claim 7, wherein the plurality of compute units process the plurality of frustums in parallel.

10. The apparatus of claim 7, wherein the first layer and the second layer are feature extraction layers configured to process a plurality of images to produce a plurality of output feature maps, wherein the plurality of computing units are configured to initialize executable operations further comprising:

batch processing the plurality of output feature maps for the plurality of images through a feature classification layer of a neural network.

11. The apparatus of claim 10, further comprising:

an external memory coupled to the plurality of computing units;

wherein processing the plurality of output feature maps for the plurality of images by a feature classification layer comprises: loading a plurality of first weights of a feature classification layer from an external memory into an internal memory; processing each of the plurality of output feature maps using the plurality of first weights of a feature classification layer prior to loading a plurality of second weights of the feature classification layer or weights of a next feature classification layer from external memory.

12. The apparatus of claim 11, wherein the plurality of computing units are programmed to initialize executable operations further comprising: