WO2024072472A1

WO2024072472A1 - Gradient-free efficient class activation map generation

Info

Publication number: WO2024072472A1
Application number: PCT/US2022/080818
Authority: WO
Inventors: Seok-Yong Byun; Wonju Lee
Original assignee: Intel Corporation
Priority date: 2022-09-26
Filing date: 2022-12-02
Publication date: 2024-04-04

Abstract

A class activation map (CAM) network generates a saliency map for a particular class of a multi-class classifier used for classifying images. The saliency map for a class highlights pixels in the image where the classifier focuses on when identifying objects of the class in the image. The CAM network described herein generates a feature map based on an input image using a first convolutional neural network (CNN) and applies a set of spatial masks to the feature map to generate a set of masked feature maps. A second CNN processes the set of masked feature maps to determine probabilities that different portions of the image correspond to particular classes. These probabilities are used to create the saliency map.

Description

GRADIENT-FREE EFFICIENT CLASS ACTIVATION MAP GENERATION

Cross-Reference to Related Application

[0001] This application claims the benefit of U.S. Provisional Application No. 63/377,074, filed September 26, 2022, which is incorporated by reference in its entirety.

Technical Field

[0002] This disclosure relates generally to neural networks, and more specifically, an explainable Al solution for neural networks.

Background

[0003] Convolutional Neural Network (CNN) models have achieved popularity because of their performance in certain applications, such as computer vision. CNNs are typically black box models, which means that while inputs are known and outputs are observed, the internal workings of the CNNs are unknown. CNN models sometimes do not perform in the expected manner, e.g., for unexpected inputs such as edge cases or out-of-distribution inputs. In the face of such performance issues, because CNNs operate as black boxes, it can be difficult to determine a root cause of the performance issues.

[0004] Explainable Al (XAI) techniques have emerged to determine root causes of performance issues in Al networks, such as CNNs. The XAI techniques provide explainable reasons for performance issues, which Al developers can use to improve their models. However, existing XAI techniques have architectural limitations and/or poor computational efficiency.

Brief Description of the Drawings

[0005] Embodiments will be readily understood by the following detailed description in conjunction with the accompanying drawings. To facilitate this description, like reference numerals designate like structural elements. Embodiments are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings.

[0006] FIG. 1 illustrates an example class activation map (CAM) network, in accordance with various embodiments.

[0007] FIG. 2 illustrates an example CNN, in accordance with various embodiments.

[0008] FIG. 3 illustrates an example convolution, in accordance with various embodiments. [0009] FIG. 4 illustrates an example set of spatial masks applied to a feature map, in accordance with various embodiments.

[0010] FIG. 5 illustrates an example relationship between sets of masked feature maps and saliency maps for different classes, in accordance with various embodiments.

[0011] FIGs. 6A and 6B illustrate example relationships between activations and portions of an input image, in accordance with various embodiments.

[0012] FIG. 7 illustrates probabilities determined by a multi-class classifier, in accordance with various embodiments.

[0013] FIG. 8 is a flowchart illustrating a process for generating a saliency map, in accordance with various embodiments.

[0014] FIG. 9 is a block diagram of an example deep neural network (DNN) accelerator, in accordance with various embodiments.

[0015] FIG. 10 illustrates a processing element (PE) array, in accordance with various embodiments.

[0016] FIG. 11 is a block diagram of a PE, in accordance with various embodiments.

[0017] FIG. 12 illustrates a deep learning environment, in accordance with various embodiments.

[0018] FIG. 13 is a block diagram of an example DNN system, in accordance with various embodiments.

[0019] FIG. 14 is a block diagram of an example computing device, in accordance with various embodiments.

Detailed Description

General Overview

[0020] Prior XAI approaches include the CAM method and various modified CAM methods, such as Grad-CAM and Score-CAM. In general, CAM methods provide class-specific saliency maps. A saliency map can be visualized as an image in which the brightness of a pixel represents how salient the pixel is, i.e., the brightness of a pixel is proportional to its saliency. In a computer-vision application, a saliency map for a particular class highlights pixels where the Al model focuses on when identifying objects of the class in the image. For example, if an image includes a bird and a tree, a saliency map for the bird class highlights pixels in the image corresponding to the bird, and a saliency map for the tree class highlights pixels in the image corresponding to the tree. More specifically, for the bird example, the saliency map highlights pixels in the image that the Al model strongly associates with features used by the model to identify the bird. Analyzing saliency maps, particularly for cases in which an Al model misclassifies or mis-identifies an object in an image, can help Al designers improve computer vision models.

[0021] The CAM method is relatively simple and provides fairly good saliency map results, but the CAM method imposes architectural limitations on a target neural network which may be undesirable. In particular, the CAM method requires a global average/max pooling (GAP/GMP) layer and a single fully connected layer. The Grad-CAM method removes these architectural restrictions but imposes a new gradient-related restriction, which involves creating a trainable network for its operation. The Score-CAM method provides an alternative mechanism for removing the gradient restriction from the CAM operation using a black box-like approach, and the Score-CAM method also does not have the architectural limitations of CAM. The Score- CAM method involves generating several new input images from a specified convolution layer's activation map and the original input image to generate a saliency map; this causes Score- CAM's execution performance to depend on input image resolution, convolution channel dimension, and network capacity. As a result, Score-CAM is significantly slower than the CAM or Grad-CAM methods.

[0022] In addition to these CAM approaches, various black-box approaches have been explored. Black-box approaches can generate a saliency map without a target network's architectural information. However, black-box methods typically have much longer execution times than white box methods, and they show relatively lower accuracy.

[0023] Disclosed here is a CAM method that provides an alternative gradient-free and efficient implementation of saliency map generation. The CAM method described herein may be referred to as reciprocal CAM or Recipro-CAM. The Recipro-CAM method does not have the architectural requirements of CAM or the gradient restriction of Grad-CAM. The Recipro-CAM method also has better computational performance than Score-CAM and black-box saliency map generation methods. In the Recipro-CAM method, a convolution layer's activation map has reciprocal relationship with the network's output value directly and indirectly. As described in detail below, Recipro-CAM involves extracting a feature map having (C x H x W) dimensions from a specified convolution layer, and applies a number of (H x W) spatial masks to the extracted feature map to generate a number N = H * W of new input feature maps from the extracted feature map. These new input feature maps are input to a next layer of the neural network. The Recipro-CAM method then measures prediction scores of a specified class for a feature location (x, y) imposed by each of the new input feature map's spatial mask.

[0024] More specifically, a method for generating a saliency map includes generating a feature map based on an input image. In a CNN for computer vision, a first convolutional layer often receives raw pixel values as the input and extracts various features (e.g., color, edge, gradient orientation, etc.) from the input. Following the first convolutional layer, CNNs typically include additional convolutional layers, pooling layers, activation operations, and fully connected layers. The feature map generated by the first convolutional layer, or a first set of convolutional layers, includes one or more channels. Each of the channels includes a number N of activations, which are arranged in a matrix having a height and a width. The elements of this matrix are referred to as activations. The feature map for an input image may generally be understood as a three-dimensional tensor, in which one dimension is the channel dimension, one is the height, and one is the width. For a particular channel, each activation may correspond to a particular portion of the input image. The portions of the input image may be overlapping.

[0025] A number of spatial masks are applied to the feature map to generate masked feature maps. The number of spatial masks is equal to the number of activations N in a given channel, and each spatial mask has the same dimensions (height and width) as a particular channel of the feature map. The spatial masks are applied to the feature map by elementwise multiplication.

[0026] The masked feature maps are fed into a second portion of the CNN, e.g., at least one additional convolutional layer. The second portion of the CNN outputs further feature maps to a scoring module, which computes a vector based on the masked feature maps. If the CNN and scoring module comprise a multi-class classifier, the scoring module provides a vector for each class of the multi-class classification, e.g., a first vector for a tree class, a second vector for a bird class, etc. Each vector includes the number N values, and each value corresponds to one of the spatial masks. In other words, each value in the vector corresponds to a different portion of the input image.

[0027] This vector is used to generate a saliency map for a given class. The vector may be normalized and reshaped, e.g., from a one-dimensional vector to a two-dimensional matrix that corresponds to the input image. Each element of the saliency map indicates a likelihood of an activation of the feature map falling into the class represented by the saliency map, e.g., a likelihood that the region of the input image corresponding to the element is a member of the class.

[0028] For purposes of explanation, specific numbers, materials and configurations are set forth in order to provide a thorough understanding of the illustrative implementations. However, it will be apparent to one skilled in the art that the present disclosure may be practiced without the specific details or/and that the present disclosure may be practiced with only some of the described aspects. In other instances, well known features are omitted or simplified in order not to obscure the illustrative implementations.

[0029] Further, references are made to the accompanying drawings that form a part hereof, and in which is shown, by way of illustration, embodiments that may be practiced. It is to be understood that other embodiments may be utilized, and structural or logical changes may be made without departing from the scope of the present disclosure. Therefore, the following detailed description is not to be taken in a limiting sense.

[0030] Various operations may be described as multiple discrete actions or operations in turn, in a manner that is most helpful in understanding the claimed subject matter. However, the order of description should not be construed as to imply that these operations are necessarily order dependent. In particular, these operations may not be performed in the order of presentation. Operations described may be performed in a different order from the described embodiment. Various additional operations may be performed or described operations may be omitted in additional embodiments.

[0031] For the purposes of the present disclosure, the phrase "A and/or B" means (A), (B), or (A and B). For the purposes of the present disclosure, the phrase "A, B, and/or C" means (A), (B), (C), (A and B), (A and C), (B and C), or (A, B, and C). The term "between," when used with reference to measurement ranges, is inclusive of the ends of the measurement ranges.

[0032] The description uses the phrases "in an embodiment" or "in embodiments," which may each refer to one or more of the same or different embodiments. The terms "comprising," "including," "having," and the like, as used with respect to embodiments of the present disclosure, are synonymous. The disclosure may use perspective-based descriptions such as "above," "below," "top," "bottom," and "side" to explain various features of the drawings, but these terms are simply for ease of discussion, and do not imply a desired or required orientation. The accompanying drawings are not necessarily drawn to scale. Unless otherwise specified, the use of the ordinal adjectives "first," "second," and "third," etc., to describe a common object, merely indicates that different instances of like objects are being referred to and are not intended to imply that the objects so described must be in a given sequence, either temporally, spatially, in ranking or in any other manner.

[0033] In the following detailed description, various aspects of the illustrative implementations will be described using terms commonly employed by those skilled in the art to convey the substance of their work to others skilled in the art.

[0034] The terms "substantially," "close," "approximately," "near," and "about," generally refer to being within +/- 20% of a target value based on the input operand of a particular value as described herein or as known in the art. Similarly, terms indicating orientation of various elements, e.g., "coplanar," "perpendicular," "orthogonal," "parallel," or any other angle between the elements, generally refer to being within +/- 5-20% of a target value based on the input operand of a particular value as described herein or as known in the art.

[0035] In addition, the terms "comprise," "comprising," "include," "including," "have," "having" or any other variation thereof, are intended to cover a non-exclusive inclusion. For example, a method, process, device, or DNN accelerator that comprises a list of elements is not necessarily limited to only those elements but may include other elements not expressly listed or inherent to such method, process, device, or DNN accelerators. Also, the term "or" refers to an inclusive "or" and not to an exclusive "or."

[0036] The systems, methods and devices of this disclosure each have several innovative aspects, no single one of which is solely responsible for all desirable attributes disclosed herein. Details of one or more implementations of the subject matter described in this specification are set forth in the description below and the accompanying drawings.

Example CAM Network

[0037] FIG. 1 illustrates an example CAM network 100, in accordance with various embodiments. The CAM network 100 receives an input image 105 that includes various objects, e.g., a person, a car, and a tree. The input image 105 is input into CNN 110, which may be a first CNN or a first portion of a CNN. CNN 110 may include a sequence of layers comprising one or more convolutional layers. In some embodiments, CNN 110 may further include one or more pooling layers. Layers of CNN 110 may execute tensor computations that include tensor operations, such as convolution (e.g., multiply-accumulate (MAC) operations, etc.), pooling operations, elementwise operations (e.g., elementwise addition, elementwise multiplication, etc.), other types of tensor operations, or some combination thereof. FIG. 2 provides an example CNN and describes example components and operations of CNNs in greater detail. [0038] In general, CNN 110 includes convolutional layers that summarize the presence of features in the input image 105. The convolutional layers function as feature extractors. For example, CNN 110 includes a first convolutional layer that performs a convolution on an input tensor, also referred to as an input feature map, and a filter. The input tensor may be the input image 105 or a tensor generated based on the input image 105. The input tensor may be a three-dimensional tensor that includes multiple channels. In an example where the input tensor is the input image 105, the channels in the input tensor may correspond to levels of red, green, and blue in pixels of the input image 105. An example of an input tensor is shown in FIG. 3 and described below. Each channel is a two-dimensional matrix having a height and a width. The filter is a second three-dimensional tensor. The filter may include multiple kernels, each of which may correspond to a different input channel of the input feature map. A kernel is a two- dimensional matrix of weights, where the weights are arranged in columns and rows. The weights may be initialized and updated by backpropagation using gradient descent. The magnitudes of the weights can indicate importance of the filter in particular extracting features from the input feature map.

[0039] CNN 110 outputs a feature map based on the input image 105. The feature map output by CNN 110 typically includes multiple channels, and may include more channels that the input tensor based on the input image 105. The number of channels in the feature map output by CNN 110 is referred to as C. Each channel of the feature map output by CNN 110 is a two- dimensional matrix having a height H and a width W. Each element of the two-dimensional matrix may be referred to as an activation. An example of a feature map output by CNN 110 is illustrated in FIG. 4.

[0040] The feature masking module 120 receives the feature map output by CNN 110. The feature masking module 120 has or creates a set of spatial masks, where the number N of spatial masks is N = H * W; the number N is also the number of activations in a given channel of the feature map output by CNN 110. Each spatial mask is a two-dimensional matrix having a height H and a width W. A single element of a spatial mask is set to 1, while the other elements of the spatial mask are set to 0. In the set of spatial masks, each spatial mask has a different element set to 1. An example set of spatial masks are illustrated in FIG. 4. The spatial masks may be created by the feature masking module 120 based on the size of the feature map (H and W). Alternatively, the feature masking module 120 may retrieve the spatial masks from a memory.

[0041] The feature masking module 120 applies the spatial masks to the feature map output by CNN 110 to generate a set of masked feature maps. In particular, the feature masking module 120 performs an elementwise multiplication, also referred to as a Hadamard product, of each spatial mask and the two-dimensional matrix of each channel of the feature map. A visual example of the elementwise multiplication is illustrated in FIG. 4. The set of masked feature maps includes a set of N * C (= H * W * C) two-dimensional matrices, where each two- dimensional matrix has the height H and the width W.

[0042] The masked feature maps are input to CNN 130, which may be a second CNN, or a second portion of a CNN. CNN 130 includes one or more convolutional layers, as described with respect to CNN 110 and further described with respect to FIG. 2. CNN 130 may further include one or more pooling layers. In general, pooling layers down-sample feature maps generated by the convolutional layers, e.g., by summarizing the presence of features in the patches of the feature maps. Pooling layers may be placed between two convolutional layers, as illustrated in FIG. 2. A pooling layer receives feature maps generated by the preceding convolution layer and applies a pooling operation to the feature maps. The pooling operation reduces the size of the feature maps while preserving their important characteristics.

[0043] The scoring module 140 generates scores representing probabilities that different portions of the input image 105 correspond to different classes. The scoring module 140 may include one or more fully connected layers following CNN 130. In some embodiments, the scoring module 140 may be considered the last portion of CNN 130. The fully connected layers of the scoring module 140 may be convolutional or not. The fully connected layers receive an input that includes the values of the last feature map generated by the last convolutional layer or the last pooling layer in the sequence of CNN 130. The fully connected layers apply a linear combination and an activation function to the input operand. The fully connected layers may generate a set of vectors, where each vector represents a different class, and each element of a particular vector corresponds to different locations within the input image 105. Thus, each vector may include N elements, where N = H * W (as defined above).

[0044] In some embodiments, the vectors may be arranged as a two-dimensional matrix, where a first dimension corresponds to different activations of the feature map, i.e., different locations within the input image 105, and a second dimension corresponds to different classes. For example, moving across the first dimension, a first row corresponds to a first portion of the input image, a second row corresponds to a second portion of the input image, etc. Moving across a particular column in the second dimension, a first element in the column represents a probability that the portion of the image represented by the row belongs to a first class, a second element in the column represents a probability that the portion of the image represented by the row belongs to a second class, etc. An example of a matrix output by the scoring module 140 is illustrated in FIG. 7.

[0045] To calculate the probabilities, the scoring module 140 may multiply each input element from the CNN 130 by a weight, make a sum, and then apply an activation function (e.g., a softmax function). This is equivalent to multiplying the input operand by the matrix containing the weights. In an example, the scoring module 140 outputs three vectors: a first vector in which each element indicates the probability of each activation corresponding to a tree, a second vector in which each element indicates the probability of each activation corresponding to a car, and a third vector in which each element indicates the probability of each activation corresponding to a person.

[0046] The scoring module 140 may further perform a normalization of the vector for a particular class. For example, if the vector for a particular class c is given by y, and the elements of the vector y have a minimum value min(y) and a maximum value max(y), the normalized vector y_nOrm can be calculated as follows:

[0047] The reshaping module 150 reshapes the vectors output by the scoring module 140 to generate one or more saliency maps 160 (individually referred to as "saliency map 160"), e.g., one saliency map for each class of object identified in the input image 105. The saliency map 160 for a given class includes N elements, and each element in the saliency map 160 indicates a likelihood of an activation of the feature map falling into the class. The reshaping module 150 reshapes the normalized vector y into a two-dimensional saliency map 160. The saliency map 160 is a two-dimensional matrix, where a position of an element in the matrix is determined based on a position of a corresponding activation in the feature map. For example, a first set of W elements in the normalized vector y_nOrm are used as values for a first row of the matrix, a second set of W elements in the normalized vector y_nOrm are used as values for a second row of the matrix, etc. An example is shown in FIG. 5. In some embodiments, prior to the reshaping, the reshaping module 150 (rather than the scoring module 140) performs the vector normalization described above.

Example CNN

[0048] FIG. 2 illustrates an example CNN 200. In general, CNN 200 is trained to receive images and output classifications of objects in the images. In the embodiments of FIG. 2, the CNN 200 receives an input image 205 that includes objects 215, 225, and 235. The input image 205 is similar to the input image 105 described above. The CNN 200 includes a sequence of layers comprising a plurality of convolutional layers 210 (individually referred to as "convolutional layer 210"), a plurality of pooling layers 220 (individually referred to as "pooling layer 220"). In some embodiments, a CNN 200 further includes one or more fully connected layers. In the example described with respect to FIG. 1, the scoring module 140 includes one or more fully connected layers.

[0049] CNN 200 may be an example of CNN 110 and/or CNN 130. As noted above, in some embodiments, CNN 110 refers to a first portion of a CNN and CNN 130 refers to a second portion of the same CNN. For example, CNN may refer to a first set of the convolutional layers 210 or a first set of the convolutional layers 210 and pooling layers 220, e.g., the convolutional layer 210 that receives the input image 205 and one or more layers to the right of this layer. In this example, CNN 130 refers to a second portion of the CNN 200, e.g., a second set of the convolutional layers 210 and pooling layer 220 after (i.e., to the right) of the first set of layers. In other words, the feature masking module 120 described above may be inserted in the set of convolutional layers 210 and pooling layers 220 illustrated in FIG. 2, e.g., at position 201 (after the first two convolutional layers 210) or at position 202 (after four convolutional layers 210 and one pooling layer 220).

[0050] In various embodiments, the CNN 200 may include fewer, more, or different layers from those illustrated in FIG. 2. In an inference of the CNN 200, the layers of the CNN 200 execute tensor computation that includes many tensor operations, such as convolution (e.g., multiply- accumulate (MAC) operations, etc.), pooling operations, elementwise operations (e.g., elementwise addition, elementwise multiplication, etc.), other types of tensor operations, or some combination thereof.

[0051] The convolutional layers 210 summarize the presence of features in the input image 205. The convolutional layers 210 function as feature extractors. The first layer of the CNN 200 is a convolutional layer 210. In an example, a convolutional layer 210 performs a convolution on an input tensor 240 (also referred to as input feature map (IFM) 240) and a filter 250. As shown in FIG. 2, the IFM 240 is represented by a 7x7x3 three-dimensional (3D) matrix. The IFM 240 includes 3 input channels, each of which is represented by a 7x7 two-dimensional (2D) matrix. The 7x7 2D matrix includes 7 input elements (also referred to as input points) in each row and 7 input elements in each column. The filter 250 is represented by a 3x3x3 3D matrix. The filter 250 includes 3 kernels, each of which may correspond to a different input channel of the IFM 240. A kernel is a 2D matrix of weights, where the weights are arranged in columns and rows. A kernel can be smaller than the IFM. In the embodiments of FIG. 2, each kernel is represented by a 3x3 2D matrix. The 3x3 kernel includes 3 weights in each row and 3 weights in each column. Weights can be initialized and updated by backpropagation using gradient descent. The magnitudes of the weights can indicate importance of the filter 250 in extracting features from the IFM 240.

[0052] The convolution includes MAC operations with the input elements in the IFM 240 and the weights in the filter 250. The convolution may be a standard convolution 263 or a depthwise convolution 283. In the standard convolution 263, the whole filter 250 slides across the IFM 240. All the input channels are combined to produce an output tensor 260 (also referred to as output feature map (OFM) 260). The OFM 260 is represented by a 5x5 2D matrix. The 5x5 2D matrix includes 5 output elements (also referred to as output points) in each row and 5 output elements in each column. For purpose of illustration, the standard convolution includes one filter in the embodiments of FIG. 2. In embodiments where there are multiple filters, the standard convolution may produce multiple output channels in the OFM 260.

[0053] The multiplication applied between a kernel-sized patch of the IFM 240 and a kernel may be a dot product. A dot product is the elementwise multiplication between the kernelsized patch of the IFM 240 and the corresponding kernel, which is then summed, always resulting in a single value. Because it results in a single value, the operation is often referred to as the "scalar product." Using a kernel smaller than the IFM 240 is intentional as it allows the same kernel (set of weights) to be multiplied by the IFM 240 multiple times at different points on the IFM 240. Specifically, the kernel is applied systematically to each overlapping part or kernel-sized patch of the IFM 240, left to right, top to bottom. The result from multiplying the kernel with the IFM 240 one time is a single value. As the kernel is applied multiple times to the IFM 240, the multiplication result is a 2D matrix of output elements. As such, the 2D output matrix (i.e., the OFM 260) from the standard convolution 263 is referred to as an OFM.

[0054] In the depthwise convolution 283, the input channels are not combined. Rather, MAC operations are performed on an individual input channel and an individual kernel and produce an output channel. As shown in FIG. 2, the depthwise convolution 283 produces a depthwise output tensor 280. The depthwise output tensor 280 is represented by a 5x5x3 3D matrix. The depthwise output tensor 280 includes 3 output channels, each of which is represented by a 5x5 2D matrix. The 5x5 2D matrix includes 5 output elements in each row and 5 output elements in each column. Each output channel is a result of MAC operations of an input channel of the IFM 240 and a kernel of the filter 250. For instance, the first output channel (patterned with dots) is a result of MAC operations of the first input channel (patterned with dots) and the first kernel (patterned with dots), the second output channel (patterned with horizontal strips) is a result of MAC operations of the second input channel (patterned with horizontal strips) and the second kernel (patterned with horizontal strips), and the third output channel (patterned with diagonal stripes) is a result of MAC operations of the third input channel (patterned with diagonal stripes) and the third kernel (patterned with diagonal stripes). In such a depthwise convolution, the number of input channels equals the number of output channels, and each output channel corresponds to a different input channel. The input channels and output channels are referred to collectively as depthwise channels. After the depthwise convolution, a pointwise convolution 293 is then performed on the depthwise output tensor 280 and a 1x1x3 tensor 290 to produce the OFM 260.

[0055] The OFM 260 is then passed to the next layer in the sequence. In some embodiments, the OFM 260 is passed through an activation function. An example activation function is the rectified linear activation function (ReLU). ReLU is a calculation that returns the value provided as input directly, or the value zero if the input is zero or less. The convolutional layer 210 may receive several images as input and calculate the convolution of each of them with each of the kernels. This process can be repeated several times. For instance, the OFM 260 is passed to the subsequent convolutional layer 210 (i.e., the convolutional layer 210 following the convolutional layer 210 generating the OFM 260 in the sequence). The subsequent convolutional layers 210 performs a convolution on the OFM 260 with new kernels and generates a new feature map. The new feature map may also be normalized and resized. The new feature map can be kernelled again by a further subsequent convolutional layer 210, and so on.

[0056] In some embodiments, a convolutional layer 210 has 4 hyperparameters: the number of kernels, the size F kernels (e.g., a kernel is of dimensions FxFxD pixels), the S step with which the window corresponding to the kernel is dragged on the image (e.g., a step of one means moving the window one pixel at a time), and the zero-padding P (e.g., adding a black contour of P pixels thickness to the input image of the convolutional layer 210). The convolutional layers 210 may perform various types of convolutions, such as 2-dimensional convolution, dilated or atrous convolution, spatial separable convolution, depthwise separable convolution, transposed convolution, and so on. The CNN 200 includes 16 convolutional layers 210. In other embodiments, the CNN 200 may include a different number of convolutional layers.

[0057] The pooling layers 220 down-sample feature maps generated by the convolutional layers, e.g., by summarizing the presence of features in the patches of the feature maps. A pooling layer 220 is placed between 2 convolution layers 210: a preceding convolutional layer 210 (the convolution layer 210 preceding the pooling layer 220 in the sequence of layers) and a subsequent convolutional layer 210 (the convolution layer 210 subsequent to the pooling layer 220 in the sequence of layers). In some embodiments, a pooling layer 220 is added after a convolutional layer 210, e.g., after an activation function (e.g., ReLU) has been applied to the OFM 260.

[0058] A pooling layer 220 receives feature maps generated by the preceding convolution layer 210 and applies a pooling operation to the feature maps. The pooling operation reduces the size of the feature maps while preserving their important characteristics. Accordingly, the pooling operation improves the efficiency of the CNN and avoids over-learning. The pooling layers 220 may perform the pooling operation through average pooling (calculating the average value for each patch on the feature map), max pooling (calculating the maximum value for each patch of the feature map), or a combination of both. The size of the pooling operation is smaller than the size of the feature maps. In various embodiments, the pooling operation is 2x2 pixels applied with a stride of 2 pixels, so that the pooling operation reduces the size of a feature map by a factor of 2, e.g., the number of pixels or values in the feature map is reduced to one quarter the size. In an example, a pooling layer 220 applied to a feature map of 6x6 results in an output pooled feature map of 3x3. The output of the pooling layer 220 is inputted into the subsequent convolution layer 210 for further feature extraction. In some embodiments, the pooling layer 220 operates upon each feature map separately to create a new set of the same number of pooled feature maps.

Example Convolution

[0059] FIG. 3 illustrates an example convolution, in accordance with various embodiments. The convolution may be a convolution in a convolutional layer of a CNN, e.g., a convolutional layer 210 in FIG. 2. The convolution can be executed on an input tensor 310 and filters 320 (individually referred to as "filter 320"). A result of the convolution is an output tensor 330.

[0060] In the embodiments of FIG. 3, the input tensor 310 includes activations (also referred to as "input activations," "elements," or "input elements") arranged in a 3D matrix. An input element is a data point in the input tensor 310. The input tensor 310 has a spatial size H_in X Win x C_in, where H_in is the height of the 3D matrix (i.e., the length along the Y axis, which indicates the number of activations in a column in the 2D matrix of each input channel), W_in is the width of the 3D matrix (i.e., the length along the X axis, which indicates the number of activations in a row in the 2D matrix of each input channel), and C_in is the depth of the 3D matrix (i.e., the length along the Z axis, which indicates the number of input channels). For purpose of simplicity and illustration, the input tensor 310 has a spatial size of 7x7x3, i.e., the input tensor 310 includes three input channels and each input channel has a 7x7 2D matrix. Each input element in the input tensor 310 may be represented by a (X, Y, Z) coordinate. In other embodiments, the height, width, or depth of the input tensor 310 may be different.

[0061] Each filter 320 includes weights arranged in a 3D matrix. The values of the weights may be determined through training the CNN. A filter 320 has a spatial size H^ x W^ x Cf, where Hf is the height of the filter (i.e., the length along the Y axis, which indicates the number of weights in a column in each kernel), IVy is the width of the filter (i.e., the length along the X axis, which indicates the number of weights in a row in each kernel), and is the depth of the filter (i.e., the length along the Z axis, which indicates the number of channels). In some embodiments, equals C_in. For purpose of simplicity and illustration, each filter 320 in FIG. 3 has a spatial size of 3x3x3, i.e., the filter 320 includes 3 convolutional kernels with a spatial size of 3x3. In other embodiments, the height, width, or depth of the filter 320 may be different. The spatial size of the convolutional kernels is smaller than the spatial size of the 2D matrix of each input channel in the input tensor 310.

[0062] An activation or weight may take one or more bytes in a memory. The number of bytes for an activation or weight may depend on the data format. For example, when the activation or weight has a INT8 format, the activation takes one byte. When the activation or weight has a FP16 format, the activation or weight takes two bytes. Other data formats may be used for activations or weights.

[0063] In the convolution, each filter 320 slides across the input tensor 310 and generates a 2D matrix for an output channel in the output tensor 330. In the embodiments of FIG. 3, the 2D matrix has a spatial size of 5x5. The output tensor 330 includes activations (also referred to as "output activations," "elements," or "output element") arranged in a 3D matrix. An output activation is a data point in the output tensor 330. The output tensor 330 has a spatial size H_out ^x W_out x C_out, where H_out is the height of the 3D matrix (i.e., the length along the Y axis, which indicates the number of output activations in a column in the 2D matrix of each output channel), VK_0Ut is the width of the 3D matrix (i.e., the length along the X axis, which indicates the number of output activations in a row in the 2D matrix of each output channel), and C_out is the depth of the 3D matrix (i.e., the length along the Z axis, which indicates the number of output channels). C_out may equal the number of filters 320 in the convolution. H_out and VK_0Ut may depend on the heights and weights of the input tensor 310 and each filter 320.

[0064] As a part of the convolution, MAC operations can be performed on a 3x3x3 input operand 315 (which is highlighted with a dotted pattern in FIG. 3) in the input tensor 310 and each filter 320. The result of the MAC operations on the input operand 315 and one filter 320 is an output activation. In some embodiments (e.g., embodiments where the convolution is an integral convolution), an output activation may include 8 bits, e.g., one byte. In other embodiments (e.g., embodiments where the convolution is a floating-point convolution), an output activation may include more than one byte. For instance, an output element may include two bytes.

[0065] After the MAC operations on the input operand 315 and all the filters 320 are finished, a vector 335 is produced. The vector 335 is highlighted with slashes in FIG. 3. The vector 335 includes a sequence of output activations, which are arranged along the Z axis. The output activations in the vector 335 have the same (x, y) coordinate, but the output activations correspond to different output channels and have different Z coordinates. The dimension of the vector 335 along the Z axis may equal the total number of output channels in the output tensor 330. After the vector 335 is produced, further MAC operations are performed to produce additional vectors till the output tensor 330 is produced.

Example Masked Feature Map Generation [0066] FIG. 4 illustrates an example set of spatial masks applied to a feature map to produce masked feature maps, in accordance with various embodiments. FIG. 4 illustrates an example feature map 410, which may be output by CNN 110. The feature map 410 is a tensor, similar to the input tensor 310, that includes activations arranged in a 3D matrix. The feature map 410 has a size H x W x C, where H is the height of the 3D matrix (i.e., the length along the Y axis, which indicates the number of activations in a column in the 2D matrix of each input channel), W is the width of the 3D matrix (i.e., the length along the X axis, which indicates the number of activations in a row in the 2D matrix of each input channel), and C is the depth of the 3D matrix (i.e., the length along the Z axis, which also refers to the number of channels). In this example, for purpose of simplicity and illustration, the feature map 410 has a spatial size of 7x7x4, i.e., the feature map 410 includes four channels and each channel has a 7x7 2D matrix.

[0067] FIG. 4 illustrates an example set of spatial masks 420. Each spatial mask is a 2D matrix that has a spatial size of H x W, where H and W are the same as H and W in the feature map 410. In this example, H and W of the spatial masks 420 are both 7. The number N of spatial masks is equal to H * W. Each spatial mask 420 has a single element of the 2D matrix set to 1; this element is highlighted in white in the visualization of FIG. 4. The other elements of the 2D matrix are set to 0; these elements are illustrated in gray in the visualization of FIG. 4. As noted above, the feature masking module 120 may create the spatial masks 420 by creating a set of N matrices, each having a size H x W, and each of the N matrices having a different element set to 1. For example, the spatial mask labelled 420a has the upper-left element (e.g., the element at position (1, 1) in the example x-y coordinate system) set to 1, and the other elements set to 0. The spatial mask labelled 420b has the element at position (2, 1) set to 1. The spatial mask labelled 420N has the element at position (7,7) set to 1. The feature masking module 120 may generate the spatial masks 420, e.g., based on the size (H and W) of the feature map 410 received from CNN 110.

[0068] The feature masking module 120 performs an element-wise multiplication (also referred to as a Hadamard product) of the feature map 410 and each of the spatial masks 420 to generate a set of masked feature maps 430. In an element-wise multiplication, two matrices of having the same dimensions are used to produce another matrix of the same dimension, where each element (i,j) is a the product of the elements (i,j) of the original two matrices. For example, the element-wise multiplication of two two-by-two matrices is illustrated below:

[0069] In the example shown in FIG. 4, element-wise multiplication of the feature map 410 and the spatial mask 420a results in a first subset of masked feature maps 430a. Similarly, element- wise multiplication of the feature map 410 and the spatial mask 420b results in a second subset of masked feature maps 430b, and element-wise multiplication of the feature map 410 and the spatial mask 420N results in an Nth subset of masked feature maps 430N. More specifically, to generate each subset of masked feature maps, the two-dimensional matrix for channel of the feature map 410 is element-wise multiplied by one of the spatial masks 420. Each subset of the mask feature maps (e.g., the subset 430a, the subset 430b, etc.) includes C two-dimensional matrices. The masked feature maps 430 includes N * C two dimensional matrices.

[0070] As visually illustrated in FIG. 4, because each spatial mask 420 has a value of 1 for only one element, each of the two-dimensional matrices in a set of masked feature maps 430 includes data for only one element of the H x W matrices the feature map 410, while the other elements are set to zero. For example, the first subset of masked feature maps 430a includes C two dimensional matrices, each of which has data at the position (1,1) (indicated by the white square). The other elements of the matrices are set to 0 (indicated by the gray shading) by the element-wise multiplication.

Example Saliency Map Generation

[0071] FIG. 5 illustrates an example relationship between sets of masked feature maps and saliency map for different classes, in accordance with various embodiments. As described with respect to FIG. 1, CNN 130 receives a set of masked feature maps (e.g., the masked feature maps 430), extracts features from the set of masked feature map 430, and outputs a further feature map based on the extracted features. The scoring module 140 generates scores based on the feature map from CNN 130, the scores representing probabilities that different portions of the input image correspond to different classes. In some embodiments, the scoring module 140 outputs the scores as a set of vectors, where each vector represents a different class, and each element of a particular vector corresponds to different locations within the input image 105. Thus, each vector may include N elements, where N = H * W.

[0072] While the scoring module 140 outputs a vector for each class, the elements of the vector correspond to different portions of the input image 105, and to different portions of the saliency map 160. FIG. 5 illustrates the relationship between sets of masked feature maps 430 and two example saliency maps 510a and 510b. The first saliency map 510a is a saliency map for a first class, e.g., a tree object, and the second saliency map 510b is a second saliency map for a second class, e.g., a person object. As illustrated, the first subset of masked feature maps 430a are used to calculate a score for the (1,1) element of the two saliency maps 510a and 510b, the second subset of masked feature maps 430b are used to calculate a score for the (1,2) element of the two saliency maps 510a and 510b, etc. As described with respect to FIG. 1, the reshaping module 150 may reshape the vectors from the scoring module 140 to generate the saliency maps 510a and 510b.

Example Sliding Window Activations

[0073] FIGs. 6A and 6B illustrate example relationships between spatial masks and portions of an input image, in accordance with various embodiments. FIG. 6A illustrates the first spatial mask 420a, which has an element at the (1,1) position set to 1, and the other elements set to 0. As illustrated in FIG. 4, the spatial mask 420a is used to select portions of the matrices of the feature map 410, in particular, the data at the position (1,1) of the 2D matrices of the feature map 410. The data in this position corresponds to a portion 610a of the input image 105. In particular, the data at the position (1,1) of the matrices of the feature map 410 describes features within this portion 610a of the input image 105.

[0074] FIG. 6B illustrates the second spatial mask 420b, which has an element at the (1,2) position set to 1, and the other elements set to 0. As illustrated in FIG. 4, the second spatial mask 420b is used to select data at the position (1,2) of the 2D matrices of the feature map 410. The data in this position corresponds to a portion 610b of the input image 105. In particular, the data at the position (1,2) of the matrices of the feature map 410 describes features within this portion 610b of the input image 105. The two portions 610a and 610b partially overlap. For example, the portion 610a includes the car in the input image 105, and the portion 610b includes most of the car and a portion of the tree. The portions 610 may be sliding windows moving across and down the input image 105. In some embodiments, each portion 610 represented by a spatial mask 420 represents a same geometric area of the input image 105, e.g., the sizes of 610a and 610b are equal. In other embodiments, different portions 610 may represent differently sized areas of the input image 105.

Example Probabilities

[0075] FIG. 7 illustrates probabilities determined by a multi-class classifier, e.g., the CAM network 100, in accordance with various embodiments. For purpose of illustration, FIG. 7 shows example detection of objects in an input 710 including three portions 715A-715C (collectively referred to as "portions 715"). For example, each portion 715 may be a portion 610 of the input image 105, as illustrated in FIG. 6. It should be understood that the input 710 may have more than 3 portions 715, e.g., 49 (=7 * 7) portions as illustrated in FIGs. 4 and 5. [0076] In another example where the input 710 is an image, the three portions 715 may be different pixels in the input 710. Each image 710 may include one or more objects. The multiclass classification model may classify an object in an image 710 into one or more classes, and more specifically, classify an object in any of the portions 715 into one or more classes. In the embodiments of FIG. 7, the multi-class classification model processes three classes: tree, car, and person.

[0077] An output 720 includes three vectors 725A, 725B, and 725C (collectively referred to as "vectors 725" or individually as "vector 725"). The vector 725A is generated from the portion 715A, e.g., by hidden layers in the multi-class classifier. The vector 725B is generated from the portion 715B, e.g., by hidden layers in the multi-class classifier. The vector 725C is generated from the portion 715C, e.g., by hidden layers in the multi-class classifier. Each vector 725 may be a row of the two-dimensional matrix describing activations of the feature map output by the scoring module 140, as described with respect to FIG. 1. Each vector 725 includes three elements corresponding to the three classes, respectively.

[0078] As described with respect to FIG. 1, the scoring module 140 may perform an activation function, such as a softmax function, on the output 720. An example softmax output 730 includes a matrix generated from the output 720 by using a softmax function. The softmax output 730 includes three vectors 735A, 735B, and 735C (collectively referred to as "vectors 735" or individually as "vector 735"). The vector 735A is generated from the vector 725A. The vector 735B is generated from the vector 725B. The vector 735C is generated from the vector 725C. Each vector 735 includes three elements corresponding to the three classes, respectively. Each element indicates a probability of an object in the corresponding portion 715 falling into the corresponding class. The probability is a confidence score of the class of the object. In some embodiments (e.g., embodiments before or without confidence calibration), the class with the highest probability may be determined as the class of the portion 715, and the highest probability may be determined as the confidence score of the multi-class classifier. Taking the portion 715A for example, the multi-class classifier may determine that the class of the portion 715 is tree, with a confidence score of 0.81. For the portion 715B, the multi-class classifier may determine that the class is person, with a confidence score of 0.96. For the portion 715C, the multi-class classifier may determine that the class is car, with a confidence score of 0.64.

[0079] The numbers shown in FIG. 7 are for illustration only. The input 710, output 720, and softmax output 730 of the multi-class classifier may include different numbers. Also, the number of portions in the input 710 or the number of classes processed by the multi-class classifier may be different in other embodiments. Also, the multi-class classifier may process other types of data than images.

Example Process for Generating a Saliency Map

[0080] FIG. 8 is a flowchart illustrating a process for generating a saliency map, in accordance with various embodiments. The CAM network 100 (e.g., CNN 110) generates 810 a feature map based on an input image, e.g., the input image 105. The feature map may include multiple channels, and each channel has N activations arranged in a matrix, where N is equal to the matrix width W times the matrix height H.

[0081] The CAM network 100 (e.g., the feature masking module 120) applies 820 spatial masks (e.g., the spatial masks 420) to the feature map (e.g., the feature map 410) to generate a set of masked feature maps (e.g., the masked feature maps 430). Each spatial mask may correspond to a different one of the N activations of the feature map, e.g., as shown in FIG. 4, each spatial mask has a 1 at a different element of a H by W matrix. Each of the set of masked feature maps includes the same number of channels as the feature map generated at 810.

[0082] The CAM network 100 (e.g., the scoring module 140) generates a probability vector based on the masked feature maps. The CAM network 100 may generate multiple classspecific probability vectors, e.g., a first probability vector for a first class, and a second probability vector for a second class. Each vector includes N values, and each value of a given vector corresponds to a different one of the spatial masks, i.e., a different activation.

[0083] The CAM network 100 (e.g., the reshaping module 150) generates a saliency map based on the probability vector. The saliency map includes N elements arranged in a two-dimensional matrix. The CAM network 100 may produce multiple saliency maps, where a given saliency map applies to a particular class. Each element of a given saliency map indicates a likelihood of a different activation of the feature map falling into the class described by the saliency map.

Example DNN Accelerator

[0084] FIG. 9 is a block diagram of an example DNN accelerator 1800, in accordance with various embodiments. The DNN accelerator 1800 can run DNNs, e.g., the CNNs 110 and 130 in FIG. 1, and the CNN 200 in FIG. 2. The DNN accelerator 1800 includes a memory 1810, a DMA (direct memory access) engine 1820, and compute blocks 1830. In other embodiments, alternative configurations, different or additional components may be included in the DNN accelerator 1800. For instance, the DNN accelerator 1800 may include more than one memory 1810 or more than one DMA engine 1820. Further, functionality attributed to a component of the DNN accelerator 1800 may be accomplished by a different component included in the DNN accelerator 1800 or by a different system.

[0085] The memory 1810 stores data to be used by the compute blocks 1830 to perform deep learning operations in DNN models. Example deep learning operations include convolutions (also referred to as "convolutional operations"), pooling operations, elementwise operations, other types of deep learning operations, or some combination thereof. The memory 1810 may be a main memory of the DNN accelerator 1800. In some embodiments, the memory 1810 includes one or more DRAMs (dynamic random-access memory). For instance, the memory 1810 may store the input tensor, convolutional kernels, or output tensor of a convolution in a convolutional layer of a DNN, e.g., the convolutional layer 180. The output tensor can be transmitted from a local memory of a compute block 1830 to the memory 1810 through the DMA engine 1820.

[0086] The DMA engine 1820 facilitates data transfer between the memory 1810 and local memories of the compute blocks 1830. For example, the DMA engine 1820 can read data from the memory 1810 and write data into a local memory of a compute block 1830. As another example, the DMA engine 1820 can read data from a local memory of a compute block 1830and write data into the memory 1810. The DMA engine 1820 provides a DMA feature that allows the compute block 1830 to initiate data transfer between the memory 1810 and the local memories of the compute blocks 1830 and to perform other operations while the data transfer is in being conducted. In some embodiments, the DMA engine 1820 may read tensors from the memory 1810, modify the tensors in a way that is optimized for the compute block 1830 before it writes the tensors into the local memories of the compute blocks 1830.

[0087] The compute blocks 1830 perform computation for deep learning operations. A compute block 1830 may run the operations in a DNN layer, or a portion of the operations in the DNN layer. A compute block 1830 may perform convolutions, e.g., standard convolution or depthwise convolution. In some embodiments, the compute block 1830 receive an input tensor and one or more convolutional kernels and performs a convolution with the input tensor and 11 convolutional kernels. The result of the convolution may be an output tensor, which can be further computed, e.g., by the compute block 1830 or another compute block. In some embodiments, the operations of the DNN layers may be run by multiple compute blocks 1830 in parallel. For instance, multiple compute blocks 1830 may each perform a portion of a workload for a convolution. Data may be shared between the compute blocks 1830.

Example PE Array

[0088] FIG. 10 illustrates a PE array 1900, in accordance with various embodiments. The PE array 1900 may be an element of a compute block 1830. The PE array 1900 includes a plurality of PEs 1910 (individually referred to as "PE 1910"). The PEs 1910 perform MAC operations. The PEs 1910 may also be referred to as neurons in the DNN. Each PE 1910 has two input signals 1950 and 1960 and an output signal 1970. The input signal 1950 is at least a portion of an IFM to the layer. The input signal 1960 is at least a portion of a filter of the layer. In some embodiments, the input signal 1950 of a PE 1910 includes one or more input operands, and the input signal 1960 includes one or more weight operand.

[0089] Each PE 1910 performs an MAC operation on the input signals 1950 and 1960 and outputs the output signal 1970, which is a result of the MAC operation. Some or all of the input signals 1950 and 1960 and the output signal 1970 may be in an integer format, such as INT8, or floating-point format, such as FP16 or BF16. For purpose of simplicity and illustration, the input signals and output signal of all the PEs 1910 have the same reference numbers, but the PEs 1910 may receive different input signals and output different output signals from each other. Also, a PE 1910 may be different from another PE 1910, e.g., including more, fewer, or different components.

[0090] As shown in FIG. 10, the PEs 1910 are connected to each other, as indicated by the dash arrows in FIG. 10. The output signal 1970 of an PE 1910 may be sent to many other PEs 1910 (and possibly back to itself) as input signals via the interconnections between PEs 1910. In some embodiments, the output signal 1970 of an PE 1910 may incorporate the output signals of one or more other PEs 1910 through an accumulate operation of the PE 1910 and generates an internal partial sum of the PE array.

[0091] In the embodiments of FIG. 10, the PEs 1910 are arranged into columns 1905 (individually referred to as "column 1905"). The input and weights of the layer may be distributed to the PEs 1910 based on the columns 1905. Each column 1905 has a column buffer 1920. The column buffer 1920 stores data provided to the PEs 1910 in the column 1905 for a short amount of time. The column buffer 1920 may also store data output by the last PE 1910 in the column 1905. The output of the last PE 1910 may be a sum of the MAC operations of all the PEs 1910 in the column 1905, which is a column-level internal partial sum of the PE array 1900. In other embodiments, input and weights may be distributed to the PEs 1910 based on rows in the PE array 1900. The PE array 1900 may include row buffers in lieu of column buffers 1920. A row buffer may store input signals of the PEs in the corresponding row and may also store a row-level internal partial sum of the PE array 1900.

[0092] As shown in FIG. 10, each column buffer 1920 is associated with a load 1930 and a drain 1940. The data provided to the column 1905 is transmitted to the column buffer 1920 through the load 1930, e.g., through upper memory hierarchies, e.g., the memory 1810 in FIG. 9. The data generated by the column 1905 is extracted from the column buffers 1920 through the drain 1940. In some embodiments, data extracted from a column buffer 1920 is sent to upper memory hierarchies, e.g., the memory 1810 in FIG. 9, through the drain operation. In some embodiments, the drain operation does not start until all the PEs 1910 in the column 1905 has finished their MAC operations. In some embodiments, the load 1930 or drain 1940 may be controlled by the controlling module 340. Even though not shown in FIG. 10, one or more columns 1905 may be associated with an external adder assembly.

[0093] FIG. 11 is a block diagram of a PE 2000, in accordance with various embodiments. The PE 2000 may be an embodiment of the PE 1910 in FIG. 10. The PE 2000 includes input register files 2010 (individually referred to as "input register file 2010"), weight registers file 2020 (individually referred to as "weight register file 2020"), multipliers 2030 (individually referred to as "multiplier 2030"), an internal adder assembly 2040, and an output register file 2050. In other embodiments, the PE 2000 may include fewer, more, or different components. For instance, the PE 2000 may include multiple output register files 2050.

[0094] The input register files 2010 temporarily store input operands for MAC operations by the PE 2000. In some embodiments, an input register file 2010 may store a single input operand at a time. In other embodiments, an input register file 2010 may store multiple input operand or a portion of an input operand at a time. An input operand includes a plurality of input elements in an IFM. The input elements of an input operand may be stored sequentially in the input register file 2010 so the input elements can be processed sequentially. In some embodiments, each input element in the input operand may be from a different input channel of the IFM. The input operand may include an input element from each of the input channels of the IFM, and the number of input element in an input operand may equal the number of the input channels. The input elements in an input operand may have the same XY coordinates, which may be used as the XY coordinates of the input operand. For instance, all the input elements of an input operand may be X0Y0, X0Y1, X1Y1, etc.

[0095] The weight register file 2020 temporarily stores weight operands for MAC operations by the PE 2000. The weight operands include weights in the filters of the DNN layer. In some embodiments, the weight register file 2020 may store a single weight operand at a time, other embodiments, an input register file 2010 may store multiple weight operands or a portion of a weight operand at a time. A weight operand may include a plurality of weights. The weights of a weight operand may be stored sequentially in the weight register file 2020 so the weight can be processed sequentially. In some embodiments, for a multiplication operation that involves a weight operand and an input operand, each weight in the weight operand may correspond to an input element of the input operand. The number of weights in the weight operand may equal the number of the input elements in the input operand.

[0096] In some embodiments, a weight register file 2020 may be the same or similar as an input register file 2010, e.g., having the same size, etc. The PE 2000 may include a plurality of register files, some of which are designated as the input register files 2010 for storing input operands, some of which are designated as the weight register files 2020 for storing weight operands, and some of which are designated as the output register file 2050 for storing output operands. In other embodiments, register files in the PE 2000 may be designated for other purposes, e.g., for storing scale operands used in elementwise add operations, etc.

[0097] The multipliers 2030 perform multiplication operations on input operands and weight operands. A multiplier 2030 may perform a sequence of multiplication operations on a single input operand and a single weight operand and generates a product operand including a sequence of products. Each multiplication operation in the sequence includes multiplying an input element in the input operand and a weight in the weight operand. In some embodiments, a position (or index) of the input element in the input operand matches the position (or index) of the weight in the weight operand. For instance, the first multiplication operation is a multiplication of the first input element in the input operand and the first weight in the weight operand, the second multiplication operation is a multiplication of the second input element in the input operand and the second weight in the weight operand, the third multiplication operation is a multiplication of the third input element in the input operand and the third weight in the weight operand, and so on. The input element and weight in the same multiplication operation may correspond to the same depthwise channel, and their product may also correspond to the same depthwise channel.

[0098] Multiple multipliers 2030 may perform multiplication operations simultaneously. These multiplication operations may be referred to as a round of multiplication operations. In a round of multiplication operations by the multipliers 2030, each of the multipliers 2030 may use a different input operand and a different weight operand. The different input operands or weight operands may be stored in different register files of the PE 2000. For instance, a first multiplier 2030 uses a first input operand (e.g., stored in a first input register file 2010) and a first weight operand (e.g., stored in a first weight register file 2020), versus a second multiplier 2030 uses a second input operand (e.g., stored in a second input register file 2010) and a second weight operand (e.g., stored in a second weight register file 2020), a third multiplier 2030 uses a third input operand (e.g., stored in a third input register file 2010) and a third weight operand (e.g., stored in a third weight register file 2020), and so on. For an individual multiplier 2030, the round of multiplication operations may include a plurality of cycles. A cycle includes a multiplication operation on an input element and a weight.

[0099] The multipliers 2030 may perform multiple rounds of multiplication operations. A multiplier 2030 may use the same weight operand but different input operands in different rounds. For instance, the multiplier 2030 performs a sequence of multiplication operations on a first input operand stored in a first input register file in a first round, versus a second input operand stored in a second input register file in a second round. In the second round, a different multiplier 2030 may use the first input operand and a different weight operand to perform another sequence of multiplication operations. That way, the first input operand is reused in the second round. The first input operand may be further reused in additional rounds, e.g., by additional multipliers 2030.

[0100] The internal adder assembly 2040 includes adders inside the PE 2000, i.e., internal adders. The internal adder assembly 2040 may perform accumulation operations on two or more products operands from multipliers 2030, and produce an output operand of the PE 2000. In some embodiments, the internal adders are arranged in a sequence of tiers. A tier includes one or more internal adders. For the first tier of the internal adder assembly 2040, an internal adder may receive product operands from two or more multipliers 2030 and generate a sum operand through a sequence of accumulation operations. Each accumulation operation produces a sum of two or more products, each of which is from a different multiplier 2030. The sum operand includes a sequence of sums, each of which is a result of an accumulation operation and corresponds to a depthwise channel. For the other tier(s) of the internal adder assembly 2040, an internal adder in a tier receives sum operands from the precedent tier in the sequence. Each of these number may be generated by a different internal adder in the precedent tier. A ratio of the number of internal adders in a tier to the number of internal adders in a subsequent tier may be 2:1. In some embodiments, the last tier of the internal adder assembly 2040 may include a single internal adder, which produces the output operand of the PE 2000.

[0101] The output register file 2050 stores output operands of the PE 2000. In some embodiments, the output register file 2050 may store an output operand at a time. In other embodiments, the output register file 2050 may store multiple output operand or a portion of an output operand at a time. An output operand includes a plurality of output elements in an IFM. The output elements of an output operand may be stored sequentially in the output register file 2050 so the output elements can be processed sequentially. In some embodiments, each output element in the output operand corresponds to a different depthwise channel and is an element of a different output channel of the OFM of the depthwise convolution. The number of output element in an output operand may equal the number of the depthwise channels of the depthwise convolution.

Example Deep Learning Environment

[0102] FIG. 12 illustrates a deep learning environment 2100, in accordance with various embodiments. The deep learning environment 2100 includes a deep learning server 2110 and a plurality of client devices 2120 (individually referred to as client device 2120). The deep learning server 2110 is connected to the client devices 2120 through a network 2130. In other embodiments, the deep learning environment 2100 may include fewer, more, or different components.

[0103] The deep learning server 2110 trains deep learning models using neural networks. A neural network is structured like the human brain and consists of artificial neurons, also known as nodes. These nodes are stacked next to each other in 3 types of layers: input layer, hidden layer(s), and output layer. Data provides each node with information in the form of inputs. The node multiplies the inputs with random weights, calculates them, and adds a bias. Finally, nonlinear functions, also known as activation functions, are applied to determine which neuron to fire. The deep learning server 2110 can use various types of neural networks, such as DNN, CNN, recurrent neural network (RNN), generative adversarial network (GAN), long shortterm memory network (LSTMN), and so on. During the process of training the deep learning models, the neural networks use unknown elements in the input distribution to extract features, group objects, and discover useful data patterns. The deep learning models can be used to solve various problems, e.g., making predictions, classifying images, and so on. The deep learning server 2110 may build deep learning models specific to particular types of problems that need to be solved. A deep learning model is trained to receive an input and outputs the solution to the particular problem. In particular, the deep learning server 2110 may build a CNN used by the CAM network 100 (e.g., a single CNN that encompasses CNN 110 and CNN 130), or a pair of CNNs used by the CAM network 100 (e.g., CNN 110 and CNN 130).

[0104] In FIG. 12, the deep learning server 2110 includes a DNN system 2140, a database 2150, and a distributer 2160. The DNN system 2140 trains DNNs. The DNNs can be used to process images, e.g., images captured by autonomous vehicles, medical devices, satellites, and so on. In an embodiment, a DNN receives an input image and outputs classifications of objects in the input image. An example of the DNNs is the CNN 200 described above in conjunction with FIG. 2. In some embodiments, the DNN system 2140 trains DNNs through knowledge distillation, e.g., dense-connection based knowledge distillation. The trained DNNs may be used on low memory systems, like mobile phones, IOT edge devices, and so on.

[0105] The database 2150 stores data received, used, generated, or otherwise associated with the deep learning server 2110. For example, the database 2150 stores a training dataset that the DNN system 2140 uses to train DNNs. In an embodiment, the training dataset is an image gallery that can be used to train a CNN for classifying images. The training dataset may include data received from the client devices 2120. As another example, the database 2150 stores hyperparameters of the neural networks built by the deep learning server 2110.

[0106] The distributer 2160 distributes deep learning models generated by the deep learning server 2110 to the client devices 2120. In some embodiments, the distributer 2160 receives a request for a DNN from a client device 2120 through the network 2130. The request may include a description of a problem that the client device 2120 needs to solve. The request may also include information of the client device 2120, such as information describing available computing resource on the client device. The information describing available computing resource on the client device 2120 can be information indicating network bandwidth, information indicating available memory size, information indicating processing power of the client device 2120, and so on. In an embodiment, the distributer may instruct the DNN system 2140 to generate a DNN in accordance with the request. The DNN system 2140 may generate a DNN based on the information in the request. For instance, the DNN system 2140 can determine the structure of the DNN and/or train the DNN in accordance with the request. [0107] In another embodiment, the distributer 2160 may select the DNN from a group of preexisting DNNs based on the request. The distributer 2160 may select a DNN for a particular client device 2120 based on the size of the DNN and available resources of the client device 2120. In embodiments where the distributer 2160 determines that the client device 2120 has limited memory or processing power, the distributer 2160 may select a compressed DNN for the client device 2120, as opposed to an uncompressed DNN that has a larger size. The distributer 2160 then transmits the DNN generated or selected for the client device 2120 to the client device 2120.

[0108] In some embodiments, the distributer 2160 may receive feedback from the client device 2120. For example, the distributer 2160 receives new training data from the client device 2120 and may send the new training data to the DNN system 2140 for further training the DNN. As another example, the feedback includes an update of the available computing resource on the client device 2120. The distributer 2160 may send a different DNN to the client device 2120 based on the update. For instance, after receiving the feedback indicating that the computing resources of the client device 2120 have been reduced, the distributer 2160 sends a DNN of a smaller size to the client device 2120.

[0109] The client devices 2120 receive DNNs from the distributer 2160 and applies the DNNs to perform machine learning tasks, e.g., to solve problems or answer questions. In various embodiments, the client devices 2120 input images into the DNNs and use the output of the DNNs for various applications, e.g., visual reconstruction, augmented reality, robot localization and navigation, medical diagnosis, weather prediction, and so on. A client device 2120 may be one or more computing devices capable of receiving user input as well as transmitting and/or receiving data via the network 2130. In one embodiment, a client device 2120 is a conventional computer system, such as a desktop or a laptop computer. Alternatively, a client device 2120 may be a device having computer functionality, such as a personal digital assistant (PDA), a mobile telephone, a smartphone, an autonomous vehicle, or another suitable device. A client device 2120 is configured to communicate via the network 2130. In one embodiment, a client device 2120 executes an application allowing a user of the client device 2120 to interact with the deep learning server 2110 (e.g., the distributer 2160 of the deep learning server 2110). The client device 2120 may request DNNs or send feedback to the distributer 2160 through the application. For example, a client device 2120 executes a browser application to enable interaction between the client device 2120 and the deep learning server 2110 via the network 2130. In another embodiment, a client device 2120 interacts with the deep learning server 2110 through an application programming interface (API) running on a native operating system of the client device 2120, such as IOS® or ANDROID™.

[0110] In an embodiment, a client device 2120 is an integrated computing device that operates as a standalone network-enabled device. For example, the client device 2120 includes display, speakers, microphone, camera, and input device. In another embodiment, a client device 2120 is a computing device for coupling to an external media device such as a television or other external display and/or audio output system. In this embodiment, the client device 2120 may couple to the external media device via a wireless interface or wired interface (e.g., an HDMI (High-Definition Multimedia Interface) cable) and may utilize various functions of the external media device such as its display, speakers, microphone, camera, and input devices. Here, the client device 2120 may be configured to be compatible with a generic external media device that does not have specialized software, firmware, or hardware specifically for interacting with the client device 2120.

[0111] The network 2130 supports communications between the deep learning server 2110 and client devices 2120. The network 2130 may comprise any combination of local area and/or wide area networks, using both wired and/or wireless communication systems. In one embodiment, the network 2130 may use standard communications technologies and/or protocols. For example, the network 2130 may include communication links using technologies such as Ethernet, 21010.11, worldwide interoperability for microwave access (WiMAX), 3G, 4G, code division multiple access (CDMA), digital subscriber line (DSL), etc. Examples of networking protocols used for communicating via the network 2130 may include multiprotocol label switching (MPLS), transmission control protocol/lnternet protocol (TCP/IP), hypertext transport protocol (HTTP), simple mail transfer protocol (SMTP), and file transfer protocol (FTP). Data exchanged over the network 2130 may be represented using any suitable format, such as hypertext markup language (HTML) or extensible markup language (XML). In some embodiments, all or some of the communication links of the network 2130 may be encrypted using any suitable technique or techniques.

Example DNN System

[0112] FIG. 13 is a block diagram of an example DNN system 2200, in accordance with various embodiments. The whole DNN system 2200 or a part of the DNN system 2200 may be implemented in the computing device 2300 in FIG. 14. The DNN system 2200 trains DNNs for various tasks, such as image classification, learning relationships between biological cells (e.g., DNA, proteins, etc.), control behaviors for devices (e.g., robots, machines, etc.), and so on. The DNN system 2200 includes an interface module 2210, a training module 2220, a validation module 2230, an inference module 2240, and a memory 2250. In other embodiments, alternative configurations, different or additional components may be included in the DNN system 2200. Further, functionality attributed to a component of the DNN system 2200 may be accomplished by a different component included in the DNN system 2200 or a different system. The DNN system 2200 or a component of the DNN system 2200 (e.g., the training module 2220 or inference module 2240) may include the computing device 2300.

[0113] The interface module 2210 facilitates communications of the DNN system 2200 with other systems. For example, the interface module 2210 establishes communications between the DNN system 2200 with an external database to receive data that can be used to train DNNs or input into DNNs to perform tasks. As another example, the interface module 2210 supports the DNN system 2200 to distribute DNNs to other systems, e.g., computing devices configured to apply DNNs to perform tasks.

[0114] The training module 2220 trains DNNs by using a training dataset. The training module 2220 forms the training dataset. In an embodiment where the training module 2220 trains an DNN to recognize objects in images, the training dataset includes training images and training labels. The training labels describe ground-truth classifications of objects in the training images. In some embodiments, each label in the training dataset corresponds to an object in a training image. In some embodiments, a part of the training dataset may be used to initially train the DNN, and the rest of the training dataset may be held back as a validation subset used by the validation module 2230 to validate performance of a trained DNN. The portion of the training dataset not including the tuning subset and the validation subset may be used to train the DNN. [0115] The training module 2220 also determines hyperparameters for training the DNN. Hyperparameters are variables specifying the DNN training process. Hyperparameters are different from parameters inside the DNN (e.g., weights of filters). In some embodiments, hyperparameters include variables determining the architecture of the DNN, such as number of hidden layers, etc. Hyperparameters also include variables which determine how the DNN is trained, such as batch size, number of epochs, etc. A batch size defines the number of training samples to work through before updating the parameters of the DNN. The batch size is the same as or smaller than the number of samples in the training dataset. The training dataset can be divided into one or more batches. The number of epochs defines how many times the entire training dataset is passed forward and backwards through the entire network. The number of epochs defines the number of times that the deep learning algorithm works through the entire training dataset. One epoch means that each training sample in the training dataset has had an opportunity to update the parameters inside the DNN. An epoch may include one or more batches. The number of epochs may be 22, 220, 500, 2200, or even larger.

[0116] The training module 2220 defines the architecture of the DNN, e.g., based on some of the hyperparameters. The architecture of the DNN includes an input layer, an output layer, and a plurality of hidden layers. The input layer of an DNN may include tensors (e.g., a multidimensional array) specifying attributes of the input image, such as the height of the input image, the width of the input image, and the depth of the input image (e.g., the number of bits specifying the color of a pixel in the input image). The output layer includes labels of objects in the input layer. The hidden layers are layers between the input layer and output layer. The hidden layers include one or more convolutional layers and one or more other types of layers, such as pooling layers, fully connected layers, normalization layers, softmax or logistic layers, and so on. The convolutional layers of the DNN abstract the input image to a feature map that is represented by a tensor specifying the feature map height, the feature map width, and the feature map channels (e.g., red, green, blue images include 3 channels). A pooling layer is used to reduce the spatial volume of input image after convolution. It is used between 2 convolution layers. A fully connected layer involves weights, biases, and neurons. It connects neurons in one layer to neurons in another layer. It is used to classify images between different category by training.

[0117] In the process of defining the architecture of the DNN, the training module 2220 also adds an activation function to a hidden layer or the output layer. An activation function of a layer transforms the weighted sum of the input of the layer to an output of the layer. The activation function may be, for example, a rectified linear unit activation function, a tangent activation function, or other types of activation functions.

[0118] After the training module 2220 defines the architecture of the DNN, the training module 2220 inputs a training dataset into the DNN. The training dataset includes a plurality of training samples. An example of a training sample includes an object in an image and a ground-truth label of the object. The training module 2220 modifies the parameters inside the DNN ("internal parameters of the DNN") to minimize the error between labels of the training objects that are generated by the DNN and the ground-truth labels of the objects. The internal parameters include weights of filters in the convolutional layers of the DNN. In some embodiments, the training module 2220 uses a cost function to minimize the error.

[0119] The training module 2220 may train the DNN for a predetermined number of epochs. The number of epochs is a hyperparameter that defines the number of times that the deep learning algorithm will work through the entire training dataset. One epoch means that each sample in the training dataset has had an opportunity to update internal parameters of the DNN. After the training module 2220 finishes the predetermined number of epochs, the training module 2220 may stop updating the parameters in the DNN. The DNN having the updated parameters is referred to as a trained DNN.

[0120] The validation module 2230 verifies accuracy of trained DNNs. In some embodiments, the validation module 2230 inputs samples in a validation dataset into a trained DNN and uses the outputs of the DNN to determine the model accuracy. In some embodiments, a validation dataset may be formed of some or all the samples in the training dataset. Additionally or alternatively, the validation dataset includes additional samples, other than those in the training sets. In some embodiments, the validation module 2230 may determine an accuracy score measuring the precision, recall, or a combination of precision and recall of the DNN. The validation module 2230 may use the following metrics to determine the accuracy score: Precision = TP / (TP + FP) and Recall = TP / (TP + FN), where precision may be how many the reference classification model correctly predicted (TP or true positives) out of the total it predicted (TP + FP or false positives), and recall may be how many the reference classification model correctly predicted (TP) out of the total number of objects that did have the property in question (TP + FN or false negatives). The F-score (F-score = 2 * PR / (P + R)) unifies precision and recall into a single measure. [0121] The validation module 2230 may compare the accuracy score with a threshold score. In an example where the validation module 2230 determines that the accuracy score of the augmented model is lower than the threshold score, the validation module 2230 instructs the training module 2220 to re-train the DNN. In one embodiment, the training module 2220 may iteratively re-train the DNN until the occurrence of a stopping condition, such as the accuracy measurement indication that the DNN may be sufficiently accurate, or a number of training rounds having taken place.

[0122] The inference module 2240 applies the trained or validated DNN to perform tasks. For instance, the inference module 2240 inputs images into the DNN. The DNN outputs classifications of objects in the images. As an example, the DNN may be provisioned in a security setting to detect malicious or hazardous objects in images captured by security cameras. As another example, the DNN may be provisioned to detect objects (e.g., road signs, hazards, humans, pets, etc.) in images captured by cameras of an autonomous vehicle. The input to the DNN may be formatted according to a predefined input structure mirroring the way that the training dataset was provided to the DNN. The DNN may generate an output structure which may be, for example, a classification of the image, a listing of detected objects, a boundary of detected objects, or the like. In some embodiments, the inference module 2240 distributes the DNN to other systems, e.g., computing devices in communication with the DNN system 2200, for the other systems to apply the DNN to perform the tasks.

[0123] The memory 2250 stores data received, generated, used, or otherwise associated with the DNN system 2200. For example, the memory 2250 stores the datasets used by the training module 2220 and validation module 2230. The memory 2250 may also store data generated by the training module 2220 and validation module 2230, such as the hyperparameters for training DNNs, internal parameters of trained DNNs (e.g., values of tunable parameters of activation functions, such as Fractional Adaptive Linear Units (FALUs)), etc. In the embodiment of FIG. 13, the memory 2250 is a component of the DNN system 2200. In other embodiments, the memory 2250 may be external to the DNN system 2200 and communicate with the DNN system 2200 through a network.

Example Computing Device

[0124] FIG. 14 is a block diagram of an example computing device 2300, in accordance with various embodiments. In some embodiments, the computing device 2300 can be used as the DNN system 2200 in FIG. 13. A number of components are illustrated in FIG. 14 as included in the computing device 2300, but any one or more of these components may be omitted or duplicated, as suitable for the application. In some embodiments, some or all of the components included in the computing device 2300 may be attached to one or more motherboards. In some embodiments, some or all of these components are fabricated onto a single system on a chip (SoC) die. Additionally, in various embodiments, the computing device 2300 may not include one or more of the components illustrated in FIG. 14, but the computing device 2300 may include interface circuitry for coupling to the one or more components. For example, the computing device 2300 may not include a display device 2306, but may include display device interface circuitry (e.g., a connector and driver circuitry) to which a display device 2306 may be coupled. In another set of examples, the computing device 2300 may not include an audio input device 2318 or an audio output device 2308, but may include audio input or output device interface circuitry (e.g., connectors and supporting circuitry) to which an audio input device 2318 or audio output device 2308 may be coupled.

[0125] The computing device 2300 may include a processing device 2302 (e.g., one or more processing devices). The processing device 2302 processes electronic data from registers and/or memory to transform that electronic data into other electronic data that may be stored in registers and/or memory. The computing device 2300 may include a memory 2304, which may itself include one or more memory devices such as volatile memory (e.g., DRAM), nonvolatile memory (e.g., read-only memory (ROM)), high bandwidth memory (HBM), flash memory, solid state memory, and/or a hard drive. In some embodiments, the memory 2304 may include memory that shares a die with the processing device 2302. In some embodiments, the memory 2304 includes one or more non-transitory computer-readable media storing instructions executable to perform operations for deep learning, e.g., the method 800 described above in conjunction with FIG. 8 or some operations performed by the CAM network 100 described above in conjunction with FIG. 1. The instructions stored in the one or more non- transitory computer-readable media may be executed by the processing device 2302.

[0126] In some embodiments, the computing device 2300 may include a communication chip 2312 (e.g., one or more communication chips). For example, the communication chip 2312 may be configured for managing wireless communications for the transfer of data to and from the computing device 2300. The term "wireless" and its derivatives may be used to describe circuits, devices, systems, methods, techniques, communications channels, etc., that may communicate data through the use of modulated electromagnetic radiation through a nonsolid medium. The term does not imply that the associated devices do not contain any wires, although in some embodiments they might not.

[0127] The communication chip 2312 may implement any of a number of wireless standards or protocols, including but not limited to Institute for Electrical and Electronic Engineers (IEEE) standards including Wi-Fi (IEEE 802.10 family), IEEE 802.16 standards (e.g., IEEE 802.16-2005 Amendment), Long-Term Evolution (LTE) project along with any amendments, updates, and/or revisions (e.g., advanced LTE project, ultramobile broadband (UMB) project (also referred to as "3GPP2"), etc.). IEEE 802.16 compatible Broadband Wireless Access (BWA) networks are generally referred to as WiMAX networks, an acronym that stands for worldwide interoperability for microwave access, which is a certification mark for products that pass conformity and interoperability tests for the IEEE 802.16 standards. The communication chip 2312 may operate in accordance with a Global System for Mobile Communication (GSM), General Packet Radio Service (GPRS), Universal Mobile Telecommunications System (UMTS), High Speed Packet Access (HSPA), Evolved HSPA (E-HSPA), or LTE network. The communication chip 2312 may operate in accordance with Enhanced Data for GSM Evolution (EDGE), GSM EDGE Radio Access Network (GERAN), Universal Terrestrial Radio Access Network (UTRAN), or Evolved UTRAN (E-UTRAN). The communication chip 2312 may operate in accordance with CDMA, Time Division Multiple Access (TDMA), Digital Enhanced Cordless Telecommunications (DECT), Evolution-Data Optimized (EV-DO), and derivatives thereof, as well as any other wireless protocols that are designated as 3G, 4G, 5G, and beyond. The communication chip 2312 may operate in accordance with other wireless protocols in other embodiments. The computing device 2300 may include an antenna 2322 to facilitate wireless communications and/or to receive other wireless communications (such as AM or FM radio transmissions).

[0128] In some embodiments, the communication chip 2312 may manage wired communications, such as electrical, optical, or any other suitable communication protocols (e.g., the Ethernet). As noted above, the communication chip 2312 may include multiple communication chips. For instance, a first communication chip 2312 may be dedicated to shorter-range wireless communications such as Wi-Fi or Bluetooth, and a second communication chip 2312 may be dedicated to longer-range wireless communications such as global positioning system (GPS), EDGE, GPRS, CDMA, WiMAX, LTE, EV-DO, or others. In some embodiments, a first communication chip 2312 may be dedicated to wireless communications, and a second communication chip 2312 may be dedicated to wired communications. [0129] The computing device 2300 may include battery/power circuitry 2314. The battery/power circuitry 2314 may include one or more energy storage devices (e.g., batteries or capacitors) and/or circuitry for coupling components of the computing device 2300 to an energy source separate from the computing device 2300 (e.g., AC line power).

[0130] The computing device 2300 may include a display device 2306 (or corresponding interface circuitry, as discussed above). The display device 2306 may include any visual indicators, such as a heads-up display, a computer monitor, a projector, a touchscreen display, a liquid crystal display (LCD), a light-emitting diode display, or a flat panel display, for example. [0131] The computing device 2300 may include an audio output device 2308 (or corresponding interface circuitry, as discussed above). The audio output device 2308 may include any device that generates an audible indicator, such as speakers, headsets, or earbuds, for example.

[0132] The computing device 2300 may include an audio input device 2318 (or corresponding interface circuitry, as discussed above). The audio input device 2318 may include any device that generates a signal representative of a sound, such as microphones, microphone arrays, or digital instruments (e.g., instruments having a musical instrument digital interface (MIDI) output).

[0133] The computing device 2300 may include a GPS device 2316 (or corresponding interface circuitry, as discussed above). The GPS device 2316 may be in communication with a satellitebased system and may receive a location of the computing device 2300, as known in the art. [0134] The computing device 2300 may include another output device 2310 (or corresponding interface circuitry, as discussed above). Examples of the other output device 2310 may include an audio codec, a video codec, a printer, a wired or wireless transmitter for providing information to other devices, or an additional storage device.

[0135] The computing device 2300 may include another input device 2320 (or corresponding interface circuitry, as discussed above). Examples of the other input device 2320 may include an accelerometer, a gyroscope, a compass, an image capture device, a keyboard, a cursor control device such as a mouse, a stylus, a touchpad, a bar code reader, a Quick Response (QR) code reader, any sensor, or a radio frequency identification (RFID) reader.

[0136] The computing device 2300 may have any desired form factor, such as a handheld or mobile computer system (e.g., a cell phone, a smart phone, a mobile internet device, a music player, a tablet computer, a laptop computer, a netbook computer, an ultra book computer, a PDA, an ultramobile personal computer, etc.), a desktop computer system, a server or other networked computing component, a printer, a scanner, a monitor, a set-top box, an entertainment control unit, a vehicle control unit, a digital camera, a digital video recorder, or a wearable computer system. In some embodiments, the computing device 2300 may be any other electronic device that processes data.

Select Examples

[0137] The following paragraphs provide various examples of the embodiments disclosed herein.

[0138] Example 1 provides a computer-implemented method, including generating a feature map based on an input image, the feature map including one or more channels, each of the one or more channels having a number of activations arranged in a matrix; applying one or more of spatial masks to the feature map to generate one or more masked feature maps, each spatial mask corresponding to a different one of the number of activations of the feature map, and each masked feature map including the one or more channels; generating a vector based on the one or more masked feature maps, the vector including a number of values, each value corresponding to a respective one of the spatial masks; and generating a saliency map based on the vector, wherein an element in the saliency map indicates a likelihood of an activation of the feature map falling into a class.

[0139] Example 2 provides the method of example 1, where generating the saliency map includes generating a normalized vector based on a minimum value in the vector and a maximum value in the vector, where the activation of the feature map corresponds to the respective one of the spatial masks for the particular value in the vector.

[0140] Example 3 provides the method of example 2, where generating the saliency map further includes reshaping the normalized vector into the saliency map, where a position of an element in the saliency map is determined based on a position of a corresponding activation in the feature map.

[0141] Example 4 provides the method of example 1, where applying the one or more spatial masks to the feature map includes applying a particular spatial mask of the one or more spatial masks to a particular channel of the feature map by performing an elementwise multiplication of the particular spatial mask and the particular channel of the feature map.

[0142] Example 5 provides the method of example 1, where a first activation of the number of activations corresponds to a first portion of the input image, and a second activation of the number of activations corresponds to a second portion of the input image, the first portion of the input image partially overlapping the second portion of the input image.

[0143] Example 6 provides the method of example 1, where generating the feature map based on the input image includes performing one or more convolutions on the input image, where the feature map may be a result of the one or more convolutions.

[0144] Example 7 provides the method of example 1, where generating the vector based on the one or more masked feature maps includes performing one or more convolutions on the one or more masked feature maps to generate a second one or more masked feature maps; and applying an activation function to the second one or more masked feature maps.

[0145] Example 8 provides the method of example 1, where the class is a first class of a multiclass classification, the method further including generating a second vector based on the one or more masked feature maps, a value of the second vector corresponding to a respective one of the spatial masks; and generating a second saliency map based on the second vector, where an element in the second saliency map indicates a likelihood of an activation of the feature map falling into a second class different from the first class.

[0146] Example 9 provides the method of example 1, where the one or more masked feature maps includes a number M masked feature maps, where M is equal to the number of activations times the number of channels.

[0147] Example 10 provides the method of example 1, where each spatial mask is a matrix having dimensions matching the width and the height of the channels of the feature map, a particular spatial mask has one element that has a value of one and has a position in the particular spatial mask, and the position matches a position of a particular activation in the feature map.

[0148] Example 11 provides one or more non-transitory computer-readable media storing instructions executable to perform operations, the operations including generating a feature map based on an input image, the feature map including one or more channels, each of the one or more channels having a number of activations arranged in a matrix; applying one or more spatial masks to the feature map to generate one or more masked feature maps, each spatial mask corresponding to a different one of the number of activations of the feature map, and each masked feature map including the one or more channels; generating a vector based on the one or more masked feature maps, the vector including a number of values, each value corresponding to a respective one of the spatial masks; and generating a saliency map based on the vector, where an element in the saliency map indicates a likelihood of an activation of the feature map falling into a class.

[0149] Example 12 provides the one or more non-transitory computer-readable media of example 11, where generating the saliency map includes generating a normalized vector based on a minimum value in the vector and a maximum value in the vector, where the activation of the feature map corresponds to the respective one of the spatial masks for the particular value in the vector.

[0150] Example 13 provides the one or more non-transitory computer-readable media of example 12, where generating the saliency map further includes reshaping the normalized vector into the saliency map, where a position of an element in the saliency map is determined based on a position of a corresponding activation in the feature map.

[0151] Example 14 provides the one or more non-transitory computer-readable media of example 11, where applying the one or more spatial masks to the feature map includes applying a particular spatial mask of the one or more spatial masks to a particular channel of the feature map by performing an elementwise multiplication of the particular spatial mask and the particular channel of the feature map.

[0152] Example 15 provides the one or more non-transitory computer-readable media of example 11, where a first activation of the number of activations corresponds to a first portion of the input image, and a second activation of the number of activations corresponds to a second portion of the input image, the first portion of the input image partially overlapping the second portion of the input image.

[0153] Example 16 provides the one or more non-transitory computer-readable media of example 15, where generating the feature map based on the input image includes performing one or more convolutions on the input image, where the feature map may be a result of the one or more convolutions.

[0154] Example 17 provides the one or more non-transitory computer-readable media of example 11, where generating the vector based on the one or more masked feature maps includes performing one or more convolutions on the one or more masked feature maps to generate a second one or more masked feature maps; and applying an activation function to the second one or more masked feature maps.

[0155] Example 18 provides the one or more non-transitory computer-readable media of example 11, where the class is a first class of a multi-class classification, and the operations further include generating a second vector based on the one or more masked feature maps, a value of the second vector corresponding to a respective one of the spatial masks; and generating a second saliency map based on the second vector, where an element in the second saliency map indicates a likelihood of an activation of the feature map falling into a second class different from the first class.

[0156] Example 19 provides the one or more non-transitory computer-readable media of example 11, where the one or more masked feature maps includes a number M masked feature maps, where M is equal to the number of activations times the number of channels. [0157] Example 20 provides the one or more non-transitory computer-readable media of example 11, where each spatial mask is a matrix having dimensions matching the width and the height of the channels of the feature map, a particular spatial mask has one element that has a value of one and has a position in the particular spatial mask, and the position matches a position of a particular activation in the feature map.

[0158] Example 21 provides an apparatus including a computer processor for executing computer program instructions; and a non-transitory computer-readable memory storing computer program instructions executable by the computer processor to perform operations including generating a feature map based on an input image, the feature map including one or more channels, each of the one or more channels having a number of activations arranged in a matrix; applying one or more spatial masks to the feature map to generate one or more masked feature maps, each spatial mask corresponding to a different one of the number of activations of the feature map, and each masked feature map including the one or more channels; generating a vector based on the one or more masked feature maps, the vector including a number of values, each value corresponding to a respective one of the spatial masks; and generating a saliency map based on the vector, where an element in the saliency map indicates a likelihood of an activation of the feature map falling into a class.

[0159] Example 22 provides the apparatus of example 21, where generating the saliency map includes generating a normalized vector based on a minimum value in the vector and a maximum value in the vector, where the activation of the feature map corresponds to the respective one of the spatial masks for the particular value in the vector.

[0160] Example 23 provides the apparatus of example 22, where generating the saliency map further includes reshaping the normalized vector into the saliency map, where a position of an element in the saliency map is determined based on a position of a corresponding activation in the feature map.

[0161] Example 24 provides the apparatus of example 21, where applying the one or more spatial masks to the feature map includes applying a particular spatial mask of the one or more spatial masks to a particular channel of the feature map by performing an elementwise multiplication of the particular spatial mask and the particular channel of the feature map. [0162] Example 25 provides the apparatus of example 21, where a first activation of the number of activations corresponds to a first portion of the input image, and a second activation of the number of activations corresponds to a second portion of the input image, the first portion of the input image partially overlapping the second portion of the input image.

[0163] The above description of illustrated implementations of the disclosure, including what is described in the Abstract, is not intended to be exhaustive or to limit the disclosure to the precise forms disclosed. While specific implementations of, and examples for, the disclosure are described herein for illustrative purposes, various equivalent modifications are possible within the scope of the disclosure, as those skilled in the relevant art will recognize. These modifications may be made to the disclosure in light of the above detailed description.

Claims

What is claimed is:

1. A computer-implemented method, comprising: generating a feature map based on an input image, the feature map comprising one or more channels, each of the one or more channels having a number of activations arranged in a matrix; applying one or more spatial masks to the feature map to generate one or more masked feature maps, each spatial mask corresponding to a different one of the number of activations of the feature map, and each masked feature map comprising the one or more channels; generating a vector based on the one or more masked feature maps, the vector comprising a number of values, each value corresponding to a respective one of the spatial masks; and generating a saliency map based on the vector, wherein an element in the saliency map indicates a likelihood of an activation of the feature map falling into a class.

2. The method of claim 1, wherein generating the saliency map comprises: generating a normalized vector based on a minimum value in the vector and a maximum value in the vector, wherein the activation of the feature map corresponds to the respective one of the spatial masks for the particular value in the vector.

3. The method of claim 2, wherein generating the saliency map further comprises: reshaping the normalized vector into the saliency map, wherein a position of an element in the saliency map is determined based on a position of a corresponding activation in the feature map.

4. The method of claim 1, wherein applying the one or more spatial masks to the feature map comprises: applying a particular spatial mask of the one or more spatial masks to a particular channel of the feature map by performing an elementwise multiplication of the particular spatial mask and the particular channel of the feature map.

5. The method of claim 1, wherein a first activation of the number of activations corresponds to a first portion of the input image, and a second activation of the number of activations corresponds to a second portion of the input image, the first portion of the input image partially overlapping the second portion of the input image.

6. The method of claim 1, wherein generating the feature map based on the input image comprises performing one or more convolutions on the input image.

7. The method of claim 1, wherein generating the vector based on the one or more masked feature maps comprises: performing one or more convolutions on the one or more masked feature maps to generate a second one or more masked feature maps; and applying an activation function to the second one or more masked feature maps.

8. The method of claim 1, wherein the class is a first class of a multi-class classification, the method further comprising: generating a second vector based on the one or more masked feature maps, a value of the second vector corresponding to a respective one of the spatial masks; and generating a second saliency map based on the second vector, wherein an element in the second saliency map indicates a likelihood of an activation of the feature map falling into a second class different from the first class.

9. The method of claim 1, wherein the one or more masked feature maps comprises a number M masked feature maps, where M is equal to the number of activations times the number of channels.

10. The method of claim 1, wherein each spatial mask is a matrix having dimensions matching the width and the height of the channels of the feature map, a particular spatial mask has one element that has a value of one and has a position in the particular spatial mask, and the position matches a position of a particular activation in the feature map.

11. One or more non-transitory computer-readable media storing instructions executable to perform operations, the operations comprising: generating a feature map based on an input image, the feature map comprising one or more channels, each of the one or more channels having a number of activations arranged in a matrix; applying one or more spatial masks to the feature map to generate one or more masked feature maps, each spatial mask corresponding to a different one of the number of activations of the feature map, and each masked feature map comprising the one or more channels; generating a vector based on the one or more masked feature maps, the vector comprising a number of values, each value corresponding to a respective one of the spatial masks; and generating a saliency map based on the vector, wherein an element in the saliency map indicates a likelihood of an activation of the feature map falling into a class.

12. The one or more non-transitory computer-readable media of claim 11, wherein generating the saliency map comprises: generating a normalized vector based a minimum value in the vector and a maximum value in the vector, wherein the activation of the feature map corresponds to the respective one of the spatial masks for the particular value in the vector.

13. The one or more non-transitory computer-readable media of claim 12, wherein generating the saliency map further comprises: reshaping the normalized vector into the saliency map, wherein a position of an element in the saliency map is determined based on a position of a corresponding activation in the feature map.

14. The one or more non-transitory computer-readable media of claim 11, wherein applying the one or more spatial masks to the feature map comprises: applying a particular spatial mask of the one or more spatial masks to a particular channel of the feature map by performing an elementwise multiplication of the particular spatial mask and the particular channel of the feature map.

15. The one or more non-transitory computer-readable media of claim 11, wherein a first activation of the number of activations corresponds to a first portion of the input image, and a second activation of the number of activations corresponds to a second portion of the input image, the first portion of the input image partially overlapping the second portion of the input image.

16. The one or more non-transitory computer-readable media of claim 15, wherein generating the feature map based on the input image comprises performing one or more convolutions on the input image.

17. The one or more non-transitory computer-readable media of claim 11, wherein generating the vector based on the one or more masked feature maps comprises: performing one or more convolutions on the one or more masked feature maps to generate a second one or more masked feature maps; and applying an activation function to the second one or more masked feature maps.

18. The one or more non-transitory computer-readable media of claim 11, wherein the class is a first class of a multi-class classification, and the operations further comprise: generating a second vector based on the one or more masked feature maps, a value of the second vector corresponding to a respective one of the spatial masks; and generating a second saliency map based on the second vector, wherein an element in the second saliency map indicates a likelihood of an activation of the feature map falling into a second class different from the first class.

19. The one or more non-transitory computer-readable media of claim 11, wherein the one or more masked feature maps comprises a number M masked feature maps, where M is equal to the number of activations times the number of channels.

20. The one or more non-transitory computer-readable media of claim 11, wherein each spatial mask is a matrix having dimensions matching the width and the height of the channels of the feature map, a particular spatial mask has one element that has a value of one and has a position in the particular spatial mask, and the position matches a position of a particular activation in the feature map.

21. An apparatus, comprising: a computer processor for executing computer program instructions; and a non-transitory computer-readable memory storing computer program instructions executable by the computer processor to perform operations comprising: generating a feature map based on an input image, the feature map comprising one or more channels, each of the one or more channels having a number of activations arranged in a matrix; applying one or more spatial masks to the feature map to generate one or more masked feature maps, each spatial mask corresponding to a different one of the number of activations of the feature map, and each masked feature map comprising the one or more channels; generating a vector based on the one or more masked feature maps, the vector comprising a number of values, each value corresponding to a respective one of the spatial masks; and generating a saliency map based on the vector, wherein an element in the saliency map indicates a likelihood of an activation of the feature map falling into a class.

22. The apparatus of claim 21, wherein generating the saliency map comprises: generating a normalized vector based on a minimum value in the vector and a maximum value in the vector, wherein the activation of the feature map corresponds to the respective one of the spatial masks for the particular value in the vector.

23. The apparatus of claim 22, wherein generating the saliency map further comprises: reshaping the normalized vector into the saliency map, wherein a position of an element in the saliency map is determined based on a position of a corresponding activation in the feature map.

24. The apparatus of claim 21, wherein applying the one or more spatial masks to the feature map comprises: applying a particular spatial mask of the one or more spatial masks to a particular channel of the feature map by performing an elementwise multiplication of the particular spatial mask and the particular channel of the feature map.

25. The apparatus of claim 21, wherein a first activation of the number of activations corresponds to a first portion of the input image, and a second activation of the number of activations corresponds to a second portion of the input image, the first portion of the input image partially overlapping the second portion of the input image.