WO2020077525A1

WO2020077525A1 - Systems and methods for model for incremental and adaptive object recognition using hierarchical representations

Info

Publication number: WO2020077525A1
Application number: PCT/CN2018/110403
Authority: WO
Inventors: Zheng Zhang
Original assignee: Shanghai New York University
Priority date: 2018-10-16
Filing date: 2018-10-16
Publication date: 2020-04-23

Abstract

Systems and methods are disclosed that can execute an adaptive object recognition model inspired by the foveation mechanism of human visual cortex. The model can tightly integrates the resolution of "where" and "what" in a series of glimpses, and output incrementally better predictions. The model can adapt to data complexity and trade-off resource consumption and performance on-demand. Systems and methods as disclosed herein can explore the graphical hierarchy of visual features, deriving a clean and effective architecture that is fast, efficient and robust. Glimpses at different levels of the feature hierarchy are processed with convolutional feature extractors with the same capacity but do not share parameters. As such, systems and methods as disclosed herein can attend to statistics of different granularity, but their limited capacities act as an information bottleneck, leading to automatic discovery of structures.

Description

[Title established by the ISA under Rule 37.2] SYSTEMS AND METHODS FOR MODEL FOR INCREMENTAL AND ADAPTIVE OBJECT RECOGNITION USING HIERARCHICAL REPRESENTATIONS

BACKGROUND

The present disclosure relates generally to the field of computer vision. More particularly, the present disclosure relates to systems and methods for a model for incremental and adaptive object recognition using hierarchical representation.

One of the more matured domain applying deep learning technologies is computer vision. Computer vision can implement convolutional neural networks ( “CNNs” ) and related solutions, including architectures based CNNs and using the availability of large amount of data and high performing GPU resources. Existing networks typically go deep, with neurons at higher layers extracting more complex features of a larger receptive fields. However, as such networks go deeper, they may lose finer details and errors become increasingly harder to propagate to earlier layers.

SUMMARY

The present disclosure relates to systems and methods that can execute an adaptive object recognition model inspired by the foveation mechanism of human visual cortex. The model can tightly integrate the resolution of “where” and “what” in a series of glimpses, and output incrementally better predictions. The model can adapt to data complexity and trade-off resource consumption and performance on-demand. Systems and methods as disclosed herein can explore the graphical hierarchy of visual features, deriving a clean and effective architecture that is fast, efficient and robust. In some embodiments, glimpses at different levels of the feature hierarchy are processed with convolutional feature extractors with the same capacity but do not share parameters. As such, systems and methods as disclosed herein can attend to statistics of different granularity, but their limited capacities act as an information bottleneck, leading to automatic discovery of structures.

In some embodiments, an object recognition system includes one or more processors and a memory storing computer-readable instructions that cause the one or more processors to provide, to an image recognition model, an input image; generate, by the image recognition model, a plurality of patches of the image using a plurality of corresponding glimpses of the image; extract, from each patch, a plurality of features of the patch; and predict, by the image recognition model, an object represented by the input image based on a sequence of the glimpses and each corresponding feature of each glimpse.

This summary is illustrative only and is not intended to be in any way limiting. Other aspects, inventive features, and advantages of the devices and/or processes described herein, will become apparent in the detailed description set forth herein, taken in conjunction with the accompanying figures, wherein like reference numerals refer to like elements.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram of an object recognition system, according to an embodiment of the present disclosure.

FIGS. 2A-2B are schematic diagrams of computation of a binary tree including per-node computations (FIG. 2A) and an overall tree structure and prediction (FIG. 2B) , according to an embodiment of the present disclosure.

FIG. 3 illustrates charts of training a model for incremental and adaptive object recognition using a single digit with structured background noise dataset, according to an embodiment of the present disclosure.

FIG. 4 illustrates training curves of the training ofFIG. 3.

FIG. 5 illustrates charts of training a model for incremental and adaptive object recognition using a MNIST-Lego dataset, according to an embodiment of the present disclosure.

FIG. 6 illustrates training curves of the training ofFIG. 5.

FIGS. 7A-7B illustrates examples of bounding box visualizes of testing a model for incremental and adaptive object recognition using a CIFAR-10 dataset.

FIGS. 8A-8C illustrates examples of testing a model for incremental and adaptive object recognition using a CUB-200-2011 dataset.

DETAILED DESCRIPTION

Before turning to the figures, which illustrate the exemplary embodiments in detail, it should be understood that the present disclosure is not limited to the details or methodology set forth in the description or illustrated in the figures. It should also be understood that the terminology used herein is for the purpose of description only and should not be regarded as limiting.

The present solution can improve upon existing systems, including existing multi-glimpse systems, by fully exploring the graphical hierarchy of visual features, deriving a clean and effective architecture that is also fast in speed, and can automatically discover structures of the data by a constant capacity feature extractor that acts as an information bottleneck.

Existing systems, including existing one-pass models, may be operate in a manner contrary to everyday experience. The human visual cortex employs a foveation mechanism, takes a series of glimpses, engages the resolution of “where” and “what” simultaneously, and is robust against adversarial examples. The serial nature of the process does not necessarily mean that it is slower: processing 10 glimpses each with a 10-layer CNN is equivalent to a one-pass processing through a 100-layer CNN. Furthermore, since a glimpse restricts feature extraction at local region of the input canvas, possibly down-sampled for a coarse representation if a larger field is required, it can be substantially cheaper. Some existing systems attempt to add skip connections to address performance issues; however, such attempts may not allow for networks to go deeper.

The present solution performs object recognition by learning a parameterized model p _θ (y|x) to map an input image

to a probability distribution y～ [0, 1] ^|y|. Here, H and W are image size in pixels, θ is the model parameters, and |y| is number of candidate classes. In some embodiments, the class with the largest probability is taken (e.g.,

) if only the most probable class is needed. If

describes the difference between ground truth y ^* and class prediction

then the learning objective of the model can be to find the model parametersθ that minimize the expected loss

Existing models typically transform the image through many layers, which typically consist of linear or non-linear functions such as convolution, max-pooling, and/or batch- normalization. Each layer can extract richer features with larger receptive fields as it goes deeper. Finally, the features are mapped into a vector of scores for the candidate classes.

As such, existing models may lack adaptivity, efficiency, and robustness. For example, existing models may lack adaptivity due to having a fixed cost rather than adapting its computation cost against complexity of the data and/or resolution of the image. Efficiency may be related to adaptivity; some context information is useful in identifying the object, but computation over regions of unrelated background context can be wasteful. With respect to robustness, an assumption of typical systems is that the objects lie in a manifold distribution with an inherently much lower dimension, despite the fact that the input dimension is much higher (e.g., 3-by-1024-by-1024, roughly 3 million) . By consuming all pixels, the model approximates an manifold distribution that is much more brittle than focusing only on the most information-carrying regions.

The present solution can improve over such existing models by incorporating design principles based on how the human visual cortex works. The entire human vision system can be adaptive in that a series of glimpses, the receptive fields of which are limited, can be induced from the input. Each glimpse can have a fixed computational cost, and can choose to either extract features of a down-sampled and large region to build context, or zoom in to focus on smaller regions for finer details. By computing at where matters more, the present solution simultaneously more efficient and robust.

The present solution can generate an understanding of an image, such as by recognition what features are located where, by inducing a dynamic graph of glimpses. The present solution can be bio-inspired but does not constrain itself to be bio-plausible. The present solution can decompose each glimpse into finding where to look next and what to extract. The present solution can leverage existing models by taking their feature extraction pipelines as off-the-shelf components that we can freely plug-in.

In some embodiments, the present solution does not require location labels which are expensive to obtain. The present solution can require only images of single object with correct label only. The present solution can identify multiple objects essentially by layering another graph atop, with branches that focus on one object only.

Referring now to FIG. 1, an object recognition system 100 in accordance with the present disclosure can include a processing circuit. The processing circuit 104 can be used to execute various functions described herein, including training and operating models for incremental and adaptive object recognition using hierarchical representations.

The processing circuit 104 includes a processor 108 and memory 112. The processor 108 may be implemented as a specific purpose processor, an application specific integrated circuit ( “ASIC” ) , one or more field programmable gate arrays ( “FPGAs” ) , a group of processing components, or other suitable electronic processing components. The memory 112 is one or more devices (e.g., RAM, ROM, flash memory, hard disk storage) for storing data and computer code for completing and facilitating the various user or client processes, layers, and modules described in the present disclosure. The memory 112 may be or include volatile memory or non-volatile memory and may include database components, object code components, script components, or any other type of information structure for supporting the various activities and information structures of the inventive concepts disclosed herein. The memory 112 is communicably connected to the processor 108 and includes computer code or instruction modules for executing one or more processes described herein. The memory 112 includes various circuits, software engines, and/or modules that cause the processor 108 to execute the systems and methods described herein.

The object recognition system 100 includes a model 116 and a machine learning engine 120. The model 116 can be a model for incremental and adaptive object recognition using hierarchical representations. The machine learning engine 120 can train the model 116, such as by providing inputs of training data to the model 116, executing the model 116 using the inputs, comparing an output of the model 116 to outputs of the training data, and modifying the model 116 based on the comparison.

The model 116 can execute a prediction process as a sequential decision process. At each step, the model 116 can observe a patch of the image

with the glimpse g _t, which is a small set of dynamically generated parameters (e.g., four in the case of a bounding box) . A feature extractor (e.g., a tower of CNN) can extract the associated features

The feature extractor may have different parameters at different time step, but has a constant capacity irrespective of the glimpse parameters. Therefore, a large patch may need to be pooled (e.g., downsized) before feeding into the extractor. The sequence of the glimpses and the associated features

allows the model to make a prediction sequence. The frequencies and boundaries of the predictions of the model 116 can be freely determined.

The model 116 can have dependency assumptions for glimpse parameters generation; the next glimpse parametersg _i+1 may only condition on a subset of the historical glimpses and glimpse parameters. Such dependencies can be drawn as a directed graph

where each node u corresponds to a glimpse. This can enable the model 116 to induce hierarchical representations, each processed with a “bottleneck” feature extractor.

can be a tree.

With respect to topology, in some embodiments, the tree is K-ary with depth L. A glimpse of tree may be a natural representation of an object, where deeper nodes focusing increasingly on more local regions and finer details, other topologies are possible. Likewise, K and L can be set as hyper-parameters, or can be random variables that are stochastically determined.

With respect to traversal and generation order, in some embodiments, a top-down model can be considered, and all K children generated at once from a given node (unless it is a leaf node) . Bottom-up or sequential generation of the nodes can be used.

With respect to deterministic locations, in some embodiments, a parent node needs to decide the locations of its children’s glimpses. This decision can be stochastic (e.g., sampled from a predicted parameterized distribution) , or deterministic. The latter can be considered as a stochastic decision with zero variance. While deterministic decisions allows a fully-differentiable model to be trained with stochastic gradient descent, stochastic models can actively explore different strategies and potentially utilize non-differentiable policies. The model in this paper uses deterministic policy.

With respect to prediction sequence: at any t, all past history of glimpses can be gathered to make a prediction. Among numerous options, levelwise-prediction was chosen: a new prediction is made only when a level of the tree has grown.

Referring to FIGS. 2A-2B, computation of a full binary tree with depth three is shown. FIG. 2A shows how the node representations and the glimpse parameters of its children are computed, given its own glimpse parameters and the image x. FIG. 2B shows how levelwise-prediction is computed by averaging the node representations of all levels generated so far.

The model 116 can incorporate features analogous to the cognitive process of the human visual system. The model 116 can induce a dynamic tree of glimpses from a still image; the root has a glimpse over the entire canvas with coarse resolution, whereas later glimpses behave the opposite. Every patch can be pooled to feed into a CNN tower of a fixed receptive field and constant capacity. By doing so, top-level patches can examine the larger contexts, and lower-level patches attend to details, which can enable the “where” and “what” pathways to be integrated.

To emulate the “where” and “what” behavior, an approach may be that each node first computes its glimpse and then extracts the associated features. However, given that the root node has the default glimpse of the entire canvas already, the present solution executes a different decomposition. A node u takes two inputs, the image x and the glimpse g _u to look at. Its task is three-folds. First, the node extracts the visual features of the patch, by pooling the patch and feeding into a CNN tower with a n×n receptive size. The next two actions can happen in parallel: computing the glimpses

of the children, unless the node is itself a leaf node, and forming the hidden representation h as another output of the node. Per node computational flow and a cartoon example of a two-level tree are shown as discussed above in FIGS. 2A-2B.

In some embodiments, a glimpse Γ (g _u, x) is a function which takes in a set of glimpse parameters g _u, as well as the original imagex, and returns a patch of the image x _u within the reception field parameterized by g _u; the patch has a constant size n×n. A glimpse may be implemented in various ways. In some embodiments, a glimpse is implemented by adopting a bank of Gaussian kernels applied over the image into a fixed-size output patch.

Where

includes six parameters corresponding to the center, the stride and the standard deviation along the x and y dimension, respectively. That is, the model 116 can generate two filterbanks

and

on x-and y-axis respectively. The filterbanks can be normalied so that the rows sum up to one, and the filters are applied with matrix multiplication.

The visual feature of the patch

is obtained from the function

such as by using a CNN as discussed further below.

The final hidden representation h _u of node u can then be a function of

the glimpse parameters g _u, and the parent node representation h _Par (u), computed with the function f _H (·) . Various such functions are discussed below.

The glimpse parameters for children

can be a function of g _u, h _u. In some embodiments, the glimpse parameter prediction can be treated as sequential modeling, such that the glimpse parameters of a specific child

can depend on the glimpse parameters of all previously computed children

In some embodiments, the node u emits the representations h _u and the glimpse parameters

for all K children v ₁, ..., v _K, given the image x and its own glimpse parameters g _u:x _u = Γ (g _u, x) (1)

where f. are functions parameterized by a (learnable) neural network.

After obtaining the hidden representations for all nodes, the readout stage can be executed, which can take in all the hidden representations into account and produces the final task-specific output.

Choices of feature extractor

Various CNNs can be used as a feature extractor. As compared to existing systems, which may employ a global average pooling layer which reduces any feature map size to 1x1 (which may cause loss of spatial information) , the present solution can replace the last layer with an adaptive average pooling to a predefined size (e.g., 2x2) . In addition to the option of having a single CNN producing a set of feature maps, the present solution can use two separate CNNs and obtain two feature map sets having different responsibilities

When predicting the children’s glimpse parameters, only the first set

may be used. The second set

and optionally the first set

can take part in the readout stage for downstream tasks.

Another option is to induce

from

using a simple transformation, by normalizing each channel:

The rational being that

contains enough information, but needs to normalize actions across features of different images in order to make the parameterization stable. We found this to be effective on real-world images.

Choices of hidden representation extractor f _H

The present solution can address various factors to select the hidden representation extractor, including the form of h _u, and how dependent it is with its parent hidden representation h _Par (u).

In some embodiments, h _u needs to integrate the image patch feature

the glimpse parameters g _u, and optionally the hidden representation of parent h _Par (u). In some embodiments, h _u can be vector-based. For example,

can be flattened, into a 1-dimensional vector, concatenated with g _u, and projected non-linearly into h _u. In some embodiments, non-linear projections of

and g _u can be added before the concatenation. As such, the model 116 to learn the spatial association of the glimpsed feature and the location. In some embodiments, g _u is a very small set of parameters and the model 116 may have difficulty mapping it to a proper association with

In some embodiments, the Gaussian kernels can be reversed and mapped to a 2-D zero-initialized matrix where bits corresponding to g _u are turned on.

In some embodiments, h _u can be feature map-based. For example, the features can be explicitly embedded spatially, such as starting with an empty canvas in feature-map space (e.g., a 3D tensor of zeros) , finding the relative location of the patch using g _u and setting them with

In addition to removing the need of learning the spatial association, such embodiments can allow the hidden representations from different nodes to be integrated similar to alpha-channel composite in photo editing.

In various embodiments, the dependencies of h _u with its parent can be either weak or strong. In an isolated approach, h _Par (u) is essentially ignored, and the only dependency to the parent is the glimpse parameters. In a message-passing approach, the parents can influence the children more strongly, such as by when computing h _u, including h _Par (u) in the nonlinear projection, which can either be an MLP, or a recurrent network such as GRU or LSTM taking h _u as recurrent state.

Choices of predictor for next glimpses

The predictor could be an MLP that generates all

simultaneously.

The predictor could be a recurrent network that generates

one by one.

h ₀ = MLP (h _u)

g′ ₀ = (0, 0, 0, 0, 0, 0)

h _i = RNNCell (Δg′ _i-1, h _i-1)

g′ _i = MLP (h _i)

Each

is expressed by a relative translation and scaling to the current glimpse parameters g _u (taking MLP as an example) :

Choices of readout stage

In some embodiments, level-wise image classification is performed. The sequence h _＜=t = (h ₁, ..., h _t) can include visual features of various patches up to time t. To guarantee that progresses will be made, some or all of the h _＜=t, t∈ [1, N] can drive predictions and subject them to classification losses. An additional benefit is at test time, the model can stop predicting when the prediction rises above certain confidence level, thereby exhibiting adaptive behaviour. The model 116 can use use levelwise-prediction, predicting once per tree level. In total, the same number of predictions as the tree depth can be made.

y ^(l)= f _pred (Summary (h ₁, ..., h _m) )

where

is the number of nodes at tree level 1 (e.g., root) to l.

Various summary functions can be used depending on whether the representations are vector-based or feature-map-based. In vector-based embodiments, the summary function includes an average of all h _u:

In vector-based with global attention embodiments, the summary function includes an average of all h _u:

The attention weight can be learned. In some embodiments, the contribution of a glimpse can be weighed according to how detailed it is (e.g., α _i ∝ Area (g _i) ^-1) . A normalization constraint can be imposed on the attention weight so that ∑ _iα _i = 1.

In vector-based with levelwise attention and max-pooling, the summary is computed by first computing a weight for the nodes in each level. Then we compute a weighted average of hidden representations for each level, and perform a max-pooling across the levels.

Mathematically, we define a scalar γ _v for each node v as follows:

· The root of the tree always has a γ _root = 1.

· For all children of u, γ _v = γ _usoftmax (MLP (h _v) ) .

· The readout vector for a single tree level

is

·

In feature-map-based embodiments, all hidden representations can have set their features at the corresponding positions on a feature-space canvas to be painted together. Thus, for each

can be overlaid on top of the canvas at the location dictated by g _u sequentially:

η ₀ = 0

Summary (h ₁, ..., h _m) = η _m

The mathematical formulation of function Overlay (·, ·, ·) is described further below. In some embodiments, the process includes as alpha compositing, such as to compose a photo with segments of different toplogies by manipulating a sequence of opacity maps.

As the model 116 builds more nodes and levels, it can accumulate more features with finer details. As such, the model 116 can spend more computational resources to extract finer details progressively. The behavior of diminishing returns is by design, and the underpinning of the adaptive nature of the model 116.

Parameter sharing

The decomposition into tree nodes and the isomorphic sub-tasks within a node yields a variety of ways to construct model parameters from are sharing across all nodes or not to share at all. It has been found that it may be crucial to have different CNN parameters at different levels, as they can deal with features at different scale and resolutions. In some embodiments, these parameters can be simple transformation from a base set.

Therefore, the model 116 can share the same

at the same level, while having different f’s across levels, meaning that Equations (1) - (4) becomes the following for node u at level l:

x _u = Γ (g _u, x) (5)

The same f _pred can be used for all levelwise predictions.

Learning

Loss components

Aside of the cross-entropy loss for prediction error, a number of regularizations can be used to restrict model 116 behavior, as follows:

A prediction loss can ensure all predictions are relevant:

A sibling glimpse overlap penalty can encourage children glimpses to look at different regions:

A parent-children containment penalty can encourage children to stay within the parent’s region so as to induce the “zoom-in” effect:

A resolution penalty can force the glimpses on the last level to have the resolution as close to the original image as possible.

can be used to regularize the parameters of the entire model; the L2 weight decay and also optionally dropout in f _pred can be used.

Each of these regularization has its own coefficient, leading to the final loss of the following form:

The model 116 can be trained by minimizing

where D ^*is the training set. The loss is end-to-end differentiable, and the model 116 can be trained with gradient descent. Note that the actual coefficients depend on the model variant being used.

Controlling the gradient propagation

The entire model can be trained end-to-end. However, the flow of the gradients can be stopped for various embodiments, including to implement monotonic prediction improvement and a single source of influence on glimpse. With respect to monotonic prediction improvement, the model 116 can make forward progress at any prediction boundary, in a greedy fashion. A parent prediction should not sacrifice itself so that its children glimpses make better predictions. This is important since at test time the model might quit at any level (e.g., for energy saving) . Thus, to emulate this behavior, when the summary function is computed at leveli, error gradient may not flow back to earlier levels.

With respect to single source of influence on glimpse, h _u can encode both g _u and

and

is dependent on g _u. The gradient can be stopped from flowing through h _u to g _u, thus forcing g _u to optimize

for the purpose of getting the maximum information. As such, “what” can precede “where. ”

Computation cost analysis

Suppose that a given CNN would take C FLOPS to compute over an M×M image (considering only a single channel (e.g., greyscale) ) . If the same CNN is used as the feature extractor

since the CNN computes over a glimpse whose resolution M _g×M _g is usually much lower than the original image, the CNN would take

FLOPS, wherein where N is the number of nodes and M is the ratio between image size and glimpse size. For example, residual networks often operate on 224 × 224 images on ImageNet dataset, so a glimpse size of 50 x 50 and a binary tree with depth-3 would take

of the original computation over the entire image considering the CNN alone.

The analysis above did not count the FLOPS of glimpses. The glimpse computation can include matrix multiplication among a filter bank with size M _g×M, the image sized M×M, and another filter bank M × M _g. This can take about

FLOPS counting both multiplications and additions, which can usually be significantly less than C. The glimpse computation can include Gaussian filter bank generation, which can include elementwise computations of M _g×M matrices, which may also be much less than C. For numerical stability, a small constant∈ can be added on both the numerator and denominator for the two fractions in our implementation. The same goes for the Parent-children containment penalty below.

From the discussion above, the number of FLOPS to execute the model 116 should be much less than that from the CNN over the entire image.

As discussed herein, the present solution can learn from the human visual cortex and adopt its design principles instead of implementations. The present solution can mimic how the human fovea system works by looking at the most information-carrying regions progressively, deriving a hierarchy of glimpses with increasingly finer details, and process them with a constant cost. It can adapt to data complexity and trade-off resource consumption and performance on-demand.

The present solution can be complementary to all existing CNN models in the sense that they can be used as an off-the-shelf plug-in module to extract features. The model 116 may not be deep, but rather adjust to scales and resolutions adaptively and dynamically, by zooming in (or out) at different nodes. As such, the model 116 can be decoupled with input image’s size and resolution. The model 116 can predict not only the class but also a progressively refined location map of the object. As such, the model 1166 can inherently integrate “what” and “where. ” The model 116 can be adaptive, making predictions with increasingly higher accuracy. This trade-off can be important when the model is under time or resource pressure, as it can quit earlier or extend the search of new evidences as needed. The model 116 can operate such that per-step visual feature extraction is constant. This allows the system 100 to control both computational resources and memory footprint. In some embodiments, this improvement of efficiency can be analogous to imposing sparsity at a higher granularity (e.g., regions of input) , rather than doing so at neuron or weight level. In some embodiments, the constant capacity of the feature extractor can force the discovery of structures within the data. By progressively focusing on the most information-carrying regions, the model 116 can ignore irrelevant background noises and therefore be less sensitive to adversarial attacks. The model 116 can rely on images containing one object, without requiring bounding box labels which are both error-prone and expensive to acquire, and can therefore be easier to improve.

As compared to one-pass models, the model 116 can be similar to neuroscience findings. For example, the model 116 can be analogous to a visual cortex that is not deep, but instead has abundant top-to-bottom and lateral connections. During the process, steps to resolve “where” and “what” are tightly intertwined, instead of separate.

The model 116 can fully explore the hierarchical and graphical view of feature extractions. Instead of relying on a serial recurrent network to compute subsequent glimpses, the model 116 can explicitly model the parent-to-children relationship in a top-down, coarse-to-finer manner. Predictions can be made by reading out the summary from a subset of graph nodes. Organizing glimpses into an hierarchy can significantly improve speed: for N glimpses the processing time is O (log N) instead of O (N) . Working out a hierarchy of features with an extractor with constant capacity also can lead to discovery of structures.

In some embodiments, each level adopts a different set of parameters. This can enable each feature extractor, while sharing the same architecture, to focus on capturing different scales. The present solution can integrate predicting glimpses of different sizes and shapes, applying Gaussian kernels to smoothly pool into a fixed input for feature extraction, and using deterministic actions instead of stochastic ones.

Single digit with structured background noises

An embodiment of the model 116 was evaluated with MNIST-cluttered dataset where a digit is placed at a random location of a n×n canvas, spread with k random 8×8 subpatches from other random MNIST digits.

The model 116 results were compared against RAM and DRAW with the same setting: for experiments on n=60, we set the k to 4, and for experiments on n=100, we set k to 8. The sizes of training, valid and test set are set to 50000, 10000 and 10000 respectively. The CNN the model 116 is a simple 5-layer with 12-by-12 input size.

Two different structures were selected, both are with 3 levels, in which one is a chain having 3 glimpses and the other is a full binary tree having 7 glimpses. The model with vector-based with levelwise attention and max-pooling was the readout module, and sequential module as the choice of predictor for the next glimpses, performed the best. For that model, sibling glimpse overlap penalty was disabled (i.e., λ _cc = 0) as well as parent-children containment penalty λ _pc was set to 0. The model 116 was optimized with RMSProp optimizer with learning rate 10 ^-4. Early stopping was adopted to avoid over-fitting.

The model 116 can also give levelwise predictions.

Table 1: Experiment results on Cluttered Translated MNIST task; shown with percentage of errors.

As shown in Table 1, the model 116 can outperform both RAM and DRAM, even with fewer number of glimpses. Furthermore, levelwise predictions of the model 116 can have progressively better accuracies.

Note that in the case of a single digit, having 3 glimpses can generalize better than having 7. We conjecture that with a fixed tree structure, the model is unable to selectively discard individual nodes or tree layers. The model will instead try to utilize these redundant nodes and layers to fit the training set. The simplest way to do so is to focus on background noises for these nodes to memorize the training examples better, which will result in overfitting. We note that naive gating and attention mechanisms could not appropriately mitigate this issue, since the model is in theory still able to look into background noises. We leave the resolution of node and level redundancy to future work.

FIG. 3 presents several visualizations of 100 × 100 cases, with 2-level (1 branch per node, 3 glimpses in total) and 3-level (2 branches per node, 7 glimpses in total) on the left half and right half, receptively. Level-2 and level-3 glimpses are colored with yellow and red boxes. Each sample is paired with an image reconstructed by overlaying appropriate up-sampling contents in the glimpses; this image thus shows the accumulated result of all the glimpses. As shown in FIG. 3, the glimpses focus on the object while ignoring the background noises. FIG. 3 shows original image and glimpse boxes (3 levels, 3 glimpses) in the first column, glimpse content (3 levels, 3 glimpses) in the second column, original image and glimpse boxes (3 levels, 7 glimpses) in the third column, and glimpse content (3 levels, 7 glimpses) in the fourth column.

The training curve of the 3-level tree model with 3 glimpses on 100 × 100 Cluttered Translated MNIST dataset is shown in FIG. 4 (level 1: 62.99; level 2: 95.85; level 3: 97.07) . As training progresses, predictions from all three levels can rise in tandem.

MNIST-Lego

The previous experiment discussed above can measure the model 116’s ability to detect “where” and “what” (e.g., to both classify and localize) . To test the ability of the model 116 to leverage and extract hierarchical features, a new dataset was made where two digits float freely in the upper and bottom half of the image, each with variable size. This makes up a total of 100 classes; class ab means digits a and b are at the top and the bottom half, respectively.

Experiments were performed with background with no noise, random Gaussian noise with various intensity, and clutters as described above.

A 3-level full binary tree structure with glimpse size 12 × 12 was picked, and the model 116 trained on two datasets with different background noise types described above. A vector-based with levelwise attention and max-pooling was selected as the readout module, and sequential model was the choice of predictor for the next glimpses. All penalties were disabled by setting their coefficients to zero (i.e., λ _pc = 0, λ _cc = 0, λ _res = 0) . The model 116 was optimized with RMSProp optimizer and the learning rate set to 10 ^-4. Early stopping was adopted to avoid over-fitting. The model 116 achieved an accuracy of 96.24%and 96.20%, with cluttered and Gaussian noise, respectively.

Training curves are shown in FIG. 6 (level 1: 46.04; level 2: 94.23; level 3: 96.20) . By extracting feature hierarchy, predictions made by two-and three-level improve consistently over time. FIG. 5 presents several visualization cases on the MNIST Lego dataset (Left: cluttered noise, Right: Gaussian noise) . In both cases, substructures were shown to be automatically discovered. To fight against more structured noises, more levels may be needed to localize parts (in this case each digit is one part of the overall class) , even though the end performances do not differ much. As such, focusing on prediction alone may not tell the whole story.

Table 2: Classification result on CUB-200-2011 dataset. The CNN baselines resize the input image to a fixed size (e.g., 200x200) , whereas our model does so adaptively and dynamically, the 100x100 model thus takes roughly half of the computational cost to achieve the same level of performance.

Real Image

The model 116 was trained on the CIFAR-10 dataset. The dataset consists of 60,000 32 ×32 RGB images, 50,000 for training and 10,000 for testing. We chose the 3-level full binary tree with glimpse size 15 × 15. The object takes up a large portion of the image, and the glimpse size is large compared to that of the images, such that requiring the children to cover different image parts may not be effective. Disabled parent-children containment penalty was disabled as well as sibling glimpse overlap penalty (i.e., λ _pc = λ _cc = 0) as a result. f _φ was initialized using a pretrained ResNet-18.92%accuracy was achieved on the test set. Some of the examples are visualized in FIGS. 7A-7B. FIGS. 7A-7B show glimpse bounding box visualizations (FIG. 7A) in which the two yellow boxes correspond to the glimpses on the second level, while the four red ones are for the glimpses on the third (last) level; FIG. 7B shows glimpse content visualizations on the same image; this figure is obtained by applying the generated Gaussian filter banks as masks on the original image.

The model 116 was tested on fine-grained classification task as shown in FIGS. 8A-C using the CUB-200-2011 dataset. FIG. 8A shows glimpse bounding box of structure 1-1; FIG. 8B shows glimpse bounding box of structure 1-2; FIG. 8C shows glimpse bounding box of structure 1-1-1. This figure was obtained by applying the generated Gaussian filter banks as masks on the original image. The dataset consists of 11, 788 images of birds with 200 types. We resized all images to 500x500 for both training and evaluating. We chose several structures: 2-level (1-1) , 2-level (1-2) , 3-level (1-1-1) trees for this task. The f _φ module in our model was initialized with pretrained ResNet-50,

is channel-se normalized from

We again selected vector-based with levelwise attention and max-pooling as our readout module, and used sequential model as the choice of predictor for next glimpses. We achieved comparable results on test set compared to directly applying the same CNN on the image with less FLOPs, details are listed in Table 2.

The model 116 can incorporate static topology with fine-grained control. For example, in some embodiments, having redundant nodes and levels hurt generalization. Some of the future works include adaptively growing one node at a time during the training procedure by looking at the best validation metric, in similar spirit to, or imposing sparsity constraints on gate activations or attention masks.

The model 116 can incorporate dynamic topology. For example, in some embodiments, the branch factor K and the depth of the tree L are both hyper-parameters. In some embodiments, a flexible model shall zoom-in where needed, while stop growing when statistics of the current glimpse is already sufficient. However, such stochastic decisions can make the model hard to learn. The model 116 can have metrics established such that an overgrown tree can be truncated at test time without much loss of performance.

The model 116 can incorporate dynamic CNN parameters. Sharing feature extractor parameters within a level may not be very principled, since there is no guarantee that glimpses at a given level are inspecting the same level of details. This assumption may not be true across images, and parameters may be refined on-demand.

The model 116 can incorporate multiple iterations. The model 116 may offer a plausible explanation to why there are abundant top-down connection in our visual cortex. At least partially, it is a feedback signal to direct next focus. Lateral connection can be explained by introducing multiple iterations over the same graph. This way, node representation can become increasingly richer. This may be implemented using Message-Passing Neural Network ( “MPNN” ) . The model 116 can offer a new perspective to explain how brain works.

The model 116 can integrate with multi-object detection. The tree of the model 116 is for one object. To parse a scene with multiple objects, the model 116 can impose one more level on top. In some embodiments, since “objectiveness” is far easier to detect than classification, the model 116 can leverage the localization part of other models, such as YOLO, to propose the top-level branches. In some embodiments, the model 116 can predict a multiset to learn from scratch.

The model 116 can be extended to image sequences. The multiple-glimpses induced by the model 116 can map out the most salient points about an object. As such, the locations and features can be more informative than a single bounding box, as many the video tracking systems do.

Overlay function

A Gaussian glimpse function can have the following form:

The corresponding inverse Gaussian glimpse function can be defined as

Equivalently, the transpose of the Gaussian filter banks can be taken, normalized by column, and the resulting filters applied on the glimpse x _u to recover to the size of original canvas.

The formulation of overlaying a new C-channel image x _u on top of an old C-channel image x at the canvas location g _u can be given. Alpha-channel overlaying can be performed as follows:

Concatenating with the new image x _u another feature map filled with 1’s:

The additional channel of 1’s can be referred to as an alpha channel.

Compute the feature map to overlay:

where c = 1, ..., C+1.

Separate the alpha channel:

The overlay is obtained by

Definitions

As utilized herein, the terms “approximately, ” “about, ” “substantially, ” and similar terms are intended to have a broad meaning in harmony with the common and accepted usage by those of ordinary skill in the art to which the subject matter of this disclosure pertains. It should be understood by those of skill in the art who review this disclosure that these terms are intended to allow a description of certain features described and claimed without restricting the scope of these features to the precise numerical ranges provided. Accordingly, these terms should be interpreted as indicating that insubstantial or inconsequential modifications or alterations of the subject matter described and claimed are considered to be within the scope of the disclosure as recited in the appended claims.

It should be noted that the term “exemplary” and variations thereof, as used herein to describe various embodiments, are intended to indicate that such embodiments are possible examples, representations, and/or illustrations of possible embodiments (and such terms are not intended to connote that such embodiments are necessarily extraordinary or superlative examples) .

The term “coupled, ” as used herein, means the joining of two members directly or indirectly to one another. Such joining may be stationary (e.g., permanent or fixed) or moveable (e.g., removable or releasable) . Such joining may be achieved with the two members coupled directly to each other, with the two members coupled to each other using a separate intervening member and any additional intermediate members coupled with one another, or with the two members coupled to each other using an intervening member that is integrally formed as a single unitary body with one of the two members. Such members may be coupled mechanically, electrically, and/or fluidly.

The term “or, ” as used herein, is used in its inclusive sense (and not in its exclusive sense) so that when used to connect a list of elements, the term “or” means one, some, or all of the elements in the list. Conjunctive language such as the phrase “at least one of X, Y, and Z, ” unless specifically stated otherwise, is understood to convey that an element may be either X, Y, Z; X and Y; X and Z; Y and Z; or X, Y, and Z (i.e., any combination of X, Y, and Z) . Thus, such conjunctive language is not generally intended to imply that certain embodiments require at least one of X, at least one of Y, and at least one of Z to each be present, unless otherwise indicated.

References herein to the positions of elements (e.g., “top, ” “bottom, ” “above, ” “below, ” etc. ) are merely used to describe the orientation of various elements in the FIGURES. It should be noted that the orientation of various elements may differ according to other exemplary embodiments, and that such variations are intended to be encompassed by the present disclosure.

The hardware and data processing components used to implement the various processes, operations, illustrative logics, logical blocks, modules and circuits described in connection with the embodiments disclosed herein may be implemented or performed with a general purpose single-or multi-chip processor, a digital signal processor ( “DSP” ) , an ASIC, an FPGA, or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general purpose processor may be a microprocessor, or, any conventional processor, controller, microcontroller, or state machine. A processor also may be implemented as a combination of computing devices, such as a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration. In some embodiments, particular processes and methods may be performed by circuitry that is specific to a given function. The memory (e.g., memory, memory unit, storage device, etc. ) may include one or more devices (e.g., RAM, ROM, Flash memory, hard disk storage, etc. ) for storing data and/or computer code for completing or facilitating the various processes, layers and modules described in the present disclosure. The memory may be or include volatile memory or non-volatile memory, and may include database components, object code components, script components, or any other type of information structure for supporting the various activities and information structures described in the present disclosure. According to an exemplary embodiment, the memory is communicably connected to the processor via a processing circuit and includes computer code for executing (e.g., by the processing circuit and/or the processor ) the one or more processes described herein.

The present disclosure contemplates methods, systems and program products on any machine-readable media for accomplishing various operations. The embodiments of the present disclosure may be implemented using existing computer processors, or by a special purpose computer processor for an appropriate system, incorporated for this or another purpose, or by a hardwired system. Embodiments within the scope of the present disclosure include program products comprising machine-readable media for carrying or having machine-executable instructions or data structures stored thereon. Such machine-readable media can be any available media that can be accessed by a general purpose or special purpose computer or other machine with a processor. By way of example, such machine-readable media can comprise RAM, ROM, EPROM, EEPROM, or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to carry or store desired program code in the form of machine-executable instructions or data structures and which can be accessed by a general purpose or special purpose computer or other machine with a processor. Combinations of the above are also included within the scope of machine-readable media. Machine-executable instructions include, for example, instructions and data which cause a general purpose computer, special purpose computer, or special purpose processing machines to perform a certain function or group of functions.

Although the figures and description may illustrate a specific order of method steps, the order of such steps may differ from what is depicted and described, unless specified differently above. Also, two or more steps may be performed concurrently or with partial concurrence, unless specified differently above. Such variation may depend, for example, on the software and hardware systems chosen and on designer choice. All such variations are within the scope of the disclosure. Likewise, software implementations of the described methods could be accomplished with standard programming techniques with rule-based logic and other logic to accomplish the various connection steps, processing steps, comparison steps, and decision steps.

Claims

An object recognition system, comprising:

one or more processors; and

a memory storing computer-readable instructions that when executed by the one or more processors, cause the one or more processors to:

provide, to an image recognition model, an input image;

generate, by the image recognition model, a plurality of patches of the image using a plurality of corresponding glimpses of the image;

extract, from each patch, a plurality of features of the patch; and

predict, by the image recognition model, an object represented by the input image based on a sequence of the glimpses and each corresponding feature of each glimpse.
The object recognition system of claim 1, comprising instructions that cause the one or more processors to train the image recognition model by evaluating an expected loss function based on the predicted object and a ground truth corresponding to the input image.
The object recognition system of claim 1, comprising instructions that cause the one or more processors to extract the plurality of features using a convolutional neural network (CNN) .
The object recognition system of claim 3, wherein the CNN has a fixed receptive field and constant capacity.
The object recognition system of claim 1, comprising instructions that cause the one or more processors to generate a plurality of dependencies corresponding to glimpses at different time steps as a directed graph comprising a plurality of nodes, each node corresponding to a respective glimpse of the plurality of glimpses.
The object recognition system of claim 1, comprising instructions that cause the one or more processors to predict the object using levelwise prediction.
The object recognition system of claim 1, comprising instructions that cause the one or more processors to generate a tree of the plurality of glimpses, a root node of the tree having a glimpse over the entire image.
The object recognition system of claim 1, comprising instructions that cause the one or more processors to generate each glimpse using a Gaussian kernel.
The object recognition system of claim 1, comprising instructions that cause the one or more processors to generate a plurality of predictions of the object at a plurality of time steps, each prediction using data of each previous time step.
The object recognition system of claim 1, comprising instructions that cause the one or more processors to use a recurrent network to predict the object.