NZ786061A

NZ786061A - Structure learning in convolutional neural networks

Info

Publication number: NZ786061A
Application number: NZ786061A
Authority: NZ
Inventors: Douglas Bertram Lee; Tomasz J Malisiewicz; Andrew Rabinovich; Vijay Badrinarayanan; Daniel DeTone; Srivignesh Rajendran
Original assignee: Magic Leap Inc
Filing date: 2017-03-13
Publication date: 2022-03-25

Abstract

Disclosed is a method implemented with a processor comprising creating a neural network comprising a plurality of layers. The neural network performs a processing task to generate an output. An extraneous layer is identified from the plurality of layers, and the extraneous layer is removed from the neural network. The method includes identifying and removing other extraneous layers from the plurality of layers until no other extraneous layer is identified. The neural network undergoes both vertical splitting and horizontal splitting.

Description

STRUCTURE NG IN CONVOLUTIONAL NEURAL KS FIELD OF THE INVENTION This sure pertains to computing networks, and more particularly to neural networks configured to learn hierarchical representations from data.

BACKGROUND Neural networks pertain to computational approaches that are loosely modeled after the neural structures of biological brain processing that can be used for solving complex computational problems. Neural networks are normally organized as a set of layers, where each layer includes a set of interconnected nodes that include various functions. Weighted connections implement functions that are processed within the network to perform various analytical operations. Learning processes may be employed to uct and modify t he networks and the associated weights for the tors within the network. By modifying the connector weights, this s the network to learn over time from past analysis to improve future analysis results.

Neural networks may be ed to m any appropriate type of data analysis, but is particularly suitable to be applied to complex analysis tasks such as pattern analysis and classification. Direct application of these techniques are therefore le, for example, to implement machine vision functions such as recognition and classification of specific objects and object classes from image data captured by digital imaging devices.

There are numerous types of neural networks that are known in the art. Deep neural networks is a type of neural network where deep learning techniques are applied to implement a cascade of many layers of nonlinear processing to perform analytical functions.

Deep learning algorithms transform their inputs through more layers than shallow learning algorithms. At each layer, the signal is transformed by a processing unit, such as an artificial , whose parameters are learned through training.

A utional neural network is a type of neural k where the connectivity pattern in the network is inspired by ical visual cortex functioning. Visual fields are constructed through the network, where the response of an individual artificial neuron to an input stimulus can be approximated mathematically by a convolution operation.

Convolutional deep neural networks have been ented in the known art.

LeNet (LeCun et al. (1998), t (Krizhevsky et al. (2012), GoogLeNet (Szegedy et al. (2015), and VGGNet (Simonyan & Zisserman (2015) are all examples of ConvNet architectures that implement different types of deep neural networks. These models are quite different (e.g., different depth, width, and activation functions). However, these models are all the same in one key t – each one is a hand designed structure which es the architects’ insights about the problem at hand.

These networks follow a relatively htforward recipe, ng with a convolutional layer that learns low-level features resembling Gabor filters or some representations thereof. The later layers encode higher-level features such as object parts (parts of faces, cars, and so on). Finally, at the top, there is a layer that returns a probability distribution over classes. While there approach provide some structure, in the label space, to the output that is produced by a trained network, the issue is that this structure is seldom utilized when these networks are designed and trained. ure learning in probabilistic graphical models have been suggested, where the conventional algorithms for structure learning in deep convolutional networks typically fall into one of two ries: those that make the nets smaller, and those that make the nets better.

One suggested approach focuses on taking unwieldy pretrained networks and squeezing them into networks with a smaller memory footprint, thus requiring fewer computational resources.

This class of techniques follows the er-student" paradigm where the goals is to create a student network which mimics the teacher. This means that one needs to start with both an Oracle architecture and its learned weights – training the student only happens later. When distilling an ensemble of lists on very large datasets, the computationally expensive ensemble training step must be performed first.

Feng et al, "Learning the Structure of Deep utional Networks" is an e of a technique for automatically learning aspects of the structure of a deep model. This approach uses an Indian Buffet Process to propose a new convolutional neural network model to identify a structure, where after the structure is determined, pruning is performed to create a more compact representation of the network. r, one drawback with this approach is that the number of layers remain static, where it is only the known individual layers within the static number of layers that is ted to be more or less complex through the structure learning process. As such, this approach is unable to identify any new layers that may be needed to optimize the structure.

Therefore, there is a need for an improved approach to implement structure ng for convolutional neural networks.

Some embodiments of the invention are directed to an improved approach to implement structure learning for neural networks. The approach starts out with a k, provides the network with a problem having labeled data, and then reviews the ure of the output produced by this network. The network’s architecture is then modified to obtain a better solution for the specific problem. Rather than having experts come up with highly complicated and domain-specific network architectures, this approach allows the data to drive the architecture of the network that will be used for a specific task.

According to some embodiments, a neural network can be improved by (a) identifying the information gain bottleneck in its structure, (b) ng the structure of the predictions to alleviate the bottleneck, and finally (c) determining the depth of specialists pathways.

Some embodiments implement structure learning of neural ks by exploiting correlations in the data/problem the networks aim to solve, where a greedy approach is performed to find necks of information gain from the bottom convolutional layers all the way to the fully connected layers. In some embodiments, a network is created at an initial point in time, and a set of outputs are generated from the network when applied to a designated task, e.g., to perform image recognition/object classification tasks. Next, the various layers within the network model are analyzed to identify the lowest performing layer within the model.

Additional structures are then injected into the model to e the performance of the model.

In particular, new specialist layers are inserted into the model at the identified vertical position to augment the performance of the model. Rather than just having one general e pathway to perform classification for multiple types of s, a first new list layer may be added just to address classification of a first type of object and a second new specialist layer may be added just to address classification of a second type of object. By taking this , over time, each of these specialist components becomes highly knowledgeable about its dedicated area of expertise, since the specialist is forced to learn extensive levels of detail about the specific subdomain assigned to that specialist component. In this way, the model is improved by adding new layers that will directly address areas of classification that have been specifically fied as being sub-optimal compared to other parts of the network. This same process continues through the rest of the model to identify any onal layers that should be modified and/or augmented. [0013A] In one aspect there is provided a method implemented with a sor, comprising: creating a neural k; generating output from the neural network; fying a low performing layer from the neural network, the low performing layer having a vely lower performance than a performance of another layer in the neural network; inserting a new specialist layer at the low performing layer; and repeating the act of identifying and the act of inserting until a top of the neural network is reached. [0013B] In another aspect there is provided a , comprising: a processor; a memory for holding programmable code; and wherein the programmable code includes instructions for creating a neural network; generating output from the neural network; identifying a low performing layer from the neural network, the low performing layer having a relatively lower mance than a performance of another layer in the neural network; inserting a new specialist layer at the low performing layer; and repeating the act of identifying and the act of inserting until a top of the neural network is reached. [0013C] In another aspect there is ed a computer program product ed on a non-transitory computer readable medium, the non-transitory computer readable medium having stored thereon a sequence of instructions which, when executed by a processor causes the processor to e a method comprising: creating a neural network; generating output from the neural network; identifying a low performing layer from the neural network, the low performing layer having a relatively lower performance than a performance of another layer in the neural network; inserting a new specialist layer at the low performing layer; and repeating the act of identifying and the act of inserting until a top of the neural network is reached. [0013D] In another aspect there is provided a method implemented with a sor, comprising: ng a neural network comprising a plurality of layers; the neural network performing a processing task to generate and output; identifying an extraneous layer from the plurality of ; removing the extraneous layer from the neural k; and identifying and removing other extraneous layers from the plurality of layers until no other extraneous layer is identified, wherein the neural network undergoes both vertical splitting and ntal splitting. [0013E] In another aspect there is provided a method implemented with a processor, comprising: creating a neural network comprising a plurality of layers; the neural network performing a processing task to generate and output; identifying an extraneous layer from the plurality of , wherein an all-or-nothing highway network is ed to identify the extraneous layer in the neural network to be removed; removing the extraneous layer from the neural network; identifying and removing other extraneous layers from the plurality of layers until no other extraneous layer is fied; and associating a penalty with use of the eous layer, wherein the penalty is set to 0 when the sor is part of a cloud computing platform.

] In another aspect there is ed a method implemented with a processor, comprising: creating a neural network comprising a plurality of layers; the neural k performing a processing task to generate and output; identifying an extraneous layer from the plurality of , wherein an -nothing highway network is employed to identify the extraneous layer in the neural network to be removed; removing the extraneous layer from the neural network; identifying and removing other extraneous layers from the plurality of layers until no other extraneous layer is identified; and the all-or-nothing highway k introducing a mixing matrix to determine how to transform a skip connection corresponding to the extraneous layer. [0013G] In another aspect there is provided a method implemented with a processor, sing: creating a neural network comprising a plurality of layers; the neural network performing a processing task to generate and output; identifying an eous layer from the plurality of layers; removing the extraneous layer from the neural network; identifying and removing other extraneous layers from the plurality of layers until no other extraneous layer is identified; adding a plurality of loss layers to the neural network; and generating predictions at one of the loss layers, and converting the predictions to one or more confusion es forming a tensor T.

] In another aspect there is provided a method implemented with a processor, comprising: creating a neural network comprising a plurality of layers; the neural network performing a processing task to generate and output; identifying an extraneous layer from the plurality of layers; removing the extraneous layer from the neural network; identifying and removing other extraneous layers from the plurality of layers until no other extraneous layer is identified, wherein each layer of the neural network is addressed ndently, and a given layer of the neural network undergoes splitting by performing a greedy choice to split the given layer which provides a best improvement on a training loss. [0013I] In another aspect there is provided a system, comprising: a processor; a memory for holding programmable code; and wherein the programmable code includes instructions for creating a neural network comprising a plurality of layers; the neural network performing a processing task to generate and ; identifying an extraneous layer from the plurality of layers; removing the extraneous layer from the neural network; and identifying and removing other eous layers from the plurality of layers until no other extraneous layer is identified, wherein the neural network undergoes both vertical ing and ntal splitting. [0013J] In r aspect there is provided a computer program product embodied on a non-transitory computer readable medium, the non-transitory computer readable medium having stored thereon a ce of instructions which, when executed by a processor causes the sor to execute a method comprising: creating a neural network sing a ity of layers; the neural network performing a processing task to te and output; identifying an extraneous layer from the ity of layers; removing the extraneous layer from the neural network; and identifying and removing other extraneous layers from the plurality of layers until no other extraneous layer is fied, wherein the neural network undergoes both vertical splitting and horizontal splitting.

In certain embodiments, a "loss" mechanism (e.g., a loss layer, a loss function, and/or cost function) is included at each layer of the network. Instead of just having a single toplevel loss layer, additional loss layers are added to the other layer within the network, e.g., where a deep neural network has multiple loss layers at intermediate, and final, stages of feature extraction, where each loss layer measures the performance of the network up to that point in depth. Predictions can be generated at each loss layer and converted to the respective ion matrice, forming a tensor T containing all confusion matrices for the network. By analyzing the ure of T and its elements, the aim is to modify and augment the existing structure of the network both in terms of depth and breadth. To maximize feature sharing and reduce computation on one hand, yet to increase accuracy on the other, the aim is to restructure the existing networks’ structure. To do so, the approach partitions the networks’ depth as well and breadth according its current performance. Therefore, vertical splitting is performed in some embodiments, e.g., by computing the dot product n the different layers. To partition the architecture in depth, some embodiments compare the neighboring subspaces that correspond to the consecutive loss function evaluations at neighboring . In addition, horizontal splitting is performed, e.g., by performing K-way Bifurcation. To improve the performance of the network at a particular layer, its structure (e.g., fully convolutional), may require augmentation.

Parts of the network focus on general knowledge alist), while others concentrate on a small subset of labels that have high rity among each other (specialist). dge achieved by layer i will be used to perform the first horizontal partitioning of the network. The processing continues (e.g., in a ive ) until the top of the network is reached. At this point, the final model is stored into a computer readable medium.

Some embodiments pertain to the deep ng of the lists. While the structure of the generalist is known to perform well on general knowledge, it is not guaranteed that this same structure will perform well in a specialist where the task of the specialist may require a more simple or complex representation. Some embodiments allow the structure of each specialist to deviate from the structure of the generalist via depth-wise ing, in a data-driven manner.

Additional variations of these techniques may be applied in alternate embodiments. For example, for every pair of splits (vertical or horizontal), a network can be retrained to get classification at a given pathway. Techniques can be d in certain embodiments for speeding this up and/or avoiding it at all, such as by agglomerative clustering and/or splitting. Further, given confusion matrix Ci, and its partitioning K, agglomerative clustering may be performed on each of the K parts of the Ci to estimate further splits. This leads to the cost Xu. Cost X s is the cost of ised grouping, learning new confusion matrices at high levels of the network. Xu is less than or equal to Xs + Tau, where Tau is the upper bound on the clustering error.

In some embodiments, variations are considered with respect to convolutional layer versus fully-connected (1x1 convolution). If splitting is required among the convolutional layers (even fully utional layers, such as in the case of semantic segmentation), then instead of changing the linear size of the layer (fc in this case), the depth of dimension may be changed to reflect the number of classes (this is the extension to FCN).

Further variations and embodiments may be produced using collapsing or adding or vertical layers per pathway, changing the size of layer as a function of label space, and/or extension to detection and RNN (unrolling in the same way by comparing confusions).

In yet another embodiment, techniques may be applied to identify when there may be too many layers in the network, such that fewer layers would be adequate for the required sing tasks. As noted above, one can reliably add depth to a network and see an improvement in performance given enough training data. However, this added boost in performance may come at a cost in terms of FLOPs and memory consumption. In some embodiments, the network is optimized with this ff in mind with the usage of an all-ornothing highway k, which learns whether or not a given layer of computation in the k is used via a binary decision. If a given computational block is used, a penalty is incurred. By varying this penalty term, one can customize the learning process with a target architecture in mind: an embedded system would prefer a much leaner architecture then a cloudbased system.

Further details of aspects, objects, and advantages of the invention are described below in the detailed description, drawings, and claims. Both the foregoing general description and the following detailed description are ary and explanatory, and are not intended to be limiting as to the scope of the ion.

BRIEF DESCRIPTION OF THE DRAWINGS The drawings illustrate the design and utility of various embodiments of the present invention. It should be noted that the figures are not drawn to scale and that ts of similar structures or ons are represented by like nce numerals throughout the figures.

In order to better appreciate how to obtain the above-recited and other advantages and objects of various embodiments of the invention, a more detailed description of the present inventions briefly described above will be rendered by reference to specific embodiments thereof, which are illustrated in the accompanying drawings. Understanding that these drawings depict only typical embodiments of the invention and are not therefore to be considered ng of its scope, the invention will be described and explained with additional specificity and detail through the use of the accompanying drawings in which: Fig. 1 illustrates an example system which may be employed in some embodiments of the invention to implement structure learning for neural networks.

Fig. 2 shows a flowchart of an approach to implement structure learning for neural networks according to some embodiments of the invention.

Fig. 3 rates a more detailed flowchart of an approach to implement structure learning for neural networks according to some embodiments.

Figs. 4A-4F illustrate various embodiments of the invention.

Figs. 5A-5B illustrate an approach to identify when there may be too many layers in the network.

Figs. 6A-6D illustrate general AR system component options for various ments.

Fig. 7 depicts a computerized system on which some embodiments of the invention can be implemented DETAILED DESCRIPTION Some ments of the invention are directed to an improved ch to implement structure learning for neural networks. The approach starts out with a network, provides the network with a problem having labeled data, and then reviews the structure of the output produced by this network. T he network’s architecture is then modified to obtain a better solution for the specific problem. R ather than having experts come up with highly cated and domain-specific network architectures, this approach allows the data to drive the architecture of the network that will be used for a specific task.

Fig. 1 illustrates an example system which may be employed in some embodiments of the invention to implement ure learning for neural networks. The system may include one or more users that interface with and operate a computing system 107 or 115 to l and/or interact with the system. The system ses any type of computing station that may be used to operate, interface with, or implement a neural network computing device 107 or user computing device 115. Examples of such computing systems include for example, servers, workstations, personal computers, or remote computing terminals connected to a networked or cloud-based computing platform. The computing system may comprise one or more input devices for the user to e operational control over the activities of the system, such as a mouse or keyboard to late a pointing object. The computing system may also be associated with a display device, such as a display r, for control interfaces and/or analysis results to users of the computing system.

In some ments, the system is employed to implement computer vision functionality. As such, the system may include one or more image capture devices, such as camera 103, to capture image data 101 for one or more objects 105 in the environment at which the system operates. The image data 101 and/or any analysis results (e.g., classification output data 113) may be stored in one or more computer readable storage s. The computer readable storage medium includes any combination of hardware and/or software that allows for ready access to the data that is located at the computer readable e medium. For example, the computer readable storage medium could be implemented as er memory and/or hard drive storage operatively managed by an operating system, and/or remote storage in a networked e device, such as networked attached storage (NAS), storage area network (SAN), or cloud e. The computer readable storage medium could also be implemented as an electronic database system having storage on persistent and/or non-persistent storage.

The neural k computing device 107 includes a structure learning module 109 to modify an original model 1 into an improved model n, where model n is the results of possibly multiple iterative processes to modify the layers within the model. The model n preferably includes a depth and breadth in knowledge, essentially a mixture of experts. The model should understand the difference between coarse categories, yet at the same time tands the ence for fine grained classes across various domains. New specialist layers 111 are added to the model as necessary to implement these goals. The design of such a system is governed by the constraint of adding resources solely where they are ed. Simply expanding the network by making it arbitrarily deeper and wider does not scale due to computational constraints and thus the present approach avoids the need for extra regularization tricks.

Fig. 2 shows a flowchart of an approach to implement structure learning for neural networks according to some embodiments of the invention. The present approach implements structure learning of neural networks by ting correlations in the data/problem the networks aim to solve. A greedy approach is described that finds bottlenecks of information gain from the bottom convolutional layers all the way to the fully ted layers. Rather than simply making the architecture deeper arbitrarily, additional computation and capacitance is only added where it is required.

At 131, a k is d at an initial point in time. Any suitable approach can be used to create the network. For example, tional Alexnet or Googlenet approaches may be used to generate the network.

Next, at 133, a set of outputs are generated from the network when d to a designated task, e.g., to perform image recognition/object classification tasks. For example, assume that a number of people and s are within an nment, and the assigned task is to analyze the image data to classify the different people and types of animals that can be observed within the environment. Each layer of the model provides certain outputs for the activities performed within that layer. The output has certain structures to it which can be reviewed to ascertain relationships between classes in the classification problem being solved.

At 135, the various layers within the network model are analyzed to identify the lowest performing layer within the model. For example, assume a model having ten , where the layers from layers 1 through 3 and layers 5 through 10 each provide a 10% improvement in classification accuracy, but layer 4 only provides a 1% improvement. In this situation, layer 4 would be identified as the lowest performing layer.

Next, at 137, additional structures are injected into the model to improve the mance of the model. In particular, new specialist layers are inserted into the model at the fied vertical position to t the performance of the model.

To n this aspect of the inventive ment, assume that the model is intended to perform classifications of the people and animals in the environment as illustrated in Fig. 4A. Here, the image capture device captures images of different people (e.g., a woman 401, a man 403, and a child 405). In on, the nment includes multiple animals (e.g., a cat 407, a dog 409, and a mouse 411). Further assume that the existing model is able to successfully distinguish the people (401, 403, 405) from the animals (407, 409, 411), but appears to have a more ult time distinguishing the different people from each other or distinguishing the different types of animals from one another. If one reviews the actual structure that can be learned from a network (e.g., an Oracle network), it is clear that the network includes learning dependence between the predictions that is being made. However, in traditional deep-learning architecture , this is not utilized. If one looks even closer at this structure, it is evident that the system is learning concepts that are actually visually similar to one another. Referring to Fig. 4B, an example r plot in 3D of classes is shown to illustrate an example structure of predictions for a fully-trained AlexNet, clustered into multiple groups. The ce between points corresponds to the visual similarity between concepts. Here, it can be seen that there is a first tight clustering of the points relative to the people objects and a second tight clustering of points relative to the animal objects. It is this phenomena that may contributes to difficulties in a model being able to distinguish one person from r or one animal from another.

In this situation in some embodiments of the invention, rather than just having one general purpose pathway to perform classification for all of these types of objects, a first new specialist layer may be added just to address classification of people and a second new specialist layer may be added just to address classification of animals. One specialist (people specialist layer) would therefore be assigned to handle data for portion 413 of the chart in Fig. 4B while the second specialist (animal specialist layer) would be assigned to handle data for portion 415 in Fig. 4B. By taking this action, over time, each of these specialist ents becomes highly knowledgeable about its dedicated area of expertise, since the specialist is forced to learn extensive levels of detail about the specific subdomain assigned to that specialist ent. In this way, the model is improved by adding new layers that will directly address areas of classification that have been specifically fied as being sub-optimal compared to other parts of the network.

This same process continues h the rest of the model to fy any additional layers that should be modified and/or augmented. Therefore, a determination is made at 139 r the processing has reached the top of the network. If so, then the model is finalized at 141. If not, then the process s back to 133 to continue the process until the top of the network is reached.

This approach can be taken to modify and improve the architecture of any off-theshelf convolutional neural network. By following the inventive approach of the present disclosure, any neural network can be improved by (a) identifying the information gain neck in its structure, (b) applying the structure of the predictions to alleviate the bottleneck, and finally (c) determining the depth of specialists pathways.

Fig. 3 illustrates a more detailed flowchart of an approach to ent structure learning for neural networks according to some embodiments. For the purposes of this flow, assume that a network (e.g., a monolithic network) has already been created pursuant to any suitable approach such as Alexnet or Googlenet.

At 151, a "loss" mechanism (e.g., a loss layer, a loss function, and/or cost function) is included at each layer of the network. A loss ism corresponds to a function that maps an event or value to a entation of a cost or error value associated with processing within the neural network. As shown in Fig. 4C, instead of just having a single top-level loss layer 421, additional loss layers 423 are added to the other layer within the network. Therefore, this figure shows an example of a deep neural network with le loss layers at intermediate, and final, stages of feature extraction, where each loss layer measures the performance of the network up to that point in depth. Recall that the goal is to augment and modify the network architecture to solve a given problem as best as possible by modifying its architecture to best fit the task. Therefore, the approach analyzes the predictions, formed at the various loss layers hout the network, and groups neuron activations based on the confusion between them.

As illustrated in Figs. 4D and 4E, predictions generated at each loss layer and converted to the respective confusion matrices (as shown in Fig. 4D), forming a tensor T containing all confusion matrices for the network, e.g., Oracle network (as shown in Fig. 4E).

By analyzing the structure of T and its elements, the aim is to modify and augment the ng structure of the network both in terms of depth and breadth.

To explain, let Ci be the confusion matrix of classes and loss layer i, then: where Ai is the affinity matrix at loss layer i, Di is the diagonal matrix, Li is the graph Laplacian, and Ĉi is a subspace d by the leading eigenvectors of the graph Laplacian of the affinity matrix produced by Ci. uently, tensor: To maximize feature sharing and reduce computation on one hand, yet to increase accuracy on the other, the aim is to restructure the existing networks’ ure. To do so, the approach partitions the networks’ depth as well and breadth according its current performance. ore, at 153, vertical splitting is performed, e.g., by computing the dot product between the different layers. To partition the architecture in depth, some embodiments compare the neighboring subspaces that pond to the consecutive loss function evaluations at neighboring layers using the ing equation: Here, Ĉi and Ĉi+1 denote the approximate leading ectors of the confusion matrices for loss functions at levels i and i + 1, and F denotes the Frobenius norm. Formally, Ĉi and Ĉi+1 represent NE-dimensional subspaces and ɸ (i, i+1) is the normalized complement angle between them. It is important to note that this measure ɸ only depends on the subspace spanned by the columns of Ĉi and Ĉi+1 and thus is invariant to rotations of the eigenvectors. Also, ɸ is constrained within [0, 1], with levels i and i + 1 are deemed similar in structure if ɸ (i, i+1) is close to zero and is y 1 when ɸ (i, i+1) are orthogonal. To construct a complete similarity relation between levels of scale space, all neighboring pairs of losses layers are compared using ɸ. With the established similarity ons it is now possible to address the m of partitioning the monolithic network architecture.

Let ϕ be the vector of all sequential pairs of i and i+1, where ϕi = ɸ (i, i+1).

Values of ϕi closest to zero indicate the lowest information gain between layers i and i+1. Thus, (ϕ) is the optimal initial split of the thic architecture. Splitting the architecture in depth facilitates feature sharing while identifying points of redundancy (zero information gain).

At 155, horizontal splitting is performed, e.g., by performing K-way Bifurcation.

To improve the performance of the network at a particular layer, its structure (e.g., fully convolutional), may require augmentation. Parts of the network focus on general knowledge (generalist), while others concentrate on a small subset of labels that have high similarity among each other (specialist). Knowledge achieved by layer i will be used to perform the first horizontal partitioning of the network.

Formally, given Ci, compute Li as per ons (1), (2), and (3) as disclosed above. An Eigengap is determined by analyzing the leading alues of the graph ian Li to determine the number of new pathways (specialists). al data is projected onto the top N leading eigengectors of Li; in RN, the data is further clustered into k classes, where k equals the Eigengap. An e of such projection and grouping is illustrated in Fig. 4B. This procedure will lead to the modification of the architecture as shown in Fig. 4F, which illustrates a network 407 after the first split.

Once the first split has been established, then all new pathways are treated as the original network. The splitting procedure are applied until no more labels are left to split or 100% accuracy is achieved.

At 157, the above processing continues (e.g., in a recursive manner) until the top of the network is reached. At this point, the final model is stored into a computer readable medium.

This portion of the disclosure pertains to the deep learning of the specialists.

While the structure of the generalist is known to perform well on l knowledge, it is not guaranteed that this same ure will perform well in a specialist where the task of the specialist may require a more simple or complex representation. Some embodiments allow the structure of each list to deviate from the structure of the generalist via depth-wise splitting, in a data-driven manner.

Let L = {L1, L2, …., Ln} be a set of fully -connected layers to be considered for further ing. Consider a layer Li in L that produces an output y. One can write the transformation that it applies to its input as y = σ(f(x)), where σ( ) applies a non-linearity such as ReLU and f(x) = Wx where W is a learned weight matrix of dimensions M x N and x is input to this layer having dimensions N x 1. To perform a split, the approach decomposes the transformation of Li into y = 2(h(x))), where σ1( ) and σ2( ) are activation functions and g(x) = W1x and h(x) = W2x in which W1 has dimensions N x N and W2 has dimensions M x N. The approach chooses: Here, W = U ΣVT is the SVD ization of W and I is the identity matrix. With this change, the transformation of layer Li is unchanged. To increase the complexity of the learned representation of Li one could set σ2 as a non-linear activation function, such as ReLU.

However, adding this non-linearity causes an abrupt change in the learned representation of Li and may cause the network to restart much of its ng from scratch. Instead, one can insert a PReLU non-linearity and initialize its single parameter a to be 1, which is equivalent to I in on 8. This es the specialist with a smooth mechanism for introducing a new nonlinearity at this layer.

Given the set of layers L, one can apply the above strategy to each layer Li independently and greedily choose the split which provides the best improvement on the training loss. This process can be repeated recursively to our the set of layers Lnew = {L1, L2, …., Ln, Ln+1}.

Additional variations of these techniques may be applied in alternate embodiments. For example, for every pair of splits (vertical or horizontal), a network can be ned to get classification at a given pathway. T echniques can be applied in certain embodiments for speeding this up and/or ng it at all, such as by erative clustering and/or splitting. Further, given ion matrix Ci, and its partitioning K, agglomerative clustering may be performed on each of the K parts of the Ci to estimate further . This leads to the cost Xu. Cost X s is the cost of supervised grouping, ng new confusion es at high levels of the network. Xu is less than or equal to Xs + Tau, where Tau is the upper bound on the clustering error.

In some embodiments, variations are considered with respect to convolutional layer versus fully-connected (1x1 convolution). If splitting is required among the convolutional layers (even fully convolutional layers, such as in the case of semantic segmentation), then d of changing the linear size of the layer (fc in this case), the depth of dimension may be changed to reflect the number of s (this is the extension to FCN). r variations and embodiments may be produced using collapsing or adding or vertical layers per pathway, changing the size of layer as a function of label space, and/or extension to detection and RNN (unrolling in the same way by comparing confusions).

In yet another embodiment, techniques may be applied to identify when there may be too many layers in the network, such that fewer layers would be adequate for the required processing tasks. As noted above, one can reliably add depth to a network and see an improvement in performance given enough training data. However, this added boost in performance may come at a cost in terms of FLOPs and memory consumption. In some embodiments, the network is optimized with this tradeoff in mind with the usage of an all-ornothing highway k, which learns whether or not a given layer of computation in the network is used via a binary decision. If a given computational block is used, a penalty is incurred. By varying this penalty term, one can customize the learning process with a target architecture in mind: an embedded system would prefer a much leaner architecture then a cloudbased system.

The issue addressed by this embodiment is to determine how deep a network should be given a computational budget for a given problem X. With the approach of using an all-or-nothing highway network, highway networks introduce a mixing matrix to learn how the skip connection from the previous layer should be transformed before mixing with the output of the current computational block. Consider the following equation: y = F(x, Wi) + Wsx (10) Residual ks can find success in using the identity mapping to combine the skip connection. Although the identity mapping is less representative, it is more efficient and easier to optimize: y = F(x, Wi) + x (11) The current approach instead parameterize a mixing matrix by a single scalar α which gates the output of the ational block (see Fig. 5A): y = α F(x, Wi) + x (12) When α = 0, the y = x and the input is simply passed to the output. When α = 1, (eqn 12) becomes (eqn 10) and a residual unit is used for computation.

Fig. 5A rates a chart 501 for a network with an all-or-nothing highway tion. In this figure, a computational block is fed an input and later joined via a residual connection ntwise on). Before the addition, the output of the computation block is scaled by a learned parameter α which penalizes the use of this computational block. This loss is described below.

Learning is performed to determine whether or not to use a computation block. It is desirable to impose a prior on the α parameter which ls the behavior of a given layer in a deep network, and optimize this parameter jointly with the model parameters and its objective function. During training, it is desirable to encourage a binary on for α to choosing either 0 or 1 for each depth independently. If a computational block is learned to be skipped, then one can simply remove that computation block from the model at inference time.

In a al network, consecutive layers in general have small mappings, where the learned al functions in general have small responses, suggesting that identity mappings provide reasonable preconditioning. This ts that transitioning between an identity map in (eqn 10) and an identity layer and vice versa should not cause catastrophic change in the objective on. Thus the present approach introduces a piecewise smooth loss function on the α parameter which gates the output of the computational block at various .

In addition, it is desirable to parameterize the loss function on the α parameters such that for a different ios, a higher penalty is assigned to models which use more computation. In the case a light embedded platform such as smartphone, one might want a high penalty for choosing a layer. In the case of a cloud computing platform, no such penalty for using a computation block might be wanted. Given these criteria, one can use the piecewise smooth polynomial/linear function shown in Fig. 5B, which can be parameterized by the following: if x < 0.: y = (np.absolute(x) * self.steepness) elif x > 1.: y = (x - 1.) * self.steepness + + self.peak*0.125 elif x < 0.5: y = -self.peak * (x**2. - x) else: y = -self.peak/2. * (x**2. - x) + self.peak*0.125 For various selections of the peak shown in Fig. 5B, a varying usage y is given to the model.

AGUMENTED REALITY AND ING SYSTEMS ARCHITECTURE(S) The above-described ques are particularly applicable to machine vision ations for virtual reality and augmented y systems. The ive neural network classification device may be implemented ndently of AR systems, but many ments below are described in relation to AR systems for illustrative es only.

Disclosed are devices, methods and systems for classification and recognition for various computer systems. In one embodiment, the computer system may be a head-mounted system configured to facilitate user interaction with various other computer systems (e.g., financial computer systems). In other embodiments, the computer system may be a stationary device (e.g., a merchant terminal or an ATM) configured to facilitate user financial transactions.

Various embodiments will be described below in the context of an AR system (e.g., headmounted ), but it should be appreciated that the embodiments disclosed herein may be used independently of any existing and/or known AR s.

Referring now to Figs. 6A-6D, some general AR system component options are illustrated according to various embodiments. It should be appreciated that although the embodiments of Figs. 6A-6D illustrate head-mounted displays, the same components may be incorporated in stationary computer systems as well, and Figs. 6A-6D should not be seen as As shown in Fig. 6A, a head-mounted device user 60 is depicted wearing a frame 64 structure coupled to a display system 62 positioned in front of the eyes of the user 60. The frame 64 may be permanently or temporarily coupled to one or more user fication specific sub systems depending on the required level of security. A speaker 66 may be coupled to the frame 64 in the depicted configuration and positioned adjacent the ear canal of the user 60. In an alternative embodiment, another speaker (not shown) is positioned adjacent the other ear canal of the user 60 to provide for stereo/shapeable sound control. In one or more embodiments, the user identification device may have a display 62 that is operatively coupled, such as by a wired lead or wireless connectivity, to a local processing and data module 70, which may be mounted in a y of configurations, such as fixedly attached to the frame 64, fixedly attached to a helmet or hat 80 as shown in the embodiment depicted in Fig. 6B, embedded in headphones, removably attached to the torso 82 of the user 60 in a backpack-style configuration as shown in the embodiment of Fig. 6C, or removably attached to the hip 84 of the user 60 in a belt-coupling style configuration as shown in the embodiment of Fig. 6D.

The local processing and data module 70 may comprise a power-efficient processor or controller, as well as digital memory, such as flash memory, both of which may be utilized to assist in the processing, caching, and storage of data. The data may be captured from sensors which may be operatively coupled to the frame 64, such as image e devices (such as cameras), microphones, inertial measurement units, rometers, ses, GPS units, radio devices, and/or gyros. Alternatively or additionally, the data may be acquired and/or processed using the remote processing module 72 and/or remote data repository 74, possibly for passage to the display 62 after such processing or retrieval. The local processing and data module 70 may be operatively d 76, 78, such as via a wired or wireless communication links, to the remote sing module 72 and the remote data repository 74 such that these remote modules 72, 74 are operatively coupled to each other and available as resources to the local processing and data module 70.

In one embodiment, the remote sing module 72 may comprise one or more relatively powerful processors or controllers configured to analyze and process data and/or image information. In one embodiment, the remote data repository 74 may comprise a relatively largescale digital data storage facility, which may be available through the internet or other king configuration in a "cloud" resource configuration. In one embodiment, all data is stored and all computation is performed in the local processing and data module, allowing fully autonomous use from any remote modules.

In some embodiments, fication devices (or AR systems having identification applications) r to those described in Figs. 6A-6D provide unique access to a user’s eyes.

Given that the identification/AR device interacts crucially with the user’s eye to allow the user to ve 3-D virtual content, and in many embodiments, tracks s biometrics related to the user’s eyes (e.g., iris patterns, eye vergence, eye motion, patterns of cones and rods, patterns of eye movements, etc.), the resultant tracked data may be advantageously used in identification applications. Thus, this unprecedented access to the user’s eyes naturally lends itself to various identification applications.

Fig. 7 is a block diagram of an illustrative computing system 1400 suitable for implementing an embodiment of the present invention. Computer system 1400 includes a bus 1406 or other communication mechanism for communicating information, which interconnects subsystems and s, such as sor 1407, system memory 1408 (e.g., RAM), static storage device 1409 (e.g., ROM), disk drive 1410 (e.g., magnetic or optical), ication ace 1414 (e.g., modem or Ethernet card), display 1411 (e.g., CRT or LCD), input device 1412 (e.g., keyboard), and cursor control.

According to one embodiment of the invention, computer system 1400 performs specific operations by processor 1407 executing one or more sequences of one or more ctions contained in system memory 1408. Such instructions may be read into system memory 1408 from another computer readable/usable medium, such as static storage device 1409 or disk drive 1410. In ative embodiments, hard-wired circuitry may be used in place of or in ation with software instructions to implement the invention. Thus, embodiments of the invention are not limited to any specific combination of hardware try and/or software. In one embodiment, the term "logic" shall mean any combination of software or hardware that is used to implement all or part of the invention.

The term "computer readable medium" or "computer usable medium" as used herein refers to any medium that participates in providing instructions to processor 1407 for execution. Such a medium may take many forms, including but not limited to, non-volatile media and le media. Non-volatile media includes, for example, optical or magnetic disks, such as disk drive 1410. Volatile media includes dynamic memory, such as system memory 1408.

Common forms of computer readable media include, for e, floppy disk, flexible disk, hard disk, magnetic tape, any other ic medium, CD-ROM, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, RAM, PROM, EPROM, FLASH-EPROM, any other memory chip or cartridge, or any other medium from which a computer can read.

In an embodiment of the invention, ion of the sequences of instructions to practice the invention is performed by a single computer system 1400. According to other embodiments of the invention, two or more computer systems 1400 coupled by communication link 1415 (e.g., LAN, PTSN, or wireless network) may perform the sequence of instructions required to practice the invention in coordination with one another.

Computer system 1400 may it and receive messages, data, and instructions, including program, e.g., application code, through communication link 1415 and communication interface 1414. Received program code may be executed by processor 1407 as it is received, and/or stored in disk drive 1410, or other non-volatile storage for later execution. Computer system 1400 may communicate through a data interface 1433 to a database 1432 on an external storage device 1431.

In the foregoing specification, the invention has been described with nce to ic embodiments f. It will, r, be evident that various modifications and changes may be made thereto without departing from the broader spirit and scope of the invention. For example, the described process flows are described with reference to a particular ordering of process actions. However, the ordering of many of the described process actions may be changed without ing the scope or ion of the invention. The specification and drawings are, accordingly, to be regarded in an illustrative rather than restrictive sense.

Claims

1. A method implemented with a processor, comprising: creating a neural network comprising a plurality of layers; the neural network performing a processing task to generate an output; identifying an extraneous layer from the plurality of layers; ng the extraneous layer from the neural network; and identifying and removing other extraneous layers from the plurality of layers until no other extraneous layer is identified, wherein the neural k undergoes both vertical splitting and ntal splitting.

2. The method of claim 1, wherein identifying the extraneous layer comprises determining that the extraneous layer does not improve the performance of the processing task by the neural network.

3. The method of claim 1 or 2, wherein an all-or-nothing highway network is employed to identify the extraneous layer in the neural network to be removed.

4. The method of claim 3, further comprising associating a penalty with use of the extraneous layer.

5. The method of claim 4, n a size of the penalty corresponds to an amount of computational power used by the extraneous layer.

6. The method of claim 5, wherein a size of the penalty is varied.

7. A method implemented with a processor, comprising: creating a neural network comprising a plurality of ; the neural network performing a sing task to te and output; identifying an extraneous layer from the plurality of layers, wherein an -nothing highway network is employed to identify the extraneous layer in the neural network to be removed; removing the eous layer from the neural network; fying and removing other eous layers from the plurality of layers until no other extraneous layer is identified; and associating a penalty with use of the extraneous layer, n the penalty is set to 0 when the processor is part of a cloud computing platform.

8. The method of claim 3, wherein the all-or-nothing highway network generates a binary decision to not use the extraneous layer.

9. A method implemented with a processor, comprising: creating a neural k comprising a plurality of layers; the neural network performing a processing task to generate and output; identifying an extraneous layer from the plurality of layers, wherein an all-or-nothing highway network is employed to identify the eous layer in the neural network to be removed; removing the extraneous layer from the neural network; identifying and removing other extraneous layers from the plurality of layers until no other eous layer is identified; and the all-or-nothing highway network introducing a mixing matrix to determine how to transform a skip tion corresponding to the extraneous layer.

10. The method of claim 9, further comprising parameterizing the mixing matrix by a scalar value.

11. The method of claim 10, wherein when the scalar value is zero, an input to the extraneous layer is passed onto an output from the extraneous layer, thereby skipping the extraneous layer before removing the eous layer form the neural network.

12. The method of claim 3, further comprising using identity mapping to transform a skip connection corresponding to the extraneous layer.

13. The method of any one of claims 1 to 6, further comprising updating a model for the neural network to obtain an updated model, wherein the extraneous layer is removed from the model to obtain the updated model.

14. The method of any one of claims 1 to 13, wherein a plurality of loss layers are added to the neural network.

15. A method implemented with a processor, comprising: creating a neural network comprising a plurality of layers; the neural network ming a processing task to generate and output; identifying an extraneous layer from the plurality of layers; removing the extraneous layer from the neural network; identifying and ng other extraneous layers from the plurality of layers until no other extraneous layer is identified; adding a plurality of loss layers to the neural k; and generating predictions at one of the loss layers, and ting the predictions to one or more ion matrices forming a tensor T.

16. The method of claim 15, wherein a structure of T is analyzed to modify and augment an existing structure of the neural network both in terms of depth and breadth.

17. The method of any one of claims 1 to 16, wherein K-way Bifurcation is performed to implement the ntal splitting.

18. A method implemented with a processor, comprising: creating a neural network comprising a plurality of layers; the neural network performing a processing task to generate and output; identifying an eous layer from the plurality of layers; removing the extraneous layer from the neural network; identifying and removing other extraneous layers from the plurality of layers until no other extraneous layer is identified, wherein each layer of the neural network is addressed independently, and a given layer of the neural network undergoes splitting by performing a greedy choice to split the given layer which provides a best improvement on a training loss.

19. The method of any one of claims 1 to 18, wherein the neural network is employed to fy images captured for a virtual realty or ted reality system.

20. A system, comprising: a processor; a memory for holding programmable code; and wherein the programmable code includes instructions for creating a neural k comprising a plurality of layers; the neural network performing a processing task to generate and ; identifying an extraneous layer from the plurality of layers; removing the eous layer from the neural network; and identifying and removing other extraneous layers from the plurality of layers until no other extraneous layer is identified, wherein the neural network undergoes both vertical splitting and horizontal splitting.

21. A er program product embodied on a non-transitory er readable medium, the non-transitory computer le medium having stored thereon a sequence of instructions which, when executed by a processor causes the processor to execute a method comprising: creating a neural network comprising a ity of layers; the neural network performing a processing task to generate and output; identifying an extraneous layer from the plurality of layers; removing the extraneous layer from the neural network; and identifying and removing other extraneous layers from the plurality of layers until no other extraneous layer is identified, wherein the neural network undergoes both vertical splitting and horizontal splitting. 1/