GB2599180A

GB2599180A - Method for improved binarized neural networks

Info

Publication number: GB2599180A
Application number: GB2103967.2A
Authority: GB
Inventors: Bulat Adrian; Tzimiropoulos Georgios; Martinez Brais
Original assignee: Samsung Electronics Co Ltd
Current assignee: Samsung Electronics Co Ltd
Priority date: 2020-06-01
Filing date: 2021-03-22
Publication date: 2022-03-30
Also published as: GB202103967D0

Abstract

A method of training a machine learning model is disclosed, such as training a binarized convolutional neural network to analyse or capture images. The machine learning model comprises convolutional layers, each layer comprising expert binary filters. A gating function of each layer is trained to select one of the expert binary filters based on data input to the model. Each expert binary filter may be trained to specialise in a subset of training data. Each expert binary filter may be trained to learn real-value weights and generate an intermediate trained model. The intermediate trained model may then be trained to learn binary-value weights. The real-value weight may be learned by training one expert binary filter and applying that real-value weight to all remaining expert binary filters. The convolutional layers may be grouped, and the expert binary filters in each group may be trained together. Also disclosed is a method of analysing data using the trained machine learning model. The expert binary filters may be selected to perform image classification, gesture recognition, emotion recognition, action recognition, to identify an image matching a query, or to determine when to capture an image.

Description

Method for Improved Binarized Neural Networks

Field

[001] The present application generally relate to an improved binarized neural network, and in particular to methods for training binarized convolutional neural networks and using the trained neural network to analyse input data on resource-constrained devices.

Background

[2] Generally speaking, Deep Neural Networks (DNN) have established themselves as the de facto solution for a wide variety of computer vision tasks. Despite their unrivalled accuracy, DNN models are typically computationally expensive and as such, their deployment on resource-constrained devices (such as edge devices, certain smartphones, Internet of Things devices, etc.) is prohibitive. A promising hardware-friendly solution is offered by network quantisation that accelerates the model by performing operations on operands represented using fewer bits. The most interesting case is that of network binarization, where convolutions can be implemented with bitwise operations. While very efficient, these binarized models are typically highly inaccurate. In one example of network binarization, all values are restricted to two states only. This comes with two big advantages. Firstly, it compresses the weights by a factor of 32x via bit-packing, and secondly it replaces the computationally expensive multiple-add with bit-wise xnor and popcount operations, offering in practice a speed-up of approximately 58x on a CPU. Despite this, how to reduce the difference in accuracy between a binarized model and its real-valued counterpart remains an open problem and this is currently the major impediment in their wide-scale adoption.

[3] Current quantized models tend to learn a generic set of filters, which do not exploit different particularities present in diverse data.

[004] It is unclear how to adapt existing model to custom requirements and to preserve two key factors: efficiency (both in power consumption and runtime) and accuracy.

[005] Existing conditional computing methods are unsuitable for binary networks since they rely on weighted average of filters or activations, which for the case of binary convolutions will either break the binarization or result in an unstable and more expensive double binarization.

[006] The present applicant has recognised the need for a technique for improving the accuracy of binarized neural network models.

Summary

[7] In a first approach of the present techniques, there is provided a computer-implemented method for training a machine learning, ML, model comprising a plurality of convolutional layers, the method comprising: obtaining training data representative of an input space; training each convolutional layer of the ML model using the training data, the layer comprising a plurality of expert binary filters; and training a gating function of each convolutional layer to select one of the plurality of expert binary filters based on input data to be processed by the ML model.

[8] The present techniques solve the above-mentioned problem by applying conditional computing to binary networks, with the purpose of increasing or expanding the capacity of a ML model. This is called Expert Binary Convolution. For each convolutional layer of the ML model, rather than learning a weight tensor that is expected to generalise well across the entire input space, a set of N experts are trained such that each expert specialises to a portion of the input space. A very light-weight gating function is trained to dynamically select a single expert for an input and use the selected expert to process features of the input. Learning to select a single expert that is tuned to the input data makes the present techniques suitable for binary networks.

[9] Training the plurality of expert binary filters of each convolutional layer may comprise training each expert binary filter to specialise to a subset of the training data. In other words, each expert binary filter may specialise to a portion of an input space.

[010] The gating function is preferably trained to select the best or most appropriate expert binary filter from the plurality of expert binary filters to process the input data.

[011] Training the plurality of expert binary filters of each convolutional layer may comprise: training each expert binary filter using the training data to learn real-value weights, and generating an intermediate trained model; and training the intermediate model using the training data to learn binary-value weights.

[012] Training the plurality of expert binary filters of each convolutional layer may comprise, prior to training each expert binary filter: training one binary filter using the training data to learn a real-value weight; and applying the real-value weight to all remaining expert binary filters of the plurality of expert binary filters of each convolutional layer to initialise the training of each expert binary filter.

[013] The plurality of convolutional layers of the ML model may be grouped into two or more groups. In this case, the training of the plurality of expert binary filters may comprise training the expert binary filters in each group together.

[14] In a second approach of the present techniques, there is provided a computer-implemented method for analysing input data using a machine learning, ML, model comprising a plurality of convolutional layers, the method comprising: receiving at least one input data item for analysis; inputting the input data item into the ML model; selecting, using a trained gating function of each convolutional layer, an expert binary filter of each convolutional layer to process the input data item; and processing the input data item using the selected expert binary filters.

[15] The method for analysing input data may further comprise receiving instructions on the analysis to be perform. The selecting may comprise selecting expert binary filters based on the analysis to be performed.

[16] The input data item may be an image or a frame of a video. The instructions may specify how the image/frame is to be processed.

[17] For example, the input data item may be an image or image frame, and the instructions may be to perform image classification. The selecting may therefore comprise selecting expert binary filters to classify the input data item.

[18] In another example, the input data item may be an image or image frame, and the instructions may be to perform gesture recognition. The selecting may therefore comprise selecting expert binary filters to identify gestures in the input data item.

[019] In another example, the input data item may be an image or image frame, and the instructions may be to perform emotion or action recognition. The selecting may comprise selecting expert binary filters to identify emotions or actions in the input data item.

[20] The instructions may be to identify images matching a user query, wherein the at least one input data item is a set of images stored on a user device. In this case, the selecting may comprise selecting expert binary filters based on the user query. The method may further comprise outputting at least one image matching the user query.

[21] The instructions may be to determine when to capture an image matching a user requirement, wherein the at least one input data item is a set of images viewed by an image capture device. In this case, the selecting may comprise selecting expert binary filters based on the user requirements. The method may further comprise sending instructions to the image capture device to capture an image matches the user requirement.

[22] In a third approach of the present techniques, there is provided an electronic apparatus comprising: at least one processor, coupled to memory, arranged to analyse input data using a machine learning, ML, model comprising a plurality of convolutional layers, by: receiving at least one input data item for analysis; inputting the input data item into the ML model; selecting, using a trained gating function of each convolutional layer, an expert binary filter of each convolutional layer to process the input data item; and processing the input data item using the selected expert binary filters.

[023] The apparatus may further comprise at least one image capture device for capturing images or videos to be processed by the ML model.

[024] The apparatus may further comprise storage storing images or videos to be processed by the ML model.

[25] The apparatus may further comprise a user interface for receiving a user query. In this case, the selecting may comprise selecting expert binary filters based on the user query.

[26] The processor may receive instructions on the analysis to be performed. The selecting may comprise selecting expert binary filters based on the analysis to be performed.

[27] The input data item may be an image or image frame captured by the at least one image capture device. The instructions received by the processor may be instructions to perform image classification. Thus, the selecting may comprise selecting expert binary filters to classify the input data item.

[28] Alternatively, the instructions received by the processor may be to perform gesture recognition. Thus, the selecting may comprise selecting expert binary filters to identify gestures in the input data item.

[29] In another example, the input data item may be an image or image frame captured by the at least one image capture device, and the user instructions may be to perform emotion or action recognition. Here, the selecting may comprise selecting expert binary filters to identify emotions or actions in the input data item.

[30] In another example, the user query received via the user interface may be to identify images matching a search criteria, and the at least one input data item may be a set of images stored on a user device. Here, the selecting may comprise selecting expert binary filters based on the user query. The processor may output at least one image matching the user query.

[31] In another case, the user query received via the user interface may be to capture an image using the image capture device matching a user requirement. Thus, the at least one input data item may be a set of images obtained by an image capture device. Here, the selecting may comprise selecting expert binary filters based on the user requirements. The processor may control the image capture device to capture an image when the image obtained by the image capture device matches the user requirement.

[32] In a related approach of the present techniques, there is provided a non-transitory data carrier carrying processor control code to implement the methods described herein.

[33] As will be appreciated by one skilled in the art, the present techniques may be embodied as a system, method or computer program product. Accordingly, present techniques may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects.

[34] Furthermore, the present techniques may take the form of a computer program product embodied in a computer readable medium having computer readable program code embodied thereon. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable medium may be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing.

[35] Computer program code for carrying out operations of the present techniques may be written in any combination of one or more programming languages, including object oriented programming languages and conventional procedural programming languages. Code components may be embodied as procedures, methods or the like, and may comprise subcomponents which may take the form of instructions or sequences of instructions at any of the levels of abstraction, from the direct machine instructions of a native instruction set to high-level compiled or interpreted language constructs.

[36] Embodiments of the present techniques also provide a non-transitory data carrier carrying code which, when implemented on a processor, causes the processor to carry out any of the methods described herein.

[037] The techniques further provide processor control code to implement the above- described methods, for example on a general purpose computer system or on a digital signal processor (DSP). The techniques also provide a carrier carrying processor control code to, when running, implement any of the above methods, in particular on a non-transitory data carrier. The code may be provided on a carrier such as a disk, a microprocessor, CD-or DVD-ROM, programmed memory such as non-volatile memory (e.g. Flash) or read-only memory (firmware), or on a data carrier such as an optical or electrical signal carrier. Code (and/or data) to implement embodiments of the techniques described herein may comprise source, object or executable code in a conventional programming language (interpreted or compiled) such as Python, C, or assembly code, code for setting up or controlling an AS IC (Application Specific Integrated Circuit) or FPGA (Field Programmable Gate Array), or code for a hardware description language such as Verilog (RTM) or VHDL (Very high speed integrated circuit Hardware Description Language). As the skilled person will appreciate, such code and/or data may be distributed between a plurality of coupled components in communication with one another. The techniques may comprise a controller which includes a microprocessor, working memory and program memory coupled to one or more of the components of the system.

[038] It will also be clear to one of skill in the art that all or part of a logical method according to embodiments of the present techniques may suitably be embodied in a logic apparatus comprising logic elements to perform the steps of the above-described methods, and that such logic elements may comprise components such as logic gates in, for example a programmable logic array or application-specific integrated circuit. Such a logic arrangement may further be embodied in enabling elements for temporarily or permanently establishing logic structures in such an array or circuit using, for example, a virtual hardware descriptor language, which may be stored and transmitted using fixed or transmittable carrier media.

[039] In an embodiment, the present techniques may be realised in the form of a data carrier having functional data thereon, said functional data comprising functional computer data structures to, when loaded into a computer system or network and operated upon thereby, enable said computer system to perform all the steps of the above-described method.

[040] The methods described above may be wholly or partly performed on an apparatus, i.e. an electronic device, using a machine learning or artificial intelligence model. The model may be processed by an artificial intelligence-dedicated processor designed in a hardware structure specified for artificial intelligence model processing. The artificial intelligence model may be obtained by training. Here, "obtained by training" means that a predefined operation rule or artificial intelligence model configured to perform a desired feature (or purpose) is obtained by training a basic artificial intelligence model with multiple pieces of training data by a training algorithm. The artificial intelligence model may include a plurality of neural network layers. Each of the plurality of neural network layers includes a plurality of weight values and performs neural network computation by computation between a result of computation by a previous layer and the plurality of weight values.

[041] As mentioned above, the present techniques may be implemented using an Al model. A function associated with Al may be performed through the non-volatile memory, the volatile memory, and the processor. The processor may include one or a plurality of processors. At this time, one or a plurality of processors may be a general purpose processor, such as a central processing unit (CPU), an application processor (AP), or the like, a graphics-only processing unit such as a graphics processing unit (GPU), a visual processing unit (VPU), and/or an Al-dedicated processor such as a neural processing unit (NPU). The one or a plurality of processors control the processing of the input data in accordance with a predefined operating rule or artificial intelligence (Al) model stored in the non-volatile memory and the

S

volatile memory. The predefined operating rule or artificial intelligence model is provided through training or learning. Here, being provided through learning means that, by applying a learning algorithm to a plurality of learning data, a predefined operating rule or Al model of a desired characteristic is made. The learning may be performed in a device itself in which Al according to an embodiment is performed, and/o may be implemented through a separate server/system.

[42] The Al model may consist of a plurality of neural network layers. Each layer has a plurality of weight values, and performs a layer operation through calculation of a previous layer and an operation of a plurality of weights. Examples of neural networks include, but are not limited to, convolutional neural network (CNN), deep neural network (DNN), recurrent neural network (RNN), restricted Boltzmann Machine (RBM), deep belief network (DBN), bidirectional recurrent deep neural network (BRDNN), generative adversarial networks (GAN), and deep 0-networks.

[43] The learning algorithm is a method for training a predetermined target device (for example, a robot) using a plurality of learning data to cause, allow, or control the target device to make a determination or prediction. Examples of learning algorithms include, but are not limited to, supervised learning, unsupervised learning, semi-supervised learning, or reinforcement learning.

Brief description of drawings

[44] Implementations of the present techniques will now be described, by way of example only, with reference to the accompanying drawings, in which: [45] Figure 1 shows a schematic diagram of the Expert Binary Convolution operation of the present techniques; [046] Figure 2 shows a network architecture for implementing Expert Binary Convolution; [047] Figures 3A and 3B are graphs showing, respectively, the effect of depth and groups on accuracy as a function of the number of binary operations (BOPS) on ImageNet; [48] Figure 4 is a graph showing the effect of adding a 1x1 binary convolutional layer after the grouped convolutional layers; [49] Figure 5 is a flowchart of example steps for training a machine learning, ML, model comprising a plurality of convolutional layers; [50] Figure 6 is a flowchart of example steps for analysing input data using a machine learning. ML, model comprising a plurality of convolutional layers; [051] Figures 7A to 7D show example applications of the method for analysing input data; and [052] Figure 8 shows an apparatus for analysing input data using a machine learning, ML, model comprising a plurality of convolutional layers.

Detailed description of drawings

[53] Broadly speaking, the present techniques provide an improved binarized neural network. In particular, the present techniques provide methods for training binarized convolutional neural networks and methods for using the trained neural network to analyse input data on resource-constrained devices.

[54] The main building block of a Convolutional Neural Network (CNN) is a module called convolution. This module comes with a set of parameters. A convolution processes some input features and outputs new features. In standard CNNs, the parameters of the convolution and the features can take real values e.g. 0.0025, -0.33 etc. In binary CNNs, the parameters and the features can take only two values +1, -1. This has huge computational advantages (in terms of speed of execution), however it also comes at a cost: the accuracy of a binary CNN is significantly reduced. The present techniques concern transforming a largely exploratory topic, that of binary CNNs, into a real contender for practical applications through a series of technical novelties.

[55] In order to perform analysis such as object recognition, action/emotion recognition, and person recognition, it is necessary to extract features from the image first. This feature extraction process is typically performed using a CNN-based feature extractor in an ML model.

The extracted features can then be input into a task-specific module of the ML model to perform the specific task (e.g. object recognition). However, typical CNN feature extractors are costly, in terms of time required to extract features, hardware requirements (e.g. processing power), and battery usage. This can make it difficult to implement ML models on consumer devices such as smartphones. In contrast, the task-specific module may be quite lightweight as it is typically performing a single, specific task. Thus, the present techniques provide a new design of a CNN feature extractor that is better suited to implementing ML models on end-user/consumer devices. Specifically, the present techniques use binary convolution, with improved power and accuracy. The architecture and training is described in more detail below.

[56] Conditional computing is a very general data processing framework which refers to using different models or different parts of a model (a model here is a CNN) conditioned on the input data or input features. This has the advantage of making the model adaptive to the input features and therefore more powerful, introducing only negligible computational cost. For the case of CNNs, a more powerful model means a more accurate one.

[57] The present techniques provide a method that constitutes the first successful application of conditional computing to binary CNNs. 20 [58] One way to increase the accuracy of a standard real convolution (i.e. not binary) is by using a Mixture-of-Experts approach (B Yang, G Bender, QV Le, J Ngiam, "CondCony: Conditionally Parameterized Convolutions for Efficient Inference', Advances in Neural Information Processing Systems, 2019). Usually in a convolution, there is one set of parameters. Yang et al propose to use N (e.g. N=8) set of parameters, also called Experts, which are firstly combined with each other using Nweights. Their combination gives rise to a new set parameters which can be then used to perform the convolution. Different weights result in different combinations. It is then important to consider what the best way is to find the optimal weights. Yang et al propose that these weights can be calculated (through another module) from the input features. Since the weights are computed from (i.e. conditioned on) the input features, this technique is called Conditional Convolution.

[059] However, the above Conditional Convolution technique will not work for binary convolutions. One reason is that the even if the N Experts are binary, when combined with each other with some weights, as described above, their combination will not be binary any longer.

[60] The present techniques propose a new mechanism to overcome this problem, which is called "Best-of-Experts" or "Expert Binary Convolution'. For each convolutional layer of the ML model, a set of N experts are trained such that each expert specialises to a portion of the input space. Rather than combining them with each other, the ML model comprises a gating function that is trained to select the most appropriate expert for input data (i.e. the best expert), again from the input features. As a result, the output of the module is guaranteed to be a set of binary weights as required by a binary operation.

[61] The present techniques show how to successfully train a binary CNN from training data when the binary convolution is replaced by an Expert Binary Convolution. This advantageously results in 3-4% gain in accuracy (with negligible additional computational cost) over the standard binary convolution.

[62] A binary convolution is defined as: BConv(x, 0) = (sign(x) 0 stgn(0)) (l) a (Equation 1) where x is the input, 9 is the weights, C) denotes the binary convolutional operation, 0 the Hadamard product, and a E RC is learned via back-propagation.

[63] The binarization is performed in two stages. In Stage 1, a network is trained with binary activations and real-valued weights. The accuracy of the Stage 1 models are very representative of the accuracy of the final fully binary model. In Stage 2, a network is trained with both binary weights and activations, by initialising from Stage 1. When reporting results, if no stage is specified, the model (weights and activations) is fully binary.

[064] The baseline is set as the Strong Baseline model (denoted as SBaseline herein), on top of which the proposed method is implemented.

[065] Expert Binary Convolution. Assume a binary convolutional layer with input X E IRcinxwxH and weight tensor 9 E RCin Cx out xkH xkv, In contrast to a normal convolution that applies the same weights to all input features, the present techniques learn a set of expert weights (or simply experts) 00, 01,..., 0N-11, O E IRcinxcautxkHxktv alongside a selector gating function which, given input x selects only one expert to be applied to it. The proposed EBConv layer is depicted in Figure 1.

[066] Figure 1 shows how binary features are input into the EBConv layer and binary features are output. The EBConv layer comprises a convolution operation and a gating function. The convolution operation is a binary convolutional operation that convolves the input binary features of shape Cin xWxH using the weights selected by the gating function. The produced output is of shape C"t x W x H. The gating function takes in the input features of an image and produces a vector that selects one expert at a time. The gating function has the following structure: global average pooling, linear layer, proposed variation of argsoftmax (as explained below). The theta parameters are a set of n binary filters (as mentioned above), each of shape Cin x C"t x kil x kw, learned via backpropagation jointly with the gating function. The EBConv layer is now described in more detail.

[067] To learn the experts, they are first stacked in matrix 0 E IRNxeiriCoutkiikw It is proposed to learn the following function: EBConv(x, 0) = BConv(x, (V(t1)(30)TO)r) (Equation 2) where (pc) is a gating function (returning an N-dimensional vector as explained below) that implements the expert selection mechanism using as input 0(x) which is an aggregation function of the input tensor x, and (.), simply reshapes its argument to a tensor of appropriate dimensions.

[068] Gating function p A crucial component of the proposed approach is the gating function that implements the expert selection mechanism. An obvious solution would be to use a Winners-Take-All (WTA) function, however this is not differentiable. A candidate that comes in mind to solve this problem is the softargmax with temperature -c: as T -> 0, the entry corresponding to the max will tend to 1 while the rest to 0. However, as 1--> 0 the derivative of the softargmax converges to the Dirac function 8 which provides poor gradients and hence hinders the training process. This could be mitigated if a high r is used, however this would require hard thresholding at test time which for the case of binary networks, given that the models are trained using Equation 2, leads to large errors.

[069] To mitigate the above, and distancing from reinforcement learning techniques often deployed when discrete decisions need to be made, the present techniques propose, for the forward pass, to use a WTA function for defining coc), as follows: L1, if = ar gmax(z) (Equation 3) co (z) = 0, otherwise Note that we define co as co: RS RN i.e. as a function that returns an N -dimensional vector which is used to multiply (element-wise) 0 in Eq. 2. This is crucial as, during training, it is desirable to back-propagate gradients for the non-selected experts. To this end, it is proposed, for the backward pass, to use the Softmax function for approximating the gradients cp(.), that is: azk = (Pk6kla - (Equation 4) where k, E [0,1,..., N -11 and Skk, is the Kronecker delta function. Overall, the present techniques (WTA for forward and Softmax for backward), effectively addresses the mismatch during inference between training and testing while, at the same time, allowing meaningful gradients to flow to all experts during training.

[70] Aggregation function tp: The purpose of this function is to give a summary of the input feature tensor which will be used to select the expert. To avoid overfitting and to keep the computational cost low, a simple and fast linear function is used: ip(x) = [elk[11*** (Equation 5) where g[t] = x[i] is the spatial average of the i -th channel and co c filcx N a learnable

HW

projection matrix. Note that no other non-linearity was used as the WTA function is already a non-linear function.

[71] Data-specific experts: One expected property of EBConv implied by the proposed design is that the experts should specialize on portions of data. This is because, for each data sample, a single expert is chosen per convolutional layer.

[072] Optimization policy: The present techniques adopt a two-stage training policy where firstly the input signal is binarized while learning real-valued weights, and then both signal and weights are binarized. Note that the aggregation function ip is kept real across all the steps since its computational cost is insignificant. Furthermore, due to the discrete decision making process early on, the training can be unstable. Therefore, to stabilize the training, firstly one expert is trained, and then this is used to initialize the training of all N experts. This ensures that early on in the process any decision made by the gating function is a good decision.

Overall, our optimization policy can be summarized as follows: 1. Train one expert, parametrized by 00, using real weights and binary activations.

2. Replicate 00 to all O,i = [1,N -1)10 initialize matrix 0.

3. Train the model initialized in step 2 using real weights and binary activations.

4. Train the model obtained from step 3 using binary weights and activations.

[73] Figure 2 shows a network architecture for implementing Expert Binary Convolution. The network comprises a plurality of macro modules, and each macro module comprises expert binary convolution layers. In the example shown in Figure 2, the network has four macro modules, but it will be understood that this is simply an illustrative example and the network may have more or fewer macro modules.

[74] The macro module depth controls the number of convolutional blocks within each of the macro modules.

[75] The number of groups within each macro module may be determined via grid search to maximise speed and accuracy. The number of groups may be the same or different per macro module.

[076] The width of the network may be increased to two, compared with a baseline ResNet network architecture. The computational cost is controlled via grouped convolution.

[77] The present techniques also address the problem of the representation capacity of the binary activations. This issue arises due to the fact that only two states are used to characterize each feature, resulting in an information bottleneck which hinders the learning of highly accurate binary networks.

[78] The solution of the present techniques is surprisingly simple yet effective: the only parameters one can adjust in order to increase the representational power of binary features are the resolution and the width (i.e. number of channels). The former is largely conditioned on the resolution of the data, being as such problem dependent. Hence, the present techniques propose to increase the network width. For example a width expansion of k = 2 can increase the number of unique configurations for a 32 x 7 x7 binary feature tensor from 232x7x7 = 21568 to 22136. However, increasing the network width directly causes a quadratic increase in complexity with respect to k. Hence, in order to keep the number of binary operations (B0P5) constant, Grouped Convolutions are used with group size G proportional to the width expansion, i.e. G = k2. Finally, since grouped convolutions are used, features across groups need to be somehow combined throughout the network. This can be achieved at no extra cost through the 1 x 1 convolutions used for downsampling at the end of each stage when change of spatial resolution occurs (standard convolutions are used for downsampling).

Table 1:

Expansion # experts Accuracy (%) Top-1 Top-5 1 (SBaseline) 1 60.9 83.0 2 1 64.6 85.6 4 1 65.1 86.0 1 4 63.8 85.1 1 8 64.0 85.3 2 4 66.0 86.4 2 8 66.3 86.6 [79] Table 1 compares models having different numbers of experts and expansion rates, against a baseline model ("SBaseline". The baseline model is the model described by Brais Martinez, Jing Yang, Adrian Bulat, and Georgios Tzimiropoulos in "Training binary neural networks with real-to-binary convolutions"; ICLR 2020.) Each model, include SBaseline, has the same number of binary operations (BOPS). As Table 1 clearly shows, models trained with a width multiplier higher than 1 offer consistent accuracy gains, notably without increasing complexity. Importantly, these gains also add up with the ones obtained by using the proposed EBConv. This is not surprising as width expansion improves representation capacity while the expert increases model capacity.

[80] In general, there is little work in network design for binary networks. Recently, a few binary Neural Architecture Search (NAS) techniques have been proposed. Despite reporting good performance, these methods have the same limitations typical of NAS methods, for example, having to search for an optimal cell using a predefined network architecture, or having to hand pick the search space. The present techniques propose a mixed semi-automated approach that draws from the advantages of both automatic and manual network designing techniques. Specifically, setting the standard ResNet-18 network as a starting point, the focus is on searching for optimal binary network structures, gradually exploring a set of different directions (width, depth, groups, layer arrangement).

[081] Effect of block arrangement: Starting from a ResNet-based topology in mind, a network is denoted with Ni,i= [1,2,3,4} blocks at each resolution as N0N1N2N3, with each block having two convolutional layers. Table 2 shows a comparison on ImageNet of stage 1 models with different block arrangements. Table 2 shows whether re-arranging the blocks, mainly by using a network which is heavier at later stages, can have an impact on accuracy.

Note that since the number of features is doubled among stages, this re-arrangement preserves the same complexity. It can be seen that the accuracy remains largely unchanged while the layers are re-distributed.

Table 2:

Ar01V1/V2/V3 Accuracy (%) Top-1 Top-5 1133 63.8 86.5 1142 63.8 86.8 1124 63.7 87.4 2222 63.9 87.4 [82] Depth vs width: As explained above, an efficient width expansion mechanism based on grouped convolutions was proposed, which is found to increase the accuracy of binary networks without increasing complexity. The effect of increasing depth by adding more blocks is also investigated. Figure 3 shows the results of depth expansion. Each constellation represents a different architecture in which only the number of blocks, i.e. the depth, is varied (as shown by the text on the graph). It can be seen that the returns of increasing depth are diminished as complexity also rapidly increases, resulting in very heavy models. The results show that, for a fixed computation budget, the proposed wide binary models with grouped convolutions actually outperform the deep ones by a large margin.

[83] Effect of groups: As explained above, the present techniques propose grouped convolutions as a mechanism for keeping the computations under control as network width is increased. The effect of different group sizes and their placement across the network is also explored. This, in turn, allows, with a high degree of granularity, to vary the computational budget at various points in the network while preserving the width and as such the information flow. To describe the space of network structure explored, the following naming convention is used: a network is denoted with N1i = [1,2,3,4) blocks at each resolution, a corresponding width expansion E (the same E was used for all blocks) and group size Gi for each convolution in these blocks as: N01V11V21V3 -E -Go: G1: G2: G3.

[084] Figure 33 and Table 3 shows the effect of number of groups and their placement on accuracy. Networks with the same structure are connected with the same type of line. It can be seen that increasing the number of groups (especially for the last 2 stages) results in significantly more efficient models which maintain the high accuracy (with only small decrease) compared to much larger models having the same network structure but fewer groups. That is, increaseing group size drastically reduces the binary operations (BOPS) with little impact on accuracy. In particular, the results suggest that group sizes of 16 or 32 for the last 2 stages provide the best trade-off.

Table 3:

Ambit ture

BOPS

Thp-I. Top-5 lop-I Thp-5 Acc, (fk Stage U Acc. (%7 Su 1242-2-4:4: 66.3 c.6 684 1262-24:4: 66;5 87.4 69.0 126:2-2-464: 5:37 67.7 69.7 U42-2-4:4:8:16. 871 689 1262-2-4:4:8:1.6 87 3 69,7 12V-7-4:4:8:16. 69,7 262-2-4:8:6:16 "1.

262-14:8,t8:10 *77t6 873 L3 1 5 1 7 90.1 Li [085] Effect of aggregation over groups: The efficient width expansion mechanism described above uses a very weak way of aggregating the information across different groups. A better way is to explicitly use a 1 x 1 binary convolutional layer (with no groups) after each block. The effect of adding that layer is shown in Figure 4. Figure 4 is a graph showing the effect of adding a 1x1 binary convolutional layer after the grouped convolutional layers. The dashed line connects same models with and without the 1x1 convolutional layer. Clearly, aggregation across groups via 1 x 1 convolutions offers significant accuracy gains, adding at the same time a reasonable amount of complexity.

[086] Training: The training procedure largely follows that described in Brais Martinez, Jing Yang, Adrian Bulat, and Georgios Tzimiropoulos, "Training binary neural networks with realto-binary convolutions", ICLR, 2020. In particular, the networks are trained using the Adam optimizer (Diederik P Kingma and Jimmy Ba. Adam, "A method for stochastic optimization", arXiv preprint arXiv:1412.6980, 2014) for 75 epochs using a learning rate of 10-3 that is decreased by 10 at epoch 40, 55 and 65. During stage I, the weight decay is set to 10-5 and to 0 during stage II. Furthermore, during the first 10 epochs a learning rate warm-up is applied (as described in Priya Goya!, Piotr Doll'ar, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He, "Accurate, large minibatch sgd: Training imagenet in 1 hou( arXiv preprint arXiv:1706.02677, 2017.) The images are augmented by randomly scaling and cropping the images to a resolution of 224 x 224px. In addition to this, to avoid overfitting on the given expert filters Mixup is used with a = 0.2 (Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz, "mixup: Beyond empirical risk minimization", arXiv preprint arXiv:1710.09412, 2017). For testing, the standard procedure of scaling the images to a resolution of 256px first and then center cropping them was used. All models were trained on 4 V100 GPUs and implemented using PyTorch.

[87] Table 4 below compares the model of the present techniques compared against state-of-the-art methods with similar capacity, based on ImageNet classification. The first four architectures include models that increase the network size/capacity -the last column in the table shows the capacity scaling. The fifth, sixth and seventh architectures are binary NAS methods. The present techniques are listed in the table as EBConv, EBConv* and EBConvn.

In the table, *denotes models trained using AT+KD, as described in Brais Martinez, Jing Yang, Adrian Bulat, and Georgios Tzimiropoulos, "Training binary neural networks with real-to-binary convolutions", ICLR, 2020. In the table ** denotes an improved training scheme, which is described below.

[88] It can be seen from Table 4 that the present techniques improve on top of the currently best performing method (Real-to-bin, as described in Brais Martinez, Jing Yang, Adrian Bulat, and Georgios Tzimiropoulos, "Training binary neural networks with real-to-binary convolutions", ICLR, 2020.), by almost 6% in terms of top-1 accuracy. In other words, the present techniques improve upon prior work, with no increase in computational cost by around 6%, reaching a groundbreaking -71% accuracy on ImageNet classification.

[89] Furthermore, as shown in Table 4 the present techniques surpass the accuracy of significantly larger and slower networks (ABC-Net, Struct. Approx. CBCN, and Ensemble) by a wide margin.

[90] The present techniques were also compared against two recent works that use NAS to search for binary networks (BATS, BNAS-F and BNAS-G). Table 4 shows that the present techniques outperform these, again by a large margin, while being significantly more efficient.

[91] In terms of computational requirements, the present techniques maintain the same overall budget, having an equal or slightly lower number of FLOPs and BOPs, as shown in Table 4. Although the present techniques increase the storage space used, by two times for a model that uses four experts, the run-time memory largely remains the same.

Table 4:

Architecture Accuracy (%) Operations # bits (W/A) Top- Top-5 BOPS x 108 FLOPS x 108 ABC-Net (M;N = 65.0 85.9 42.5 1.3 (1/1)x52 5) Struct. Approx. 66.3 86.6 (1/1)x4 CBCN 61.4 82.8 (1/1)x4 Ensemble 61.0 10.6 7.8 (1/1)x6 BATS 66.1 87.0 2.1 1.2 (1/1) BNAS-F 58.9 80.9 1.7 1.5 (1/1) BNAS-G 62.2 83.9 3.6 1.5 (1/1) BNN 42.2 69.2 1.7 1.3 1/1 XNOR-Net 51.2 73.2 1.7 1.3 1/1 CCNN 54.2 77.9 1.7 1.3 1/1 Bi-Real Net 56.4 79.5 1.7 1.5 1/1 Rethink. BNN 56.6 79.4 1.7 1.3 1/1 XNOR-Net++ 57.1 79.9 1.7 1.4 1/1 IR-Net 58.1 80.0 1.7 1.3 1/1 CI-Net 59.9 84.2 1/1 Real-to-Bin* 65.4 86.2 1.7 1.5 1/1 EBConv 67.5 87.5 1.7 1.1 1/1 EBConv* 70.0 89.2 1.7 1.1 1/1 EBConv** 71.2 90.1 1.7 1.1 1/1 ABC-Net is described in Xiaofan Lin, Gong Zhao, and Wei Pan, "Towards accurate binary convolutional neural network", NIPS, pp. 345-353, 2017. Struct. Approx. is described in Bohan Zhuang, Chunhua Shen, Mingkui Tan, Lingqiao Liu, and Ian Reid, "Structured binary neural networks for accurate image classification and semantic segmentation', CVPR, pp. 413-422, 2019. CBCN is described in Chunlei Liu,Wenrui Ding, Xin Xia, Baochang Zhang, Jiaxin Gu, Jianzhuang Liu, Rongrong Ji, and David Doermann, "Circulant binary convolutional networks: Enhancing the performance of 1-bit dcnns with circulant back propagation', CVPR, pp. 2691-2699, 2019. Ensemble is described in Shilin Zhu, Xin Dong, and Hao Su, "Binary ensemble neural network: More bits per network or more networks per bit?' CVPR, pp. 49234932, 2019. BATS is described in Adrian Bulat, Brais Martinez, and Georgios Tzimiropoulos, "BATS: Binary architecture search", arXiv preprint arXiv:2003.01711, 2020. BNAS-F and BNAS-G are described in Kunal Pratap Singh, Dahyun Kim, and Jonghyun Choi, "Learning architectures for binary networks", arXiv preprint arXiv:2002.06963, 2020. BNN is described in Matthieu Courbariaux, Ray Hubara, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio, "Binarized neural networks: Training deep neural networks with weights and activations constrained to +1 or -1", arXiv preprint arXiv:1602.02830, 2016. XNOR-Net is described in Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi, "XNOR-Net: ImageNet classification using binary convolutional neural networks", European conference on computer vision, pp. 525-542. Springer, 2016. CCNN is described in Zhe Xu and Ray CC Cheung, "Accurate and compact convolutional neural networks with trained binarization", arXiv preprint arXiv:1909.11366, 2019. Bi-Real Net is described in Zechun Liu, BaoyuanWu,Wenhan Luo, Xin Yang,Wei Liu, and Kwang-Ting Cheng. "Si-Real Net: Enhancing the performance of 1-bit CNNs with improved representational capability and advanced training algorithm", ECCV, pp. 747-763, 2018. Rethink. BNN is described in Koen Helwegen, James Widdicombe, Lukas Geiger, Zechun Liu, Kwang-Ting Cheng, and Roeland Nusselder, "Latent weights do not exist: Rethinking binarized neural network optimization", NeurIPS, 2019. XNOR-Net++ is described in Adrian Bulat and Georgios Tzimiropoulos, "XNOR-Net++: Improved binary neural networks", BMVC, 2019. IR-Net is described in Haotong Qin, Ruihao Gong, Xianglong Liu, Mingzhu Shen, Ziran Wei, Fengwei Yu, and Jingkuan Song, "Forward and backward information retention for accurate binary neural networks', Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2250-2259, 2020. CI-Net is described in Ziwei Wang, Jivven Lu, Chenxin Tao, Jie Zhou, and Qi Tian, "Learning channel-wise interactions for binary convolutional neural networks", CVPR, pp. 568-577, 2019.

[092] The improved training scheme mentioned above (EBConv**) is based on the idea that the increase in representational capacity offered by the proposed model could benefit from a stronger teacher. To validate this hypothesis, two real-valued teacher models of different capacity (controlled by depth) we trained. One scored 72.5% top-1 accuracy on ImageNet classification, and a larger one scored 76.0%. Table 5 shows the impact of the teacher on the final accuracy of the model of the present techniques on ImageNet classification. As shown in Table 5 below, the model of the present techniques can exploit the knowledge contained in a stronger teacher network, improving the overall performance by 1.2%. At Step 1, a full precision model with a structure that matches the binary network of the present techniques is trained. At Step II, the previous model is used as a teacher and used to train a student with binary activations and real-valued weights. At the end of this step, the weights expansion strategy is also performed, which propagates the trained weights across all experts as described above. Finally, the model produced at the previous step is used as a teacher for training a fully binary network.

Table 5:

FP32 Teacher Binary Student 72.5% 70.0% 76.0% 71.2% [93] Figure 5 is a flowchart of example steps for training a machine learning, ML, model comprising a plurality of convolutional layers. The method shown in Figure 5 is a computer-implemented method for training a machine learning, ML, model comprising a plurality of convolutional layers. The method comprises obtaining training data representative of an input space (step S100). The training data may be obtained from any suitable source, including ImageNet. The training data and process is described above in more detail.

[94] The method comprises training each convolutional layer of the ML model using the training data, the layer comprising a plurality of expert binary filters (step S102). Training the plurality of expert binary filters of each convolutional layer may comprise training each expert binary filter to specialise to a subset of the training data. In other words, each expert binary filter may specialise to a portion of an input space.

[95] Training the plurality of expert binary filters of each convolutional layer may comprise: training each expert binary filter using the training data to learn real-value weights, and generating an intermediate trained model; and training the intermediate model using the training data to learn binary-value weights.

[96] Training the plurality of expert binary filters of each convolutional layer may comprise, prior to training each expert binary filter: training one binary filter using the training data to learn a real-value weight; and applying the real-value weight to all remaining expert binary filters of the plurality of expert binary filters of each convolutional layer to initialise the training of each expert binary filter.

[097] The plurality of convolutional layers of the ML model may be grouped into two or more groups. In this case, the training of the plurality of expert binary filters may comprise training the expert binary filters in each group together.

[098] The method comprises training a gating function of each convolutional layer to select one of the plurality of expert binary filters based on input data to be processed by the ML model (step 3104). The gating function is preferably trained to select the best or most appropriate expert binary filter from the plurality of expert binary filters to process the input data.

[99] Figure 6 is a flowchart of example steps for analysing input data using a machine learning, ML, model comprising a plurality of convolutional layers. The method shown in Figure 6 is a computer-implemented method for analysing input data using a machine learning, ML, model comprising a plurality of convolutional layers. The method comprises receiving at least one input data item for analysis (step 3200). The at least one input data item may be an image or a frame of a video, which may be stored on a device which implements this method and/or may be captured by a device which implements this method.

[100] The method comprises inputting the input data item into the ML model (step 3202).

[101] The method comprises selecting, using a trained gating function of each convolutional layer, an expert binary filter of each convolutional layer to process the input data item (step S204) and processing the input data item using the selected expert binary filters (step 3206). The specific expert binary filter which is selected may vary depending on the input data item.

[102] The method for analysing input data may further comprise receiving instructions on the analysis to be perform. The selecting may comprise selecting expert binary filters based on the analysis to be performed.

[103] The input data item may be an image or a frame of a video. The instructions may specify how the image/frame is to be processed.

[104] For example, the input data item may be an image or image frame, and the instructions may be to perform image classification. The selecting may therefore comprise selecting expert binary filters to classify the input data item.

[105] In another example, the input data item may be an image or image frame, and the instructions may be to perform gesture recognition. The selecting may therefore comprise selecting expert binary filters to identify gestures in the input data item.

[106] In another example, the input data item may be an image or image frame, and the instructions may be to perform emotion or action recognition. The selecting may comprise selecting expert binary filters to identify emotions or actions in the input data item.

[107] The instructions may be to identify images matching a user query, wherein the at least one input data item is a set of images stored on a user device. In this case, the selecting may comprise selecting expert binary filters based on the user query. The method may further comprise outputting at least one image matching the user query.

[108] The instructions may be to determine when to capture an image matching a user requirement, wherein the at least one input data item is a set of images viewed by an image capture device. In this case, the selecting may comprise selecting expert binary filters based on the user requirements. The method may further comprise sending instructions to the image capture device to capture an image matches the user requirement.

[109] Figures 7A to 7D show example applications of the method for analysing input data. In Figure 7A, an apparatus (such as a smartphone) comprises an image capture device for taking photographs or videos. The captured image or video may be processed by the trained ML model to perform near-real-time or real-time analysis of the image/video. For example, the trained ML model may analyse the image/video to determine what objects are present in the image/video or whether a specific object is present in the image/video, or what type of scenes the image capture device should select, or who is present in the image/video.

[110] Figure 7B shows how the trained ML model may be used to identify images/videos that match a user query. For example, a user may want to see images that were taken on a sunny day or find images with a specific person which show that person smiling. Thus, the ML model may be used to perform gallery search. Similarly, the trained ML model may be used to provide real-time or near-real-time Al-assisted image/video capture. For example, a user may want to capture a photo when everyone is smiling or to capture a photo when someone is cooking. Thus, the user may specify user requirements/conditions that the ML model uses to control when a photo or video is captured.

[111] Figure 7C shows how the trained ML model may be used to perform on-device image classification. For example, each input image (which may be captured by an image capture device) may be analysed by the trained ML model to identify and classify objects within the image.

[112] Figure 7D shows how a device such as a robotic assistant device may use the trained ML model to perform image recognition and other analysis tasks on-device. As the model is running on the robotic device, it is cheaper to implement (no cloud resources required), faster, and has increased privacy (as data being analysed remains local to the device). For example, the robotic device may perform user recognition and identification using the trained ML model to, among other things, automatically recognise the user ID that interacts with the robot, and automatically identify user pose. In another example, the robotic device may perform face or body tracking using the trained ML model to automatically localise face and body landmarks in an input image or video stream. In another example, the robotic device may perform action or emotion recognition using the trained ML model, to identify a user's action or displayed emotion so that it can react accordingly/appropriately.

[113] Figure 8 shows an apparatus 100 for analysing input data using a machine learning, ML, model comprising a plurality of convolutional layers. The apparatus 100 may be any one of: a smartphone, tablet, laptop, computer or computing device, virtual assistant device, a vehicle, a drone, an autonomous vehicle, a robot or robotic device, a robotic assistant, image capture system or device, an augmented reality system or device, a virtual reality system or device, a gaming system, an Internet of Things device, or a smart consumer device (such as a smart fridge). It will be understood that this is a non-exhaustive and non-limiting list of example devices.

[114] The apparatus 100 comprises a trained machine learning, ML, model 106 for analysing input data, such as for the purposes described above with respect to Figures 7A to 7D.

[115] The apparatus comprises at least one processor 102 coupled to memory 104. The at least one processor 102 may comprise one or more of: a microprocessor, a microcontroller, and an integrated circuit. The memory 104 may comprise volatile memory, such as random access memory (RAM), for use as temporary memory, and/or non-volatile memory such as Flash, read only memory (ROM), or electrically erasable programmable ROM (EEPROM), for storing data, programs, or instructions, for example.

[116] The at least one processor 102 may be arranged to analyse input data using the machine learning, ML, model 106 comprising a plurality of convolutional layers, by: receiving at least one input data item for analysis; inputting the input data item into the ML model; selecting, using a trained gating function of each convolutional layer, an expert binary filter of each convolutional layer to process the input data item; and processing the input data item using the selected expert binary filters.

[117] The apparatus 100 may further comprise at least one image capture device 108 for capturing images or videos to be processed by the ML model.

[118] The apparatus 100 may further comprise storage 112 storing images or videos to be processed by the ML model.

[119] The apparatus may further comprise at least one user interface 110 for receiving a user query. In this case, the selecting may comprise selecting expert binary filters based on the user query. The user interface 110 may be a display screen, touch screen, microphone to capture spoken instructions, etc. [120] The processor 102 may receive instructions on the analysis to be performed (e.g. via the user interface 110). The selecting may comprise selecting expert binary filters based on the analysis to be performed.

[121] The input data item may be an image or image frame captured by the at least one image capture device 108. The instructions received by the processor 102 may be instructions to perform image classification. Thus, the selecting may comprise selecting expert binary filters to classify the input data item.

[122] Alternatively, the instructions received by the processor 102 may be to perform gesture recognition. Thus, the selecting may comprise selecting expert binary filters to identify gestures in the input data item.

[123] In another example, the input data item may be an image or image frame captured by the at least one image capture device 108, and the user instructions may be to perform emotion or action recognition. Here, the selecting may comprise selecting expert binary filters to identify emotions or actions in the input data item.

[124] In another example, the user query received via the user interface 110 may be to identify images matching a search criteria, and the at least one input data item may be a set of images stored in storage 112 on the user device. Here, the selecting may comprise selecting expert binary filters based on the user query. The processor 102 may output at least one image matching the user query.

[125] In another case, the user query received via the user interface 110 may be to capture an image using the image capture device 108 matching a user requirement. Thus, the at least one input data item may be a set of images obtained by an image capture device 108. Here, the selecting may comprise selecting expert binary filters based on the user requirements.

The processor 102 may control the image capture device 108 to capture an image when the image obtained by the image capture device matches the user requirement.

[126] Those skilled in the art will appreciate that while the foregoing has described what is considered to be the best mode and where appropriate other modes of performing present techniques, the present techniques should not be limited to the specific configurations and methods disclosed in this description of the preferred embodiment. Those skilled in the art will recognise that present techniques have a broad range of applications, and that the embodiments may take a wide range of modifications without departing from any inventive concept as defined in the appended claims.

Claims

CLAIMS1. A computer-implemented method for training a machine learning. ML, model comprising a plurality of convolutional layers, the method comprising: obtaining training data representative of an input space; training each convolutional layer of the ML model using the training data, the layer comprising a plurality of expert binary filters; and training a gating function of each convolutional layer to select one of the plurality of expert binary filters based on input data to be processed by the ML model.
2. The method as claimed in claim 1 wherein training the plurality of expert binary filters of each convolutional layer comprises training each expert binary filter to specialise to a subset of the training data.
3. The method as claimed in claim 1 or 2 wherein training the plurality of expert binary filters of each convolutional layer comprises: training each expert binary filter using the training data to learn real-value weights, and generating an intermediate trained model; and training the intermediate model using the training data to learn binary-value weights.
4. The method as claimed in claim 3 wherein training the plurality of expert binary filters of each convolutional layer comprises, prior to training each expert binary filter: training one binary filter using the training data to learn a real-value weight; and applying the real-value weight to all remaining expert binary filters of the plurality of expert binary filters of each convolutional layer to initialise the training of each expert binary filter.
5. The method as claimed in any preceding claim wherein the plurality of convolutional layers of the ML model are grouped into two or more groups, and wherein the training of the plurality of expert binary filters comprises training the expert binary filters in each group together.
6. A computer-implemented method for analysing input data using a machine learning, ML, model comprising a plurality of convolutional layers, the method comprising: receiving at least one input data item for analysis; inputting the input data item into the ML model; selecting, using a trained gating function of each convolutional layer, an expert binary filter of each convolutional layer to process the input data item; and processing the input data item using the selected expert binary filters.
7. The method as claimed in claim 6 further comprising receiving instructions on the analysis to be perform, and wherein the selecting comprises selecting expert binary filters based on the analysis to be performed.
S. The method as claimed in claim 7 wherein the input data item is an image or image frame, and the instructions are to perform image classification, and wherein the selecting comprises selecting expert binary filters to classify the input data item.
9. The method as claimed in claim 7 wherein the input data item is an image or image frame, and the instructions are to perform gesture recognition, and wherein the selecting comprises selecting expert binary filters to identify gestures in the input data item.
10. The method as claimed in claim 7 wherein the input data item is an image or image frame, and the instructions are to perform emotion or action recognition, and wherein the selecting comprises selecting expert binary filters to identify emotions or actions in the input data item.
11. The method as claimed in claim 7 wherein the instructions are to identify images matching a user query, wherein the at least one input data item is a set of images stored on a user device, and wherein the selecting comprises selecting expert binary filters based on the user query, the method further comprising: outputting at least one image matching the user query.
12. The method as claimed in claim 7 wherein the instructions are to determine when to capture an image matching a user requirement, wherein the at least one input data item is a set of images viewed by an image capture device, and wherein the selecting comprises selecting expert binary filters based on the user requirements, the method further comprising: sending instructions to the image capture device to capture an image matches the user requirement.
13. A non-transitory data carrier carrying code which, when implemented on a processor, causes the processor to carry out the method of any of claims 1 to 12.
14. An electronic apparatus comprising: at least one processor, coupled to memory, arranged to analyse input data using a machine learning, ML, model comprising a plurality of convolutional layers, by: receiving at least one input data item for analysis; inputting the input data item into the ML model; selecting, using a trained gating function of each convolutional layer, an expert binary filter of each convolutional layer to process the input data item; and processing the input data item using the selected expert binary filters.
15. The apparatus as claimed in claim 14 further comprising at least one image capture device for capturing images or videos to be processed by the ML model.
16. The apparatus as claimed in claim 14 or 15 further comprising storage storing images or videos to be processed by the ML model.
17. The apparatus as claimed in claim 14, 15 or 16 further comprising a user interface for receiving a user query, and wherein the selecting comprises selecting expert binary filters based on the user query.
18. The apparatus as claimed in any of claims 14 to 17 wherein the processor receives instructions on the analysis to be performed, and wherein the selecting comprises selecting expert binary filters based on the analysis to be performed.
19. The apparatus as claimed in claim 18 wherein: the input data item is an image or image frame captured by the at least one image capture device, the instructions are to perform image classification, and the selecting comprises selecting expert binary filters to classify the input data item.
20. The apparatus as claimed in claim 18 wherein: the input data item is an image or image frame captured by the at least one image capture device, the instructions are to perform gesture recognition, and the selecting comprises selecting expert binary filters to identify gestures in the input data item.
21. The apparatus as claimed in claim 18 wherein: the input data item is an image or image frame captured by the at least one image capture device, the user instructions are to perform emotion or action recognition, and the selecting comprises selecting expert binary filters to identify emotions or actions in the input data item.
22. The apparatus as claimed in claim 17 wherein: the user query received via the user interface is to identify images matching a search criteria, the at least one input data item is a set of images stored on a user device, and the selecting comprises selecting expert binary filters based on the user query, wherein the processor: outputs at least one image matching the user query.
23. The apparatus as claimed in claim 17 wherein: the user query received via the user interface is to capture an image using the image capture device matching a user requirement, the at least one input data item is a set of images obtained by an image capture device, and the selecting comprises selecting expert binary filters based on the user requirements, wherein the processor: controls the image capture device to capture an image when the image obtained by the image capture device matches the user requirement.