WO2020225772A1 - Method and system for initializing a neural network - Google Patents

Method and system for initializing a neural network Download PDF

Info

Publication number
WO2020225772A1
WO2020225772A1 PCT/IB2020/054350 IB2020054350W WO2020225772A1 WO 2020225772 A1 WO2020225772 A1 WO 2020225772A1 IB 2020054350 W IB2020054350 W IB 2020054350W WO 2020225772 A1 WO2020225772 A1 WO 2020225772A1
Authority
WO
WIPO (PCT)
Prior art keywords
neural network
training
trained neural
trained
output
Prior art date
Application number
PCT/IB2020/054350
Other languages
French (fr)
Inventor
Farsheed VARNO
Behrouz Haji SOLEIMANI
Marzie SAGHAYI
Lisa DI JORIO
Stan Matwin
Original Assignee
Imagia Cybernetics Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Imagia Cybernetics Inc. filed Critical Imagia Cybernetics Inc.
Priority to CA3134565A priority Critical patent/CA3134565A1/en
Priority to US17/609,296 priority patent/US20220215252A1/en
Priority to JP2021565987A priority patent/JP2022531882A/en
Priority to EP20801909.1A priority patent/EP3966741A1/en
Priority to CN202080034485.XA priority patent/CN113795850A/en
Publication of WO2020225772A1 publication Critical patent/WO2020225772A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks

Definitions

  • One or more embodiments of the invention pertain to artificial intelligence. More precisely, one or more embodiment of the invention pertain to a method and a system for initializing a neural network.
  • AN Ns Artificial Neural Networks
  • AN Ns Artificial Neural Networks
  • the pre-trained features are only used in inference mode and corresponding parameters remain intact during training. This protects the learned representations from undesired contamination but also prevents the required new task-specific features to be learned.
  • Fine-tuning lets the pre-trained features and augmented parameters learn the target task together. Fine-tuning usually performs better than feature extraction and training from scratch with random initialization [6] However, the pre-trained features are substantially contaminated due to noise flowing from random layers to the loss and from there, back-propagated toward the features.
  • a method for initializing a pre- trained neural network comprising obtaining a pre-trained neural network having an output layer, amending the output layer of the pre-trained neural network, wherein said amending comprises updating each weight of said output layer according to a function that maximizes the entropy of the output classes probability, wherein said function depends on a parameter controlling a proportion of error of said output classes probability such as it decreases the variance of the output classes probability, and providing the initialized pre-trained neural network.
  • the amending of the output layer of the pre-trained neural network further comprises z-normalizing features located right before the output layer prior updating each weight.
  • the pre-trained neural network uses softmax logit in the output layer.
  • a method for training a pre-trained neural network comprising obtaining a pre- trained neural network to train; obtaining a dataset suitable for said training; initializing the pre-trained neural network using the method disclosed above; training the initialized pre-trained neural network using the obtained dataset; and providing the trained neural network.
  • the training is a federated learning method.
  • the training is a meta-leaming method.
  • the training is a distributed machine learning method.
  • the training is a network architecture search using said pre-trained neural network as a seed.
  • the pre-trained neural network comprises a generative adversarial network, wherein said initializing of the pre- trained neural network using the method disclosed above is performed at the discriminator.
  • a method for training a neural network through federated learning comprising obtaining a shared neural network to train; obtaining at least two datasets suitable for said federated learning, each of the at least two datasets for training a corresponding decentralized training unit; each decentralized training unit performing a first round of training using a corresponding dataset; for each subsequent round of training: each decentralized training unit initializing the shared neural network using the method disclosed above, each decentralized training unit training the initialized shared neural network using the corresponding dataset, globally federating the learning from all decentralized training units to a resulting global shared neural network, and until the global shared neural network converges to a good global model, providing the corresponding global shared neural network to the decentralized training units as the new shared neural network; and providing the trained shared neural network.
  • a method for training a neural network using a reptile meta-leaming method comprising obtaining a neural network to train; obtaining a dataset suitable for said reptile meta- leaming method; for each iteration of the reptile meta-leaming method: initializing the neural network using the method as disclosed above for each task sampled, and training the initialized neural network for said corresponding sampled task using the obtained dataset; and providing the trained neural network.
  • the training of the initialized pre-trained neural network comprising training the initialized pre-trained neural network using a first training batch of the obtained dataset, wherein the first training batch is smaller than a number of features fed to said last layer of said initialized pre-trained neural network.
  • a computer comprising a central processing unit; a graphics processing unit; a communication port; a memory unit comprising an application for initializing a pre-trained neural network, the application comprising: instructions for obtaining a pre-trained neural network having an output layer, instructions for amending the output layer of the pre-trained neural network, wherein said amending comprises updating each weight of said output layer according to a function that maximizes the entropy of the output classes probability, wherein said function depends on a parameter controlling a proportion of error of said output classes probability such as it decreases the variance of the output classes probability, and instructions for providing the initialized pre-trained neural network.
  • a computer program comprising computer-executable instructions which, when executed, cause a computer to perform a method for initializing a pre-trained neural network, the method comprising obtaining a pre-trained neural network having an output layer, amending the output layer of the pre-trained neural network, wherein said amending comprises updating each weight of said output layer according to a function that maximizes the entropy of the output classes probability, wherein said function depends on a parameter controlling a proportion of error of said output classes probability such as it decreases the variance of the output classes probability, and providing the initialized pre-trained neural network.
  • a non -transitory computer readable storage medium for storing computer-executable instructions which, when executed, cause a computer to perform a method for initializing a pre- trained neural network, the method comprising obtaining a pre-trained neural network having an output layer, amending the output layer of the pre-trained neural network, wherein said amending comprises updating each weight of said output layer according to a function that maximizes the entropy of the output classes probability, wherein said function depends on a parameter controlling a proportion of error of said output classes probability such as it decreases the variance of the output classes probability, and providing the initialized pre-trained neural network.
  • a method for initializing a neural network comprising obtaining a neural network having an output layer, amending the output layer of the neural network, wherein said amending comprises updating each weight of said output layer according to a function that maximizes the entropy of the output classes probability, wherein said function depends on a parameter controlling a proportion of error of said output classes probability such as it decreases the variance of the output classes probability, and providing the initialized neural network.
  • An advantage of one or more embodiments of the method disclosed is that they significantly decrease the initial noise that is back-propagated from randomly-initialized parameters toward layers that contain the transferred knowledge.
  • a processing device used for training a neural network according to one or more embodiments of the method disclosed herein will use less resources for training a neural network resulting in more available resources available to complete other tasks.
  • one or more embodiments of the method disclosed herein may contribute to better performances overall compared to other traditional training methods, and does significantly contribute to bettering performances in training cases including a small number of training steps compared to other traditional training methods which is particularly useful to evaluate model potential during architecture search & design.
  • one or more embodiments of the method disclosed herein may contribute to decrease the negative impact of catastrophic forgetting, that may occur during model training across tasks, as it limits the impact of noise propagation.
  • Another advantage of one or more embodiments of the method disclosed is that they are easy to implement and can be beneficially applied to any pre- trained neural network that estimates output probabilities using softmax logits in one embodiment.
  • a benefit is a broad applicability and integration across various deep learning frameworks offered by various vendors such as Google TensorFlow and Facebook PyTorch.
  • the optimal parameter initialization is derived for neural networks being fine-tuned on pre-trained models for classification and show that such optimal initial loss leads to a significant acceleration in adapting a pre-trained neural network to a new task.
  • Another advantage of one or more embodiments of the method disclosed is that they are independent of the choice of architecture and may be applied to transfer knowledge within any domain. As a consequence, a benefit is that one or more embodiments of the method disclosed herein do not increase the complexity of the overall architecture to be trained (e.g. no additional layer, no multi-stage training process like warm up methods)..
  • Another advantage of one or more embodiments of the method disclosed herein is that they show a significant practical impact on convergence.
  • a processing device used for training a neural network according to one or more embodiments of the method disclosed herein will use less resources for training a neural network than with a conventional methods, resulting in more available resources available to complete other tasks.
  • One or more embodiments of the method disclosed herein may contribute to better performances overall compared to other traditional training methods, and does significantly contribute to bettering performances in training cases including a small number of training steps compared to other traditional training methods which is particularly useful to evaluate model potential during architecture search & design.
  • One or more embodiments of the method disclosed herein may contribute to decrease the negative impact of catastrophic forgetting, that may occur during model training across tasks, as it limits the impact of noise propagation.
  • Figures 1a and 1b are graphs which show the negative effect of classical transfer learning techniques on the variance of the output layer on two benchmarks (MNIST and CIFAR), using state of the art models. Both graphs highlight the“noise injection” phenomenon.
  • Figures 2a and 2b are diagrams which show respectively an FNN architecture and last layer initialization in a base model and in accordance with one embodiment of the method disclosed.
  • Figure 3 is a flowchart which shows an embodiment of a method for initializing a pre-trained neural network.
  • Figure 4 is a flowchart which shows an embodiment of a method for training a pre-trained neural network which uses an embodiment of the method disclosed in Figure 3.
  • Figure 5 is a diagram which shows an embodiment of a processing device which may be used for initializing a pre-trained neural network in accordance with an embodiment.
  • Figure 6 is a table which illustrates an initial percentage of noise energy to total energy of back-propagated error at the output of the last layer, profiled for models fine-tuned using regular fine-tuning; 95% confidence interval is calculated over 24 seeds.
  • Figure 7 shows a plurality of graphs and illustrates test accuracy progress for fine-tuning models that are pre-trained on ImageNet dataset.
  • Figure 8a is a table which illustrates average initial test accuracy improvement by using one embodiment of the method disclosed herein.
  • Figure 8b is a table which illustrates convergence test accuracy of models trained on CIFAR10 dataset with 95% confidence.
  • Figure 8c is a table which illustrates convergence test accuracy of models trained on CIFAR100 dataset with 95% confidence.
  • Figure 9 is a table which illustrates convergence test accuracy of models trained on Caltech101 dataset with 95% confidence.
  • invention and the like mean "the one or more inventions disclosed in this application,” unless expressly specified otherwise.
  • one or more embodiments of the present invention are directed to a method and a system for initializing a pre-trained neural network and its use for training a pre-trained neural network.
  • FIG. 5 there is shown an embodiment of a processing device 500 which may be used for implementing a method for initializing a pre-trained neural network.
  • processing device 500 may be any type of computer.
  • the processing device 500 is selected from a group consisting of desktop computers, laptop computers, tablet PC’s, servers, smartphones, etc.
  • the processing device 500 comprises a central processing unit (CPU) 502, also referred to as a microprocessor, a graphic processing unit (GPU) 503, input/output devices 504, an optional display device 506, communication ports 508, a data bus 510 and a memory unit 512.
  • the central processing unit 502 is used for processing computer instructions. The skilled addressee will appreciate that various embodiments of the central processing unit 502 may be provided.
  • the central processing unit 502 comprises a CPU Intel ⁇ 9-7920C manufactured by Intel (TM).
  • the graphics processing unit 503 is used for processing specific computer instructions. It will be appreciated that a memory unit 520 is operatively connected to the graphics processing unit 503.
  • the graphics processing unit 503 comprises a GPU Titan V manufactured by Nvidia Nvidia (TM).
  • the input/output devices 504 are used for inputting/outputting data into the processing device 500.
  • the optional display device 506 is used for displaying data to a user.
  • the skilled addressee will appreciate that various types of display device 506 may be used.
  • the optional display device 506 is a standard liquid crystal display (LCD) monitor.
  • LCD liquid crystal display
  • the communication ports 508 are used for operatively connecting the processing device 500 to various processing devices.
  • the communication ports 508 may comprise, for instance, universal serial bus (USB) ports for connecting a keyboard and a mouse to the processing device 500.
  • the communication ports 508 may further comprise a data network communication port such as an IEEE 802.3 port for enabling a connection of the processing device 508 with another processing device.
  • the memory unit 512 is used for storing computer-executable instructions.
  • the memory unit 512 may comprise a system memory such as a high- speed random access memory (RAM) for storing system control program (e.g., BIOS, operating system module, applications, etc.) and a read-only memory (ROM).
  • system control program e.g., BIOS, operating system module, applications, etc.
  • ROM read-only memory
  • the memory unit 512 has 128 GB of DDR4 RAM.
  • the memory unit 512 comprises, in one embodiment, an operating system module 514.
  • operating system module 514 may be of various types.
  • the operating system module 514 is Linux Ubuntu 18.04 + Lambda Stack.
  • the memory unit 520 comprises an application 516 for initializing a pre-trained neural network.
  • the memory unit 520 operatively connected to the graphics processing unit 503 and has a size of 24 GB of VRAM.
  • the skilled addressee will appreciate that various alternative embodiments may be possible.
  • the memory unit 520 is further used for storing data 518.
  • the skilled addressee will appreciate that the data 518 may be of various types.
  • the memory unit 520 of the graphics processing unit 503 is further used for storing at least a portion of data referred to as the batch size.
  • the batch size may be used for storing at least a portion of data referred to as the batch size.
  • a larger batch size may improve the effectiveness of the optimization steps resulting in more rapid convergence of the model parameters, and a larger batch size can also improve performance by reducing the communication overhead caused by moving the training data to the graphics processing unit 503 - causing more compute cycles to run on the card with each iteration.
  • the processing device 500 is a 4GPU Deep learning workstation manufactured by Lambda Quad.
  • a Feed-forward Neural Network is usually built by stacking up a number of layers on top of each other.
  • the input of a layer can be composed of any combination of the previous layers’ outputs.
  • the last layer is usually a fully connected one with , where N is the number of
  • Equation 2 could be rewritten for a single example as
  • the posterior of the last layer’s neurons, corresponding to each class, is usually estimated using a softmax normalizer, defined as
  • cross entropy is the most commonly used loss function for classification tasks and is equal to the Kullback-Leibler divergence between the labels and the estimates) .
  • the gradients of the CE loss with respect to the output of the j-th neuron of the deepest layer are equal to:
  • the gradients of loss with respect to the rows of the weight matrix are:
  • each data entry is passed twice through each layer’s weights except the layers fed directly by the raw input.
  • the magnitude of weights in a layer may get affected by the energy of the input visited by the layer, and the error back-propagated up to its output.
  • a term of X ' usually appears in the derivative of with respect to the weights of the l-th layer. This is already shown for the last layer in Equation 9.
  • Weights are distinguished from biases and called such since they involve multiplication. This operation can rapidly increase/decrease the energy of its result, compared to the operands. This intensified/lessened energy of the output may increase/decrease the energy of the weights themselves through the gradient updates as discussed.
  • Fig. 1 Sudden initial change in variance of XL with initial learning rate equal to 0.0001 , fine-tuned on (a) MNIST and (b) CIFAR100 datasets. The horizontal axes show the training steps. In each model the augmented parameters are initialized based on preserving the variance of gradient recommended in [8] The color shadows represent the standard deviation through training with 24 different seeds.
  • the warm-up phase the accuracy of the network is limited since most parts of the network are frozen. Additionally, the effective number of required training steps in the warm-up phase may be large, depending on the learning rate, initial values of augmented parameters and size of the dataset.
  • an initialization technique for fine-tuning in which the noise is initially trapped only within the task- specific augmented parameters.
  • the noise is always minimized after the first update and therefore the parameters can be trained altogether afterward.
  • the method disclosed herein is easier to apply in the sense that the training process is not manipulated in any way.
  • the energy consists of three components from which only one is directly correlated with the accuracy of the estimator and the two others are energies of the true labels and the estimates. The contribution of these components is disclosed and the lower and upper bounds for each one is found.
  • Equation 7 the total energy of the error over all examples in the batch and all C neurons of the last layer is equal to:
  • the third term becomes the average probability assignment for the correct labels.
  • the goal of training the model is to maximize this term which is bounded by .
  • the total energy of the back-propagated error is bounded between 0 and 2.
  • neurons of the last layer should have per-example equal output.
  • FIG. 3 there is now shown an embodiment of a method for initializing a pre-trained neural network 100.
  • a pre-trained neural network is obtained. It will be appreciated that the pre-trained neural network has an output layer. In one embodiment, the pre-trained neural network uses softmax logit.
  • pre-trained neural network may be provided according to various embodiments.
  • the pre-trained neural network is received from a processing device. In another embodiment, the pre-trained neural network is obtained from a memory unit of the processing device. In another embodiment, the pre-trained neural network is provided by a user interacting with the processing device. The skilled addressee will appreciate that various alternative embodiments may be provided for providing the pre-trained neural network.
  • the output layer of the pre-trained neural network is amended. It will be appreciated that the amending of the output layer of the pre-trained neural network comprises updating each weight of the output layer according to a function that maximizes the entropy of the output classes probability.
  • the function depends on a parameter controlling a proportion of error of the output classes probability such as it decreases the variance of the output classes probability.
  • the amending of the output layer of the pre-trained neural network further comprises z-normalizing features located right before the output layer prior updating each weight.
  • the initializing of the pre- trained neural network is performed to prevent adverse contamination during a training of the initialized pre-trained neural network.
  • processing step 106 the initialized pre-trained neural network is provided.
  • the initialized pre-trained neural network may be provided according to various embodiments.
  • the initialized pre- trained neural network is provided to a processing device.
  • the initialized pre-trained neural network is saved in a memory unit of the processing device.
  • the initialized pre-trained neural network is displayed to a user interacting with the processing device.
  • various alterative embodiments may be provided for providing the initialized pre-trained neural network. While it has been disclosed in Fig. 3 that the method is used for initializing a neural network which is pre-trained, it will be appreciated that in one or more alternative embodiments, the neural network is not pre-trained.
  • a method for initializing a neural network comprises obtaining a neural network having an output layer.
  • neural network may be provided according to various embodiments.
  • the neural network is received from a processing device. In another embodiment, the neural network is obtained from a memory unit of the processing device. In another embodiment, the neural network is provided by a user interacting with the processing device. The skilled addressee will appreciate that various alternative embodiments may be provided for providing the neural network.
  • the method further comprises amending the output layer of the neural network.
  • the amending of the output layer comprises updating each weight of the output layer according to a function that maximizes the entropy of the output classes probability. It will be appreciated that the function depends on a parameter controlling a proportion of error of the output classes probability such as it decreases the variance of the output classes probability.
  • the method further comprises providing the initialized neural network.
  • the initialized neural network may be provided according to various embodiments.
  • the initialized neural network is provided to a processing device.
  • the initialized neural network is saved in a memory unit of the processing device.
  • the initialized neural network is displayed to a user interacting with the processing device.
  • various alternative embodiments may be provided for providing the initialized neural network.
  • the initialized pre-trained neural network may be used when training the pre-trained neural network as disclosed for instance in Fig. 4.
  • a pre-trained neural network to train is obtained.
  • the pre-trained neural network may be obtained according to various embodiments.
  • a dataset suitable for the training is obtained.
  • dataset suitable for the training may be obtained according to various embodiments.
  • the dataset is obtained from a remote processing device operatively connected with a processing device.
  • the pre-trained neural network is initialized.
  • the pre-trained neural network may be initialized according to one or more embodiments of the method disclosed in Fig. 3.
  • the initialized pre-trained neural network is trained. It will be appreciated that the initialized pre-trained neural network is trained using the dataset obtained.
  • the training of the initialized pre-trained neural network comprises training the initialized pre-trained neural network using a first training batch of the obtained dataset, wherein the first training batch is smaller than a number of features fed to the last layer of the initialized pre-trained neural network.
  • the training is a federated learning method. It will be appreciated that the federated learning method is disclosed at https://arxiv.org/pdf/1902.04885.pdf.
  • the training is a meta-leaming method. It will be appreciated that meta learning is disclosed for instance in the article“Human-level concept learning through probabilistic program induction” by Brenden M. Lake et al. Science 350, 1332 (2015).
  • the training is a distributed machine learning method. It will be appreciated that the distributed machine learning method is disclosed at https://arxiv.org/abs/1810.06060.
  • the training is a network architecture search using the pre-trained neural network as a seed. It will be appreciated that the network architecture search is disclosed at https://arxiv.org/pdf/1802.03268.pdf.
  • the pre-trained neural network comprises a generative adversarial network.
  • the initializing of the pre-trained neural network is performed at the discriminator. Still referring to Fig. 4 and according to processing step 208, the trained neural network is provided.
  • trained neural network may be provided according to various embodiments.
  • the trained neural network is provided to a processing device. In another embodiment, the trained neural network is saved in a memory unit of the processing device. The skilled addressee will appreciate that various alternative embodiments may be provided for providing the initialized neural network.
  • the method comprises obtaining a shared neural network to train.
  • the method further comprises obtaining at least two datasets suitable for the federated learning. Each of the at least two datasets is used for training a corresponding decentralized training unit.
  • the method further comprises each decentralized training unit performing a first round of training using a corresponding dataset.
  • the method further comprises, for each subsequent round of training, each decentralized training unit initializing the shared neural network using one or more embodiments of the method disclosed above for initializing a pre-trained neural network and each decentralized training unit training the initialized shared neural network using the corresponding dataset.
  • the method further comprises globally federating the learning from all decentralized training units to a resulting global shared neural network, and until the global shared neural network converges to a good global model, providing the corresponding global shared neural network to the decentralized training units as the new shared neural network.
  • the method comprises providing the trained shared neural network.
  • the trained shared neural network may be provided according to various embodiments.
  • the trained shared neural network is provided to a processing device.
  • the trained shared neural network is saved in a memory unit of the processing device.
  • the trained shared neural network is displayed to a user interacting with the processing device.
  • various alternative embodiments may be provided for providing the trained shared neural network.
  • the method comprises obtaining a neural network to train.
  • the method further comprises obtaining a dataset suitable for the reptile meta-leaming method. It will be appreciated that the reptile meta-leaming method is disclosed at https://d4mucfpksywv.doudfront.net/research- covers/reptile/repti le_update. pdf.
  • the method further comprises, for each iteration of the reptile meta-leaming method, initializing the neural network using one or more embodiments of the method disclosed above for initializing a pre-trained neural network for each task sampled and training the initialized neural network for the corresponding sampled task using the obtained dataset.
  • the method comprises providing the trained neural network.
  • the trained neural network may be provided according to various embodiments.
  • the trained neural network is provided to a processing device.
  • the trained neural network is saved in a memory unit of the processing device.
  • the trained neural network is displayed to a user interacting with the processing device. The skilled addressee will appreciate that various alternative embodiments may be provided for providing the trained neural network.
  • the energy of the estimates contains pure noise, i.e. it lacks a meaningful relationship with either the inputs or the labels. Its infimum was calculated and it was shown that it can be achieved when all the estimates are exactly equal to each other for each example. This condition is intuitively appealing since it maximizes the entropy of the estimates prior to the training when and are independent and/or unaligned.
  • Equation 8 To prevent the noise from contaminating pre-trained layers an efficient solution should consider both of these criteria.
  • One or more embodiments of a method are therefore introduced which maximizes the initial entropy of estimates while preventing to become contaminated by .
  • the method can be described as follows.
  • the method requires the features that are fed to the last layer to be normalized. This is done by applying z-normalization across the batch,
  • Figure 2 shows a FNN architecture and last layer’s initialization in (a) base model and (b) EN- TAME. According to [8], mis 2 for ReLU networks.
  • the method maximizes the entropy of the estimates by initializing the last layer’s weights to values drawn from Independent and Identically Distributed (i.i.d.), zero centered normal distribution as follows
  • each output neuron is approximately zero-centered, or and per-example energy of all output neurons
  • the exponential function is dose to linear when its input is dose to zero. This could be easily shown by using the result of Equation 22, and the Taylor series approximation of the exponential function around zero,
  • the outputs of each neuron of the last layer can get comparably high expected values. This may cause the estimates to have much lower entropy compared to the initial state.
  • multiple rows and columns of weights and corresponding elements of biases from the last layer can possibly get identical first updates (see Equation 24).
  • the exponential functions in softmax make the small difference much larger. Therefore, the expressiveness of the last layer is preserved by initializing its weights to very small numbers instead of zero.
  • the first update makes the energy of large enough to let the error of the next updates back-propagate through it and reach the pre-trained layers. In other words, this automatically opens up the stalled way and lets the error to back-propagate to the output of other layers. This is enough for correctly guiding pre-trained parameters with an advanced optimization algorithm like Adam [13] is used. Most of the noise is purified and the next back-propagated errors toward pre-trained features are meaningful and contain both prior and likelihood. In more details, the energy of the j-th row in becomes:
  • initial N contains only information about the prior, we desire to make its energy initially smaller than . For this to happen, initial N has to be chosen such
  • Equation 25 N ⁇ K (which usually is satisfied).
  • Equation 25 l should be chosen such that but not very small to numerically reduce the rank of due to possible similar updates (see Equation 25), which may results in higher entropy of roughly determines the maximum proportion of remaining energy of
  • a feature normalization is performed. It will be appreciated that applying z- normalization on top of features may increase or decrease the level of average feature-wise energy in X L resulting in less need for tweaking the learning rate and F w for different tasks and even different models. If the values of X L are too small, it may take a longer time for W L to grow which leaves pre-trained features unchanged for a longer time.
  • z-normalization is applied if the provided initialized pre-trained model is to be further trained on the provided data, where the provided data exhibits an important domain shift with respect to the data on which the model was pre-trained.
  • Z-normalization across batches plays a more important role than just equalization.
  • image classification is done and two images in a batch contain exactly the same patter or visual object. If one of the columns of X L represents a feature that recognizes said pattern, the feature is expected to reflect the presence of the patter in both of the mentioned images equally.
  • the problem is that raw inputs are usually normalized with statistics that are identically applied on all pixels in all examples. In the best case, such normalization is applied separately for different channels.
  • Object-wise normalization does not seem to be feasible prior to detection which indirectly is done through training neural network classifier. Therefore, even if the same object is exactly copied in both images due to normalizing raw images, one object may get less intensified than the other one. This may directly be reflected to the values of the particular column of X L responsible for showing the presence of the desired object. Z-normalization, compensates for this problem by normalizing the features after they are detected.
  • ImageNet [18] ILSVRC 2012 is the source dataset used to pre-train the models. Each pre-trained model is fine-tuned on the following datasets: MNIST [16], CIFAR10, CIFAR100 [15] and Caltech101 [5] The latter dataset is not originally separated into train and test nor is balanced in contrast to the other ones. Each Caltech101 category is split randomly into train and test subsets with 15 percent chance of drawing each image for test subset. Prior to feeding the input to the models, each channel is normalized with its mean and standard deviation obtained from all pixels of that channel throughout the corresponding training subset. Training images are also augmented with random horizontal flip.
  • Figure 7 shows the progress of test accuracy of pre-trained models fine- tuned on each dataset.
  • the smaller plot inside each larger one shows the same curves zoomed-in the first steps of training.
  • the colorful shade around each curve shows the standard deviation across 24 different seeds.
  • Each plot includes 4 curves color mapped as follows; blue: base, orange: base with a single Warm Up (WU) step, green: disclosed method’s Maximum Entropy Initialization (MEI), red: full disclosed method or MEI + Feature Normalization (FN).
  • MEI Maximum Entropy Initialization
  • FN MEI + Feature Normalization
  • Figure 8a shows the average increase in the accuracy of the first 10 training steps with 95% confidence. Further improvements have been observed by adjusting l and the batch size but to show the robustness of the model the same setup has been kept as much as possible.
  • FIG. 7 there is shown test accuracy progress for fine- tuning models that are pre-trained on ImageNet dataset.
  • the horizontal axes on each plot show the number of training steps. Colorful shades show the standard deviation across different seeds.
  • a superscript * means that all models in corresponding row or column are trained with batch size of 64 instead of 256 to make the model fit into the device.
  • the smaller plots inside the bigger ones are just zoomed-in version of the same curves for the first few steps.
  • Fig. 8a there is shown average initial test accuracy improvement by using an embodiment of the method disclosed herein instead of base method.
  • the entries show increase in the mean of test accuracy over first 10 steps of training with 95% confidence calculated over 24 seeds.
  • Fig. 8b there is shown convergence test accuracy of models trained on CIFAR10 dataset with 95% confidence.
  • Fig. 8c there is shown convergence test accuracy of models trained on CIFAR100 dataset with 95% confidence.
  • Fig. 9 there is shown convergence test accuracy of models trained on Caltech 101 dataset with 95% confidence. It will be appreciated that although a focus was done on image classification, the reasoning behind one or more embodiments of the method disclosed herein’s impressive performance is not tied to image datasets in any way.
  • the application 516 for initializing a neural network comprises instructions for obtaining a pre-trained neural network having an output layer.
  • the application 516 for initializing a pre-trained neural network further comprises instructions for amending the output layer of the pre-trained neural network.
  • the amending comprises updating each weight of said output layer according to a function that maximizes the entropy of the output classes probability.
  • the function depends on a parameter controlling a proportion of error of the output classes probability such as it decreases the variance of the output classes probability.
  • the application 516 for initializing a pre-trained neural network further comprises instructions for providing the initialized pre-trained neural network.
  • a non-transitory computer readable storage medium for storing computer-executable instructions which, when executed, cause a computer to perform a method for initializing a pre-trained neural network, the method comprising obtaining a pre-trained neural network having an output layer, amending the output layer of the pre-trained neural network, wherein the amending comprises updating each weight of the output layer according to a function that maximizes the entropy of the output classes probability, wherein the function depends on a parameter controlling a proportion of error of said output classes probability such as it decreases the variance of the output classes probability, and providing the initialized pre-trained neural network.
  • a computer program comprising computer-executable instructions which, when executed, cause a computer to perform a method for initializing a pre-trained neural network, the method comprising obtaining a pre-trained neural network having an output layer, amending the output layer of the pre-trained neural network, wherein said amending comprises updating each weight of said output layer according to a function that maximizes the entropy of the output classes probability, wherein said function depends on a parameter controlling a proportion of error of said output classes probability such as it decreases the variance of the output classes probability, and providing the initialized pre-trained neural network.
  • An advantage of one or more embodiments of the method disclosed is that they significantly decrease the initial noise that is back-propagated from randomly-initialized parameters toward layers that contain the transferred knowledge.
  • a processing device used for training a neural network according to one or more embodiments of the method disclosed herein will use less resources for training a neural network resulting in more available resources available to complete other tasks.
  • one or more embodiments of the method disclosed herein may contribute to better performances overall compared to other traditional training methods, and does significantly contribute to bettering performances in training cases including a small number of training steps compared to other traditional training methods which is particularly useful to evaluate model potential during architecture search & design.
  • one or more embodiments of the method disclosed herein may contribute to decrease the negative impact of catastrophic forgetting, that may occur during model training across tasks, as it limits the impact of noise propagation.
  • Another advantage of one or more embodiments of the method disclosed is that they are easy to implement and can be beneficially applied to any pre- trained neural network that estimates output probabilities using softmax logits in one embodiment.
  • a benefit is a broad applicability and integration across various deep learning frameworks offered by various vendors such as Google TensorFlow and Facebook PyTorch.
  • the optimal parameter initialization is derived for neural networks being fine-tuned on pre-trained models for classification and show that such optimal initial loss leads to a significant acceleration in adapting a pre-trained neural network to a new task.
  • Another advantage of one or more embodiments of the method disclosed is that they are independent of the choice of architecture and may be applied to transfer knowledge within any domain. As a consequence, a benefit is that one or more embodiments of the method disclosed herein do not increase the complexity of the overall architecture to be trained (e.g. no additional layer, no multi-stage training process like warm up methods)..
  • Another advantage of one or more embodiments of the method disclosed herein is that they show a significant practical impact on convergence.
  • a processing device used for training a neural network according to one or more embodiments of the method disclosed herein will use less resources for training a neural network than with a conventional methods, resulting in more available resources available to complete other tasks.
  • One or more embodiments of the method disclosed herein may contribute to better performances overall compared to other traditional training methods, and does significantly contribute to bettering performances in training cases including a small number of training steps compared to other traditional training methods which is particularly useful to evaluate model potential during architecture search & design.
  • One or more embodiments of the method disclosed herein may contribute to decrease the negative impact of catastrophic forgetting, that may occur during model training across tasks, as it limits the impact of noise propagation.
  • a method for initializing a pre-trained neural network comprising: obtaining a pre-trained neural network having an output layer, amending the output layer of the pre-trained neural network, wherein said amending comprises updating each weight of said output layer according to a function that maximizes the entropy of the output classes probability, wherein said function depends on a parameter controlling a proportion of error of said output classes probability such as it decreases the variance of the output classes probability, and providing the initialized pre-trained neural network.
  • Clause 2 The method as claimed in clause 1 , wherein the amending of the output layer of the pre-trained neural network further comprises z- normalizing features located right before the output layer prior updating each weight.
  • Clause 3 The method as claimed in clause 1 , wherein the pre-trained neural network uses softmax logit in the output layer.
  • Clause 4 A method for training a pre-trained neural network, the method comprising: obtaining a pre-trained neural network to train; obtaining a dataset suitable for said training; initializing the pre-trained neural network using the method as claimed in any one of clauses 1 to 3; training the initialized pre-trained neural network using the obtained dataset; and providing the trained neural network.
  • Clause 5 A method as claimed in clause 4, wherein said training is a federated learning method.
  • Clause 6 A method as claimed in clause 4, wherein said training is a meta-leaming method.
  • Clause 7 A method as claimed in clause 4, wherein said training is a distributed machine learning method.
  • Clause 8 The method as claimed in clause 4, wherein said training is a network architecture search using said pre-trained neural network as a seed.
  • Clause 9 The method as claimed in any one of clauses 4 to 8, wherein the pre-trained neural network comprises a generative adversarial network, wherein said initializing of the pre-trained neural network using the method as claimed in clause 1 is performed at the discriminator.
  • a method for training a neural network through federated learning comprising: obtaining a shared neural network to train; obtaining at least two datasets suitable for said federated learning, each of the at least two datasets for training a corresponding decentralized training unit; each decentralized training unit performing a first round of training using a corresponding dataset; for each subsequent round of training: each decentralized training unit initializing the shared neural network using the method as claimed in any one of clauses 1 to 3, each decentralized training unit training the initialized shared neural network using the corresponding dataset, globally federating the learning from all decentralized training units to a resulting global shared neural network, and until the global shared neural network converges to a good global model, providing the corresponding global shared neural network to the decentralized training units as the new shared neural network; and providing the trained shared neural network.
  • Clause 11 A method for training a neural network using a reptile meta- learning method, the method comprising: obtaining a neural network to train; obtaining a dataset suitable for said reptile meta-leaming method; for each iteration of the reptile meta-leaming method: initializing the neural network using the method as claimed in any one of clauses 1 to 3 for each task sampled, and training the initialized neural network for said corresponding sampled task using the obtained dataset; and providing the trained neural network.
  • Clause 12 The method as claimed in any one of clauses 4 to 9, wherein the training of the initialized pre-trained neural network comprising training the initialized pre-trained neural network using a first training batch of the obtained dataset, wherein the first training batch is smaller than a number of features fed to said last layer of said initialized pre-trained neural network.
  • Clause 13 A method for using a pre-trained neural network trained in accordance with any one of clauses 4 to 9.
  • a computer comprising: a central processing unit; a graphics processing unit; a communication port; a memory unit comprising an application for initializing a pre-trained neural network, the application comprising: instructions for obtaining a pre-trained neural network having an output layer, instructions for amending the output layer of the pre-trained neural network, wherein said amending comprises updating each weight of said output layer according to a function that maximizes the entropy of the output classes probability, wherein said function depends on a parameter controlling a proportion of error of said output classes probability such as it decreases the variance of the output classes probability, and instructions for providing the initialized pre-trained neural network.
  • Computer program comprising computer-executable instructions which, when executed, cause a computer to perform a method for initializing a pre-trained neural network, the method comprising: obtaining a pre-trained neural network having an output layer, amending the output layer of the pre-trained neural network, wherein said amending comprises updating each weight of said output layer according to a function that maximizes the entropy of the output classes probability, wherein said function depends on a parameter controlling a proportion of error of said output classes probability such as it decreases the variance of the output classes probability, and providing the initialized pre-trained neural network.
  • Clause 16 A non-transitory computer readable storage medium for storing computer-executable instructions which, when executed, cause a computer to perform a method for initializing a pre-trained neural network, the method comprising: obtaining a pre-trained neural network having an output layer, amending the output layer of the pre-trained neural network, wherein said amending comprises updating each weight of said output layer according to a function that maximizes the entropy of the output classes probability, wherein said function depends on a parameter controlling a proportion of error of said output classes probability such as it decreases the variance of the output classes probability, and providing the initialized pre-trained neural network.
  • a method for initializing a neural network comprising: obtaining a neural network having an output layer, amending the output layer of the neural network, wherein said amending comprises updating each weight of said output layer according to a function that maximizes the entropy of the output classes probability, wherein said function depends on a parameter controlling a proportion of error of said output classes probability such as it decreases the variance of the output classes probability, and providing the initialized neural network.
  • Fei-Fei, L, Fergus, R., Perona, P. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. Computer vision and Image understanding 106(1), 59-70 (2007)

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Image Analysis (AREA)
  • Machine Translation (AREA)

Abstract

A method and a system are disclosed for initializing a pre-trained neural network, the method comprising obtaining a pre-trained neural network having an output layer, amending the output layer of the pre-trained neural network, wherein the amending comprises updating each weight of the output layer according to a function that maximizes the entropy of the output classes probability, wherein the function depends on a parameter controlling a proportion of error of the output classes probability such as it decreases the variance of the output classes probability, and providing the initialized pre-trained neural network.

Description

METHOD AND SYSTEM FOR INITIALIZING A NEURAL NETWORK
RELATED APPLICATION
The present application claims priority from US provisional application No. 62/844,472 filed on May 7, 2019, the content of which is incorporated herein in its entirety.
FIELD
One or more embodiments of the invention pertain to artificial intelligence. More precisely, one or more embodiment of the invention pertain to a method and a system for initializing a neural network.
BACKGROUND
Artificial Neural Networks (AN Ns) have shown a great capacity in learning complicated tasks and have become the first contender to solve many problems in the machine learning community. However, a large training dataset is a key pre-requisite for these networks to achieve good performances. This limitation has opened a new chapter in neural network research, which attempts to make learning possible with limited amounts of data. So far, one of the most widely used techniques to cope with such hindrance is the initialization of parameters based on the prior knowledge obtained from already trained models.
To adapt a pre-trained model to a new task, usually task-specific, extraneous and random parameters are transplanted to a meaningful set of representations, resulting in a heterogeneous model [1 ,6,17,19] Training these unassociated modules together may contaminate the genuinely learned representations and significantly degrade the maximum transferable knowledge. Current fine-tuning techniques slow down the training process to compensate for this knowledge leak [17] which undermines fast convergence of a model that suffers from data-shortage.
Previous studies on parameters initialization of ANNs [2, 7, 8, 14] focus on preserving the variance, or other statistics, of the flowing data along the depth. This stabilizes the model and makes training deeper networks possible. Arpit and Bengio [2] recently showed that the initialization introduced in [8] is the optimal one for a ReLU network trained from scratch. [2] recommended to use the fan-out mode which preserves the variance of the back-propagated error along the depth. He et al. [8], inconsistently exempted the last layer of the models used for their experiments from the distribution for weights that they have recommended. This layer’s distribution is stated to be found experimentally and no justification has been provided for its outcome. Such a strategy could be traced down to the earlier practices in constructing deep neural networks [21]
Recent studies on transfer learning use variance preserving initialization techniques for fine-tuning [17, 20] However, it can be shown that using such techniques, initially contaminates the transferred knowledge, resulting in unguided modification of valuable transferred features.
Careful initialization is also an inevitable part of self-normalized neural networks introduced in [14] These networks use Scaled Exponential Units (SELUs) as their activation function.
In feature extraction [3, 4], the pre-trained features are only used in inference mode and corresponding parameters remain intact during training. This protects the learned representations from undesired contamination but also prevents the required new task-specific features to be learned.
Fine-tuning [6] lets the pre-trained features and augmented parameters learn the target task together. Fine-tuning usually performs better than feature extraction and training from scratch with random initialization [6] However, the pre-trained features are substantially contaminated due to noise flowing from random layers to the loss and from there, back-propagated toward the features.
There is a need for at least one of a method and a system that will overcome at least one of the above-identified drawbacks.
SUMMARY
According to a broad aspect, there is disclosed a method for initializing a pre- trained neural network, the method comprising obtaining a pre-trained neural network having an output layer, amending the output layer of the pre-trained neural network, wherein said amending comprises updating each weight of said output layer according to a function that maximizes the entropy of the output classes probability, wherein said function depends on a parameter controlling a proportion of error of said output classes probability such as it decreases the variance of the output classes probability, and providing the initialized pre-trained neural network.
In accordance with one or more embodiments, the amending of the output layer of the pre-trained neural network further comprises z-normalizing features located right before the output layer prior updating each weight.
In accordance with one or more embodiments, the pre-trained neural network uses softmax logit in the output layer.
In accordance with one or more embodiments, there is disclosed a method for training a pre-trained neural network, the method comprising obtaining a pre- trained neural network to train; obtaining a dataset suitable for said training; initializing the pre-trained neural network using the method disclosed above; training the initialized pre-trained neural network using the obtained dataset; and providing the trained neural network. In accordance with one or more embodiments, the training is a federated learning method.
In accordance with one or more embodiments, the training is a meta-leaming method.
In accordance with one or more embodiments the training is a distributed machine learning method.
In accordance with one or more embodiments the training is a network architecture search using said pre-trained neural network as a seed.
In accordance with one or more embodiments, the pre-trained neural network comprises a generative adversarial network, wherein said initializing of the pre- trained neural network using the method disclosed above is performed at the discriminator.
In accordance with a broad aspect, there is disclosed method for training a neural network through federated learning, the method comprising obtaining a shared neural network to train; obtaining at least two datasets suitable for said federated learning, each of the at least two datasets for training a corresponding decentralized training unit; each decentralized training unit performing a first round of training using a corresponding dataset; for each subsequent round of training: each decentralized training unit initializing the shared neural network using the method disclosed above, each decentralized training unit training the initialized shared neural network using the corresponding dataset, globally federating the learning from all decentralized training units to a resulting global shared neural network, and until the global shared neural network converges to a good global model, providing the corresponding global shared neural network to the decentralized training units as the new shared neural network; and providing the trained shared neural network. In accordance with a broad aspect, there is disclosed a method for training a neural network using a reptile meta-leaming method, the method comprising obtaining a neural network to train; obtaining a dataset suitable for said reptile meta- leaming method; for each iteration of the reptile meta-leaming method: initializing the neural network using the method as disclosed above for each task sampled, and training the initialized neural network for said corresponding sampled task using the obtained dataset; and providing the trained neural network.
In accordance with one or more embodiments, the training of the initialized pre-trained neural network comprising training the initialized pre-trained neural network using a first training batch of the obtained dataset, wherein the first training batch is smaller than a number of features fed to said last layer of said initialized pre-trained neural network.
In accordance with a broad aspect, there is disclosed a method for using a pre-trained neural network trained in accordance with a method disclosed above.
In accordance with a broad aspect, there is disclosed a computer comprising a central processing unit; a graphics processing unit; a communication port; a memory unit comprising an application for initializing a pre-trained neural network, the application comprising: instructions for obtaining a pre-trained neural network having an output layer, instructions for amending the output layer of the pre-trained neural network, wherein said amending comprises updating each weight of said output layer according to a function that maximizes the entropy of the output classes probability, wherein said function depends on a parameter controlling a proportion of error of said output classes probability such as it decreases the variance of the output classes probability, and instructions for providing the initialized pre-trained neural network.
In accordance with a broad aspect, there is disclosed a computer program comprising computer-executable instructions which, when executed, cause a computer to perform a method for initializing a pre-trained neural network, the method comprising obtaining a pre-trained neural network having an output layer, amending the output layer of the pre-trained neural network, wherein said amending comprises updating each weight of said output layer according to a function that maximizes the entropy of the output classes probability, wherein said function depends on a parameter controlling a proportion of error of said output classes probability such as it decreases the variance of the output classes probability, and providing the initialized pre-trained neural network.
In accordance with a broad aspect, there is disclosed a non -transitory computer readable storage medium for storing computer-executable instructions which, when executed, cause a computer to perform a method for initializing a pre- trained neural network, the method comprising obtaining a pre-trained neural network having an output layer, amending the output layer of the pre-trained neural network, wherein said amending comprises updating each weight of said output layer according to a function that maximizes the entropy of the output classes probability, wherein said function depends on a parameter controlling a proportion of error of said output classes probability such as it decreases the variance of the output classes probability, and providing the initialized pre-trained neural network.
In accordance with a broad aspect, there is disclosed a method for initializing a neural network, the method comprising obtaining a neural network having an output layer, amending the output layer of the neural network, wherein said amending comprises updating each weight of said output layer according to a function that maximizes the entropy of the output classes probability, wherein said function depends on a parameter controlling a proportion of error of said output classes probability such as it decreases the variance of the output classes probability, and providing the initialized neural network.
An advantage of one or more embodiments of the method disclosed is that they significantly decrease the initial noise that is back-propagated from randomly-initialized parameters toward layers that contain the transferred knowledge. As a consequence, a processing device used for training a neural network according to one or more embodiments of the method disclosed herein will use less resources for training a neural network resulting in more available resources available to complete other tasks. Moreover, one or more embodiments of the method disclosed herein may contribute to better performances overall compared to other traditional training methods, and does significantly contribute to bettering performances in training cases including a small number of training steps compared to other traditional training methods which is particularly useful to evaluate model potential during architecture search & design. Also, one or more embodiments of the method disclosed herein may contribute to decrease the negative impact of catastrophic forgetting, that may occur during model training across tasks, as it limits the impact of noise propagation.
In fact, experiments show that models trained by one or more embodiments of the method disclosed herein leam substantially faster than those using prior art fine-tuning or even more complicated tricks such as warm up [171.
Another advantage of one or more embodiments of the method disclosed is that they are easy to implement and can be beneficially applied to any pre- trained neural network that estimates output probabilities using softmax logits in one embodiment. As a consequence, a benefit is a broad applicability and integration across various deep learning frameworks offered by various vendors such as Google TensorFlow and Facebook PyTorch.
The optimal parameter initialization is derived for neural networks being fine-tuned on pre-trained models for classification and show that such optimal initial loss leads to a significant acceleration in adapting a pre-trained neural network to a new task. Another advantage of one or more embodiments of the method disclosed is that they are independent of the choice of architecture and may be applied to transfer knowledge within any domain. As a consequence, a benefit is that one or more embodiments of the method disclosed herein do not increase the complexity of the overall architecture to be trained (e.g. no additional layer, no multi-stage training process like warm up methods)..
Another advantage of one or more embodiments of the method disclosed herein is that they show a significant practical impact on convergence. As a consequence, a processing device used for training a neural network according to one or more embodiments of the method disclosed herein will use less resources for training a neural network than with a conventional methods, resulting in more available resources available to complete other tasks. One or more embodiments of the method disclosed herein may contribute to better performances overall compared to other traditional training methods, and does significantly contribute to bettering performances in training cases including a small number of training steps compared to other traditional training methods which is particularly useful to evaluate model potential during architecture search & design. One or more embodiments of the method disclosed herein may contribute to decrease the negative impact of catastrophic forgetting, that may occur during model training across tasks, as it limits the impact of noise propagation.
BRIEF DESCRIPTION OF THE DRAWINGS
Figures 1a and 1b are graphs which show the negative effect of classical transfer learning techniques on the variance of the output layer on two benchmarks (MNIST and CIFAR), using state of the art models. Both graphs highlight the“noise injection” phenomenon. Figures 2a and 2b are diagrams which show respectively an FNN architecture and last layer initialization in a base model and in accordance with one embodiment of the method disclosed.
Figure 3 is a flowchart which shows an embodiment of a method for initializing a pre-trained neural network.
Figure 4 is a flowchart which shows an embodiment of a method for training a pre-trained neural network which uses an embodiment of the method disclosed in Figure 3.
Figure 5 is a diagram which shows an embodiment of a processing device which may be used for initializing a pre-trained neural network in accordance with an embodiment.
Figure 6 is a table which illustrates an initial percentage of noise energy to total energy of back-propagated error at the output of the last layer, profiled for models fine-tuned using regular fine-tuning; 95% confidence interval is calculated over 24 seeds.
Figure 7 shows a plurality of graphs and illustrates test accuracy progress for fine-tuning models that are pre-trained on ImageNet dataset.
Figure 8a is a table which illustrates average initial test accuracy improvement by using one embodiment of the method disclosed herein.
Figure 8b is a table which illustrates convergence test accuracy of models trained on CIFAR10 dataset with 95% confidence.
Figure 8c is a table which illustrates convergence test accuracy of models trained on CIFAR100 dataset with 95% confidence. Figure 9 is a table which illustrates convergence test accuracy of models trained on Caltech101 dataset with 95% confidence.
DETAILED DESCRIPTION
In the following description of the embodiments, references to the accompanying drawings are by way of illustration of an example by which the invention may be practiced.
Terms
The term "invention" and the like mean "the one or more inventions disclosed in this application,” unless expressly specified otherwise.
The terms“an aspect," "an embodiment,” "embodiment,” "embodiments,” "the embodiment,” "the embodiments,” "one or more embodiments,” "some embodiments,” "certain embodiments,” "one embodiment,” "another embodiment" and the like mean "one or more (but not all) embodiments of the disclosed invention(s),” unless expressly specified otherwise.
A reference to "another embodiment" or“another aspect” in describing an embodiment does not imply that the referenced embodiment is mutually exclusive with another embodiment (e.g., an embodiment described before the referenced embodiment), unless expressly specified otherwise.
The terms "including," "comprising" and variations thereof mean "including but not limited to,” unless expressly specified otherwise.
The terms "a,” "an" and "the" mean "one or more," unless expressly specified otherwise.
The term "plurality" means "two or more,” unless expressly specified otherwise. The term "herein" means "in the present application, including anything which may be incorporated by reference,” unless expressly specified otherwise.
The term "whereby" is used herein only to precede a clause or other set of words that express only the intended result, objective or consequence of something that is previously and explicitly recited. Thus, when the term "whereby" is used in a claim, the clause or other words that the term "whereby" modifies do not establish specific further limitations of the claim or otherwise restricts the meaning or scope of the claim.
The term "e.g." and like terms mean "for example,” and thus do not limit the terms or phrases they explain. For example, in a sentence "the computer sends data (e.g., instructions, a data structure) over the Interet,” the term "e.g." explains that "instructions" are an example of "data" that the computer may send over the Interet, and also explains that "a data structure" is an example of "data" that the computer may send over the Interet. However, both "instructions" and "a data structure" are merely examples of "data” and other things besides "instructions" and "a data structure" can be "data.”
The term "i.e." and like terms mean "that is,” and thus limit the terms or phrases they explain.
Neither the Title nor the Abstract is to be taken as limiting in any way as the scope of the disclosed invention(s). The title of the present application and headings of sections provided in the present application are for convenience only, and are not to be taken as limiting the disclosure in any way.
Numerous embodiments are described in the present application, and are presented for illustrative purposes only. The described embodiments are not, and are not intended to be, limiting in any sense. The presently disclosed invention(s) are widely applicable to numerous embodiments, as is readily apparent from the disclosure. One of ordinary skill in the art will recognize that the disclosed invention(s) may be practiced with various modifications and alterations, such as structural and logical modifications. Although particular features of the disclosed invention(s) may be described with reference to one or more particular embodiments and/or drawings, it should be understood that such features are not limited to usage in the one or more particular embodiments or drawings with reference to which they are described, unless expressly specified otherwise.
With all this in mind, one or more embodiments of the present invention are directed to a method and a system for initializing a pre-trained neural network and its use for training a pre-trained neural network.
It will be appreciated that one or more embodiments of the method disclosed herein may be implemented according to various embodiments.
More precisely and now referring to Fig. 5, there is shown an embodiment of a processing device 500 which may be used for implementing a method for initializing a pre-trained neural network.
In fact, it will be appreciated that the processing device 500 may be any type of computer.
In one embodiment, the processing device 500 is selected from a group consisting of desktop computers, laptop computers, tablet PC’s, servers, smartphones, etc.
In the embodiment shown in Fig. 5, the processing device 500 comprises a central processing unit (CPU) 502, also referred to as a microprocessor, a graphic processing unit (GPU) 503, input/output devices 504, an optional display device 506, communication ports 508, a data bus 510 and a memory unit 512. The central processing unit 502 is used for processing computer instructions. The skilled addressee will appreciate that various embodiments of the central processing unit 502 may be provided.
In one embodiment, the central processing unit 502 comprises a CPU Intel Ϊ9-7920C manufactured by Intel(TM
The graphics processing unit 503 is used for processing specific computer instructions. It will be appreciated that a memory unit 520 is operatively connected to the graphics processing unit 503.
In one embodiment, the graphics processing unit 503 comprises a GPU Titan V manufactured by Nvidia Nvidia(TM
The input/output devices 504 are used for inputting/outputting data into the processing device 500.
The optional display device 506 is used for displaying data to a user. The skilled addressee will appreciate that various types of display device 506 may be used.
In one embodiment, the optional display device 506 is a standard liquid crystal display (LCD) monitor.
The communication ports 508 are used for operatively connecting the processing device 500 to various processing devices.
The communication ports 508 may comprise, for instance, universal serial bus (USB) ports for connecting a keyboard and a mouse to the processing device 500. The communication ports 508 may further comprise a data network communication port such as an IEEE 802.3 port for enabling a connection of the processing device 508 with another processing device.
The skilled addressee will appreciate that various alternative embodiments of the communication ports 508 may be provided.
The memory unit 512 is used for storing computer-executable instructions.
The memory unit 512 may comprise a system memory such as a high- speed random access memory (RAM) for storing system control program (e.g., BIOS, operating system module, applications, etc.) and a read-only memory (ROM). In one embodiment, the memory unit 512 has 128 GB of DDR4 RAM.
It will be appreciated that the memory unit 512 comprises, in one embodiment, an operating system module 514.
It will be appreciated that the operating system module 514 may be of various types.
In one embodiment, the operating system module 514 is Linux Ubuntu 18.04 + Lambda Stack.
In one embodiment, the memory unit 520 comprises an application 516 for initializing a pre-trained neural network.
In one embodiment, the memory unit 520 operatively connected to the graphics processing unit 503 and has a size of 24 GB of VRAM. The skilled addressee will appreciate that various alternative embodiments may be possible. The memory unit 520 is further used for storing data 518. The skilled addressee will appreciate that the data 518 may be of various types.
In fact, it will be appreciated that the memory unit 520 of the graphics processing unit 503 is further used for storing at least a portion of data referred to as the batch size. The skilled addressee will appreciate that a larger batch size "may" improve the effectiveness of the optimization steps resulting in more rapid convergence of the model parameters, and a larger batch size can also improve performance by reducing the communication overhead caused by moving the training data to the graphics processing unit 503 - causing more compute cycles to run on the card with each iteration.
In one embodiment, the processing device 500 is a 4GPU Deep learning workstation manufactured by Lambda Quad.
Flow of data and error
It will be appreciated that a Feed-forward Neural Network (FNN) is usually built by stacking up a number of layers on top of each other. The input of a layer can be composed of any combination of the previous layers’ outputs. Let the input and output of the /-th layer of an FNN having L layers be X' and respectively. They are related to each other through functions g(.) and h(.) as
Figure imgf000017_0001
where
Figure imgf000017_0002
are weights and biases from the l-th layer and V is the number of columns in X1.
If the target task is classification with C classes, then the last layer is usually a fully connected one with , where N is the number of
Figure imgf000017_0003
examples passing through the network, known as batch size. Specifically for this layer
Figure imgf000018_0001
where is a column vector of ones.
Since many formulas are batch independent and could be easily broadcast, lower-case bold-face letters of corresponding introduced matrices are separately used to indicate a single example in the batch (N = 1). Therefore,
Equation 2 could be rewritten for a single example as
Figure imgf000018_0002
The posterior of the last layer’s neurons, corresponding to each class, is usually estimated using a softmax normalizer, defined as
Figure imgf000018_0003
where represents the j-th element of aL.
It will be appreciated that cross entropy is the most commonly used loss function for classification tasks and is equal to the Kullback-Leibler divergence between the labels and the estimates) . To train the network
Figure imgf000018_0006
using back-propagation [10], gradients of the loss with respect to each parameter are calculated. To make this easier using the chain rule, first, the gradients are computed with respect to the output of each layer as in
Figure imgf000018_0004
and from there the desired gradients are calculated;
Figure imgf000018_0005
where is the y-th row of Wi
The gradients of the CE loss with respect to the output of the j-th neuron of the deepest layer are equal to:
Figure imgf000019_0001
and since the last layer is fully connected, the back-propagated error to the previous layer is also easily calculated using the chain rule:
Figure imgf000019_0002
where is the k-th element of Wj L. Finally, the weights and biases are updated
Figure imgf000019_0006
using gradient descend.
Specifically for the last layer, the gradients of loss with respect to the rows of the weight matrix are:
Figure imgf000019_0003
where EN is the expectation over the examples in the batch. Likewise the derivative of the loss with respect to a single bias in the last layer is:
Figure imgf000019_0004
Initialization
During the back-propagation algorithm [10], each data entry is passed twice through each layer’s weights except the layers fed directly by the raw input. The magnitude of weights in a layer may get affected by the energy of the input visited by the layer, and the error back-propagated up to its output. Mathematically, it means that in Equations 6, a term of X
Figure imgf000019_0005
' usually appears in the derivative of with respect to the weights of the l-th layer. This is already shown for the last layer in Equation 9. Weights are distinguished from biases and called such since they involve multiplication. This operation can rapidly increase/decrease the energy of its result, compared to the operands. This intensified/lessened energy of the output may increase/decrease the energy of the weights themselves through the gradient updates as discussed. This loop can lead to numerical problems known as exploding/vanishing gradients. One way of facing these problems is to initialize the weights and biases such that the energy of the flowing data/error is preserved. Currently, energy preserving initialization [9] is known to be the optimal solution for training ReLU networks [2]
At the end of training a model on the source task, the magnitude of back- propagated errors goes toward zero. By switching the task and introducing randomly initialized layers, these errors are suddenly increased. Moreover, the optimization algorithm is usually restarted which causes updates to modify all the parameters with the same rate. These large back-propagated errors include considerable amounts of noise as shown in Figure 6. This noise is injected into pre-trained knowledge through the first update. Fig. 1 shows the sudden initial changes in the variance of the input to the last layer when pre-trained models are fine-tuned.
The two common approaches to reduce this contamination are to slow down the training and/or to include a warm-up (WU) phase. However, the former slows down the contamination rather than eliminating it [17] since the small learning rate also updates the augmented parameters slowly, which injects noise into pre-trained layers for a longer time. In the latter solution, the new parameters are updated for a number of steps before jointly training the entire network. Fig. 1. Sudden initial change in variance of XL with initial learning rate equal to 0.0001 , fine-tuned on (a) MNIST and (b) CIFAR100 datasets. The horizontal axes show the training steps. In each model the augmented parameters are initialized based on preserving the variance of gradient recommended in [8] The color shadows represent the standard deviation through training with 24 different seeds.
The warm-up phase, the accuracy of the network is limited since most parts of the network are frozen. Additionally, the effective number of required training steps in the warm-up phase may be large, depending on the learning rate, initial values of augmented parameters and size of the dataset.
In a more effective approach, an initialization technique is disclosed herein for fine-tuning in which the noise is initially trapped only within the task- specific augmented parameters. In contrast to using the warm-up phase, in the method disclosed herein, the noise is always minimized after the first update and therefore the parameters can be trained altogether afterward. In addition, the method disclosed herein is easier to apply in the sense that the training process is not manipulated in any way.
Energy Components of Back-propagating Error
The energy consists of three components from which only one is directly correlated with the accuracy of the estimator and the two others are energies of the true labels and the estimates. The contribution of these components is disclosed and the lower and upper bounds for each one is found.
Let be the sum of energy of through all examples of the batch,
Figure imgf000021_0003
Figure imgf000021_0002
defined as:
Figure imgf000021_0001
Accordingly, using Equation 7, the total energy of the error over all examples in the batch and all C neurons of the last layer is equal to:
Figure imgf000022_0001
Assuming the labels are one-hot encoded; the third term becomes the average probability assignment for the correct labels. The goal of training the model is to maximize this term which is bounded by . The
Figure imgf000022_0006
second term is the energy of the labels and is always equal to one. Finally, the first term is the energy of the estimates. The infimum of this term could be calculated using Cauchy-Schwarz inequality as follows:
Figure imgf000022_0002
This could also be derived directly from the definition of the softmax
(Equation 4),
Figure imgf000022_0003
followed by taking the partial derivatives with respect to the inputs of the soft- max, setting it to zero and re-indexing gives:
Figure imgf000022_0004
Figure imgf000022_0005
that results in:
Figure imgf000023_0001
which means
Figure imgf000023_0004
Reforming Equation 16 considering equal elements in finds minimum of as:
Figure imgf000023_0007
Figure imgf000023_0005
Figure imgf000023_0002
The upper bound of the energy of the estimates equals to one and is achieved when their entropy is minimized per example, All
Figure imgf000023_0006
in all, the total energy of the back-propagated error is bounded between 0 and 2.
Here, the bounds for the energy of estimates are investigated. Using the definition of softmax, it is shown that to achieve the minimum of , all the
Figure imgf000023_0003
neurons of the last layer should have per-example equal output.
Now referring to Fig. 3, there is now shown an embodiment of a method for initializing a pre-trained neural network 100.
According to processing step 102, a pre-trained neural network is obtained. It will be appreciated that the pre-trained neural network has an output layer. In one embodiment, the pre-trained neural network uses softmax logit.
It will be appreciated that the pre-trained neural network may be provided according to various embodiments.
In one embodiment, the pre-trained neural network is received from a processing device. In another embodiment, the pre-trained neural network is obtained from a memory unit of the processing device. In another embodiment, the pre-trained neural network is provided by a user interacting with the processing device. The skilled addressee will appreciate that various alternative embodiments may be provided for providing the pre-trained neural network.
Still referring to Fig. 3 and according to processing step 104, the output layer of the pre-trained neural network is amended. It will be appreciated that the amending of the output layer of the pre-trained neural network comprises updating each weight of the output layer according to a function that maximizes the entropy of the output classes probability.
It will be appreciated that the function depends on a parameter controlling a proportion of error of the output classes probability such as it decreases the variance of the output classes probability.
In one embodiment, the amending of the output layer of the pre-trained neural network further comprises z-normalizing features located right before the output layer prior updating each weight.
It will be appreciated that in one embodiment, the initializing of the pre- trained neural network is performed to prevent adverse contamination during a training of the initialized pre-trained neural network.
According to processing step 106, the initialized pre-trained neural network is provided.
It will be appreciated that the initialized pre-trained neural network may be provided according to various embodiments. In one embodiment, the initialized pre- trained neural network is provided to a processing device. In another embodiment, the initialized pre-trained neural network is saved in a memory unit of the processing device. In another embodiment, the initialized pre-trained neural network is displayed to a user interacting with the processing device. The skilled addressee will appreciate that various alterative embodiments may be provided for providing the initialized pre-trained neural network. While it has been disclosed in Fig. 3 that the method is used for initializing a neural network which is pre-trained, it will be appreciated that in one or more alternative embodiments, the neural network is not pre-trained.
In such embodiments, there is disclosed a method for initializing a neural network. The method comprises obtaining a neural network having an output layer.
It will be appreciated that the neural network may be provided according to various embodiments.
In one embodiment, the neural network is received from a processing device. In another embodiment, the neural network is obtained from a memory unit of the processing device. In another embodiment, the neural network is provided by a user interacting with the processing device. The skilled addressee will appreciate that various alternative embodiments may be provided for providing the neural network.
The method further comprises amending the output layer of the neural network. The amending of the output layer comprises updating each weight of the output layer according to a function that maximizes the entropy of the output classes probability. It will be appreciated that the function depends on a parameter controlling a proportion of error of the output classes probability such as it decreases the variance of the output classes probability.
The method further comprises providing the initialized neural network.
It will be appreciated that the initialized neural network may be provided according to various embodiments. In one embodiment, the initialized neural network is provided to a processing device. In another embodiment, the initialized neural network is saved in a memory unit of the processing device. In another embodiment, the initialized neural network is displayed to a user interacting with the processing device. The skilled addressee will appreciate that various alternative embodiments may be provided for providing the initialized neural network.
It will be appreciated that an embodiment of the method disclosed in Fig. 3 is detailed herein below.
It will be appreciated that the initialized pre-trained neural network may be used when training the pre-trained neural network as disclosed for instance in Fig. 4.
More precisely and according to processing step 200, a pre-trained neural network to train is obtained.
As mentioned above, it will be appreciated that the pre-trained neural network may be obtained according to various embodiments.
According to processing step 202, a dataset suitable for the training is obtained.
It will be appreciated that the dataset suitable for the training may be obtained according to various embodiments.
In one embodiment, the dataset is obtained from a remote processing device operatively connected with a processing device.
According to processing step 204, the pre-trained neural network is initialized.
It will be appreciated that the pre-trained neural network may be initialized according to one or more embodiments of the method disclosed in Fig. 3.
According to processing step 206, the initialized pre-trained neural network is trained. It will be appreciated that the initialized pre-trained neural network is trained using the dataset obtained.
In one or more embodiments, the training of the initialized pre-trained neural network comprises training the initialized pre-trained neural network using a first training batch of the obtained dataset, wherein the first training batch is smaller than a number of features fed to the last layer of the initialized pre-trained neural network.
It will be appreciated that in one or more embodiments, the training is a federated learning method. It will be appreciated that the federated learning method is disclosed at https://arxiv.org/pdf/1902.04885.pdf.
It will be appreciated that in one or more embodiments, the training is a meta-leaming method. It will be appreciated that meta learning is disclosed for instance in the article“Human-level concept learning through probabilistic program induction” by Brenden M. Lake et al. Science 350, 1332 (2015).
It will be appreciated that in one or more embodiments, the training is a distributed machine learning method. It will be appreciated that the distributed machine learning method is disclosed at https://arxiv.org/abs/1810.06060.
It will be appreciated that in one or more other embodiments, the training is a network architecture search using the pre-trained neural network as a seed. It will be appreciated that the network architecture search is disclosed at https://arxiv.org/pdf/1802.03268.pdf.
It will be appreciated that in one or more embodiments, the pre-trained neural network comprises a generative adversarial network. In such embodiments, the initializing of the pre-trained neural network is performed at the discriminator. Still referring to Fig. 4 and according to processing step 208, the trained neural network is provided.
It will be appreciated that the trained neural network may be provided according to various embodiments.
In one embodiment, the trained neural network is provided to a processing device. In another embodiment, the trained neural network is saved in a memory unit of the processing device. The skilled addressee will appreciate that various alternative embodiments may be provided for providing the initialized neural network.
It will be appreciated that there is also disclosed a method for training a neural network through federated learning.
It will be appreciated that the method comprises obtaining a shared neural network to train.
The method further comprises obtaining at least two datasets suitable for the federated learning. Each of the at least two datasets is used for training a corresponding decentralized training unit.
The method further comprises each decentralized training unit performing a first round of training using a corresponding dataset.
The method further comprises, for each subsequent round of training, each decentralized training unit initializing the shared neural network using one or more embodiments of the method disclosed above for initializing a pre-trained neural network and each decentralized training unit training the initialized shared neural network using the corresponding dataset.
The method further comprises globally federating the learning from all decentralized training units to a resulting global shared neural network, and until the global shared neural network converges to a good global model, providing the corresponding global shared neural network to the decentralized training units as the new shared neural network.
Finally, the method comprises providing the trained shared neural network. It will be appreciated that the trained shared neural network may be provided according to various embodiments. In one embodiment, the trained shared neural network is provided to a processing device. In another embodiment, the trained shared neural network is saved in a memory unit of the processing device. In another embodiment, the trained shared neural network is displayed to a user interacting with the processing device. The skilled addressee will appreciate that various alternative embodiments may be provided for providing the trained shared neural network.
It will be appreciated that there is also disclosed a method for training a neural network using a reptile meta-leaming method. The method comprises obtaining a neural network to train.
The method further comprises obtaining a dataset suitable for the reptile meta-leaming method. It will be appreciated that the reptile meta-leaming method is disclosed at https://d4mucfpksywv.doudfront.net/research- covers/reptile/repti le_update. pdf.
The method further comprises, for each iteration of the reptile meta-leaming method, initializing the neural network using one or more embodiments of the method disclosed above for initializing a pre-trained neural network for each task sampled and training the initialized neural network for the corresponding sampled task using the obtained dataset.
Finally, the method comprises providing the trained neural network. It will be appreciated that the trained neural network may be provided according to various embodiments. In one embodiment, the trained neural network is provided to a processing device. In another embodiment, the trained neural network is saved in a memory unit of the processing device. In another embodiment, the trained neural network is displayed to a user interacting with the processing device. The skilled addressee will appreciate that various alternative embodiments may be provided for providing the trained neural network.
network disclosed above
It will be appreciated that initially, the energy of the estimates contains pure noise, i.e. it lacks a meaningful relationship with either the inputs or the labels. Its infimum was calculated and it was shown that it can be achieved when all the estimates are exactly equal to each other for each example. This condition is intuitively appealing since it maximizes the entropy of the estimates prior to the training when
Figure imgf000030_0007
and are independent and/or unaligned. The
Figure imgf000030_0006
entropy is exactly reflected in the CE loss which then becomes deterministically In C regardless of
Figure imgf000030_0005
Another source of contaminating pre-trained layers is
Figure imgf000030_0002
itself. This is because is affected by both the last layer’s error and its weights (see
Figure imgf000030_0001
Equation 8). To prevent the noise from contaminating pre-trained layers an efficient solution should consider both of these criteria. One or more embodiments of a method are therefore introduced which maximizes the initial entropy of estimates while preventing
Figure imgf000030_0004
to become contaminated by .
Figure imgf000030_0003
The method can be described as follows.
First and in accordance with an embodiment, the method requires the features that are fed to the last layer to be normalized. This is done by applying z-normalization across the batch,
Figure imgf000031_0001
Figure 2 shows a FNN architecture and last layer’s initialization in (a) base model and (b) EN- TAME. According to [8], mis 2 for ReLU networks.
This is similar to batch-normalization [12], except that it does not need any leamable parameters. The statistics used in inference mode of the simple z- normalization are detached from the computational graph version of the ones obtained in the corresponding training step.
Second, the method maximizes the entropy of the estimates by initializing the last layer’s weights to values drawn from Independent and Identically Distributed (i.i.d.), zero centered normal distribution as follows
Figure imgf000031_0002
where fa is the energy of each element of is the initial value of the
Figure imgf000031_0007
learning rate (recommended default g = l0- and A is a hyper-parameter that controls the proportion of noise energy over total energy of last layers weights, right after the first update (recommended range is 1 to 1000). is chosen to
Figure imgf000031_0006
be a numerically small number (for example 2 means that 95% of the
Figure imgf000031_0003
values in are initially between -2 x 10-6and 2 x 10-6), but it can be seen
Figure imgf000031_0005
that such small randomness may help the expressiveness of the model. If the biases are also initialized constantly to all zeros and if K is not extremely large, the energy of
Figure imgf000031_0004
would be initially very small as well. Concretely, from the distribution selected for
Figure imgf000032_0009
, the value of each output neuron is approximately zero-centered, or and per-example energy of all output neurons
Figure imgf000032_0008
together is
Figure imgf000032_0001
This is derived from being normalized across the
Figure imgf000032_0002
batch.
Moreover, the exponential function is dose to linear when its input is dose to zero. This could be easily shown by using the result of Equation 22, and the Taylor series approximation of the exponential function around zero,
Plugging this into the softmax definition yields which
Figure imgf000032_0003
Figure imgf000032_0004
approximately maximizes the entropy as desired.
When the estimates are equal for each example, becomes only a
Figure imgf000032_0010
function of the prior, i.e.,
Figure imgf000032_0005
Accordingly, the gradients of the loss with respect to last layer’s parameters are simplified to
Figure imgf000032_0006
Applying the updates results in
Figure imgf000032_0007
where g is the initial learning rate and the second number in the superscripts represent the number of updates, e.g. indicates to the weights of the l-th
Figure imgf000033_0001
layer after u updates. The error cannot move further backward at this point since is very small, making negligible.
Figure imgf000033_0002
After the first update, the outputs of each neuron of the last layer can get comparably high expected values. This may cause the estimates to have much lower entropy compared to the initial state. On the other hand, depending on Y and X, multiple rows and columns of weights and corresponding elements of biases from the last layer can possibly get identical first updates (see Equation 24).
This may cause the entropy of the estimates to stay comparably high for each example. Very small random numbers, used to initialize WL, help these identical estimates to diverge and make different estimates. It will be appreciated that as the expectation of gets away from zero toward positive
Figure imgf000033_0005
values, the exponential functions in softmax make the small difference much larger. Therefore, the expressiveness of the last layer is preserved by initializing its weights to very small numbers instead of zero.
The first update makes the energy of large enough to let the error of the next updates back-propagate through it and reach the pre-trained layers. In other words, this automatically opens up the stalled way and lets the error to back-propagate to the output of other layers. This is enough for correctly guiding pre-trained parameters with an advanced optimization algorithm like Adam [13] is used. Most of the noise is purified and the next back-propagated errors toward pre-trained features are meaningful and contain both prior and likelihood. In more details, the energy of the j-th row in becomes:
Figure imgf000033_0004
Figure imgf000033_0003
Figure imgf000034_0001
which means:
Figure imgf000034_0002
Energy of is the energy of j-th bias in the last layer. Since
Figure imgf000034_0003
Figure imgf000034_0004
contains only information about the prior, we desire to make its energy initially smaller than . For this to happen, initial N has to be chosen such
Figure imgf000034_0005
that N < K (which usually is satisfied). On the other hand, l should be chosen such that but not very small to numerically reduce the rank of due to
Figure imgf000034_0006
Figure imgf000034_0009
possible similar updates (see Equation 25), which may results in higher entropy of roughly determines the maximum proportion of remaining energy of
Figure imgf000034_0007
noise to the total energy of elements of .
Figure imgf000034_0008
Role of Feature Normalization
As mentioned above and in accordance with one or more embodiments, a feature normalization is performed. It will be appreciated that applying z- normalization on top of features may increase or decrease the level of average feature-wise energy in XL resulting in less need for tweaking the learning rate and Fw for different tasks and even different models. If the values of XL are too small, it may take a longer time for WL to grow which leaves pre-trained features unchanged for a longer time.
In one or more embodiments, z-normalization is applied if the provided initialized pre-trained model is to be further trained on the provided data, where the provided data exhibits an important domain shift with respect to the data on which the model was pre-trained. Z-normalization across batches plays a more important role than just equalization. To clarify, it is assumed that image classification is done and two images in a batch contain exactly the same patter or visual object. If one of the columns of XL represents a feature that recognizes said pattern, the feature is expected to reflect the presence of the patter in both of the mentioned images equally. The problem is that raw inputs are usually normalized with statistics that are identically applied on all pixels in all examples. In the best case, such normalization is applied separately for different channels. Object-wise normalization does not seem to be feasible prior to detection which indirectly is done through training neural network classifier. Therefore, even if the same object is exactly copied in both images due to normalizing raw images, one object may get less intensified than the other one. This may directly be reflected to the values of the particular column of XL responsible for showing the presence of the desired object. Z-normalization, compensates for this problem by normalizing the features after they are detected.
Batch-normalization layers used in between the hidden layers of some pre-trained models, usually need more training steps to adapt to the distribution of target task’s data. Since we also care about the performance of the model in the first training steps, feeding normalized features to the last layer is vital. It will be appreciated that the simple z-normalization applied on XL, directly influences the first update of WL.
Experiments
ImageNet [18] ILSVRC 2012 is the source dataset used to pre-train the models. Each pre-trained model is fine-tuned on the following datasets: MNIST [16], CIFAR10, CIFAR100 [15] and Caltech101 [5] The latter dataset is not originally separated into train and test nor is balanced in contrast to the other ones. Each Caltech101 category is split randomly into train and test subsets with 15 percent chance of drawing each image for test subset. Prior to feeding the input to the models, each channel is normalized with its mean and standard deviation obtained from all pixels of that channel throughout the corresponding training subset. Training images are also augmented with random horizontal flip.
The set of used architectures are listed in the leftmost column of Figure 7. Among these models InceptionV3 requires all images to be soiled up to 229 229, so due to limitations in device’s memory the batch size of 64 is chosen for this architecture. In addition, the other models that are trained on the Caltech101 dataset are also fed 64 images per batch owning to large image sizes. All other models and dataset use batch size of 256.
The initialization recommended by [8] is used for augmented layers in the base models. A try was made to unify the problem by applying similar conditions for training different models as much as possible. This by itself would show the impact of one or more embodiments of the method disclosed and how universally it could help the task adaptation, even without considering hyper- parameter tuning. Accordingly, learning rate is set to 0.0001 for all models and datasets and the value of fw is chosen to be 10-12 everywhere.
Figure 7 shows the progress of test accuracy of pre-trained models fine- tuned on each dataset. The smaller plot inside each larger one shows the same curves zoomed-in the first steps of training. The colorful shade around each curve shows the standard deviation across 24 different seeds. Each plot includes 4 curves color mapped as follows; blue: base, orange: base with a single Warm Up (WU) step, green: disclosed method’s Maximum Entropy Initialization (MEI), red: full disclosed method or MEI + Feature Normalization (FN). Experiments were also done with only applying FN, but they mostly perform worse than all other cases, so they are not included to save space and make plots more readable. To measure how the convergence is sped up initially, average progressing accuracy is compared over first ten training steps. Paired t-test suggests that one or more embodiments of the method disclosed herein significantly enhances the test accuracy compared to the base method for all architectures and datasets mentioned herein. Figure 8a shows the average increase in the accuracy of the first 10 training steps with 95% confidence. Further improvements have been observed by adjusting l and the batch size but to show the robustness of the model the same setup has been kept as much as possible.
Finally, the converged accuracy of each curve shown in Fig. 7 is listed in Figures 8b, 8c and 9. The convergence test accuracy is recorded after training models for 10 epochs if target dataset is CIFAR10 or Caltech101 and 15 epochs if target dataset is CIFAR100. Further experiments have been done on ResNet [9], DenseNet [11] and VGG [21] with other popular sizes but similar results were obtained so results of the two most common sizes of each in the above-mentioned tables were only reported.
Now referring to Fig. 7, there is shown test accuracy progress for fine- tuning models that are pre-trained on ImageNet dataset. The horizontal axes on each plot show the number of training steps. Colorful shades show the standard deviation across different seeds. A superscript * means that all models in corresponding row or column are trained with batch size of 64 instead of 256 to make the model fit into the device. The smaller plots inside the bigger ones are just zoomed-in version of the same curves for the first few steps.
Now referring to Fig. 8a, there is shown average initial test accuracy improvement by using an embodiment of the method disclosed herein instead of base method. The entries show increase in the mean of test accuracy over first 10 steps of training with 95% confidence calculated over 24 seeds. Now referring to Fig. 8b, there is shown convergence test accuracy of models trained on CIFAR10 dataset with 95% confidence.
Now referring to Fig. 8c, there is shown convergence test accuracy of models trained on CIFAR100 dataset with 95% confidence.
Now referring to Fig. 9, there is shown convergence test accuracy of models trained on Caltech 101 dataset with 95% confidence. It will be appreciated that although a focus was done on image classification, the reasoning behind one or more embodiments of the method disclosed herein’s impressive performance is not tied to image datasets in any way.
An important outcome of the empirical results is that models fine-tuned on datasets with 100 or even more classes show initial test accuracy of over 40% by visiting only the first 64 images. This can open-up a whole new discussion about the power of few-shot learning algorithms.
It will be appreciated that the application 516 for initializing a neural network comprises instructions for obtaining a pre-trained neural network having an output layer.
The application 516 for initializing a pre-trained neural network further comprises instructions for amending the output layer of the pre-trained neural network. The amending comprises updating each weight of said output layer according to a function that maximizes the entropy of the output classes probability. The function depends on a parameter controlling a proportion of error of the output classes probability such as it decreases the variance of the output classes probability.
The application 516 for initializing a pre-trained neural network further comprises instructions for providing the initialized pre-trained neural network. It will be appreciated that there is also disclosed a non-transitory computer readable storage medium for storing computer-executable instructions which, when executed, cause a computer to perform a method for initializing a pre-trained neural network, the method comprising obtaining a pre-trained neural network having an output layer, amending the output layer of the pre-trained neural network, wherein the amending comprises updating each weight of the output layer according to a function that maximizes the entropy of the output classes probability, wherein the function depends on a parameter controlling a proportion of error of said output classes probability such as it decreases the variance of the output classes probability, and providing the initialized pre-trained neural network.
It will be appreciated that there is also disclosed a computer program comprising computer-executable instructions which, when executed, cause a computer to perform a method for initializing a pre-trained neural network, the method comprising obtaining a pre-trained neural network having an output layer, amending the output layer of the pre-trained neural network, wherein said amending comprises updating each weight of said output layer according to a function that maximizes the entropy of the output classes probability, wherein said function depends on a parameter controlling a proportion of error of said output classes probability such as it decreases the variance of the output classes probability, and providing the initialized pre-trained neural network.
It will be appreciated that there is also disclosed a method for using a pre- trained neural network trained in accordance one or more embodiments of the method disclosed herein.
It will be appreciated that one or more embodiments of the method disclosed herein are of great advantage for various reasons.
An advantage of one or more embodiments of the method disclosed is that they significantly decrease the initial noise that is back-propagated from randomly-initialized parameters toward layers that contain the transferred knowledge. As a consequence, a processing device used for training a neural network according to one or more embodiments of the method disclosed herein will use less resources for training a neural network resulting in more available resources available to complete other tasks. Moreover, one or more embodiments of the method disclosed herein may contribute to better performances overall compared to other traditional training methods, and does significantly contribute to bettering performances in training cases including a small number of training steps compared to other traditional training methods which is particularly useful to evaluate model potential during architecture search & design. Also, one or more embodiments of the method disclosed herein may contribute to decrease the negative impact of catastrophic forgetting, that may occur during model training across tasks, as it limits the impact of noise propagation.
In fact, experiments show that models trained by one or more embodiments of the method disclosed herein leam substantially faster than those using prior art fine-tuning or even more complicated tricks such as warm up [17].
Another advantage of one or more embodiments of the method disclosed is that they are easy to implement and can be beneficially applied to any pre- trained neural network that estimates output probabilities using softmax logits in one embodiment. As a consequence, a benefit is a broad applicability and integration across various deep learning frameworks offered by various vendors such as Google TensorFlow and Facebook PyTorch.
The optimal parameter initialization is derived for neural networks being fine-tuned on pre-trained models for classification and show that such optimal initial loss leads to a significant acceleration in adapting a pre-trained neural network to a new task. Another advantage of one or more embodiments of the method disclosed is that they are independent of the choice of architecture and may be applied to transfer knowledge within any domain. As a consequence, a benefit is that one or more embodiments of the method disclosed herein do not increase the complexity of the overall architecture to be trained (e.g. no additional layer, no multi-stage training process like warm up methods)..
Another advantage of one or more embodiments of the method disclosed herein is that they show a significant practical impact on convergence. As a consequence, a processing device used for training a neural network according to one or more embodiments of the method disclosed herein will use less resources for training a neural network than with a conventional methods, resulting in more available resources available to complete other tasks. One or more embodiments of the method disclosed herein may contribute to better performances overall compared to other traditional training methods, and does significantly contribute to bettering performances in training cases including a small number of training steps compared to other traditional training methods which is particularly useful to evaluate model potential during architecture search & design. One or more embodiments of the method disclosed herein may contribute to decrease the negative impact of catastrophic forgetting, that may occur during model training across tasks, as it limits the impact of noise propagation.
Clause 1 : A method for initializing a pre-trained neural network, the method comprising: obtaining a pre-trained neural network having an output layer, amending the output layer of the pre-trained neural network, wherein said amending comprises updating each weight of said output layer according to a function that maximizes the entropy of the output classes probability, wherein said function depends on a parameter controlling a proportion of error of said output classes probability such as it decreases the variance of the output classes probability, and providing the initialized pre-trained neural network.
Clause 2: The method as claimed in clause 1 , wherein the amending of the output layer of the pre-trained neural network further comprises z- normalizing features located right before the output layer prior updating each weight.
Clause 3: The method as claimed in clause 1 , wherein the pre-trained neural network uses softmax logit in the output layer.
Clause 4: A method for training a pre-trained neural network, the method comprising: obtaining a pre-trained neural network to train; obtaining a dataset suitable for said training; initializing the pre-trained neural network using the method as claimed in any one of clauses 1 to 3; training the initialized pre-trained neural network using the obtained dataset; and providing the trained neural network.
Clause 5: A method as claimed in clause 4, wherein said training is a federated learning method.
Clause 6: A method as claimed in clause 4, wherein said training is a meta-leaming method. Clause 7: A method as claimed in clause 4, wherein said training is a distributed machine learning method.
Clause 8: The method as claimed in clause 4, wherein said training is a network architecture search using said pre-trained neural network as a seed.
Clause 9: The method as claimed in any one of clauses 4 to 8, wherein the pre-trained neural network comprises a generative adversarial network, wherein said initializing of the pre-trained neural network using the method as claimed in clause 1 is performed at the discriminator.
Clause 10: A method for training a neural network through federated learning, the method comprising: obtaining a shared neural network to train; obtaining at least two datasets suitable for said federated learning, each of the at least two datasets for training a corresponding decentralized training unit; each decentralized training unit performing a first round of training using a corresponding dataset; for each subsequent round of training: each decentralized training unit initializing the shared neural network using the method as claimed in any one of clauses 1 to 3, each decentralized training unit training the initialized shared neural network using the corresponding dataset, globally federating the learning from all decentralized training units to a resulting global shared neural network, and until the global shared neural network converges to a good global model, providing the corresponding global shared neural network to the decentralized training units as the new shared neural network; and providing the trained shared neural network.
Clause 11 : A method for training a neural network using a reptile meta- learning method, the method comprising: obtaining a neural network to train; obtaining a dataset suitable for said reptile meta-leaming method; for each iteration of the reptile meta-leaming method: initializing the neural network using the method as claimed in any one of clauses 1 to 3 for each task sampled, and training the initialized neural network for said corresponding sampled task using the obtained dataset; and providing the trained neural network.
Clause 12: The method as claimed in any one of clauses 4 to 9, wherein the training of the initialized pre-trained neural network comprising training the initialized pre-trained neural network using a first training batch of the obtained dataset, wherein the first training batch is smaller than a number of features fed to said last layer of said initialized pre-trained neural network.
Clause 13: A method for using a pre-trained neural network trained in accordance with any one of clauses 4 to 9.
Clause 14: A computer comprising: a central processing unit; a graphics processing unit; a communication port; a memory unit comprising an application for initializing a pre-trained neural network, the application comprising: instructions for obtaining a pre-trained neural network having an output layer, instructions for amending the output layer of the pre-trained neural network, wherein said amending comprises updating each weight of said output layer according to a function that maximizes the entropy of the output classes probability, wherein said function depends on a parameter controlling a proportion of error of said output classes probability such as it decreases the variance of the output classes probability, and instructions for providing the initialized pre-trained neural network.
Clause 15: Computer program comprising computer-executable instructions which, when executed, cause a computer to perform a method for initializing a pre-trained neural network, the method comprising: obtaining a pre-trained neural network having an output layer, amending the output layer of the pre-trained neural network, wherein said amending comprises updating each weight of said output layer according to a function that maximizes the entropy of the output classes probability, wherein said function depends on a parameter controlling a proportion of error of said output classes probability such as it decreases the variance of the output classes probability, and providing the initialized pre-trained neural network.
Clause 16: A non-transitory computer readable storage medium for storing computer-executable instructions which, when executed, cause a computer to perform a method for initializing a pre-trained neural network, the method comprising: obtaining a pre-trained neural network having an output layer, amending the output layer of the pre-trained neural network, wherein said amending comprises updating each weight of said output layer according to a function that maximizes the entropy of the output classes probability, wherein said function depends on a parameter controlling a proportion of error of said output classes probability such as it decreases the variance of the output classes probability, and providing the initialized pre-trained neural network.
Clause 17: A method for initializing a neural network, the method comprising: obtaining a neural network having an output layer, amending the output layer of the neural network, wherein said amending comprises updating each weight of said output layer according to a function that maximizes the entropy of the output classes probability, wherein said function depends on a parameter controlling a proportion of error of said output classes probability such as it decreases the variance of the output classes probability, and providing the initialized neural network. References
1. Agrawal, P., Girshick, R., Malik, J.: Analyzing the performance of multilayer neural networks for object recognition. In: European conference on computer vision pp. 329-344. Springer (2014)
2. Arpit, D., Bengio, Y.: The benefits of over-parameterization at initialization in deep relu networks. arXiv preprint arXiv: 1901.03611 (2019)
3. Azizpour, H., Razavian, A.S., Sullivan, J., Maki, A., Carisson, S.: Factors of transferability for a generic convnet representation. IEEE transactions on pattern analysis and machine intelligence 38(9), 1790-1802 (2016)
4. Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., Darrell, T.: Decaf: A deep convolutional activation feature for generic visual recognition (2013)
5. Fei-Fei, L, Fergus, R., Perona, P.: Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. Computer vision and Image understanding 106(1), 59-70 (2007)
6. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition pp. 580-587 (2014)
7. Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the thirteenth international conference on artificial intelligence and statistics pp. 249-256 (2010) 8. He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: Surpassing human- level performance on imagenet classification. In: Proceedings of the IEEE international conference on computer vision pp. 1026-1034 (2015)
9. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and patter recognition pp. 770-778 (2016)
10. Homik, K., Stinchcombe, M., White, H.: Multilayer feedforward networks are universal approximators. Neural networks 2(5), 359-366 (1989)
11. Huang, G., Liu, Z., Maaten, L.v.d., Weinberger, K.Q.: Densely connected convolutional networks. 2017 IEEE Conference on Computer Vision and Pat- tern Recognition (CVPR) (Jul 2017). https://doi.Org/10.1109/cvpr.2017.243, http://dx.doi.Org/10.1109/CVPR.2017.243
12. Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv: 1502.03167 (2015)
13. Kingma, D., Ba, J.: Adam: a method for stochastic optimization (2014). arXiv preprint arXiv: 1412.6980 15 (2015)
14. Klambauer, G., Unterthiner, T., Mayr, A., Hochreiter, S.: Self- normalizing neural networks (2017)
15. Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images. Tech rep., Citeseer (2009)
16. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P., et al.: Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11), 2278-2324 (1998) 17. Li, Z., Hoiem, D.: Learning without forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence 40(12), 2935-2947 (2018)
18. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recognition challenge. International journal of computer vision 115(3), 211-252 (2015)
19. Sharif Razavian, A., Azizpour, H., Sullivan, J., Carisson, S.: Cnn features off-the-shelf: an astounding baseline for recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops pp. 806-813 (2014)
20. Shemnin, T., Murshed, M., Lu, G., Teng, S.W.: Transfer learning using classification layer features of cnn (2018)
21. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition (2014)
22. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition pp. 2818-2826 (2016)

Claims

CLAIMS:
1. A method for initializing a pre-trained neural network, the method comprising: obtaining a pre-trained neural network having an output layer, amending the output layer of the pre-trained neural network, wherein said amending comprises updating each weight of said output layer according to a function that maximizes the entropy of the output classes probability, wherein said function depends on a parameter controlling a proportion of error of said output classes probability such as it decreases the variance of the output classes probability, and providing the initialized pre-trained neural network.
2. The method as claimed in claim 1 , wherein the amending of the output layer of the pre-trained neural network further comprises z-normalizing features located right before the output layer prior updating each weight.
3. The method as claimed in claim 1 , wherein the pre-trained neural network uses softmax logit in the output layer.
4. A method for training a pre-trained neural network, the method comprising: obtaining a pre-trained neural network to train; obtaining a dataset suitable for said training; initializing the pre-trained neural network using the method as claimed in any one of claims 1 to 3; training the initialized pre-trained neural network using the obtained dataset; and providing the trained neural network.
5. A method as claimed in claim 4, wherein said training is a federated learning method.
6. A method as claimed in claim 4, wherein said training is a meta-learning method.
7. A method as claimed in claim 4, wherein said training is a distributed machine learning method.
8. The method as claimed in claim 4, wherein said training is a network architecture search using said pre-trained neural network as a seed.
9. The method as claimed in any one of claims 4 to 8, wherein the pre-trained neural network comprises a generative adversarial network, wherein said initializing of the pre-trained neural network using the method as claimed in claim 1 is performed at the discriminator.
10. A method for training a neural network through federated learning, the method comprising: obtaining a shared neural network to train; obtaining at least two datasets suitable for said federated learning, each of the at least two datasets for training a corresponding decentralized training unit; each decentralized training unit performing a first round of training using a corresponding dataset; for each subsequent round of training: each decentralized training unit initializing the shared neural network using the method as claimed in any one of claims 1 to 3, each decentralized training unit training the initialized shared neural network using the corresponding dataset, globally federating the learning from all decentralized training units to a resulting global shared neural network, and until the global shared neural network converges to a good global model, providing the corresponding global shared neural network to the decentralized training units as the new shared neural network; and providing the trained shared neural network.
11. A method for training a neural network using a reptile meta-leaming method, the method comprising: obtaining a neural network to train; obtaining a dataset suitable for said reptile meta-leaming method; for each iteration of the reptile meta-learning method: initializing the neural network using the method as claimed in any one of claims 1 to 3 for each task sampled, and training the initialized neural network for said corresponding sampled task using the obtained dataset; and providing the trained neural network.
12. The method as claimed in any one of claims 4 to 9, wherein the training of the initialized pre-trained neural network comprising training the initialized pre- trained neural network using a first training batch of the obtained dataset, wherein the first training batch is smaller than a number of features fed to said last layer of said initialized pre-trained neural network.
13. A method for using a pre-trained neural network trained in accordance with any one of claims 4 to 9.
14. A computer comprising: a central processing unit; a graphics processing unit; a communication port; a memory unit comprising an application for initializing a pre-trained neural network, the application comprising: instructions for obtaining a pre-trained neural network having an output layer, instructions for amending the output layer of the pre-trained neural network, wherein said amending comprises updating each weight of said output layer according to a function that maximizes the entropy of the output classes probability, wherein said function depends on a parameter controlling a proportion of error of said output classes probability such as it decreases the variance of the output classes probability, and instructions for providing the initialized pre-trained neural network.
15. Computer program comprising computer-executable instructions which, when executed, cause a computer to perform a method for initializing a pre-trained neural network, the method comprising: obtaining a pre-trained neural network having an output layer, amending the output layer of the pre-trained neural network, wherein said amending comprises updating each weight of said output layer according to a function that maximizes the entropy of the output classes probability, wherein said function depends on a parameter controlling a proportion of error of said output classes probability such as it decreases the variance of the output classes probability, and providing the initialized pre-trained neural network.
16. A non-transitory computer readable storage medium for storing computer- executable instructions which, when executed, cause a computer to perform a method for initializing a pre-trained neural network, the method comprising: obtaining a pre-trained neural network having an output layer, amending the output layer of the pre-trained neural network, wherein said amending comprises updating each weight of said output layer according to a function that maximizes the entropy of the output classes probability, wherein said function depends on a parameter controlling a proportion of error of said output classes probability such as it decreases the variance of the output classes probability, and providing the initialized pre-trained neural network.
17. A method for initializing a neural network, the method comprising: obtaining a neural network having an output layer, amending the output layer of the neural network, wherein said amending comprises updating each weight of said output layer according to a function that maximizes the entropy of the output classes probability, wherein said function depends on a parameter controlling a proportion of error of said output classes probability such as it decreases the variance of the output classes probability, and providing the initialized neural network.
PCT/IB2020/054350 2019-05-07 2020-05-07 Method and system for initializing a neural network WO2020225772A1 (en)

Priority Applications (5)

Application Number Priority Date Filing Date Title
CA3134565A CA3134565A1 (en) 2019-05-07 2020-05-07 Method and system for initializing a neural network
US17/609,296 US20220215252A1 (en) 2019-05-07 2020-05-07 Method and system for initializing a neural network
JP2021565987A JP2022531882A (en) 2019-05-07 2020-05-07 Methods and systems for initializing neural networks
EP20801909.1A EP3966741A1 (en) 2019-05-07 2020-05-07 Method and system for initializing a neural network
CN202080034485.XA CN113795850A (en) 2019-05-07 2020-05-07 Method and system for initializing a neural network

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201962844472P 2019-05-07 2019-05-07
US62/844,472 2019-05-07

Publications (1)

Publication Number Publication Date
WO2020225772A1 true WO2020225772A1 (en) 2020-11-12

Family

ID=73051560

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2020/054350 WO2020225772A1 (en) 2019-05-07 2020-05-07 Method and system for initializing a neural network

Country Status (6)

Country Link
US (1) US20220215252A1 (en)
EP (1) EP3966741A1 (en)
JP (1) JP2022531882A (en)
CN (1) CN113795850A (en)
CA (1) CA3134565A1 (en)
WO (1) WO2020225772A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112383396A (en) * 2021-01-08 2021-02-19 索信达(北京)数据技术有限公司 Method and system for training federated learning model
CN112600772A (en) * 2020-12-09 2021-04-02 齐鲁工业大学 OFDM channel estimation and signal detection method based on data-driven neural network
CN112766491A (en) * 2021-01-18 2021-05-07 电子科技大学 Neural network compression method based on Taylor expansion and data driving
CN113033712A (en) * 2021-05-21 2021-06-25 华中科技大学 Multi-user cooperative training people flow statistical method and system based on federal learning

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP4141886A1 (en) * 2021-08-23 2023-03-01 Siemens Healthcare GmbH Method and system and apparatus for quantifying uncertainty for medical image assessment

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150072215A1 (en) * 2011-10-27 2015-03-12 SAKT13, Inc. Barrier for thin film lithium batteries made on flexible substrates and related methods
EP2905722A1 (en) * 2014-02-10 2015-08-12 Huawei Technologies Co., Ltd. Method and apparatus for detecting salient region of image
US20170148433A1 (en) * 2015-11-25 2017-05-25 Baidu Usa Llc Deployed end-to-end speech recognition
US20180089587A1 (en) * 2016-09-26 2018-03-29 Google Inc. Systems and Methods for Communication Efficient Distributed Mean Estimation
US20180107926A1 (en) * 2016-10-19 2018-04-19 Samsung Electronics Co., Ltd. Method and apparatus for neural network quantization
US20190012594A1 (en) * 2017-07-05 2019-01-10 International Business Machines Corporation Pre-training of neural network by parameter decomposition
CA3022125A1 (en) * 2017-10-27 2019-04-27 Royal Bank Of Canada System and method for improved neural network training

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150072215A1 (en) * 2011-10-27 2015-03-12 SAKT13, Inc. Barrier for thin film lithium batteries made on flexible substrates and related methods
EP2905722A1 (en) * 2014-02-10 2015-08-12 Huawei Technologies Co., Ltd. Method and apparatus for detecting salient region of image
US20170148433A1 (en) * 2015-11-25 2017-05-25 Baidu Usa Llc Deployed end-to-end speech recognition
US20180089587A1 (en) * 2016-09-26 2018-03-29 Google Inc. Systems and Methods for Communication Efficient Distributed Mean Estimation
US20180107926A1 (en) * 2016-10-19 2018-04-19 Samsung Electronics Co., Ltd. Method and apparatus for neural network quantization
US20190012594A1 (en) * 2017-07-05 2019-01-10 International Business Machines Corporation Pre-training of neural network by parameter decomposition
CA3022125A1 (en) * 2017-10-27 2019-04-27 Royal Bank Of Canada System and method for improved neural network training

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112600772A (en) * 2020-12-09 2021-04-02 齐鲁工业大学 OFDM channel estimation and signal detection method based on data-driven neural network
CN112383396A (en) * 2021-01-08 2021-02-19 索信达(北京)数据技术有限公司 Method and system for training federated learning model
CN112383396B (en) * 2021-01-08 2021-05-04 索信达(北京)数据技术有限公司 Method and system for training federated learning model
CN112766491A (en) * 2021-01-18 2021-05-07 电子科技大学 Neural network compression method based on Taylor expansion and data driving
CN113033712A (en) * 2021-05-21 2021-06-25 华中科技大学 Multi-user cooperative training people flow statistical method and system based on federal learning
CN113033712B (en) * 2021-05-21 2021-09-14 华中科技大学 Multi-user cooperative training people flow statistical method and system based on federal learning

Also Published As

Publication number Publication date
JP2022531882A (en) 2022-07-12
CN113795850A (en) 2021-12-14
CA3134565A1 (en) 2020-11-12
EP3966741A1 (en) 2022-03-16
US20220215252A1 (en) 2022-07-07

Similar Documents

Publication Publication Date Title
US20220215252A1 (en) Method and system for initializing a neural network
Ruthotto et al. Deep neural networks motivated by partial differential equations
Hinton et al. A better way to pretrain deep boltzmann machines
Zhuo et al. Scsp: Spectral clustering filter pruning with soft self-adaption manners
Singh et al. Layer-specific adaptive learning rates for deep networks
CN111882040A (en) Convolutional neural network compression method based on channel number search
Kristan et al. Online kernel density estimation for interactive learning
CN108985457B (en) Deep neural network structure design method inspired by optimization algorithm
Fujino et al. Deep convolutional networks for human sketches by means of the evolutionary deep learning
CN108154235A (en) A kind of image question and answer inference method, system and device
Sajjadi et al. Tempered adversarial networks
Arnekvist et al. The effect of target normalization and momentum on dying relu
Lopes et al. Deep belief networks (DBNs)
Zheng Smoothly approximated support vector domain description
Stanitsas et al. Active convolutional neural networks for cancerous tissue recognition
Gui et al. A fast adaptive algorithm for training deep neural networks
Varno et al. Efficient neural task adaptation by maximum entropy initialization
WO2021253938A1 (en) Neural network training method and apparatus, and video recognition method and apparatus
US20220382038A1 (en) Microscope and Method with Implementation of a Convolutional Neural Network
Witzgall Rapid Class Augmentation for Continuous Deep Learning Applications
Georgiou et al. Norm loss: An efficient yet effective regularization method for deep neural networks
Wang et al. Adaptive normalized risk-averting training for deep neural networks
MOHAMMED et al. A New Image Classification System Using Deep Convolution Neural Network And Modified Amsgrad Optimizer
Rusiecki et al. Effectiveness of unsupervised training in deep learning neural networks
Liao A random matrix framework for large dimensional machine learning and neural networks

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20801909

Country of ref document: EP

Kind code of ref document: A1

DPE1 Request for preliminary examination filed after expiration of 19th month from priority date (pct application filed from 20040101)
ENP Entry into the national phase

Ref document number: 3134565

Country of ref document: CA

ENP Entry into the national phase

Ref document number: 2021565987

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2020801909

Country of ref document: EP

Effective date: 20211207