WO2020220191A1 - Method and apparatus for training and applying a neural network - Google Patents
Method and apparatus for training and applying a neural network Download PDFInfo
- Publication number
- WO2020220191A1 WO2020220191A1 PCT/CN2019/084969 CN2019084969W WO2020220191A1 WO 2020220191 A1 WO2020220191 A1 WO 2020220191A1 CN 2019084969 W CN2019084969 W CN 2019084969W WO 2020220191 A1 WO2020220191 A1 WO 2020220191A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- network
- layer
- matrix
- weight matrix
- training
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
Definitions
- the method is also used to adjust learning rates of parameters layer by layer, which multiplies the weight matrix of each layer by a positive constant scalar, and at the same time change the initialization of this layer accordingly.
- the number of shared elements in the structured implementation with shared elements is a block size in a circulant implementation.
- the adjusting parameter is calculated as following: and wherein m is a number of input channel of a layer of the network, n is a number of output channel of the layer of the network, Cnm is the adjusting parameter, and Bnm is the number of shared elements.
- Xavier_value is determined by the following:
- W nm is the weight matrix
- the training result of the network is convergence, comprises a difference of the weight matrix is smaller than a preset threshold.
- Xavier_value is determined by the following:
- the integrated matrix is updated in each iteration, comprises the random number is updated in each iteration.
- Figure 2 shows the losses of network summarized in formula (12a) ;
- Figure 4 shows an exemplary compression processing of a block-circulant matrix
- Figure 6 shows an exemplary block diagram of a neutral network training apparatus.
- W and b are usually parameters to be learned in the training process.
- a neural network will go through iterations of forward and backward propagations.
- the loss function is denoted by L.
- a forward propagation one calculates y based on a given x.
- the backward propagation one calculates and based on a given Suppose that v is a learnable parameter for this layer. Then after one backward propagation with learning rate r, v will be updated by
- the unstructured matrix can also be replaced by other structured implementation with shared elements
- the compression ratio B relates to the number of shared elements.
- Circulant implementation is one kind of structured implementations with shared elements.
- the other structured implementations with shared elements include toplize matrix, Hankel matrix and so on.
- W1 and W2 are the parameter matrices of sizes and respectively. Entries in the parameter matrices are to be learned. Decomposing the above model into two networks, each with two layers, in the following two ways
- Example 1 A training process of a simple network with the initialization formula (10) is shown on the MNIST dataset.
- the network structure is summarized in Table 1, and CirCNN compression is used for the first fully connected layer. It is noted that after the compression, the total number of learnable parameters is only about 0.76%of the original model. From the plot of the loss function in Figure 1, we see that the proposed initialization method better suits this CirCNN network.
- Example 2 A simple network where the weight matrix of many layers share a common weight matrix V. Detail of the network, together with the initialization of the common weight V by proposed formula (8) , is summarized in formula (12a) and (12b)
- W nm is the weight matrix
- the parameters comprises an adjusting parameter, and wherein the adjusted parameter is negatively correlated with the number of shared elements.
- the number of shared elements in the structured implementation with shared elements is a block size in a circulant implementation.
- an apparatus for training a neutral network comprising: one or more processors; and a non-transitory computer-readable storage medium coupled to the processors and storing programming for execution by the processors, wherein the programming, when executed by the processors, configures the apparatus to carry out the method according to any one of claims 1-12 in the present application.
- the bus 512 of the apparatus 500 can be composed of multiple buses.
- the secondary storage 514 can be directly coupled to the other components of the apparatus 500 or can be accessed via a network and can comprise a single integrated unit such as a memory card or multiple units such as multiple memory cards.
- the apparatus 500 can thus be implemented in a wide variety of configurations.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Machine Translation (AREA)
Abstract
A neural network training method comprises: determining parameters for a weight matrix of each layer of a network, based on the number of shared elements in a structured implementation with shared elements of each layer; determining the weight matrix of each layer based on the determined parameter; decompressing the determined weight matrix to derived an integrated matrix; iterating over forward propagation and backward propagation on all layers of the network based on the integrated matrix until a preset condition is satisfied, wherein the integrated matrix is updated in each iteration; and storing the iterated parameters of the network.
Description
The present disclosure relates generally to artificial intelligence (AI) technique, and more particularly, to a method and apparatus for initialization of compressed neural network with structured matrices with shared elements.
In recent years, deep learning methods have remarkable achievements in a broad range of tasks including object classifications, natural language processing, and speech recognition. In 2016, AlphaGo, a Go game software powered by deep learning algorithms, becomes the first Go game software to beat a human champion in a series of 5 Go games. How to effectively train deep neural networks is an active research area. In the seminal work of Xavier and Yoshua, they observed that a proper initialization of the learnable parameters plays an important role in training a neural network. Later, He et al. in extended the method of Xavier and Yoshua to consider the ReLU activation function. Nowadays these two initialization methods are widely used in deep learning software packages like TensorFlow, PyTorch and Keras. Since these two initialization methods only differ by a factor called “gain” , which is decided by the activation function, they are considered as one method and called as the Xavier/He initialization.
The main idea of the Xavier/He initialization is to maintain activation variances and back-propagated gradients variance for different layers. However, when there is heavy weigh sharing for certain parameters, the activation variances and the back propagated gradients are not good indicators of variances of learnable parameters. As a result, the update of heavily shared parameters may be much faster than other parameters, and this leads to difficulties in training. If one layer of a neural network is multiplied by a positive constant number and then the positive constant number is divided in another layer, the neural network is very difficult to be trained even if the Xavier/He initialization is used.
SUMMARY
An initialization method is proposed based on the Xavier/He initialization. For a fully connected layer with no weight sharing, the embodiments in the present application are the same as the Xavier/He initialization method. But the embodiments can cope with weight sharing, which is a common phenomenon for CirCNN implementation of neural networks and various numerical embodiments are shown to verify the effectiveness of the initialization method in the present application.
The method is also used to adjust learning rates of parameters layer by layer, which multiplies the weight matrix of each layer by a positive constant scalar, and at the same time change the initialization of this layer accordingly.
In the first embodiment of the present application, a neutral network training method is disclosed, comprising: determining parameters for a weight matrix of each layer of a network, based on the number of shared elements in a structured implementation with shared elements of each layer; determining the weight matrix of each layer based on the determined parameter; decompressing the determined weight matrix to derived an integrated matrix; iterating over forward propagation and backward propagation on all layers of the network based on the integrated matrix until a preset condition is satisfied, wherein the integrated matrix is updated in each iteration; and storing the iterated parameters of the network.
In a feasible implementation, the parameters comprises an adjusting parameter, and wherein the adjusted parameter is negatively correlated with the number of shared elements.
It is noted that the calculation of the adjusting parameter can be 1/B or 1/ (B×B) and so on, which is not limited in the present application.
In a feasible implementation, the number of shared elements in the structured implementation with shared elements is a block size in a circulant implementation.
It is noted that circulant implementation is a kind of structured implementation with shared elements, the number of shared elements or the block size relates to a compression ratio of the layer.
In a feasible implementation, the adjusting parameter is calculated as following:
and wherein m is a number of input channel of a layer of the network, n is a number of output channel of the layer of the network, Cnm is the adjusting parameter, and Bnm is the number of shared elements.
In a feasible implementation, the parameters further comprises a random number in an uniform distribution between -Xavier_value and Xavier_value.
It is noted that uniform distribution is a kind of candidate implementation, which is not limited in the present application. Other distributions may be also used for the solution.
In a feasible implementation, Xavier_value is determined by the following:
In a feasible implementation, the weight matrix is determined by the following: W
nm=C
nm·v
λ (n, m)
where W
nm is the weight matrix.
In a feasible implementation, the integrated matrix is updated in each iteration, comprises the random number is updated in each iteration.
In a feasible implementation, the preset condition comprises a training result of the network is convergence.
In a feasible implementation, the training result of the network is convergence, comprises a difference of the weight matrix is smaller than a preset threshold.
It is noted that any criteria which can be used to judge the convergence in a neutral network training process can be applied in the present application.
In a feasible implementation, an input of the network in training comprises image information data or sound information data.
In a feasible implementation, the network is used to classify object, process language or recognize speech.
So it is clear that the present application is useful for the industry, for example, object classifications, natural language processing, and speech recognition field.
In the second embodiment of the present application, a neutral network training device is disclosed, comprising: a first calculation module, to determine parameters for a weight matrix of each layer of a network, based on the number of shared elements in a structured implementation with shared elements of each layer; a second calculation module, to determine the weight matrix of each layer based on the determined parameter; a decompression module, to decompress the determined weight matrix to derived an integrated matrix; an iteration module, to iterate over forward propagation and backward propagation on all layers of the network based on the integrated matrix until a preset condition is satisfied, wherein the integrated matrix is updated in each iteration; and a storage module, to store the iterated parameters of the network.
In a feasible implementation, the parameters comprises an adjusting parameter, and wherein the adjusted parameter is negatively correlated with the number of shared elements.
In a feasible implementation, the number of shared elements in the structured implementation with shared elements is a block size in a circulant implementation.
In a feasible implementation, the adjusting parameter is calculated as following:
and wherein m is a number of input channel of a layer of the network, n is a number of output channel of the layer of the network, Cnm is the adjusting parameter, and Bnm is the number of shared elements.
In a feasible implementation, the parameters further comprises a random number in an uniform distribution between -Xavier_value and Xavier_value.
In a feasible implementation, Xavier_value is determined by the following:
In a feasible implementation, the weight matrix is determined by the following: W
nm=C
nm·v
λ (n, m)
where W
nm is the weight matrix.
In a feasible implementation, the integrated matrix is updated in each iteration, comprises the random number is updated in each iteration.
In a feasible implementation, the preset condition comprises a training result of the network is convergence.
In a feasible implementation, the training result of the network is convergence, comprises a difference of the weight matrix is smaller than a preset threshold.
In a feasible implementation, an input of the network in training comprises image information data or sound information data.
In a feasible implementation, the network is used to classify object, process language or recognize speech.
In the third embodiment of the present application, an apparatus for training a neutral network is disclosed, the apparatus comprising: one or more processors; and a non-transitory computer-readable storage medium coupled to the processors and storing programming for execution by the processors, wherein the programming, when executed by the processors, configures the apparatus to carry out the method according to any one of implementation in the first embodiment.
In the fourth embodiment of the present application, a computer program product is disclosed, comprising a program code for performing the method according to any one of implementation in the first embodiment.
For a more complete understanding of the present disclosure, and the advantages thereof, reference is now made to the following descriptions taken in conjunction with the accompanying drawings, wherein like numbers designate like objects, and in which:
Figure 1 shows the losses of networks summarized in Table 1;
Figure 2 shows the losses of network summarized in formula (12a) ;
Figure 3 illustrates an exemplary method for a neutral network training;
Figure 4 shows an exemplary compression processing of a block-circulant matrix;
Figure 5 shows an exemplary block diagram of a neutral network training device;
Figure 6 shows an exemplary block diagram of a neutral network training apparatus.
Figures 1 through 6, discussed below, and the various embodiments used to describe the principles of the present invention in this patent document are by way of illustration only and should not be construed in any way to limit the scope of the invention. Those skilled in the art will understand that the principles of the invention may be implemented in any type of suitably arranged device or system.
The following documents are hereby incorporated into the present disclosure as if fully set forth herein, the reference numbers of these documents will be used in the following part of this application.
[1] Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. Natural language processing (almost) from scratch. Journal of machine learning research, 12 (Aug) : 2493–2537, 2011.
[2] Caiwen Ding, Siyu Liao, Yanzhi Wang, Zhe Li, Ning Liu, Youwei Zhuo, Chao Wang, Xuehai Qian, Yu Bai, Geng Yuan, et al. Circnn: accelerating and compressing deep neural networks using block-circulant weight matrices. In Proceedings of the 50
th Annual IEEE/ACM International Symposium on Microarchitecture, pages 395–408. ACM, 2017.
[3] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256, 2010.
[4] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. http: //www. deeplearningbook. org.
[5] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In The IEEE International Conference on Computer Vision (ICCV) , December 2015.
[6] Geoffrey Hinton, Li Deng, Dong Yu, George Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Brian Kingsbury, et al. Deep neural networks for acoustic modeling in speech recognition. IEEE Signal processing magazine, 29, 2012.
[7] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pages 448–456, 2015.
[8] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
[9] Dmytro Mishkin and Jiri Matas. All you need is a good init. In International Conference on Learning Representations, 2016.
[10] Andrew Saxe, James L McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. 2013.
[11] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search nature, 529 (7587) : 484, 2016.
[12] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv: 1409.1556, 2014.
[13] Lechao Xiao, Yasaman Bahri, Jascha Sohl-Dickstein, Samuel Schoenholz, and Jeffrey Pennington. Dynamical isometry and a mean field theory of cnns: How to train 10,000-layer vanilla convolutional neural networks. In International Conference on Machine Learning, pages 5389–5398, 2018.
[14] Liang Zhao, Siyu Liao, Yanzhi Wang, Zhe Li, Jian Tang, and Bo Yuan. Theoretical properties for neural networks with weight matrices of low displacement rank. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages4082–4090. JMLR. org, 2017.
Some notations and terms used in the present application are introduced here first.
I. Single layer structure
For a typical neural network, one layer of the network is of the form
y = act (Wx+b) ; (1)
where
is the input vector,
is the weight matrix,
is the bias vector, and act is a pointwise activation function (for example, the identity function or the ReLU function
) . Entries in W and b are usually parameters to be learned in the training process.
In the training process, a neural network will go through iterations of forward and backward propagations. Suppose the loss function is denoted by L. In a forward propagation, one calculates y based on a given x. In the backward propagation, one calculates
and
based on a given
Suppose that v is a learnable parameter for this layer. Then after one backward propagation with learning rate r, v will be updated by
For any variable z in this layer, z
(0) is used to denote its value after initialization, and z
(1) is used to denote its value after the first forward and backward propagation. A definition is introduced here. Assuming the notations and single layer structure as in the reference document [1] , let z be a variable in this layer. Then the increment of z, denoted as Δz, is defined to be
Δz=z
(1) -z
(0)
when we set the learning rate to be 1.
For a learnable variable v, having
For an intermediate variable Wnm which corresponds to the learnable variable v, if it is assumed that v only appears in W in this layer, having
where the last summation is over all entries in W that share the same learnable parameter v. If v also appears in weights of other layers, the last summation need also to include the corresponding partial derivatives.
II. CirCNN networks
Most state of the art neural networks contain an enormous amount of learnable parameters. For example, the VGG16 network introduced in the reference document [12] contains more than 1 million learnable parameters. How to find efficient ways to decrease the number of parameters in deep neural networks is an active research area. A promising approach is to replace the matrices and convolutions in a deep neural network by structured matrices and structured convolutions. This approach has the advantage that the architecture of the network can be preserved. In reference document [2] , it is proposed to replace unstructured matrices in a network by block-circulant matrices, which is called as a CirCNN network. Accordingly, a matrix is a circulant matrix if it has the form
Note that unlike an unstructured matrix, a circulant matrix is determined if one knows the first row (or first column) of the matrix. For a fully connected layer of the form y= ReLU (Wx+b) , W is a block-circulant matrix if W consists of block matrices, where each block matrix is a circulant matrix. For a fully connected layer, suppose replacing an unstructured matrix by a block-circulant one with block size B. Then for this matrix, reducing the amount of parameters by a factor of B. The number B is called as the compression ratio of this layer. Note that for one layer, the number of parameters in the weight matrix W will dominate the number of parameters in this layer.
It is noted that the unstructured matrix can also be replaced by other structured implementation with shared elements, the compression ratio B relates to the number of shared elements. Circulant implementation is one kind of structured implementations with shared elements. The other structured implementations with shared elements include toplize matrix, Hankel matrix and so on. For a M×N structured implementations with shared elements, it can be represents by elements fewer than M×N.
Compressing a neural network by using block-circulant matrices can significantly reduce the amount of parameters. For block-circulant matrices, fast algorithms can be employed, like the FFT, to speed up matrix vector multiplication. It has been proved in the reference document [14] that the Universal Approximation Property can be preserved if unstructured matrices are replaced by block-circulant ones, which guarantees the expressive power of CirCNN implementation.
III. The Xavier/He initialization and limitation
An example of a network with a parameter c to show that the Xavier/He initialization may not be sufficient to lead to a fast and stable training.
For a fully connected layer of the form in formula (1) , the entries in the matrix W and the vector b are the learnable parameters. Denote the final loss function as L, then the Xavier/He initialization has the following requirements:
where it is assumed that all entries in x follow the distribution X, all entries in y follow the distribution Y, all entries in
follow the distribution
and all entries in
follow the distribution
Unless W is a square matrix, the above conditions cannot be satisfied simultaneously, but the following initialization rules are widely used: the bias vector b should be initialized as the zero vector, and entries in W should be identically and independently distributed with mean 0 and variance
where gain is a factor determined by the activation function. For the identity function, gain should be 1, while for the ReLU activation, gain should be 2. In practice the distributions for entries in W are chosen as either the uniform distribution or the normal distribution.
Considering a network with no bias of the form
y=W
2 (ReLU (W
1x) ) , (4)
where
is the input vector, and W1 and W2 are the parameter matrices of sizes
and
respectively. Entries in the parameter matrices are to be learned. Decomposing the above model into two networks, each with two layers, in the following two ways
where c is a fixed positive scalar. With the sameW1, W2 and x, these two decompositions will give the same y. By applying Xavier/He initializations to both networks, it can be ensure that for both networks, the mean and variance of the inputs and outputs of each layers are the same, and the mean and variance of the partial derivatives of each layers are also the same. Routine calculations show that initializations should be as follows
It can be seen that when c = 1, the two decompositions and initializations are the same. To test the effect of the positive scalar c, the above two networks on the MNIST dataset (a set of handwritten character digits) are tested. In this case M = 784 and N = 10. Setting c = 0: 01, and using SGD (Stochastic Gradient Descent, without moment) as the optimizer. The outputs y of each networks are connected to soft-max layers, and using the cross entropy as the lost function in both cases (for details see the reference document [4] ) . The learning rates for the two cases are tuned by trial and error to find the best value. It is observed that the training of the second network (with c = 0: 01) is much difficult than the training of the first network. As a result, it is needed to use a much smaller learning rate, which results in slow convergence and a higher final loss value. It is clear that for the c = 0: 01 case, there is no observable change in the distributions. This indicates that parameters in W1 does not learn after iterations. This is because of the very small learning rate needs to be used and the small value c = 0: 01. On the other hand, for the c = 1 case, the distribution evolves from the uniform distribution to a bell-curved distribution. This indicates a successful learning process of parameters in W1. It is noticed that with c = 0: 01, the unstable training process cannot be remedied by batch normalization, which is introduced in the reference document [7] and is widely used to remedy unstable training.
A new initialization method for fully connected layers are proposed in the present application.
According to the issues described, a network should be decomposed properly in order to obtain stable training and better network performance. To solve this problem, a new condition to the Xavier/He initialization for layers where the weight matrix does not have sparsity is proposed, which can guarantee a more balanced initialization.
It is assumed that a fully connected layer of a multi-layer network is of the form
where α is a positive fixed scalar,
is the input,
is the output,
is the bias,
is the collection of learnable parameters, act is the activation function, and
λ: {0, 1, …, N} × {0, 1, …, M} → {0, 1, …, Λ}
is a function that builds the matrix W from the vector v. In this way, it is guarantee that each entry in W corresponds to one element in v, but it is allowed multiple entries in W to share the same entry in v. Using L to denote the loss function (in the examples in the above description, L is the combination of soft-max and cross-entropy) . It is only considered SGD (without moment) as the optimizer. The parameters to be trained are v and b.
Assuming the notations and single layer structure in formula (7) . It is require that any initializations of formula (7) should satisfy the following conditions
It is note that for a standard fully connected layer, where all α equals 1 and entries in the matrix W are in one-to-one relations to entries in v, the last condition in formula (8) will be satisfied automatically. The significance of the initialization formula (8) will become apparent when more general fully connected layers with parameter sharing is considered.
For the splitting formula (5b) , having α = c, and a calculation shows that
In order to satisfy the initialization conditions formula (8) , c should be equal to 1. For c = 0: 01, even though both initializations formula (6a) and (6b) satisfy the Xavier/He initialization, only the initialization formula (6a) satisfies the initialization conditions and it gives a better network decomposition.
In another embodiment, for the CirCNN network in reference document [2] , the fully connected layers can be expressed as formula (7) with α = 1. A routine calculation shows that
denoting V
nm= {W
n′m′: λ (n′, m′) =λ (n, m) } , and using Bnm to denote the number of elements in Vnm. It is noted that the weight sharing comes from the circulant matrices in W.
Assuming that
is uncorrelated, having
The only way to satisfy the initialization conditions formula (8) is to have Bnm = 1. However, this corresponds to circulant matrices of size 1×1, which simply implies that the CirCNN does not compress the original network at all.
In order to achieve a desirable compression rate, it needs to introduce an extra positive scaler α and reformulate the fully connected layer with CirCNN in the reference document [2] in the form of formula (7) . It is stressed that the scalar α will be a constant for this layer, and the learnable variables are still v and b. Thus by introducing α, the number of learnable parameters is not increased, and the increment in the number of operations for this layer is negligible.
Assuming that it is required to compress this layer by a factor of B. Then the circulant matrices in W should have size B×B. A calculation as before shows that
where Bnm = B. Then to achieve the initialization conditions formula (8) , setting
where gain should be determined by the activation function. To see the effect of the positive scalar α on the updates of v
λ (n, m) , calculating as follows. Supposing the learning rate of the SGD optimizer is r, then after one forward and backward propagation, having
It shows that now the effective learning rate for v
λ (n, m) changes from r to ar. Thus by formulating each layer as in formula (7) , the effective learning rate of learnable parameters layer by layer can be changed.
The effectiveness of the proposed initialization method is shown by numerical experiments.
Example 1. A training process of a simple network with the initialization formula (10) is shown on the MNIST dataset. The network structure is summarized in Table 1, and CirCNN compression is used for the first fully connected layer. It is noted that after the compression, the total number of learnable parameters is only about 0.76%of the original model. From the plot of the loss function in Figure 1, we see that the proposed initialization method better suits this CirCNN network.
Table 1 Network structure of Example 1. For a compression ratio B>=1, circulant implementation with block size B. The number of parameters for a CirCNN implementation should be divided by B.
Example 2. A simple network where the weight matrix of many layers share a common weight matrix V. Detail of the network, together with the initialization of the common weight V by proposed formula (8) , is summarized in formula (12a) and (12b)
where as before, for the proposed initialization, the weight matrices have the form W
i=αV for i = 1, 2, …, 50. A comparison with the initialization (12b) and the Xavier/He initialization is summarized in Figure 2.
Example 3. The proposed initialization method on a CirCNN implementation of the VGG16 network is tested. For the second fully connected layer in the VGG16 network which has 4096 input channels and 4096 output channels, a circulant matrix with block size 4096 is used. The Cifar10 data set with 5 runs for both Xavier/He and the proposed initialization methods are tested. The original VGG16 network can achieve 92.24%top-1 accuracy. On the 5 runs, the network with proposed initialization consistently outperforms the network with the Xavier/He initialization on top-1 accuracy. The average top-1 accuracy for proposed method is 92.00%, while the average top-1 accuracy for the Xavier/He initialization is 90.98%.
Figure 3 illustrates a method for a neutral network training. The training solution with the proposed initialization method can be summarized as following:
S101: Determining parameters for an initialization weight matrix of a layer of a neural network, according to the block size in circulant implementation of the layer.
It is noted that for another structured implementation with shared elements, S101 comprises determining parameters for a weight matrix of each layer of a network, based on the number of shared elements in the structured implementation with shared elements of each layer.
In an implementation, the parameters include:
where gain equals to 1 for Identity Function, and equals to 2 in the initialization of ReLU, M or m represents the number of input channels, N or n represents the number of output channels, Bnm represents the number of parameter sharing of Wnm, which depends on the method of parameter sharing. For example, a circulant matrix mentioned above is a matrix with parameter sharing form. In a circnn network,
Bnm=B, wherein B is the block size in circulant implementation of the layer. Xavier_value equals to
which is used to generate v
λ (n, m) , a random number in an uniform distribution between -Xavier_value and Xavier_value, that is [-Xavier_value, Xavier_value] .
It is noted that Cnm is negatively correlated with the number of shared weights. The more weights sharing, the smaller value of Cnm. In this case, the learning rates of the shared weights become lower, the stability of training the neural network is improved.
And in some other embodiments, Cnm can be equal to 1/Bnm, 1/ (Bnm×Bnm) and so on.
S102: Determining the initialization weight matrix according to the parameters.
More specifically,
W
nm=C
nm·v
λ (n, m)
S103: Decompressing Wnm to derive the integrated block-circulant matrix Wt.
Figure 4 exemplary shows the compression of a block-circulant matrix. The decompression is an inverse process or the compression. And in a lossless compression, it is an absolute inverse process.
S104: Forward propagation of training.
In an implementation, the training date are trained by formula (1) , which is Y=Wt ·X+B, where X represents the input date, Y represents the output date, B represents the bias, and Wt is derived from step S103. And it is noted that when the layer is a convolution layer, X and B are date translated by “im2col” operation.
S105: Backward propagation of training.
In an implementation, the output date of step S104 are trained by
where L is the loss function of X and Y, W′
t is an updated value of Wt. Accordingly, v
λ (n, m) is updated.
S106: Performing iteration of step S104 and S105 until the result of training is convergence.
It is noted that the criterion of the convergence is not limited in the present application. For example, the criterion can be the difference of any parameter (v
λ (n, m) , Wt, loss and so on) involved in the training between two neighboring iteration is smaller than a threshold.
S107: Storing the trained parameters of the neural network.
In an implementation, steps S101-S103 are implemented on each layer of the neural network, and steps S104-S107 are implemented based on the whole network.
It is noted that the training date can be images, videos, voice signal, shape, color and other to-be-recognized information, and accordingly, the trained neural network can be used for the related applications, for example, object classifications, natural language processing, and speech recognition, which are mentioned in the above descriptions.
In an embodiment of the present application, a neutral network training method is disclosed, comprising: determining parameters for a weight matrix of each layer of a network, based on the number of shared elements in a structured implementation with shared elements of each layer; determining the weight matrix of each layer based on the determined parameter; decompressing the determined weight matrix to derived an integrated matrix; iterating over forward propagation and backward propagation on all layers of the network based on the integrated matrix until a preset condition is satisfied, wherein the integrated matrix is updated in each iteration; and storing the iterated parameters of the network.
In a feasible implementation, the parameters comprises an adjusting parameter, and wherein the adjusted parameter is negatively correlated with the number of shared elements.
In a feasible implementation, the number of shared elements in the structured implementation with shared elements is a block size in a circulant implementation.
In a feasible implementation, the adjusting parameter is calculated as following:
and wherein m is a number of input channel of a layer of the network, n is a number of output channel of the layer of the network, Cnm is the adjusting parameter, and Bnm is the number of shared elements.
In a feasible implementation, the parameters further comprises a random number in an uniform distribution between -Xavier_value and Xavier_value.
In a feasible implementation, Xavier_value is determined by the following:
In a feasible implementation, the weight matrix is determined by the following: W
nm=C
nm·v
λ (n, m)
where W
nm is the weight matrix.
In a feasible implementation, the integrated matrix is updated in each iteration, comprises the random number is updated in each iteration.
In a feasible implementation, the preset condition comprises a training result of the network is convergence.
In a feasible implementation, the training result of the network is convergence, comprises a difference of the weight matrix is smaller than a preset threshold.
In a feasible implementation, an input of the network in training comprises image information data or sound information data.
In a feasible implementation, the network is used to classify object, process language or recognize speech.
In an embodiment of the present application, as shown in Figure 5, a neutral network training device 400 is disclosed, comprising: a first calculation module 401, to determine parameters for a weight matrix of each layer of a network, based on the number of shared elements in a structured implementation with shared elements of each layer; a second calculation module 402, to determine the weight matrix of each layer based on the determined parameter; a decompression module 403, to decompress the determined weight matrix to derived an integrated matrix; an iteration module 404, to iterate over forward propagation and backward propagation on all layers of the network based on the integrated matrix until a preset condition is satisfied, wherein the integrated matrix is updated in each iteration; and a storage module 405, to store the iterated parameters of the network.
In a feasible implementation, the parameters comprises an adjusting parameter, and wherein the adjusted parameter is negatively correlated with the number of shared elements.
In a feasible implementation, the number of shared elements in the structured implementation with shared elements is a block size in a circulant implementation.
In a feasible implementation, the adjusting parameter is calculated as following:
and wherein m is a number of input channel of a layer of the network, n is a number of output channel of the layer of the network, Cnm is the adjusting parameter, and Bnm is the number of shared elements.
In a feasible implementation, the parameters further comprises a random number in an uniform distribution between -Xavier_value and Xavier_value.
In a feasible implementation, Xavier_value is determined by the following:
In a feasible implementation, the weight matrix is determined by the following: W
nm=C
nm·v
λ (n, m)
where W
nm is the weight matrix.
In a feasible implementation, the integrated matrix is updated in each iteration, comprises the random number is updated in each iteration.
In a feasible implementation, the preset condition comprises a training result of the network is convergence.
In a feasible implementation, the training result of the network is convergence, comprises a difference of the weight matrix is smaller than a preset threshold.
In a feasible implementation, an input of the network in training comprises image information data or sound information data.
In a feasible implementation, the network is used to classify object, process language or recognize speech.
In an embodiment of the present application, an apparatus for training a neutral network is disclosed, the apparatus comprising: one or more processors; and a non-transitory computer-readable storage medium coupled to the processors and storing programming for execution by the processors, wherein the programming, when executed by the processors, configures the apparatus to carry out the method according to any one of claims 1-12 in the present application.
In an embodiment of the present application, a computer program product is disclosed, comprising a program code for performing the method according to any one of claims 1-12 in the present application.
Figure 6 is a simplified block diagram of an apparatus 500 that may be used as the apparatus for training a neutral network according to the exemplary embodiment.
A processor 502 in the apparatus 500 can be a central processing unit. Alternatively, the processor 502 can be any other type of device, or multiple devices, capable of manipulating or processing information now-existing or hereafter developed. Although the disclosed implementations can be practiced with a single processor as shown, e.g., the processor 502, advantages in speed and efficiency can be achieved using more than one processor.
A memory 504 in the apparatus 500 can be a read only memory (ROM) device or a random access memory (RAM) device in an implementation. Any other suitable type of storage device can be used as the memory 504. The memory 504 can include code and data 506 that is accessed by the processor 502 using a bus 512. The memory 504 can further include an operating system 508 and application programs 510, the application programs 510 including at least one program that permits the processor 502 to perform the methods described here. For example, the application programs 510 can include applications 1 through N, which further include a video coding application that performs the methods described here.
The apparatus 500 can also include one or more output devices, such as a display 518. The display 518 may be, in one example, a touch sensitive display that combines a display with a touch sensitive element that is operable to sense touch inputs. The display 518 can be coupled to the processor 502 via the bus 512.
Although depicted here as a single bus, the bus 512 of the apparatus 500 can be composed of multiple buses. Further, the secondary storage 514 can be directly coupled to the other components of the apparatus 500 or can be accessed via a network and can comprise a single integrated unit such as a memory card or multiple units such as multiple memory cards. The apparatus 500 can thus be implemented in a wide variety of configurations.
In some embodiments, some or all of the functions or processes of the one or more of the devices are implemented or supported by a computer program that is formed from computer readable program code and that is embodied in a computer readable medium. The phrase “computer readable program code” includes any type of computer code, including source code, object code, and executable code. The phrase “computer readable medium” includes any type of medium capable of being accessed by a computer, such as read only memory (ROM) , random access memory (RAM) , a hard disk drive, a compact disc (CD) , a digital video disc (DVD) , or any other type of memory.
It may be advantageous to set forth definitions of certain words and phrases used throughout this patent document. The terms “include” and “comprise, ” as well as derivatives thereof, mean inclusion without limitation. The term “or” is inclusive, meaning and/or. The phrases “associated with” and “associated therewith, ” as well as derivatives thereof, mean to include, be included within, interconnect with, contain, be contained within, connect to or with, couple to or with, be communicable with, cooperate with, interleave, juxtapose, be proximate to, be bound to or with, have, have a property of, or the like.
While this disclosure has described certain embodiments and generally associated methods, alterations and permutations of these embodiments and methods will be apparent to those skilled in the art. Accordingly, the above description of example embodiments does not define or constrain this disclosure. Other changes, substitutions, and alterations are also possible without departing from the scope of this disclosure, as defined by the following claims.
Claims (26)
- A neutral network training method, comprising:determining parameters for a weight matrix of each layer of a network, based on the number of shared elements in a structured implementation with shared elements of each layer;determining the weight matrix of each layer based on the determined parameter;decompressing the determined weight matrix to derived an integrated matrix;iterating over forward propagation and backward propagation on all layers of the network based on the integrated matrix until a preset condition is satisfied, wherein the integrated matrix is updated in each iteration; andstoring the iterated parameters of the network.
- The method of claim 1, wherein the parameters comprises an adjusting parameter, and wherein the adjusted parameter is negatively correlated with the number of shared elements.
- The method of claim 1 or 2, wherein the number of shared elements in the structured implementation with shared elements is a block size in a circulant implementation.
- The method of any one of claims 1-3, wherein the adjusting parameter is calculated as following: and wherein m is a number of input channel of a layer of the network, n is a number of output channel of the layer of the network, Cnm is the adjusting parameter, and Bnm is the number of shared elements.
- The method of claim 4, wherein the parameters further comprises a random number in an uniform distribution between -Xavier_value and Xavier_value.
- The method of claim 6, wherein the weight matrix is determined by the following:W nm=C nm·v λ (n, m)where W nm is the weight matrix.
- The method of any one of claims 5-7, wherein the integrated matrix is updated in each iteration, comprises the random number is updated in each iteration.
- The method of any one of claims 5-8, wherein the preset condition comprises a training result of the network is convergence.
- The method of claim 9, wherein the training result of the network is convergence, comprises a difference of the weight matrix is smaller than a preset threshold.
- The method of any one of claims 1-10, wherein an input of the network in training comprises image information data or sound information data.
- The method of any one of claims 1-11, wherein the network is used to classify object, process language or recognize speech.
- A neutral network training device, comprising:a first calculation module, to determine parameters for a weight matrix of each layer of a network, based on the number of shared elements in a structured implementation with shared elements of each layer;a second calculation module, to determine the weight matrix of each layer based on the determined parameter;a decompression module, to decompress the determined weight matrix to derived an integrated matrix;an iteration module, to iterate over forward propagation and backward propagation on all layers of the network based on the integrated matrix until a preset condition is satisfied, wherein the integrated matrix is updated in each iteration; anda storage module, to store the iterated parameters of the network.
- The device of claim 13, wherein the parameters comprises an adjusting parameter, and wherein the adjusted parameter is negatively correlated with the number of shared elements.
- The device of claim 13 or 14, wherein the number of shared elements in the structured implementation with shared elements is a block size in a circulant implementation.
- The device of any one of claims 13-15, wherein the adjusting parameter is calculated as following: and wherein m is a number of input channel of a layer of the network, n is a number of output channel of the layer of the network, Cnm is the adjusting parameter, and Bnm is the number of shared elements.
- The device of claim 16, wherein the parameters further comprises a random number in an uniform distribution between -Xavier_value and Xavier_value.
- The device of claim 18, wherein the weight matrix is determined by the following:W nm=C nm·v λ (n, m)where W nm is the weight matrix.
- The device of any one of claims 17-19, wherein the integrated matrix is updated in each iteration, comprises the random number is updated in each iteration.
- The device of any one of claims 17-20, wherein the preset condition comprises a training result of the network is convergence.
- The device of claim 21, wherein the training result of the network is convergence, comprises a difference of the weight matrix is smaller than a preset threshold.
- The device of any one of claims 13-22, wherein an input of the network in training comprises image information data or sound information data.
- The device of any one of claims 13-23, wherein the network is used to classify object, process language or recognize speech.
- An apparatus for training a neutral network, the apparatus comprising:one or more processors; anda non-transitory computer-readable storage medium coupled to the processors and storing programming for execution by the processors, wherein the programming, when executed by the processors, configures the apparatus to carry out the method according to any one of claims 1-12.
- A computer program product comprising a program code for performing the method according to any one of claims 1-12.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201980093322.6A CN113508401A (en) | 2019-04-29 | 2019-04-29 | Method and apparatus for training and applying neural networks |
PCT/CN2019/084969 WO2020220191A1 (en) | 2019-04-29 | 2019-04-29 | Method and apparatus for training and applying a neural network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2019/084969 WO2020220191A1 (en) | 2019-04-29 | 2019-04-29 | Method and apparatus for training and applying a neural network |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2020220191A1 true WO2020220191A1 (en) | 2020-11-05 |
Family
ID=73028749
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2019/084969 WO2020220191A1 (en) | 2019-04-29 | 2019-04-29 | Method and apparatus for training and applying a neural network |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN113508401A (en) |
WO (1) | WO2020220191A1 (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113050077A (en) * | 2021-03-18 | 2021-06-29 | 电子科技大学长三角研究院(衢州) | MIMO radar waveform optimization method based on iterative optimization network |
CN113077853A (en) * | 2021-04-06 | 2021-07-06 | 西安交通大学 | Double-loss-value network deep reinforcement learning KVFD model mechanical parameter global optimization method and system |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104899641A (en) * | 2015-05-25 | 2015-09-09 | 杭州朗和科技有限公司 | Deep neural network learning method, processor and deep neural network learning system |
US20170103308A1 (en) * | 2015-10-08 | 2017-04-13 | International Business Machines Corporation | Acceleration of convolutional neural network training using stochastic perforation |
-
2019
- 2019-04-29 WO PCT/CN2019/084969 patent/WO2020220191A1/en active Application Filing
- 2019-04-29 CN CN201980093322.6A patent/CN113508401A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104899641A (en) * | 2015-05-25 | 2015-09-09 | 杭州朗和科技有限公司 | Deep neural network learning method, processor and deep neural network learning system |
US20170103308A1 (en) * | 2015-10-08 | 2017-04-13 | International Business Machines Corporation | Acceleration of convolutional neural network training using stochastic perforation |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113050077A (en) * | 2021-03-18 | 2021-06-29 | 电子科技大学长三角研究院(衢州) | MIMO radar waveform optimization method based on iterative optimization network |
CN113050077B (en) * | 2021-03-18 | 2022-07-01 | 电子科技大学长三角研究院(衢州) | MIMO radar waveform optimization method based on iterative optimization network |
CN113077853A (en) * | 2021-04-06 | 2021-07-06 | 西安交通大学 | Double-loss-value network deep reinforcement learning KVFD model mechanical parameter global optimization method and system |
CN113077853B (en) * | 2021-04-06 | 2023-08-18 | 西安交通大学 | Global optimization method and system for mechanical parameters of double loss value network deep reinforcement learning KVFD model |
Also Published As
Publication number | Publication date |
---|---|
CN113508401A (en) | 2021-10-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Huang et al. | Dynamics of deep neural networks and neural tangent hierarchy | |
Liang et al. | Pruning and quantization for deep neural network acceleration: A survey | |
Liu et al. | Dynamic sparse training: Find efficient sparse network from scratch with trainable masked layers | |
Bottou | Large-scale machine learning with stochastic gradient descent | |
US20210073643A1 (en) | Neural network pruning | |
EP3889829A1 (en) | Integrated clustering and outlier detection using optimization solver machine | |
Dulac-Arnold et al. | Deep multi-class learning from label proportions | |
Kim et al. | Neuron merging: Compensating for pruned neurons | |
WO2020220191A1 (en) | Method and apparatus for training and applying a neural network | |
US20180137410A1 (en) | Pattern recognition apparatus, pattern recognition method, and computer program product | |
Fukunaga et al. | Wasserstein k-means with sparse simplex projection | |
US9779207B2 (en) | Information processing apparatus information processing method, and storage medium | |
EP4231199A1 (en) | Method and device for providing a recommender system | |
Fan et al. | Layer-wise model pruning based on mutual information | |
Pikoulis et al. | A new clustering-based technique for the acceleration of deep convolutional networks | |
Lauritzen et al. | Locally associated graphical models and mixed convex exponential families | |
Paul et al. | Non-iterative online sequential learning strategy for autoencoder and classifier | |
US20200372363A1 (en) | Method of Training Artificial Neural Network Using Sparse Connectivity Learning | |
Banburski et al. | Theory III: Dynamics and generalization in deep networks | |
Salehinejad et al. | Pruning of convolutional neural networks using ising energy model | |
Jubran et al. | Fast and accurate least-mean-squares solvers | |
Salehinejad et al. | A framework for pruning deep neural networks using energy-based models | |
Sharifnassab et al. | MetaOptimize: A Framework for Optimizing Step Sizes and Other Meta-parameters | |
Wolcott et al. | Scalable record linkage | |
Chamlal et al. | A Two-Step Feature Selection Procedure to Handle High-Dimensional Data in Regression Problems |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19927258 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19927258 Country of ref document: EP Kind code of ref document: A1 |