CN111062474A

CN111062474A - Neural network optimization method for solving and improving adjacent computer machines

Info

Publication number: CN111062474A
Application number: CN201811203464.7A
Authority: CN
Inventors: 林宙辰; 李嘉; 方聪
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2018-10-16
Filing date: 2018-10-16
Publication date: 2020-04-24
Anticipated expiration: 2038-10-16
Also published as: CN111062474B

Abstract

The invention discloses a neural network optimization method for solving and improving adjacent computer machines, and relates to the technical field of deep learning neural network optimization; in the training of the forward neural network, the LPOM model of the adjacent computer machine is solved and improved by adopting a block coordinate descent method, each subproblem in the LPOM model has convergence, the weight of each layer of the neural network can be updated in parallel, the network activation is carried out, and no extra memory space is occupied. By adopting the technical scheme of the invention, the parallelism, the applicability and the training effect of the neural network training can be improved under the condition of using relatively less storage.

Description

Neural network optimization method for solving and improving adjacent computer machines

Technical Field

The invention relates to the technical field of deep learning neural network optimization, in particular to a method for carrying out neural network optimization by solving and improving a neighboring computer machine (LPOM).

Background

The forward deep neural network is composed of fully connected layers of a hierarchy and no feedback connection exists. With recent advances in hardware and dataset size, forward deep neural networks have become standard for many tasks. For example, image recognition [16], speech recognition [12], natural language understanding [6] and as an important component of a go learning system [22 ].

For the last decades, the goal of optimizing the forward neural network has typically been a function that is highly non-convex and nested with respect to network weights. The main method for optimizing the forward neural network is the Stochastic Gradient Descent (SGD) [21] its effectiveness was verified by its success in various practical applications. In recent years, various variants with decreasing random gradients have been proposed. They use adaptive learning efficiency or momentum terms such as Nesterov momentum [23], AdaGrad [8], RMSProp [7] and Adam [15 ]. SGD and its variants use fewer training samples to estimate the gradient, making the computation per iteration less computationally intensive. Furthermore, since the estimated gradient contains noise, this is advantageous for escaping the saddle point [9 ]. However, these methods also have some disadvantages. The main problem is that the magnitude of the gradient decreases or increases exponentially with the number of network layers causing the gradient to disappear or explode. This phenomenon can cause slow or unstable convergence, which is particularly severe in deeper neural networks. This disadvantage can be mitigated by using non-saturating activation functions such as linear rectification units (relus) and modified network architectures such as ResNet [11 ]. However, the fundamental problem still remains [24]. In addition, they cannot directly process non-differentiable activation functions (such as binary neural network [13]), nor can weights at different layers be updated in parallel.

The shortcomings of SGD have motivated new approaches to the study of training forward neural networks. Recently, training the forward neural network has been formalized as a constrained optimization problem. It introduces network activation as an auxiliary variable, and the network structure is guaranteed by layer-by-layer constraints [3 ]. This breaks the dependency of the nested functions into equality constraints and can then be solved using a number of standard optimization algorithms. The main difference belonging to this type of approach is how to handle the equality constraints. Document [4] approximates the equality constraints by a quadratic penalty term and alternately optimizes the network weights and activations. Document [25] introduces one more additional variable per layer. They also use quadratic penalty terms to approximate the equality constraints. However, both of these approaches are either approximately equality constrained or contain more auxiliary variables. Inspired by the interleaved orientation method [16], documents [24] and [27] use the augmented Lagrangian method to obtain a strict equality constraint. However, both of these methods involve lagrangian multipliers and nonlinear constraints, which require more memory and make optimization more difficult. According to the fact that the ReLU activation function is equivalent to a simple constrained convex optimization problem, document [26] relaxes the nonlinear constraint as a penalty term, which characterizes the network structure and the ReLU activation function. Thus, the non-linear constraint no longer exists. However, this method is limited to the ReLU activation function and cannot be used for other activation functions. Document [2] takes a similar idea, but discusses various types of single increment activation functions. However, their algorithms for updating weights and activation are still limited to the ReLU function. Their method can only be used to initialize SGD and cannot exceed the performance of SGD. Patent [1] proposes a new model approximating the forward neural network, called a lifted neighbor operator machine (LPOM). LPOM rewrites the activation function to its equivalent neighbor operator and adds the neighbor operator as a penalty term to the objective function to approximate the forward neural network. However, the solving algorithm presented in patent [1] does not take advantage of its property of being blocky and convex with respect to per-layer weights and activations. Updating the network activation using the staggered direction method introduces a number of auxiliary variables. It is very difficult to select proper learning efficiency when updating the weights using the gradient descent method.

Cited documents:

[1] optimization method for improving neighbor operator neural network 201711156691.4

[2]Askari,A.；Negiar,G.；Sambharya,R.；and Ghaoui,L.E.2018.Lifted neuralnetworks.arXiv preprint arXiv:1805.01532.

[3]Beck,A.,and Teboulle,M.2009.A fast iterative shrinkagethresholding algorithm for linear inverse problems.SIAM Journal on ImagingSciences 183–202.

[4]Carreira-Perpinan,M.,and Wang,W.2014.Distributed optimization ofdeeply nested systems.In International Conference on Artificial Intelligenceand Statistics,10–19.

[5]Clevert,D.-A.；Unterthiner,T.；and Hochreiter,S.2015.Fast andaccurate deep network learning by exponential linear units(elus).arXivpreprint arXiv:1511.07289.

[6]Collobert,R.；Weston,J.；Bottou,L.；Karlen,M.；Kavukcuoglu,K.；andKuksa,P.2011.Natural language processing(almost)from scratch.Journal ofMachine Learning Research 12:2493–2537.

[7]Dauphin,Y.；de Vries,H.；and Bengio,Y.2015.Equilibrated adaptivelearning rates for non-convex optimization.In NIPS,1504–1512.

[8]Duchi,J.；Hazan,E.；and Singer,Y.2011.Adaptive subgradient methodsfor online learning and stochastic optimization.Journal of Machine LearningResearch 12:2121–2159.

[9]Ge,R.；Huang,F.；Jin,C.；and Yuan,Y.2015.Escaping from saddle points-online stochastic gradient for tensor decomposition.In COLT,797–842.

[10]Glorot,X.,and Bengio,Y.2010.Understanding the difficulty oftraining deep feedforward neural networks.In Proceedings of the ThirteenthInternational Conference on Artificial Intelligence and Statistics,249–256.

[11]He,K.；Zhang,X.；Ren,S.；and Sun,J.2016.Deep residual learning forimage recognition.In CVPR,770–778.

[12]Hinton,G.；Deng,L.；Yu,D.；Dahl,G.E.；Mohamed,A.-R.；Jaitly,N.；Senior,A.；Vanhoucke,V.；Nguyen,P.；Sainath,T.N.；et al.2012.Deep neural networks foracoustic modeling in speech recognition:The shared views of four researchgroups.IEEE Signal Processing Magazine 29(6):82–97.

[13]Hubara,I.；Courbariaux,M.；Soudry,D.；El-Yaniv,R.；and Bengio,Y.2016.Binarized neural networks.In Advances in NIPS,4107–4115.

[14]Jia,Y.；Shelhamer,E.；Donahue,J.；Karayev,S.；Long,J.；Girshick,R.；Guadarrama,S.；and Darrell,T.2014.Caffe:Convolutional architecture for fastfeature embedding.In Proceedings of the 22nd ACM International Conference onMultimedia,675–678.ACM.

[15]Kingma,D.P.,and Ba,J.2014.Adam:A method for stochasticoptimization.arXiv preprint arXiv:1412.6980.

[16]Krizhevsky,A.；Sutskever,I.；and Hinton,G.E.2012.Imagenetclassification with deep convolutional neural networks.In NIPS,1097–1105.

[17]Lin,Z.；Liu,R.；and Su,Z.2011.Linearized alternating directionmethod with adaptive penalty for low-rank representation.In NIPS,612–620.

[18]Nesterov,Y.,ed.2004.Introductory Lectures on Convex Optimization:A Basic Course.Springer.

[19]Netzer,Y.；Wang,T.；Coates,A.；Bissacco,A.；Wu,B.；and Ng,A.Y.2011.Reading digits in natural images with unsupervised featurelearning.In NIPS workshop on Deep Learning and Unsupervised Feature Learning,volume 2011,5.

[20]Parikh,N.；Boyd,S.；et al.2014.Proximal algorithms.Foundations andTrendsR in Optimization 1(3):127–239.

[21]Rumelhart,D.E.；Hinton,G.E.；and Williams,R.J.1986.Learningrepresentations by back-propagating errors.Nature 323(6088):533.

[22]Silver,D.；Huang,A.；Maddison,C.J.；Guez,A.；Sifre,L.；Van DenDriessche,G.；Schrittwieser,J.；Antonoglou,I.；Panneershelvam,V.；Lanctot,M.；etal.2016.Mastering the game of Go with deep neural networks and treesearch.Nature 529(7587):484.

[23]Sutskever,I.；Martens,J.；Dahl,G.；and Hinton,G.2013.On theimportance of initialization and momentum in deep learning.In ICML,1139–1147.

[24]Taylor,G.；Burmeister,R.；Xu,Z.；Singh,B.；Patel,A.；and Goldstein,T.2016.Training neural networks without gradients:A scalable ADMM approach.InICML,2722–2731.

[25]Zeng,J.；Ouyang,S.；Lau,T.T.-K.；Lin,S.；and Yao,Y.2018.Globalconvergence in deep learning with variable splitting via the Kurdyka-Lojasiewicz property.arXiv preprint arXiv:1803.00225.

[26]Zhang,Z.,and Brand,M.2017.Convergent block coordinate descent fortraining Tikhonov regularized deep neural networks.In NIPS,1721–1730.

[27]Zhang,Z.；Chen,Y.；and Saligrama,V.2016.Efficient training of verydeep neural networks for supervised hashing.In CVPR,1487–1495.

Disclosure of Invention

To overcome the above-mentioned deficiencies of the prior art, the present invention provides a new solution to boost the neighbor calculator machine (LPOM) for training of the forward neural network. Different from the existing neural network optimization method, the solution has convergence guarantee on each subproblem, can update variables in parallel, and occupies a memory equivalent to that of a random gradient descent (SGD) method in the solving process.

For convenience of description, the invention first introduces the LPOM model, specifically as shown in formula 1:

wherein, W^i-1Is the i-1 th layer network weight, XⁱIs the i-th network activation, i 2, …, n, l (X)ⁿL) is a loss function, n is the number of layers of the neural network, X¹Is a training sample (when i is 2, X^i-1Is namely X¹) L is X¹The corresponding class label is marked with the corresponding class label,

for the matrix input, f (x) and g (x) are element-by-element, φ (x) is the activation function^-1Is the inverse function of phi, mu_iMore than 0 is the parameter of the ith penalty term, 1 is the full 1 column vector, | | · | survival_FIs the Frobenius norm. If l (X)ⁿL) with respect to XⁿIs convex and phi (x) is monotonically increasing, then LPOM is related to WⁱAnd XⁱIs block-convex, i.e. the objective function of equation 1 is with respect to W if the remaining variables remain unchangedⁱAnd XⁱIs convex.

The technical scheme provided by the invention is as follows:

a neural network optimization method for solving and improving adjacent computer machines is characterized in that in the training of a forward neural network, an LPOM model (shown in formula 1) is solved by adopting a new block coordinate descent method, the convergence of each subproblem in the LPOM model is guaranteed, variables can be updated in parallel, and no additional memory space is occupied; the method comprises the following steps:

1) randomly selecting m from neural network training samples¹A training sample X¹And L, wherein m¹Is the size of the batch process, L is the training sample X¹Corresponding class labels;

2) updating network activation X layer by layerⁱI is 1, …, n; operations 21) to 22) are performed, and the meanings of the symbols in these steps are the same as in formula 1:

21) sequentially updating X according to the sequence of i-1, …, n-1ⁱ. Loop 2 until convergence.

In formula 2,. mu._i、μ_i+1Parameters of the ith and (i + 1) th penalty terms, respectively.

22) Updating Xⁿ. Loop 3 until convergence.

In formula 3,. mu._nIs a parameter of the nth penalty term in equation 1.

3) Updating the network weight Wⁱ，i＝1,…,n-1。

It is assumed here that

β is smooth, i.e., the following inequality holds:

update W by the following procedureⁱ：

Initialization: w^i,0，W^i,1，θ₀And t is 0 and 1. Wherein, W^i,0And W^i,1Is to iteratively update WⁱInitial value of (a), theta₀Is the parameter thetaIs the number of iterations.

31) Calculating theta_t：

Wherein, theta_t> 0 represents the value of the parameter theta at the t-th iteration;

32) calculating Y^i,t：

Wherein, Y^i,tRepresents the t-th iteration to update Yⁱ；

33) Calculating W^i,t+1：

Wherein, W^i,t+1Represents the t +1 th iteration update Wⁱ，

Represents XⁱThe pseudo-inverse of (1);

34)t←t+1；

wherein, steps 21), 22) and 3) have convergence guarantee, and realize updating network activation and network weight layer by layer.

Namely, the neural network optimization is realized by the block coordinate descent method for solving and improving the adjacent operator machines.

Compared with the prior art, the invention has the beneficial effects that:

the method optimizes the forward neural network by solving and improving the adjacent computer machine, and can be used for specific tasks such as image recognition, voice recognition, natural language understanding and the like. The method for solving and improving the block coordinate descent of the adjacent operator machine can improve the parallelism, the applicability and the training effect of neural network training under the condition of using relatively less storage.

Specifically, the method provided by the invention can update the weight and activation of each layer in parallel. In addition, the algorithm only uses the activation function and does not use the differentiation of the activation function, so that the problem of gradient disappearance or explosion in a training method based on the gradient is avoided, and the training effect of the neural network can be improved. The optimization forward neural network method provided by the invention can be suitable for various single-increment Prziz continuous activation functions, and the activation functions can be saturated and cannot be differentiated. No additional auxiliary variables are required except for the activation of each layer, so substantially equivalent memory is used as with SGD. Further, the specific implementation experiment verifies that the algorithm updates the weight of each layer and the activated convergence. An image recognition task experiment on MNIST, CIFAR-10 and SVHN data sets [19] also verifies that the algorithm has the advantage of high accuracy rate when being used for neural network optimization.

Drawings

FIG. 1 is a comparison chart of the results of the new algorithm for solving LPOM and SGD method proposed by the present invention on MNIST and CIFAR-10 data sets.

Detailed Description

The invention will be further described by way of examples, without in any way limiting the scope of the invention, with reference to the accompanying drawings.

The invention provides a neural network optimization method for solving and improving adjacent computer machines, which adopts a new block coordinate reduction method to solve an LPOM model in the training of a forward neural network, ensures the convergence of each subproblem in the LPOM model, can update variables in parallel, improves the accuracy of the neural network training, and does not occupy extra memory space. The neural network optimization method provided by the invention can be applied to specific tasks such as image recognition, voice recognition, natural language processing and the like.

The following describes embodiments using image recognition as an example and compares them with the best current results. The method of the invention uses a least squares loss function

And the ReLU activation function, ReLU (x) ═ max (x,0), without using any regularization processing weights. The method proposed by the present invention to solve LPOM uses the same inputs as the SGD method and uses the document [10 ]]The random initialization method is described. The method for solving LPOM and SGD of the invention is adopted to carry out the method for MNIST, CIFAR-10 and SVHN [19]]Image recognition task on three datasetsAnd (6) comparing. For both SGD and LPOM, all training images in each dataset are used only once per pass (epoch) training process. The optimization of the image recognition neural network training by adopting the method comprises the following steps:

1) randomly selecting m from training samples of image recognition neural network¹Training image X¹And L, wherein m¹Is the batch size and can be of the same value as 100 or 256, L is X¹Corresponding class labels, wherein the commonly used MNIST and CIFAR-10 data sets respectively comprise 10 classes;

2) updating activation X of forward neural network layer by layerⁱI is 1, …, n; operations 21) to 22) are performed, and the meanings of the symbols in these steps are the same as in formula 1:

21) sequentially updating X according to the sequence of i-1, …, n-1ⁱ. Cycle 4 was repeated 100 times.

22) Updating activation X of Forward neural networkⁿ. Cycle 5 was repeated 100 times.

3) Updating weight W of forward neural networkⁱ，i＝1,…,n-1。

For the purpose of the ReLU activation function,

β is smoothed 1, i.e., the following inequality holds:

therefore, W can be updated by the following procedureⁱA total of 5 iterations:

initialization: w^i,0＝Wⁱ，W^i,1＝Wⁱ，θ₀And t is 0 and 1. Wherein, W^i,0And W^i,1Are all initialized to Wⁱ。

31) Calculating theta_t：

Wherein, theta_t-1Is the value of the parameter theta at t-1 iterations;

32) calculating Y^i,t：

33) Calculating W^i,t+1：

Wherein the content of the first and second substances,

represents XⁱThe pseudo-inverse of (1);

34)t←t+1；

wherein, steps 21), 22) and 3) have convergence guarantee.

Namely, the optimization of the image recognition neural network is realized by the method for solving and improving the block coordinate descent of the adjacent operator machine.

Specifically, on the MNIST data set, 784 original pixels are used as input to the method of solving LPOM and SGD in the present invention. The data set contained 60,000 training images and 10,000 test images in total. No pre-processing or data enhancement is used in the implementation. And document [25]]Similarly, the invention uses a forward fully-connected neural network of 784-. The method of the invention is used for simply setting mu for LPOM _i20. The LPOM solving method and the SGD method are carried out in 100 times in the experiment, the batch processing size is 100, and the method is similar to the literature [25] on the CIFAR-10 data set]Similarly, the invention uses 3072-4000-1000-4000-10 forward fully-connected neural network. The color image is normalized by subtracting the mean of the three channels red, green and blue. In addition, no other pre-processing or data enhancement is used. For the method to solve LPOM, set μ_iThe method to solve for LPOM and SGD method were run 100 runs together, with a batch size of 100.

On MNIST data set with literature [2]]When compared, the present invention is used in reference [2]]The same network structure. In actual calculations, document [2]]Only the ReLU activation function is used. And document [2]]Similarly, the method of solving the LPOM is run for 17 runs, with a batch size of 100. for LPOM, μ is set on all network fabrics_iNo pre-processing or data enhancement was used in the implementation. And document [24]]In SVHN dataset [19]When the above comparison is made, it is according to the reference [24]]Settings regarding network structure and data sets. For the proposed method of solving LPOM, set μ_i＝20.

The training and testing accuracy of the method for solving LPOM and the SGD method on MNIST data set is shown in fig. 1(a) and (b). It can be seen that the training accuracy of both methods is close to 100%, however, the test accuracy (98.2%) obtained by solving for LPOM using the inventive method is slightly better than SGD (98.0%). The training and testing accuracy of LPOM and SGD in the present invention on CIFAR-10 data set is shown in FIGS. 1(c) and (d). It can be seen that the training accuracy of both methods is close to 100%, however, the testing accuracy of LPOM in the present invention (52.5%) is higher than that of SGD (47.5%).

The testing accuracy of LPOM adopting the method of the invention and the reference [2] on MNIST is shown in Table 1. It can be seen that the results of LPOM in the present invention are significantly better than those of document [2] the test accuracy of LPOM in the present invention with SGD and document [24] on SVHN data set is shown in Table 2. it can be seen that the results of solving LPOM by the method of the present invention are better than those of SGD and document [24].

Table 1: comparison of LPOM and literature [2] solved by the method of the invention on MNIST data set

TABLE 2 comparison of LPOM and SGD solved by the method of the invention and document [24] on SVHD datasets

SGD	95.0％
		Document [24]	96.5％
LPOM	98.3％

It is noted that the disclosed embodiments are intended to aid in further understanding of the invention, but those skilled in the art will appreciate that: various substitutions and modifications are possible without departing from the spirit and scope of the invention and appended claims. Therefore, the invention should not be limited to the embodiments disclosed, but the scope of the invention is defined by the appended claims.

Claims

1. A neural network optimization method for solving and improving a neighboring computer machine solves and improves a neighboring computer machine LPOM model by adopting a block coordinate descent method in the training of a forward neural network, each subproblem in the LPOM model has convergence, the weight and network activation of each layer of the neural network can be updated in parallel, and additional memory space is not occupied;

the objective function for promoting the LPOM model of the adjacent operator is expressed as formula 1:

wherein, W^i-1Is the i-1 layer network weight of the neural network; xⁱIs a layer i network activation, i 2, …, n; l (X)ⁿL) is a loss function, n is the number of layers of the neural network；X¹Is a training sample, X when i is 2^i-1Is namely X¹(ii) a L is X¹Corresponding class labels;

for matrix input, f (x) and g (x) are element-by-element; phi (x) is the activation function, phi^-1Is the inverse function of phi; mu.s_i> 0 is a parameter of the ith penalty term; 1 is a column vector of all 1's; i | · | purple wind_FIs the Frobenius norm; if l (X)ⁿL) with respect to XⁿIs convex and phi (x) is monotonically increasing, then LPOM is related to WⁱAnd XⁱIs block-convex, i.e. the objective function expressed by equation 1 is with respect to W if the remaining variables remain unchangedⁱAnd XⁱIs convex;

the neural network optimization method for solving and improving the adjacent computer machines comprises the following steps:

2) updating network activation X layer by layerⁱI is 1, …, n; perform operations 21) through 22):

21) sequentially updating X according to the sequence of i-1, …, n-1ⁱ: cycle 2 until convergence:

in formula 2,. mu._i+1A parameter which is the (i + 1) th penalty term;

22) updating Xⁿ: loop 3 until convergence:

in formula 3,. mu._nA parameter that is the nth penalty term;

3) updating the network weight Wⁱ，i＝1,…,n-1：

In particular, assume that

Is β smooth, i.e., the inequality:

if true; update W by the following procedureⁱ：

And (3) initializing: w^i,0、W^i,1、θ₀0; t is 1; wherein, W^i,0And W^i,1Is to iteratively update WⁱInitial value of (a), theta₀Is the initial value of the parameter θ, t is the number of iterations;

31) calculating theta_t：

32) calculating Y^i,t：

Wherein, Y^i,tRepresents the t-th iteration to update Yⁱ；

33) Calculating W^i,t+1：

Wherein, W^i,t+1Represents the t +1 th iteration update Wⁱ，

Represents XⁱThe pseudo-inverse of (1);

34) adding 1 to the number of iterations: t ← t + 1;

the steps 21), 22) and 3) have convergence guarantee;

through the steps, the method for solving and lifting the block coordinate descending of the adjacent operator machine is used for realizing the neural network optimization.

2. The neural network optimization method of claim 1, wherein the neural network optimization method performs parallel update of weights and network activations of each layer of the neural network.

3. The neural network optimization method of claim 1, wherein the neural network optimization method is applied to image recognition, speech recognition and natural language processing neural networks.

4. The neural network optimization method of claim 1, wherein the neural network optimization method is applied to image recognition;

using a least squares loss function:

and the ReLU activation function: relu (x) max (x,0), without using any regularization processing weights; the training samples are training images in an image dataset.

5. A neural network optimization method as claimed in claim 4, wherein the image dataset is a MNIST, CIFAR-10 and/or SVHN dataset.