CN109919301A

CN109919301A - Deep neural network batch optimization method based on InfoMax criterion

Info

Publication number: CN109919301A
Application number: CN201910141284.9A
Authority: CN
Inventors: 郭春生; 夏尚琴; 章坚武; 陈华华
Original assignee: Hangzhou Electronic Science and Technology University
Current assignee: Hangzhou Electronic Science and Technology University
Priority date: 2019-02-26
Filing date: 2019-02-26
Publication date: 2019-06-21

Abstract

The invention discloses a kind of deep neural network batch optimization method based on InfoMax criterion: first, the principle of injection random signal is analyzed from aerodynamic point, specify effect of the random signal in deep neural network, to based on this, as unit of batch, random signal is injected in input sample；Secondly, adjusting random signal general power as unit of batch: as the increase of the number of iterations reduces the random signal general power on each batch, helping algorithm to flee from saddle point, converge to minimum value；Finally, in each iteration, there is good detection effect so that finally obtained model performance reaches balanced in the upper reasonable distribution random signal power of batch based on InfoMax criterion.

Description

Deep neural network batch optimization method based on information transmission maximization criterion

Technical Field

The invention belongs to the technical field of deep neural network optimization problems, and particularly relates to a deep neural network batch optimization method based on an information transmission maximization criterion.

Background

The deep neural network is the basis of deep learning, extracts typical characteristics in input samples by carrying out a series of nonlinear operations on input data, learns the internal structure and the law of the data, is widely applied to tasks such as image classification and image identification, and shows good detection effect. In deep neural networks, the loss function is typically a non-convex function, so there are a large number of saddle points in the training process, and as the number of layers of the network increases, the number of saddle points also increases. Therefore, the common optimization algorithm is easy to fall into a saddle point, and the trained model has poor performance. To solve this problem, a great deal of work is currently done to find how to escape the saddle point, so that the algorithm converges to a minimum. The method for escaping from the saddle point based on Hessian information finds the direction of the escaping saddle point by using the geometric information about the saddle point provided by the second-order information of the loss function, and trains along the direction, thereby converging to the minimum value. However, such methods are computationally expensive and complex in a high-dimensional space. Another method for escaping from the saddle point based on the random signal utilizes the unstable characteristic of the saddle point to add the random signal in each direction of gradient descending, so that the algorithm can continuously update parameters under the push of the random signal when falling into the saddle, thereby escaping from the saddle point. Although this method is simple and effective, the performance of the trained model still needs to be improved.

Disclosure of Invention

The invention provides a method for optimizing deep neural network batches based on a random signal escape saddle point, which is based on a principle of analyzing and injecting random signals from a dynamics angle, and the effect of the random signals in the deep neural network is determined, so that on the basis, the random signals are injected into input samples by taking batches as a unit, the total power of the random signals in the batches is reduced along with the increase of iteration times, and finally an algorithm is helped to escape from the saddle point. Meanwhile, in order to further optimize the deep neural network and balance the performance of the model obtained by training, the invention adjusts the power of the random signal on the basis of the information transmission maximization criterion in batch, so that the information of various input samples is fully utilized in the training process. The method has the specific idea that: firstly, taking optimization algorithms in a deep neural network, namely a Gradient Descriptent (GD) algorithm and a batch Stochastic Gradient Descriptent (SGD) algorithm as examples, explaining the principle of kinetic injection of random signals; then based on the analysis, dividing the input samples into a plurality of batches at random, and injecting a random signal into each sample by taking the batch as a unit; then, reducing the total power of the random signals on the batch along with the increase of the iteration times; at each iteration, random signal power over the batch is allocated based on an information transfer maximization criterion.

For the convenience of describing the present invention, the following terms are first defined:

definitions 1 principle of kinetic injection of random signals

First, the loss function of the deep neural network is expressed by the following formula

Where F (ω) represents the loss function for all input samples, assuming it is continuous and second order differentiable, and F:d represents a dimension and is a finite positive integer. f (x)_iω) represents the loss function corresponding to the ith input sample, assuming it is continuous and second order differentiable, and f:omega denotes a parameter of the network which,x_irepresents the ith input sample, i ∈ [1, N]N denotes the total number of training samples, and N is a finite positive integer.

Assuming that the total iteration time is T, the GD algorithm performs parameter update when the iteration time is T as follows

Wherein,when the function F (ω) ═ F (ω)_t) Is at a point ω_tWhen an increment is generated, the ratio of the increment of the function output value to the increment of the argument is a limit when the argument increment approaches 0. ε represents the learning rate and 0 < ε < 1, T represents the current iteration time and 0 < T < T.

In the training process, the learning rate epsilon is far less than 1 and not 0, and the formula (2) is changed to obtain

Assuming that the parameter ω is a function of time t and the left-hand equation in equation (3) is approximately equal to the derivative of ω, we can then derive an ordinary differential powertrain system, expressed as follows using the ordinary differential equation

Wherein,denotes the derivative function of ω with respect to t, d ω is the infinitesimal quantity of ω, and dt is the infinitesimal quantity of t.

Then, the formula (3) can be regarded as an euler solution of the formula (4), that is, the parameter updating process of the GD algorithm actually uses the euler solution to solve the power system. Because (4) is a constant differential powerTo learn the system, so that the solution of ω is a definite trajectory, ω will eventually converge to a stationary pointThis point may be the minimum point, but may also be a saddle point. Since the deep neural network has a large number of saddle points in the training process, the algorithm that the GD algorithm only relies on the first-order gradient information for parameter updating is very likely to fall into the saddle points.

The batch SGD algorithm differs from the GD algorithm in that the gradient over the batch is used during the gradient descent processSubstitution of the overall gradient in the GD AlgorithmThe parameter updating process of the batch SGD algorithm at the time t is shown as follows

Wherein,represents the current function F_B(ω)＝F_B(ω_t) Is at a point ω_tWhen an increment is generated, the ratio of the increment of the function output value to the increment of the argument is a limit when the argument increment approaches 0. B represents the size of the batch and is a finite positive integer.

Therefore, compared with the GD algorithm, the batch SGD algorithm introduces a certain random signal at each iteration, and the random signal introduced at the iteration time of t is ξ_tIndicates that then there is

Wherein, ξ_tIs a mean of 0 and a variance of σ²Due to ξ_tIs 0, so ξ_tVariance σ of²Equal to ξ_tOf the power of (c).

At this time, the original ordinary differential kinetic system (5) is added with a random signal to become a random kinetic system, which is expressed by the following Langmian equation

Wherein ξ (t) is a Gaussian process.

According to the solution method of the Langewaten kinetic equation, the solution of omega in the equilibrium state can be obtained

It can be seen from equation (8) that the parameter ω is not a definite value, but a function of the random signal power σ²The smallest point in the original F (ω) of the probability distribution now becomes the largest point in P (ω). Sigma²The larger, the flatter P (ω), the more ω may "run" around, so the algorithm will traverse more regions; sigma²The smaller P (ω) becomes "sharper" and the more prominent the region of the maximum point of P (ω), where ω will not "run out" easily if it falls into this region, and gradually converges to a minimum.

Method for defining 2 batch injection of random signals

In the batch SGD algorithm, the magnitude of the random signal power depends on the batch and the learning rate. The smaller the batch is, the larger the learning rate is, and the larger the random signal power is; the larger the batch, the smaller the learning rate and the smaller the random signal power. The Smith el al study shows that as the number of iterations increases, the size of the batch is slowly increased, reducing the learning rate, thereby reducing the random signal power and helping the algorithm converge to a minimum. The reason is that in the earlier stage of iteration, the set batch is a little smaller, the learning rate is higher, the power of the random signal is higher, and the algorithm is favorable for traversing more areas; with the increase of the iteration times, the algorithm slowly approaches the minimum value area, and the random signal power is reduced at the moment, so that the step length is prevented from being too large and crossing the minimum value.

But the batch size is increased, the performance requirement on computer hardware is higher, and the training cost is increased; reducing the learning rate reduces the convergence drop of the algorithm. Therefore, in order to save cost and improve training speed as much as possible while escaping from saddle points, the invention does not change the batch and learning rate in the deep neural network in the training process, directly adds random signals on input samples, and reduces the power of the random signals along with the increase of iteration times, thereby leading the algorithm to finally converge to the minimum value. In particular, the amount of the solvent to be used,

1) dividing all N input samples into K batches, wherein the size of each batch is B;

2) a random signal is added to the input samples in batches. Then the relationship between each input sample and the random signal over the kth (K e [1, K ]) lot is as follows

Wherein,representing the ith input sample, x, on the kth batch_iFor newly generated samples, ξ_i～N(0,1)，Represents the power of the random signal;

3) the total power of the random signal over a batch is expressed as the number of iterations increasesOn a per batch basis

Definitions 3 method for maximally adjusting random signal power based on information transmission

Random signals are injected in batches and the total power of the random signals on the batches is reduced as the number of iterations increases. However, the total random signal power over the batch is constant at each iteration, thus involving the problem of random signal power allocation. In order to fully utilize information contained in various samples in the network training process and enable the model performance to reach balance, the random signal power on a batch is reasonably distributed based on the information transmission maximization criterion. Specifically, an objective function based on the information transmission maximization criterion is provided, and then the function is solved to obtain the method for distributing the random signal power in batches.

First, assume that after introducing a random signal, the total information amount in the deep neural network isWherein, c_kFor the amount of information on each batch, C is equal to the sum of the amounts of information on all the batches. Thus, maximizing the total amount of information is equivalent to maximizing the amount of information on each batch maxc_k。

In the water filling algorithm, when the total power of the signals is not changed, the signal power is reasonably distributed, and the maximization of the channel capacity can be realized. Inspired by the thought, the objective function based on information transmission maximization is provided

Wherein,λ is a lagrangian constant for the random signal power assigned to the ith input sample. p is a radical of_iRepresenting the signal power, n_iRepresents the noise power, and p_i+n_i＝r_i。r_iThe expression is the power of the ith input sample, and because the batch optimization strategy is adopted in the method, certain noise exists in each gradient update, the sample power r at the moment_iIt is seen that the signal power p is_i(without using batch optimization r_i＝p_i) Sum noise power n_iTwo parts are formed.

The meaning of the objective function herein is: total power of random signal injected on batch at each iterationIs fixed and reasonably distributes the power of random signalsInformation transmission can be maximized.

To maximizeOrder toAfter simplification, the method for reasonably distributing the random signal power in the text batch is obtained, which is as follows

Wherein, a⁺Denotes max (a, 0). Due to the fact thatSo whenWhen it is negative, orderI.e. not allocating random signal power to the ith input sample

The meaning of formula (11) is: when the input sample power r_iLarger, to allow the information originally in such samples to be fully utilized, smaller random signal power is allocated to themEven no power is allocated; when r is_iSmaller, in order to make the network pay more attention to the input samples in the training process, and to balance the model performance obtained by final training, larger samples are allocated

Therefore, finally, the method for adding random signals to input samples and distributing random signal power on a batch at each iteration in this document is as follows

Wherein,representing input samples x_iNew samples generated after adding the random signal.

The method comprises the following specific steps:

the method comprises the following steps: dividing all N training samples into K batches, wherein the size of each batch is B;

step two: random signals are injected for the samples in batches. The method for injecting a random signal on each sample of each batch is as follows

Representation of the total power of the random signal per batchAnd is

Step three: from the whole iterative process, the algorithm is far away from the minimum value point in the initial training period of the model, and the algorithm is carried out on each batchLarger, helping the algorithm to escape from the saddle point, and to go quickly towards the minimum point; as the number of training increases, the algorithm approaches a minimum value gradually, at which point it is setSmaller, preventIf the value is too large, the value is over the minimum value, and the next saddle point is sunk;

step four: at the time of each iteration of the process,the random signal power corresponding to each input sample on the batch is distributed based on the information transmission maximization criterion without changing, and the specific method is as follows

If it is notThen sample x_iThe random signal power allocated above is 0.

The invention has the advantages that: aiming at the problem that the deep neural network falls into a saddle point in the training process, a batch optimization algorithm based on information transmission maximization is provided, random signals are injected into input samples, and the total power of the random signals in the batch is reduced along with the increase of iteration times, so that the algorithm is helped to be rapidly converged to the minimum value. Meanwhile, in order to fully utilize the information of various input samples, random signal power is distributed on a batch basis on the basis of an information transmission maximization criterion, so that the model obtained through final training has good performance.

Drawings

Fig. 1 is a diagram illustrating the allocation of random signal power over batches based on information transfer maximization criteria, assuming that each batch size is equal to 8.

Detailed Description

The following detailed description is further illustrative of the methods and techniques provided by the present invention and should not be construed as limiting the invention.

Taking the case that the size of each batch is equal to 8 as an example, the implementation flow of the invention is shown in fig. 1, and the specific implementation steps are as follows:

the method comprises the following steps: dividing all N training samples into K batches, wherein the size of each batch is 8;

Representation of the total power of the random signal per batchAnd is

If it is notThen sample x_iThe random signal power allocated above is 0.

The above description of the embodiments is only intended to facilitate the understanding of the method of the invention and its core ideas. It should be noted that, for those skilled in the art, it is possible to make various improvements and modifications to the present invention without departing from the principle of the present invention, and those improvements and modifications also fall within the scope of the claims of the present invention.

Claims

1. The deep neural network batch optimization method based on the information transmission maximization criterion is characterized by comprising the following steps of:

dividing all N training samples into K batches, wherein the size of each batch is B;

and step two, injecting a random signal into the samples by taking the batch as a unit, wherein in the kth batch (K belongs to [1, K ]), the relation between each input sample and the random signal is as follows:

wherein x is_iRepresenting the ith input sample on the kth batch,in order to create a new sample of the sample,

ξ_i～N(0,1)，represents the power of the random signal;

total power of random signal per batchIt is shown that,

step three: iterative computation, wherein the total power of random signals on each batch is reduced along with the increase of the iteration times, so that the algorithm is helped to escape from saddle points and is converged to the minimum value;

step four: at the time of each iteration of the process,the random signal power corresponding to each input sample in the batch is distributed based on the information transmission maximization criterion without changing, specifically

Order toa + represents max (a,0), r_iRepresenting the power of the ith input sample,representing the total power of the input samples;

if it is notThen sample x_iThe random signal power allocated above is 0.

2. The method of claim 1, wherein: in step three, the total power of the random signals on each batch is reducedPrevent fromCrossing the minimum, trapping into the next saddle point.

3. The method of claim 1, wherein: in step four, the objective function based on information transmission maximization is as follows:

wherein λ is the Lagrangian constant; p is a radical of_iRepresenting the signal power, n_iRepresenting the noise power, p_i+n_i＝r_i。